Overview

This guide will take you through the steps to integrate SDF with an existing DBT project. SDF currently works with DBT v1.7.0 and above.

Example

This first section will walk you through the steps to integrate SDF with an existing DBT project. We’ll use the jaffle_shop example project from DBT to demonstrate this.

Prerequisites

Ensure that you have the following installed and configured locally before beginning.

  • DBT
  • SDF
  • A valid profiles.yml file configured wth DBT. See here for more information.

Next, you’ll need the dbt-core Jaffle Shop example project setup locally. You can clone it from here.

Without a valid profiles.yml, the Jaffle Shop example will not be able to compile, and SDF will not work.

We use this example as it doesn’t require authentication to a database or existing warehouse. In a production scenario, you’d likely compile DBT with your own models and remote warehouse, requiring authentication. To use SDF in this context, check our authentication page to see if we support your warehouse. If not, DDLs can be manually added to SDF alongside the DBT project to enable SDF compilation.

Guide

1

Compile DBT

dbt compile will compile the DBT project and generate the manifest file. This must be run first before SDF can work.

dbt compile
2

Initialize the SDF DBT Workspace

Using the next command, SDF will create a workspace.sdf.yml based on your DBT project’s configuration. This file will be placed adjacent to your dbt_project.yml file and should be committed to your repository.

sdf dbt init

Notice that your workspace.sdf.yml includes block points to files within the target directory. This is because SDF deals with raw SQL directly and not DBT models. To accomplish this, SDF copies the necessary compiled DBT models into the sdf directory with target/compiled (or your configured target-path).

This command will also compile DBT snapshots into a format SDF can understand, and it will configure SDF to work with your DBT seeds. It will add these DBT seed configurations to the workspace.sdf.yml file like so:

...
table:
	name: raw_orders
	location: seeds/raw_orders.csv
	file-format: csv
	with-header: true

---
table:
	name: raw_payments
	location: seeds/raw_payments.csv
	file-format: csv
	with-header: true

---
table:
	name: raw_customers
	location: seeds/raw_customers.csv
	file-format: csv
	with-header: true
...

All SDF needs to know is the name of the table, the location of the file, and the file format. From there, it can infer the schema and data types. As new seeds are added, simply running sdf dbt refresh will update the workspace.sdf.yml file with the new seeds. Seeds can also be configured manually according to the YAML specification for SDF Table Blocks

3

Compile SDF

Now that we have our workspace.sdf.yml file configured, we can compile SDF. This will generate the column level lineage and place it in the .sdfcache directory local to your DBT project.

sdf compile

Great, now that we have our lineage, let’s try adding some metadata to our models.

4

Classify a DBT Model

In our jaffle shop example, we have a table created from a dbt seed called raw_customers. This table contains two columns (first_name, last_name) with personally identifiable information (PII). Let’s classify these columns as PII in SDF, and ensure any downstream usage of these columns also inherits this classification.

First, let’s define our PII classifier in the workspace.sdf.yml file.

---
classifier: 
	name: pii
	description: Personally Identifiable Information
	labels: 
		- name: name
		  description: An individual's first, middle, or last name

Next, let’s attach this to the right columns raw_customers table in the workspace.sdf.yml file.

---
table:
	name: raw_customers
	location: seeds/raw_customers.csv
	file-format: csv
	with-header: true
	columns:
		- name: first_name
			classifiers:
				- pii.name
		- name: last_name
			classifiers:
				- pii.name

Great, now let’s compile SDF again and see what happens.

sdf compile --show all
Working set 6 model files, 1 .sdf file
Finished 8 models [8 reused] in 0.089 secs

Schema jaffle_shop.dbt_alice.raw_customers
+-------------+-----------+------------+
| column_name | data_type | classifier |
+-------------+-----------+------------+
| id          | bigint    |            |
| first_name  | varchar   | pii.name   |
| last_name   | varchar   | pii.name   |
+-------------+-----------+------------+

...
...

Schema jaffle_shop.dbt_alice.stg_customers
+-------------+-----------+------------+
| column_name | data_type | classifier |
+-------------+-----------+------------+
| customer_id | bigint    |            |
| first_name  | varchar   | pii.name   |
| last_name   | varchar   | pii.name   |
+-------------+-----------+------------+

...
...

Schema jaffle_shop.dbt_alice.customers
+-------------------------+-----------+------------+
| column_name             | data_type | classifier |
+-------------------------+-----------+------------+
| customer_id             | bigint    |            |
| first_name              | varchar   | pii.name   |
| last_name               | varchar   | pii.name   |
| first_order             | date      |            |
| most_recent_order       | date      |            |
| number_of_orders        | bigint    |            |
| customer_lifetime_value | bigint    |            |
+-------------------------+-----------+------------+

You’ll notice the classification is not only attached to the raw_customers table, but also to the downstream customers table and others. This is because SDF is able to infer the lineage between these two tables and propagate the classification.

5

Deploy & Explore the Lineage (Optional)

Now that we have our lineage and metadata generated, we can deploy it to the SDF cloud and explore it in the UI. Note this is only available to users who have been granted access to the SDF Cloud Console.

sdf auth login
sdf push

Commands

Here we layout the sdf dbt commands available for us when developing with SDF and DBT locally.

sdf dbt init

This command will initialize the SDF workspace for your DBT project. It will create a workspace.sdf.yml file adjacent to your dbt_project.yml file. It will also configure all DBT seeds and compile DBT snapshots into a format SDF can understand. Lastly, it will copy the compiled models into the sdf directory with target/compiled (or your configured target-path).

sdf dbt init
Initialize a sdf workspace from a dbt project -- best effort

Usage: sdf dbt init [OPTIONS]

Options:
      --target <TARGET>                Use this DBT target over the default target in profiles.yml
      --profiles-dir <PROFILES_DIR>    Use this DBT profile instead of the defaults at ~/.dbt/profile.yml -- (note dbt uses --profile_dir, this CLI uses --profile-dir)
      --workspace-dir <WORKSPACE_DIR>  Specifies the workspace directory where we expect to see manifest and dbt project files The SDF workspace file will be placed in the same directory. Default: current directory
  -s, --save                           Save and overwrite the workspace file
  -c, --config <CONFIG>                Supply a config yml file or provide config as yml string e.g. '{key: value}'
  -h, --help                           Print help

dbt compile must be run before running sdf dbt init

sdf dbt refresh

This command will refresh the SDF workspace for your DBT project. This is useful if you make changes to DBT models during development, then would like to ensure SDF works with your latest models without regenerating the workspace.sdf.yml file. It will recompile the DBT snapshots and move the compiled models into the sdf directory with target/compiled (or your configured target-path).

sdf dbt refresh
Re-initialize a sdf workspace from a dbt project -- best effort

Usage: sdf dbt refresh [OPTIONS]

Options:
      --target <TARGET>                Use this DBT target over the default target in profiles.yml
      --profiles-dir <PROFILES_DIR>    Use this DBT profile instead of the defaults at ~/.dbt/profile.yml -- (note dbt uses --profile_dir, this CLI uses --profile-dir)
      --workspace-dir <WORKSPACE_DIR>  Specifies the workspace directory where we expect to see manifest and dbt project files The SDF workspace file will be placed in the same directory. Default: current directory
  -s, --save                           Save and overwrite the workspace file
  -c, --config <CONFIG>                Supply a config yml file or provide config as yml string e.g. '{key: value}'
  -h, --help                           Print help

By default, sdf dbt refresh will make changes to your workspace.sdf.yml in order to reflect any updates to your DBT project. However, in certain cases when you want to preserve things like comments in your workspace.sdf.yml file, for this you can use the --no-save flag. This will output the changes to the workspace.sdf.yml file to the console, but not save them to the file.

sdf dbt refresh --no-save

Conclusion

SDF can work alongside an existent DBT project to power column-level lineage, SQL compilation and validation, impact analysis, data classification, and much more for DBT models. This integration is actively under development, with tons more coming soon.