Learn how to run pipelines in production while preserving data quality with SDF build
sdf run in your orchestrator, we strongly encourage users to leverage
the SDF build process to ensure that the data quality is preserved in production.
Critically, build stages your data first, then tests it, and only if tests pass overwrites the production data. This is the best way to run models in production, since it ensures production data only gets updated if all data
quality checks pass.
Create a new sample workspace from Mom's Flower Shop
Setup Pipeline Testing
Working set 12 model files, 1 test file, 27 .sdf files
Running moms_flower_shop.raw.raw_inapp_events (./models/raw/raw_inapp_events.sql)
Running moms_flower_shop.staging.inapp_events (./models/staging/inapp_events.sql)
Testing moms_flower_shop.staging.test_inapp_events (./sdftarget/dbg/tests/moms_flower_shop/staging/test_inapp_events.sql)
Finished 2 models [2 succeeded], 1 test [1 passed] in 1.655 secs
[Pass] Test moms_flower_shop.staging.test_inapp_eventsinapp_events contains a column event_value with the following assertion tests applied to it:inapp_events.sql to:sdf build would prevent this scenario from happening.Build the Pipeline
sdf build comes in. When we run build, SDF will stage the data by materializing it with _draft appended to its table name, then run tests on that table, and publish them to overwrite production only if the tests pass. Let’s try it out.-- hard stop -- indicating:inapp_events were run since this model failed the quality checks.inapp_events was not overwritten itself, meaning any downstream dependencies like dashboards are only pulling slightly stale data, as opposed to bad data.inapp_events.sql file to our originally intended change:Working set 12 model files, 1 test file, 27 .sdf files
Staging moms_flower_shop.raw.raw_addresses (./models/raw/raw_addresses.sql)
Staging moms_flower_shop.raw.raw_inapp_events (./models/raw/raw_inapp_events.sql)
Staging moms_flower_shop.raw.raw_customers (./models/raw/raw_customers.sql)
Staging moms_flower_shop.raw.raw_marketing_campaign_events (./models/raw/raw_marketing_campaign_events.sql)
Staging moms_flower_shop.staging.inapp_events (./models/staging/inapp_events.sql)
Staging moms_flower_shop.staging.marketing_campaigns (./models/staging/marketing_campaigns.sql)
Testing moms_flower_shop.staging.test_inapp_events (./sdftarget/dbg/tests/moms_flower_shop/staging/test_inapp_events.sql)
Staging moms_flower_shop.staging.app_installs_v2 (./models/staging/app_installs_v2.sql)
Staging moms_flower_shop.staging.app_installs (./models/staging/app_installs.sql)
Staging moms_flower_shop.analytics.agg_installs_and_campaigns (./models/analytics/agg_installs_and_campaigns.sql)
Staging moms_flower_shop.staging.stg_installs_per_campaign (./models/staging/stg_installs_per_campaign.sql)
Staging moms_flower_shop.staging.customers (./models/staging/customers.sql)
Staging moms_flower_shop.analytics.dim_marketing_campaigns (./models/analytics/dim_marketing_campaigns.sql)
Publishing 12 models, 1 tests
Finished 12 models [12 succeeded], 1 test [1 passed] in 1.829 secs
[Pass] Test moms_flower_shop.staging.test_inapp_events(Optional) Build a Specific Model or Directory of Models
sdf build command and combining it with the --targets-only flag. For more on target specification, check out our IO Guide.For example, to build only the inapp_events model, you can run:Working set 12 model files, 1 test file, 19 .sdf files
Publishing 2 models, 1 tests
Finished 2 models [2 reused], 1 test [1 passed (reused)] in 0.941 secs
[Pass] Test moms_flower_shop.staging.test_inapp_eventsinapp_events (like raw_inapp_events as seen in the output) and then build inapp_events itself. Think of this like a build system, where we are building all the dependencies of a target before building the target itself.In an Airflow orchestration scenario, you might manually specify that one DAG needs to run before another that depends on it. Then when running the second DAG, you’d be duplicating work by running the first DAG again.
To avoid this and only run the DAG in a specific workspace, you can use the --targets-only flag with a directory path. For example, if we wanted to build models in a directory models/my_dag, we could run:models/my_dag directory first, then run tests against all of them, and only if all tests pass on all models in the DAG will it publish and overwrite the production data.What is the difference between `sdf run`, `sdf test` and `sdf build`?
sdf run: This command is used to run the pipeline. It will materialize the models specified and update the data in the production location. No tests are run with sdf run.sdf test: This command is used to run the tests. It will run the models specified, then test them against the assertions defined in the workspace. Notably the tests run after the model has been materialized and data is updated.sdf build: This command stages the data, runs the tests, then publishes the data to production only if the tests pass (i.e. WAP pattern). This is the best way to run models in production, since it ensures production data only gets updated if all data quality checks pass.sdf build command should be used in production scenarios to ensure data quality is preserved.How does the publish step of SDF build work?
_draft table to the production table name using an ALTER TABLE statement.Does this work with incremental models and snapshots?
sdf build will work with incremental models and snapshots, and will behave the same as sdf run in relation to the cache.Furthermore, in an incremental scenario, sdf build will only run the incremental computation, yet still test the full history of all increments. It works by:<model-name>_draftHow can I build an SDF workspace in Airflow?
sdf build command on a cadence. If you have a more complex scenario where you need to run certain groups of models as DAGs in a specific order, you can use the --targets-only flag to build a specific model or directory of models. For example, you could write three Airflow DAGs with a simple BashOperator that runs:sdf build --targets-only models/dag1sdf build --targets-only models/dag2sdf build --targets-only models/dag3How can I build an SDF workspace in Dagster?
sdf build command in tandem with our Dagster integration. For more on how to set that up, check out our Dagster integration guide.sdf build command can be used to stage data, run tests, and publish data to production only if the tests pass. This is a critical step in ensuring data quality in production and is a best practice for running models in production.