Overview

This guide will walk you through the process of building and deploying your pipelines in production with SDF. Although this can be simple as running sdf run in your orchestrator, we strongly encourage users to leverage the SDF build process to ensure that the data quality is preserved in production.

Critically, build stages your data first, then tests it, and only if tests pass overwrites the production data. This is the best way to run models in production, since it ensures production data only gets updated if all data quality checks pass.

SDF build abstracts the write, audit, publish paradigm. You can think of this like a blue-green deployment in software engineering. Simply put, models are first written to a temporary location, then audited for data quality, and finally published to the production location. In an optimal scenario, the publish step should require no extra compute resources, since the data has already been processed and validated. For more on this, check out our blog post on WAP.

Guide

1

Create a new sample workspace from Mom's Flower Shop

Let’s start by creating a new sample workspace based on the Mom’s Flower Shop example. This time, we’ll use the completed version (not the one designed for the tutorial) so that the tests are already in place.

  sdf new --sample moms_flower_shop_completed
2

Setup Pipeline Testing

Great! Now that we have our workspace, let’s start by running the tests. This will run the models, then test them against the assertions we’ve defined in the workspace.

  sdf test

Working set 12 model files, 1 test file, 27 .sdf files     Running moms_flower_shop.raw.raw_inapp_events (./models/raw/raw_inapp_events.sql)     Running moms_flower_shop.staging.inapp_events (./models/staging/inapp_events.sql)     Testing moms_flower_shop.staging.test_inapp_events (./sdftarget/dbg/tests/moms_flower_shop/staging/test_inapp_events.sql)    Finished 2 models [2 succeeded], 1 test [1 passed] in 0.933 secs [Pass] Test moms_flower_shop.staging.test_inapp_events

Great! Our tests passed. However, for the sake of this demo let’s introduce a bug that fails one of our data quality checks.

The model inapp_events contains a column event_value with the following assertion tests applied to it:

metadata/staging/inapp_events.sdf.yml
  columns:
    - name: event_value
      tests:
        - expect: valid_scalar("""event_value >= 0""")
          severity: error
        - expect: minimum(0)
          severity: error

So, let’s say we intended to modify this column to subtract one, but instead subtracted the value from one by mistake. In order to do this, update inapp_events.sql to:

models/staging/inapp_events.sql
  SELECT 
    event_id,
    customer_id,
    FROM_UNIXTIME(event_time/1000) AS event_time,  
    event_name,
    1 - event_value AS event_value,
    additional_details,
    platform,
    campaign_id
FROM raw.raw_inapp_events

Now, let’s run the tests again, but this time turn off warnings so any errors are clear:

  sdf test

Uh oh, both tests failed and the model’s data was updated with these faulty values. In an incremental scenario, this might result in some painful backfilling. Furthermore, in a production scenario, downstream data consumers like dashboards might now be displaying this faulty data. Let’s move on and show how sdf build would prevent this scenario from happening.

3

Build the Pipeline

So, we have a workspace with a failing model. Let’s imagine this is a production scenario and we’re about to deploy this pipeline. We’d want to make sure this test passes before we overwrite the production data.

This is where sdf build comes in. When we run build, SDF will stage the data by materializing it with _draft appended to its table name, then run tests on that table, and publish them to overwrite production only if the tests pass. Let’s try it out.

  sdf build

Nice! As you can see SDF prints out a clear -- hard stop -- indicating:

  1. None of the downstream models of inapp_events were run since this model failed the quality checks.
  2. inapp_events was not overwritten itself, meaning any downstream dependencies like dashboards are only pulling slightly stale data, as opposed to bad data.

Let’s patch up this error by updating the inapp_events.sql file to our originally intended change:

models/staging/inapp_events.sql
  SELECT 
    event_id,
    customer_id,
    FROM_UNIXTIME(event_time/1000) AS event_time,  
    event_name,
    ABS(event_value - 1) as event_value,
    additional_details,
    platform,
    campaign_id
  FROM raw.raw_inapp_events

Now, let’s rerun build again, this time with a fully passing workspace:

  sdf build

Working set 12 model files, 1 test file, 27 .sdf files     Staging moms_flower_shop.raw.raw_inapp_events (./models/raw/raw_inapp_events.sql)     Staging moms_flower_shop.raw.raw_customers (./models/raw/raw_customers.sql)     Staging moms_flower_shop.raw.raw_addresses (./models/raw/raw_addresses.sql)     Staging moms_flower_shop.raw.raw_marketing_campaign_events (./models/raw/raw_marketing_campaign_events.sql)     Staging moms_flower_shop.staging.marketing_campaigns (./models/staging/marketing_campaigns.sql)     Staging moms_flower_shop.staging.inapp_events (./models/staging/inapp_events.sql)     Testing moms_flower_shop.staging.test_inapp_events (./sdftarget/dbg/tests/moms_flower_shop/staging/test_inapp_events.sql)     Staging moms_flower_shop.staging.app_installs (./models/staging/app_installs.sql)     Staging moms_flower_shop.staging.app_installs_v2 (./models/staging/app_installs_v2.sql)     Staging moms_flower_shop.analytics.agg_installs_and_campaigns (./models/analytics/agg_installs_and_campaigns.sql)     Staging moms_flower_shop.staging.customers (./models/staging/customers.sql)     Staging moms_flower_shop.staging.stg_installs_per_campaign (./models/staging/stg_installs_per_campaign.sql)     Staging moms_flower_shop.analytics.dim_marketing_campaigns (./models/analytics/dim_marketing_campaigns.sql)  Publishing 12 models, 1 tests    Finished 12 models [12 succeeded], 1 test [1 passed] in 1.057 secs [Pass] Test moms_flower_shop.staging.test_inapp_events

Great! Our pipeline built successfully, and we can be confident that our data is correct.

4

(Optional) Build a Specific Model or Directory of Models

Many people find it useful to build a specific model or a directory of models that might represent a DAG. In orchestration scenarios that use something like Airflow, this is a common pattern. This can be accomplished simply by passing the model name or directory path to the sdf build command and combining it with the --targets-only flag. For more on target specification, check out our IO Guide.

For example, to build only the inapp_events model, you can run:

  sdf build --targets-only models/staging/inapp_events.sql

Working set 12 model files, 1 test file, 19 .sdf files  Publishing 2 models, 1 tests    Finished 2 models [2 reused], 1 test [1 passed (reused)] in 0.859 secs [Pass] Test moms_flower_shop.staging.test_inapp_events

This will build all models upstream of inapp_events (like raw_inapp_events as seen in the output) and then build inapp_events itself. Think of this like a build system, where we are building all the dependencies of a target before building the target itself.

In an Airflow orchestration scenario, you might manually specify that one DAG needs to run before another that depends on it. Then when running the second DAG, you’d be duplicating work by running the first DAG again. To avoid this and only run the DAG in a specific workspace, you can use the --targets-only flag with a directory path. For example, if we wanted to build models in a directory models/my_dag, we could run:

  sdf build --targets-only models/my_dag

In this build scenario, SDF will stage all models in the models/my_dag directory first, then run tests against all of them, and only if all tests pass on all models in the DAG will it publish and overwrite the production data.

FAQ

Conclusion

In this guide, we learned how to build and deploy pipelines in production with SDF. We saw how the sdf build command can be used to stage data, run tests, and publish data to production only if the tests pass. This is a critical step in ensuring data quality in production and is a best practice for running models in production.