Overview

In the previous tutorial we learned how to create a new model with SDF. As we were working on dim_marketing_campaigns, the result of the query just didn’t make sense.

In this tutorial we will see how SDF can simplify complicated debugging scenarios and accelerate their resolution. It can also provide real-time feedback as you author your queries with unprecedented speed and accuracy.

More specifically, we will use SDF to:

  1. Identify the issue upstream with our built-in column-level lineage.
  2. Run impact analysis after fixing the issue.
  3. Test some syntax static analysis examples.

Prerequisites

Completion of the previous tutorial.

Guide

1

Understand Column Lineage

SDF provides column-level lineage out of the box, all the way up to source.

To understand the lineage of the column total_num_installs in our new model, all we need to do is run:

sdf lineage analytics.dim_marketing_campaigns --column total_num_installs

moms_flower_shop.analytics.dim_marketing_campaigns.total_num_installs │ │ copy └──────┐        moms_flower_shop.staging.stg_installs_per_campaign.total_num_installs        │        │ mod        └──────┐               moms_flower_shop.staging.app_installs.event_id               │               │ copy               └──────┐                      moms_flower_shop.raw.raw_marketing_campaign_events.event_id                      moms_flower_shop.staging.inapp_events.event_id                      │                      │ copy                      └──────┐                             moms_flower_shop.raw.raw_inapp_events.event_id

Now we can pinpoint exactly which models we need to check. Out of the entire warehouse, we only need to check 3 models!

To learn more about SDF lineage, visit our lineage guide

2

Identify the Issue Upstream

Let’s check the first upstream model — stg_installs_per_campaign.sql located under the directory models/staging/.

models/staging/stg_installs_per_campaign.sql
SELECT 
    campaign_id,
    COUNT(event_id) AS total_num_installs
FROM app_installs
GROUP BY 1

There is not much going on here. We need to keep moving up.

In the upstream model app_installs defined in models/staging/app_installs.sql, let’s zoom in on the FROM statement:

models/staging/app_installs.sql
...
FROM inapp_events i 
    LEFT OUTER JOIN raw.raw_marketing_campaign_events m
        ON (i.event_id = m.event_id) 
...

Aha! Who said that inapp_events and marketing events are the same?

3

Compare to Other Models

You remember another model, customers, has correct marketing campaign data. Let’s check out the lineage for the customers model:

sdf lineage staging.customers --column campaign_id

moms_flower_shop.staging.customers.campaign_id │ │ copy └──────┐        moms_flower_shop.staging.app_installs_v2.campaign_id        │        │ copy        └──────┐               moms_flower_shop.raw.raw_marketing_campaign_events.campaign_id

Aha #2! There’s a V2 version of app_installsapp_installs_v2

4

Update the Model

Once we identify the issue, it’s an easy fix. Update the FROM statement of models/staging/stg_installs_per_campaign.sql, to the use app_installs_v2:

models/staging/stg_installs_per_campaign.sql
...
FROM app_installs_v2  -- Change from app_installs to app_installs_v2
...
5

Analyze

Let’s analyze the impact.

First, we should make sure our fix compiles:

sdf compile 

We expect to see a successful run:

Working set 12 model files, 15 .sdf files   Compiling moms_flower_shop.staging.stg_installs_per_campaign (./models/staging/stg_installs_per_campaign.sql)   Compiling moms_flower_shop.analytics.dim_marketing_campaigns (./models/analytics/dim_marketing_campaigns.sql)    Finished 11 models [2 succeeded, 9 reused] in 0.745 secs

Next, we can look at all the tables and columns this change will impact. If we add the flag --forward to the lineage command, we will see the lineage of each column of stg_installs_per_campaign and how it is used by other models.

To do that, add a forward tag to your lineage command:

sdf lineage staging.stg_installs_per_campaign --forward

It should look like this:

Table: moms_flower_shop.staging.stg_installs_per_campaign

moms_flower_shop.staging.stg_installs_per_campaign.campaign_id

moms_flower_shop.staging.stg_installs_per_campaign.total_num_installs │ │ copy ├──────┐ │      moms_flower_shop.analytics.dim_marketing_campaigns.total_num_installs │ mod └──────┐        moms_flower_shop.analytics.dim_marketing_campaigns.avg_customer_acquisition_cost        moms_flower_shop.analytics.dim_marketing_campaigns.install_duration_ratio

Only our new model will be impacted.

6

Fix it!

To materialize the fix, all we need to do is run the queries:

sdf run analytics.dim_marketing_campaigns

Working set 12 model files, 15 .sdf files     Running moms_flower_shop.raw.raw_inapp_events (./models/raw/raw_inapp_events.sql)     Running moms_flower_shop.raw.raw_marketing_campaign_events (./models/raw/raw_marketing_campaign_events.sql)     Running moms_flower_shop.staging.inapp_events (./models/staging/inapp_events.sql)     Running moms_flower_shop.staging.marketing_campaigns (./models/staging/marketing_campaigns.sql)     Running moms_flower_shop.staging.app_installs_v2 (./models/staging/app_installs_v2.sql)     Running moms_flower_shop.staging.stg_installs_per_campaign (./models/staging/stg_installs_per_campaign.sql)     Running moms_flower_shop.analytics.dim_marketing_campaigns (./models/analytics/dim_marketing_campaigns.sql)    Finished 7 models [7 succeeded] in 1.105 secs

Table moms_flower_shop.analytics.dim_marketing_campaigns ┌─────────────┬─────────────────────────────────┬────────────────────┬───────────────────────────────┬────────────────────────┐ │ campaign_id ┆ campaign_name                   ┆ total_num_installs ┆ avg_customer_acquisition_cost ┆ install_duration_ratio │ ╞═════════════╪═════════════════════════════════╪════════════════════╪═══════════════════════════════╪════════════════════════╡ │ 3           ┆ instagram_ads_3_3e31b541d8      ┆ 192                ┆ 7.958249909930306             ┆ 0                      │ │ 1           ┆ friends_referrals_1_9b24ce4c65  ┆ 160                ┆ 10.391763883087673            ┆ 0                      │ │ 0           ┆ instagram_ads_0_8c1fdf7cb0      ┆ 91                 ┆ 4.513104055002336             ┆ 0                      │ │ 13          ┆ friends_referrals_13_c6e177ff0a ┆ 72                 ┆ 22.915686962366962            ┆ 0                      │ │ 14          ┆ google_search_14_3775b59a56     ┆ 48                 ┆ 54.12730118839374             ┆ 0                      │ │ 19          ┆ friends_referrals_19_3f4a9d9fc3 ┆ 26                 ┆ 59.38710281622366             ┆ 0                      │ │ 11          ┆ google_search_11_56e2a09ed5     ┆ 19                 ┆ 58.38329050623466             ┆ 0                      │ │ 4           ┆ friends_referrals_4_c554b77114  ┆ 17                 ┆ 163.69881288433993            ┆ 1                      │ │ 22          ┆ friends_referrals_22_69b6886d4a ┆ 15                 ┆ 116.4817634058549             ┆ 0                      │ │ 17          ┆ google_search_17_a400715565     ┆ 15                 ┆ 147.1283543422433             ┆ 1                      │ └─────────────┴─────────────────────────────────┴────────────────────┴───────────────────────────────┴────────────────────────┘ 114 rows, showing only 10 rows.   Run with —limit 0 to show all rows.

Perfect! We are all set!

Bonus

Lineage is not all we can do.

SDF provides out of the box static analysis to support the development process as you author your transformations. This capability has two benefits:

  1. Quicker development thanks to the instant feedback on your syntax and semantics
  2. Compute cost savings since analysis happens on compile, before we send the query to run

Let’s try a few things that will fail. In this case, getting an error early is a good thing. Notice how quickly you get the response!

1

Open the File

To keep it simple, we will use models/staging/stg_installs_per_campaign.sql throughout the following steps.

As a reminder, the query looks like this:

models/staging/stg_installs_per_campaign.sql
SELECT 
    campaign_id,
    COUNT(event_id) AS total_num_installs
FROM app_installs_v2
GROUP BY 1
2

Error: No Column Found

Let’s try to change the column campaign_id to a non-existing column.

For example, we can change the column name from campaign_id to non_existing_column. This column definitely doesn’t exist in app_installs_v2 (or app_installs)

models/staging/stg_installs_per_campaign.sql
SELECT 
    -- We changed campaign_id to non_existing_column below
    non_existing_column, 
    COUNT(event_id) AS total_num_installs
FROM app_installs_v2
GROUP BY 1

If we try to compile:

sdf compile staging.stg_installs_per_campaign

The run will fail and the error message will look like this:

error: SDF227: No column non_existing_column found. Available are app_installs.install_time, app_installs.campaign_name, app_installs.campaign_id, app_installs.campaign_type, app_installs.event_id, app_installs.platform, app_installs.customer_id   —> models/staging/stg_installs_per_campaign.sql:2:5

If you are using Visual Studio Code, you can easily click into the error location by CMD-Clicking on the bottom of the error:

--> models/staging/stg_installs_per_campaign.sql:2:5

3

Error: Function Arguments Do Not Match

Similarly, we can write a function incorrectly.

For example, the function ABS takes one numeric argument. What if we try to give it a string like campaign_name? In the file models/staging/stg_installs_per_campaign.sql, change the campaign_id selection with the following:

models/staging/stg_installs_per_campaign.sql
SELECT 
    -- We changed campaign_id to ABS(campaign_name) 
    ABS(campaign_name), 
    COUNT(event_id) AS total_num_installs
FROM app_installs_v2
GROUP BY 1

If we try to compile:

sdf compile staging.stg_installs_per_campaign

The run will fail and the error message will look like this:

error: SDF209: Failed to resolve function ABS: Argument type mismatch: actual: (varchar); candidates: (tinyint); (smallint); (bigint); (double); (decimal(p, s)); (real).   —> models/staging/stg_installs_per_campaign.sql:2:5

4

Error: Un-aggregated columns in aggregation context

We can even try some more complicated logic.

For example, we can add another column to the SELECT without updating the GROUP BY.

SELECT 
    campaign_id,
    campaign_name,  -- We added an extra column
    COUNT(event_id) AS total_num_installs
FROM app_installs_v2
GROUP BY 1  -- We didn't add the new column here

Watch what happens when you try to compile:

sdf compile staging.stg_installs_per_campaign

The run will fail and the error message will look like this:

error: SDF209: Failed to resolve function ABS: Argument type mismatch: actual: (varchar); candidates: (tinyint); (smallint); (bigint); (double); (decimal(p, s)); (real).   —> models/staging/stg_installs_per_campaign.sql:2:5

5

Try it Yourself

What is that one syntax mistake you always forget about until you see that run error? Try it out! See if SDF can save you the wait next time :).

6

Don't Forget to Revert!

While experimenting, we introduced some breaking logic to our data warehouse. Let’s revert the model to the original SQL code.

Update the file models/staging/stg_installs_per_campaign.sql with the original SQL:

models/staging/stg_installs_per_campaign.sql
SELECT 
    campaign_id, 
    COUNT(event_id) AS total_num_installs
FROM app_installs_v2
GROUP BY 1

Next Steps

Now that we fixed the issue, we want to prevent anything similar from happening again. In our next tutorial, “Deprecating a Model” we will deprecate app_installs, the old version of app_installs_v2 that we initially used.