Overview
In the previous tutorial, we set up guardrails based on checks. In this tutorial, we will see how SDF’s semantic understanding can help transform your data warehouse from strings and numbers to real-world business logic:- Maintain business logic consistency
- Control development environment and minimize mistakes from propagating
- Comply with privacy and GDPR requirements
Prerequisites
Completion of the previous tutorial.Guide
1
Setup
If you haven’t completed the previous tutorial,
uncomment the relevant section to reference the metadata files:
workspace.sdf.yml
2
Intro to SDF Classifiers
SDF’s has the ability to annotate columns and tables with user defined types which represent
real-world business logic. Those types enrich the data warehouse and create new SQL types -
instead of just BIGINTs, we can now have currencies, different types of IDs, zip-codes, and many more.SDF automatically propagates those types to downstream assets, enriching the entire
data warehouse with a new layer of semantic understanding.
To learn more about SDF’s classifiers, read our classifiers guide.
3
Creating New Classifiers
Let’s focus back on Mom’s Flower Shop.If you recall, V1 of Essentially, we were joining two elements that are completely different.
Like comparing Apples to Oranges. These kind of mistakes happen all the time.
Thankfully, we can leverage SDF’s semantic understanding and smart propagation
to set guardrails which will prevent future similar mistakes.Let’s use SDF classifiers to add the missing business logic.The column classifiers file
app_installs
had an incorrect JOIN
between
mobile app in-app events in the raw_inapp_events
table, and marketing
campaign events in the raw_marketing_campaign_events
table:classifications/column_classifiers.sdf.yml
already contains
the event classifiers. Take a look yourself:classifications/column_classifiers.sdf.yml
4
Assign Classifiers to Source Tables
To assign the classifiers, uncomment the relevant section in each of the
files:Notice that we only assigned the classifiers to the source
metadata/raw/raw_inapp_events.sdf.yml
metadata/raw/raw_marketing_campaign_events.sdf.yml
raw
tables.5
Propagate Classifiers
SDF actually propagates the classifiers automatically, so no extra steps
are required.
dim_marketing_campaigns
,
we can see some of its columns inherited the classifier EVENT.inapp
.Schema moms_flower_shop.analytics.dim_marketing_campaigns
┌───────────────────────────────┬───────────┬─────────────┬──────────────────────────────────────────────────────────────────┐
│ column_name ┆ data_type ┆ classifier ┆ description │
╞═══════════════════════════════╪═══════════╪═════════════╪══════════════════════════════════════════════════════════════════╡
│ campaign_id ┆ bigint ┆ ┆ The identifier of the campaign that is associated with the event │
│ campaign_name ┆ varchar ┆ ┆ The campaign name associated with the campaign_id │
│ total_num_installs ┆ bigint ┆ EVENT.inapp ┆ │
│ avg_customer_acquisition_cost ┆ double ┆ EVENT.inapp ┆ │
│ install_duration_ratio ┆ bigint ┆ EVENT.inapp ┆ │
└───────────────────────────────┴───────────┴─────────────┴──────────────────────────────────────────────────────────────────┘
app_installs
inherited both EVENT.inapp
and EVENT.marketing
. This is due to the incorrect JOIN
we found.Schema moms_flower_shop.staging.app_installs
This table is a staging table which adds campaign information to app install events
[TABLE_STATUS.deprecated]
┌───────────────┬───────────┬─────────────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ column_name ┆ data_type ┆ classifier ┆ description │
╞═══════════════╪═══════════╪═════════════════════════════╪══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ event_id ┆ bigint ┆ EVENT.inapp EVENT.marketing ┆ │
│ customer_id ┆ bigint ┆ ┆ The identifier of the customer that performed the event │
│ install_time ┆ timestamp ┆ ┆ │
│ platform ┆ varchar ┆ ┆ iOS or Android │
│ campaign_id ┆ bigint ┆ ┆ The identifier of the campaign that is associated with the event │
│ campaign_name ┆ varchar ┆ ┆ The campaign name associated with the campaign_id │
│ campaign_type ┆ varchar ┆ ┆ A substring of the campaign name contain the campaign type. Supported types - instagram_ads, friends_referrals, facebook_ads, google_search │
│ ┆ ┆ ┆ │
└───────────────┴───────────┴─────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
6
Set Up Guardrails
We can create custom checks by querying SDF’s information schema.
To prevent this mistake from repeating, we can set up a check.First, create a file called Let’s run it:This check will fail because
mixed_event_ids.sql
under the folder
checks
. Add the following SQL to the new file:checks/mixed_event_ids.sql
app_installs
still has an
incorrect JOIN
.[Fail] Check moms_flower_shop.pub.mixed_event_ids
┌──────────────┬─────────────┬────────────────────────────────┐
│ table_name ┆ column name ┆ classifiers │
╞══════════════╪═════════════╪════════════════════════════════╡
│ app_installs ┆ event_id ┆ [EVENT.inapp, EVENT.marketing] │
└──────────────┴─────────────┴────────────────────────────────┘
1 rows.
7
Deprecate App Installs
In the previous tutorial, we already resolved
any downstream dependencies of
app_installs
and we now fully support the newer
version, app_installs_v2
. It is safe to deprecate the model - just delete the file.Or course, you can run sdf compile
to validate the change.Bonus
Classifiers can enrich your data warehouse in many ways. The following are just a few examples of added information layers to your static tables. With each example, you can create checks and reports to monitor your warehouse’s health and compliance.1
Privacy
Privacy is critical when storing sensitive information.
With SDF’s smart classifiers propagation, it is easier than
ever to track PII and other privacy related concerns.Open Now, when we compile, we can see the PII across our data warehouse.For example, the
metadata/raw/raw_customers.sdf.yml
and uncomment
all classifier sections in the file. They should look like this:metadata/raw/raw_customers.sdf.yml
customers
table compile output should look like this:Schema moms_flower_shop.staging.customers
┌───────────────┬───────────┬─────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ column_name ┆ data_type ┆ classifier ┆ description │
╞═══════════════╪═══════════╪═════════════╪══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ customer_id ┆ bigint ┆ ┆ A unique identifier of a mom’s floewr shop customer │
│ first_name ┆ varchar ┆ PII.name ┆ The first name of the customer │
│ last_name ┆ varchar ┆ PII.name ┆ The last name of the customer │
│ full_name ┆ varchar ┆ PII.name ┆ │
│ email ┆ varchar ┆ PII.email ┆ The emial of the customer │
│ gender ┆ varchar ┆ PII.gender ┆ The gender of the customer │
│ campaign_id ┆ bigint ┆ ┆ The identifier of the campaign that is associated with the event │
│ campaign_name ┆ varchar ┆ ┆ The campaign name associated with the campaign_id │
│ campaign_type ┆ varchar ┆ ┆ A substring of the campaign name contain the campaign type. Supported types - instagram_ads, friends_referrals, facebook_ads, google_search │
│ ┆ ┆ ┆ │
│ address_id ┆ bigint ┆ PII.address ┆ The address_id of the address associated with the customer │
│ full_address ┆ varchar ┆ ┆ The full address associated with the address_id │
│ state ┆ varchar ┆ ┆ The address US state associated with the address_id │
└───────────────┴───────────┴─────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
2
Retention
We can set up table level and column level retention classifiers.
Let’s look at a table level example.In the We can assign short term retention to our raw tables, while keeping infinite
retention for any analytics tables.For each raw table metadata found in To each analytics table metadata found in Notice that these classifiers are defined not to propagate downstream
using the flag Compile to see the results:For example, we can look at the Or
table_classifiers.sdf.yml
file you will find a retention classifier:classifications/table_classifiers.sdf.yml
metadata/raw/*
, add:metadata/analytics/*
, add:propagate: false
:raw_addresses
output
from the first command:Schema moms_flower_shop.raw.raw_addresses
All relevant information related to street addresses known to mom’s flower shop. This information comes from the user input into the mobile app.
[RETENTION.d7]
┌───────────────┬───────────┬────────────┬──────────────────────────────────────────────────────────┐
│ column_name ┆ data_type ┆ classifier ┆ description │
╞═══════════════╪═══════════╪════════════╪══════════════════════════════════════════════════════════╡
│ index ┆ bigint ┆ ┆ Row number │
│ address_id ┆ bigint ┆ ┆ A unique identifier of an address │
│ full_address ┆ varchar ┆ ┆ The full address associated with the address_id │
│ street_number ┆ bigint ┆ ┆ The address street number associated with the address_id │
│ street_name ┆ varchar ┆ ┆ The address street name associated with the address_id │
│ state ┆ varchar ┆ ┆ The address US state associated with the address_id │
│ city ┆ varchar ┆ ┆ The address US city associated with the address_id │
└───────────────┴───────────┴────────────┴──────────────────────────────────────────────────────────┘
dim_marketing_campaigns
from the second:Schema moms_flower_shop.analytics.dim_marketing_campaigns
[RETENTION.infinity]
┌───────────────────────────────┬───────────┬─────────────┬──────────────────────────────────────────────────────────────────┐
│ column_name ┆ data_type ┆ classifier ┆ description │
╞═══════════════════════════════╪═══════════╪═════════════╪══════════════════════════════════════════════════════════════════╡
│ campaign_id ┆ bigint ┆ ┆ The identifier of the campaign that is associated with the event │
│ campaign_name ┆ varchar ┆ ┆ The campaign name associated with the campaign_id │
│ total_num_installs ┆ bigint ┆ EVENT.inapp ┆ │
│ avg_customer_acquisition_cost ┆ double ┆ EVENT.inapp ┆ │
│ install_duration_ratio ┆ bigint ┆ EVENT.inapp ┆ │
└───────────────────────────────┴───────────┴─────────────┴──────────────────────────────────────────────────────────────────┘
3
Reports
To learn more on how to generate reports of warehouse health, see
our extended reports guide.
Summary
This tutorial only shows the tip of the iceberg of what you can do with our semantic understanding. Anything that’s possible with SQL is possible as a check or report against the information schema.We created the information schema to support custom checks and reports.
You can find the full description of SDF’s information schema here.