Enriching your warehouse
Overview
In the previous tutorial, we set up guardrails based on checks. In this tutorial, we will see how SDF’s semantic understanding can help transform your data warehouse from strings and numbers to real-world business logic:
- Maintain business logic consistency
- Control development environment and minimize mistakes from propagating
- Comply with privacy and GDPR requirements
Let’s see how we could’ve prevented the V1 version of app_installs in the first place.
Prerequisites
Completion of the previous tutorial.
Guide
Setup
If you haven’t completed the previous tutorial, uncomment the relevant section to reference the metadata files:
Intro to SDF Classifiers
SDF’s has the ability to annotate columns and tables with user defined types which represent real-world business logic. Those types enrich the data warehouse and create new SQL types - instead of just BIGINTs, we can now have currencies, different types of IDs, zip-codes, and many more.
SDF automatically propagates those types to downstream assets, enriching the entire data warehouse with a new layer of semantic understanding.
To learn more about SDF’s classifiers, read our classifiers guide.
Creating New Classifiers
Let’s focus back on Mom’s Flower Shop.
If you recall, V1 of app_installs
had an incorrect JOIN
between
mobile app in-app events in the raw_inapp_events
table, and marketing
campaign events in the raw_marketing_campaign_events
table:
Essentially, we were joining two elements that are completely different. Like comparing Apples to Oranges. These kind of mistakes happen all the time. Thankfully, we can leverage SDF’s semantic understanding and smart propagation to set guardrails which will prevent future similar mistakes.
Let’s use SDF classifiers to add the missing business logic.
The column classifiers file classifications/column_classifiers.sdf.yml
already contains
the event classifiers. Take a look yourself:
Assign Classifiers to Source Tables
To assign the classifiers, uncomment the relevant section in each of the files:
Notice that we only assigned the classifiers to the source raw
tables.
Propagate Classifiers
SDF actually propagates the classifiers automatically, so no extra steps are required.
Let’s compile to view our tables metadata:
Although we didn’t assign the classifiers to dim_marketing_campaigns
,
we can see some of its columns inherited the classifier EVENT.inapp
.
Schema moms_flower_shop.analytics.dim_marketing_campaigns
┌───────────────────────────────┬───────────┬─────────────┬─────────────┐
│ column_name ┆ data_type ┆ classifier ┆ description │
╞═══════════════════════════════╪═══════════╪═════════════╪═════════════╡
│ campaign_id ┆ bigint ┆ ┆ │
│ campaign_name ┆ varchar ┆ ┆ │
│ total_num_installs ┆ bigint ┆ EVENT.inapp ┆ │
│ avg_customer_acquisition_cost ┆ double ┆ EVENT.inapp ┆ │
│ install_duration_ratio ┆ bigint ┆ EVENT.inapp ┆ │
└───────────────────────────────┴───────────┴─────────────┴─────────────┘
Furthermore, we can see that app_installs
inherited both EVENT.inapp
and EVENT.marketing
. This is due to the incorrect JOIN
we found.
Schema moms_flower_shop.staging.app_installs
This table is a staging table which adds campaign information to app install events
[TABLE_STATUS.deprecated]
┌───────────────┬───────────┬─────────────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ column_name ┆ data_type ┆ classifier ┆ description │
╞═══════════════╪═══════════╪═════════════════════════════╪══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ event_id ┆ bigint ┆ EVENT.inapp EVENT.marketing ┆ A unique identifier of the marketing campaign event │
│ customer_id ┆ bigint ┆ ┆ │
│ install_time ┆ timestamp ┆ ┆ │
│ platform ┆ varchar ┆ ┆ │
│ campaign_id ┆ bigint ┆ ┆ The identifier of the campaign that is associated with the event │
│ campaign_name ┆ varchar ┆ ┆ The campaign name associated with the campaign_id │
│ campaign_type ┆ varchar ┆ ┆ A substring of the campaign name contain the campaign type. Supported types - instagram_ads, friends_referrals, facebook_ads, google_search │
│ ┆ ┆ ┆ │
└───────────────┴───────────┴─────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
Set Up Guardrails
We can create custom checks by querying SDF’s information schema. To prevent this mistake from repeating, we can set up a check.
First, create a file called mixed_event_ids.sql
under the folder
checks
. Add the following SQL to the new file:
Let’s run it:
This check will fail because app_installs
still has an
incorrect JOIN
.
[Fail] Check moms_flower_shop.pub.mixed_event_ids
┌──────────────┬─────────────┬────────────────────────────────┐
│ table_name ┆ column name ┆ classifiers │
╞══════════════╪═════════════╪════════════════════════════════╡
│ app_installs ┆ event_id ┆ [EVENT.inapp, EVENT.marketing] │
└──────────────┴─────────────┴────────────────────────────────┘
1 rows.
Deprecate App Installs
In the previous tutorial, we already resolved
any downstream dependencies of app_installs
and we now fully support the newer
version, app_installs_v2
. It is safe to deprecate the model - just delete the file.
Or course, you can run sdf compile
to validate the change.
Bonus
Classifiers can enrich your data warehouse in many ways. The following are just a few examples of added information layers to your static tables.
With each example, you can create checks and reports to monitor your warehouse’s health and compliance.
Privacy
Privacy is critical when storing sensitive information. With SDF’s smart classifiers propagation, it is easier than ever to track PII and other privacy related concerns.
Open metadata/raw/raw_customers.sdf.yml
and uncomment
all classifier sections in the file. They should look like this:
Now, when we compile, we can see the PII across our data warehouse.
For example, the customers
table compile output should look like this:
Schema moms_flower_shop.staging.customers
┌───────────────┬───────────┬─────────────┬────────────────────────────────────────────────────────────┐
│ column_name ┆ data_type ┆ classifier ┆ description │
╞═══════════════╪═══════════╪═════════════╪════════════════════════════════════════════════════════════╡
│ customer_id ┆ bigint ┆ ┆ A unique identifier of a mom’s floewr shop customer │
│ first_name ┆ varchar ┆ PII.name ┆ The first name of the customer │
│ last_name ┆ varchar ┆ PII.name ┆ The last name of the customer │
│ full_name ┆ varchar ┆ PII.name ┆ │
│ email ┆ varchar ┆ PII.email ┆ The emial of the customer │
│ gender ┆ varchar ┆ PII.gender ┆ The gender of the customer │
│ campaign_id ┆ bigint ┆ ┆ │
│ campaign_name ┆ varchar ┆ ┆ │
│ campaign_type ┆ varchar ┆ ┆ │
│ address_id ┆ bigint ┆ PII.address ┆ The address_id of the address associated with the customer │
│ full_address ┆ varchar ┆ ┆ The full address associated with the address_id │
│ state ┆ varchar ┆ ┆ The address US state associated with the address_id │
└───────────────┴───────────┴─────────────┴────────────────────────────────────────────────────────────┘
Retention
We can set up table level and column level retention classifiers. Let’s look at a table level example.
In the table_classifiers.sdf.yml
file you will find a retention classifier:
We can assign short term retention to our raw tables, while keeping infinite retention for any analytics tables.
For each raw table metadata found in metadata/raw/*
, add:
To each analytics table metadata found in metadata/analytics/*
, add:
Notice that these classifiers are defined not to propagate downstream
using the flag propagate: false
:
Compile to see the results:
For example, we can look at the raw_addresses
output
from the first command:
Schema moms_flower_shop.raw.raw_addresses
All relevant information related to street addresses known to mom’s flower shop. This information comes from the user input into the mobile app.
[RETENTION.d7]
┌───────────────┬───────────┬────────────┬──────────────────────────────────────────────────────────┐
│ column_name ┆ data_type ┆ classifier ┆ description │
╞═══════════════╪═══════════╪════════════╪══════════════════════════════════════════════════════════╡
│ index ┆ bigint ┆ ┆ Row number │
│ address_id ┆ bigint ┆ ┆ A unique identifier of an address │
│ full_address ┆ varchar ┆ ┆ The full address associated with the address_id │
│ street_number ┆ bigint ┆ ┆ The address street number associated with the address_id │
│ street_name ┆ varchar ┆ ┆ The address street name associated with the address_id │
│ state ┆ varchar ┆ ┆ The address US state associated with the address_id │
│ city ┆ varchar ┆ ┆ The address US city associated with the address_id │
└───────────────┴───────────┴────────────┴──────────────────────────────────────────────────────────┘
Or dim_marketing_campaigns
from the second:
Schema moms_flower_shop.analytics.dim_marketing_campaigns
[RETENTION.infinity]
┌───────────────────────────────┬───────────┬─────────────┬─────────────┐
│ column_name ┆ data_type ┆ classifier ┆ description │
╞═══════════════════════════════╪═══════════╪═════════════╪═════════════╡
│ campaign_id ┆ bigint ┆ ┆ │
│ campaign_name ┆ varchar ┆ ┆ │
│ total_num_installs ┆ bigint ┆ EVENT.inapp ┆ │
│ avg_customer_acquisition_cost ┆ double ┆ EVENT.inapp ┆ │
│ install_duration_ratio ┆ bigint ┆ EVENT.inapp ┆ │
└───────────────────────────────┴───────────┴─────────────┴─────────────┘
Reports
To learn more on how to generate reports of warehouse health, see our extended reports guide.
Summary
This tutorial only shows the tip of the iceberg of what you can do with our semantic understanding. Anything that’s possible with SQL is possible as a check or report against the information schema.
We created the information schema to support custom checks and reports. You can find the full description of SDF’s information schema here.