Enriching your warehouse

Overview

In the previous tutorial, we set up guardrails based on checks. In this tutorial, we will see how SDF’s semantic understanding can help transform your data warehouse from strings and numbers to real-world business logic:

Maintain business logic consistency
Control development environment and minimize mistakes from propagating
Comply with privacy and GDPR requirements

Let’s see how we could’ve prevented the V1 version of app_installs in the first place.

Prerequisites

Completion of the previous tutorial.

Guide

Setup

If you haven’t completed the previous tutorial, uncomment the relevant section to reference the metadata files:

workspace.sdf.yml

...
# Uncomment here to begin the "Deprecating a Model" tutorial >>>>>>
- path: classifications  # Classifiers
    type: metadata
- path: reports  # Reports based on SDF s information schema
    type: report
- path: checks  # Checks against SDF's information schema
    type: check
# <<<<<<<

Intro to SDF Classifiers

SDF’s has the ability to annotate columns and tables with user defined types which represent real-world business logic. Those types enrich the data warehouse and create new SQL types - instead of just BIGINTs, we can now have currencies, different types of IDs, zip-codes, and many more.

SDF automatically propagates those types to downstream assets, enriching the entire data warehouse with a new layer of semantic understanding.

To learn more about SDF’s classifiers, read our classifiers guide.

Creating New Classifiers

Let’s focus back on Mom’s Flower Shop.

If you recall, V1 of app_installs had an incorrect JOIN between mobile app in-app events in the raw_inapp_events table, and marketing campaign events in the raw_marketing_campaign_events table:

...
FROM inapp_events i 
    LEFT OUTER JOIN raw.raw_marketing_campaign_events m
        ON (i.event_id = m.event_id) 
...

Essentially, we were joining two elements that are completely different. Like comparing Apples to Oranges. These kind of mistakes happen all the time. Thankfully, we can leverage SDF’s semantic understanding and smart propagation to set guardrails which will prevent future similar mistakes.

Let’s use SDF classifiers to add the missing business logic.

The column classifiers file classifications/column_classifiers.sdf.yml already contains the event classifiers. Take a look yourself:

classifications/column_classifiers.sdf.yml

classifier:
    name: EVENT
    labels:
        - name: inapp
        - name: marketing

Assign Classifiers to Source Tables

To assign the classifiers, uncomment the relevant section in each of the files:

metadata/raw/raw_inapp_events.sdf.yml

...
- name: event_id
    description: A unique identifier of an in-app event within mom's flower shop's mobile app
    # Uncomment to begin the "Enriching your Warehouse" tutorial >>>>>
    classifiers:
        - EVENT.inapp
    # <<<<<
...

metadata/raw/raw_marketing_campaign_events.sdf.yml

- name: event_id
    description: A unique identifier of the marketing campaign event
    # Uncomment to begin the "Enriching your Warehouse" tutorial >>>>>
    classifiers:
        - EVENT.marketing
    # <<<<<

Notice that we only assigned the classifiers to the source raw tables.

Propagate Classifiers

SDF actually propagates the classifiers automatically, so no extra steps are required.

Let’s compile to view our tables metadata:

sdf compile --show result

Although we didn’t assign the classifiers to dim_marketing_campaigns, we can see some of its columns inherited the classifier EVENT.inapp.

sdf compile analytics.dim_marketing_campaigns --show result

Schema moms_flower_shop.analytics.dim_marketing_campaigns
┌───────────────────────────────┬───────────┬─────────────┬──────────────────────────────────────────────────────────────────┐
│ column_name                   ┆ data_type ┆ classifier  ┆ description                                                      │
╞═══════════════════════════════╪═══════════╪═════════════╪══════════════════════════════════════════════════════════════════╡
│ campaign_id                   ┆ bigint    ┆             ┆ The identifier of the campaign that is associated with the event │
│ campaign_name                 ┆ varchar   ┆             ┆ The campaign name associated with the campaign_id                │
│ total_num_installs            ┆ bigint    ┆ EVENT.inapp ┆                                                                  │
│ avg_customer_acquisition_cost ┆ double    ┆ EVENT.inapp ┆                                                                  │
│ install_duration_ratio        ┆ bigint    ┆ EVENT.inapp ┆                                                                  │
└───────────────────────────────┴───────────┴─────────────┴──────────────────────────────────────────────────────────────────┘

Furthermore, we can see that app_installs inherited both EVENT.inapp and EVENT.marketing. This is due to the incorrect JOIN we found.

sdf compile staging.app_installs --show result

Schema moms_flower_shop.staging.app_installs
       This table is a staging table which adds campaign information to app install events
       [TABLE_STATUS.deprecated]
┌───────────────┬───────────┬─────────────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ column_name   ┆ data_type ┆ classifier                  ┆ description                                                                                                                                  │
╞═══════════════╪═══════════╪═════════════════════════════╪══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ event_id      ┆ bigint    ┆ EVENT.inapp EVENT.marketing ┆                                                                                                                                              │
│ customer_id   ┆ bigint    ┆                             ┆ The identifier of the customer that performed the event                                                                                      │
│ install_time  ┆ timestamp ┆                             ┆                                                                                                                                              │
│ platform      ┆ varchar   ┆                             ┆ iOS or Android                                                                                                                               │
│ campaign_id   ┆ bigint    ┆                             ┆ The identifier of the campaign that is associated with the event                                                                             │
│ campaign_name ┆ varchar   ┆                             ┆ The campaign name associated with the campaign_id                                                                                            │
│ campaign_type ┆ varchar   ┆                             ┆ A substring of the campaign name contain the campaign type.  Supported types - instagram_ads, friends_referrals, facebook_ads, google_search │
│               ┆           ┆                             ┆                                                                                                                                              │
└───────────────┴───────────┴─────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Set Up Guardrails

We can create custom checks by querying SDF’s information schema. To prevent this mistake from repeating, we can set up a check.

First, create a file called mixed_event_ids.sql under the folder checks. Add the following SQL to the new file:

checks/mixed_event_ids.sql

SELECT
    DISTINCT c.table_name as "table_name",
    c.column_name as "column name",
    c.classifiers
FROM
    sdf.information_schema.columns c
WHERE
    -- more than one EVENT classifier is assigned
    CAST(c.classifiers AS VARCHAR) LIKE '%EVENT%EVENT%'

Let’s run it:

sdf check mixed_event_ids --show result

This check will fail because app_installs still has an incorrect JOIN.

[Fail] Check moms_flower_shop.pub.mixed_event_ids
┌──────────────┬─────────────┬────────────────────────────────┐
│ table_name   ┆ column name ┆ classifiers                    │
╞══════════════╪═════════════╪════════════════════════════════╡
│ app_installs ┆ event_id    ┆ [EVENT.inapp, EVENT.marketing] │
└──────────────┴─────────────┴────────────────────────────────┘
1 rows.

Deprecate App Installs

In the previous tutorial, we already resolved any downstream dependencies of app_installs and we now fully support the newer version, app_installs_v2. It is safe to deprecate the model - just delete the file.

Or course, you can run sdf compile to validate the change.

Bonus

Classifiers can enrich your data warehouse in many ways. The following are just a few examples of added information layers to your static tables.

With each example, you can create checks and reports to monitor your warehouse’s health and compliance.

Privacy

Privacy is critical when storing sensitive information. With SDF’s smart classifiers propagation, it is easier than ever to track PII and other privacy related concerns.

Open metadata/raw/raw_customers.sdf.yml and uncomment all classifier sections in the file. They should look like this:

metadata/raw/raw_customers.sdf.yml

...
# Uncomment to begin the "Enriching your Warehouse" tutorial >>>>>
classifiers:  <-- uncomment this part
    - name: PII.[name/address/gender/email]  <-- uncomment this part
# <<<<<
...

Now, when we compile, we can see the PII across our data warehouse.

sdf compile --show all

For example, the customers table compile output should look like this:

sdf compile staging.customers --show result

Schema moms_flower_shop.staging.customers
┌───────────────┬───────────┬─────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ column_name   ┆ data_type ┆ classifier  ┆ description                                                                                                                                  │
╞═══════════════╪═══════════╪═════════════╪══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════════╡
│ customer_id   ┆ bigint    ┆             ┆ A unique identifier of a mom’s floewr shop customer                                                                                          │
│ first_name    ┆ varchar   ┆ PII.name    ┆ The first name of the customer                                                                                                               │
│ last_name     ┆ varchar   ┆ PII.name    ┆ The last name of the customer                                                                                                                │
│ full_name     ┆ varchar   ┆ PII.name    ┆                                                                                                                                              │
│ email         ┆ varchar   ┆ PII.email   ┆ The emial of the customer                                                                                                                    │
│ gender        ┆ varchar   ┆ PII.gender  ┆ The gender of the customer                                                                                                                   │
│ campaign_id   ┆ bigint    ┆             ┆ The identifier of the campaign that is associated with the event                                                                             │
│ campaign_name ┆ varchar   ┆             ┆ The campaign name associated with the campaign_id                                                                                            │
│ campaign_type ┆ varchar   ┆             ┆ A substring of the campaign name contain the campaign type.  Supported types - instagram_ads, friends_referrals, facebook_ads, google_search │
│               ┆           ┆             ┆                                                                                                                                              │
│ address_id    ┆ bigint    ┆ PII.address ┆ The address_id of the address associated with the customer                                                                                   │
│ full_address  ┆ varchar   ┆             ┆ The full address associated with the address_id                                                                                              │
│ state         ┆ varchar   ┆             ┆ The address US state associated with the address_id                                                                                          │
└───────────────┴───────────┴─────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Retention

We can set up table level and column level retention classifiers. Let’s look at a table level example.

In the table_classifiers.sdf.yml file you will find a retention classifier:

classifications/table_classifiers.sdf.yml

classifier: 
    name: RETENTION
    labels:
        - name: d7
        - name: d30
        - name: d90
        - name: d180
        - name: infinity

We can assign short term retention to our raw tables, while keeping infinite retention for any analytics tables.

For each raw table metadata found in metadata/raw/*, add:

...
classifiers:
    - name: RETENTION.d7
...

To each analytics table metadata found in metadata/analytics/*, add:

...
classifiers:
    - name: RETENTION.infinity
...

Notice that these classifiers are defined not to propagate downstream using the flag propagate: false:

classifier:
    ...
    propagate: false
...

Compile to see the results:

sdf compile models/raw/ --show result

For example, we can look at the raw_addresses output from the first command:

Schema moms_flower_shop.raw.raw_addresses
       All relevant information related to street addresses known to mom’s flower shop.  This information comes from the user input into the mobile app.
       [RETENTION.d7]
┌───────────────┬───────────┬────────────┬──────────────────────────────────────────────────────────┐
│ column_name   ┆ data_type ┆ classifier ┆ description                                              │
╞═══════════════╪═══════════╪════════════╪══════════════════════════════════════════════════════════╡
│ index         ┆ bigint    ┆            ┆ Row number                                               │
│ address_id    ┆ bigint    ┆            ┆ A unique identifier of an address                        │
│ full_address  ┆ varchar   ┆            ┆ The full address associated with the address_id          │
│ street_number ┆ bigint    ┆            ┆ The address street number associated with the address_id │
│ street_name   ┆ varchar   ┆            ┆ The address street name associated with the address_id   │
│ state         ┆ varchar   ┆            ┆ The address US state associated with the address_id      │
│ city          ┆ varchar   ┆            ┆ The address US city associated with the address_id       │
└───────────────┴───────────┴────────────┴──────────────────────────────────────────────────────────┘

Or dim_marketing_campaigns from the second:

Schema moms_flower_shop.analytics.dim_marketing_campaigns
       [RETENTION.infinity]
┌───────────────────────────────┬───────────┬─────────────┬──────────────────────────────────────────────────────────────────┐
│ column_name                   ┆ data_type ┆ classifier  ┆ description                                                      │
╞═══════════════════════════════╪═══════════╪═════════════╪══════════════════════════════════════════════════════════════════╡
│ campaign_id                   ┆ bigint    ┆             ┆ The identifier of the campaign that is associated with the event │
│ campaign_name                 ┆ varchar   ┆             ┆ The campaign name associated with the campaign_id                │
│ total_num_installs            ┆ bigint    ┆ EVENT.inapp ┆                                                                  │
│ avg_customer_acquisition_cost ┆ double    ┆ EVENT.inapp ┆                                                                  │
│ install_duration_ratio        ┆ bigint    ┆ EVENT.inapp ┆                                                                  │
└───────────────────────────────┴───────────┴─────────────┴──────────────────────────────────────────────────────────────────┘

Reports

To learn more on how to generate reports of warehouse health, see our extended reports guide.

Summary

This tutorial only shows the tip of the iceberg of what you can do with our semantic understanding. Anything that’s possible with SQL is possible as a check or report against the information schema.

We created the information schema to support custom checks and reports. You can find the full description of SDF’s information schema here.

Classifiers

Learn all the ways classifiers can help enhance your workspace

Checks

Learn more about setting up checks for static analysis

Checks

Learn more about checks to run impact analysis on your data warehouse

Introduction

Tutorials

Guides

Integrations

Enriching your warehouse

Overview

Prerequisites

Guide

Bonus

Summary

Classifiers

Checks

Checks

Introduction

Tutorials

Guides

Integrations

​Overview

​Prerequisites

​Guide

​Bonus

​Summary

​Related Topics

Classifiers

Checks

Checks

Overview

Prerequisites

Guide

Bonus

Summary

Related Topics