Understanding Checks

Checks are an example of Metaprogramming. SDF takes your SQL workspace, classifiers, and lineage and creates a rich information schema whenever you run sdf compile.

Checks are simply small SQL queries that run against this information schema and check for the existence of certain properties which you want to assure. Some examples of effective checks are:

  • Ensure Data Privacy: Ensure all personally identifiable information (PII) is appropriately anonymized
  • Ensure Data Ownership: Guarantee every table has an owner (a staple of GDPR)
  • Ensure Data Quality: Prevent different currency types from combining in calculations (e.g. preventing £ + $ )

Writing Your First Check

Let’s write and modify a check, starting with one of the sdf samples.

1

Create a Sample Workspace

In the terminal, run:

sdf new --sample hello_with_pii && cd hello_with_pii

    Created hello_with_pii/.gitignore     Created hello_with_pii/checks/code_check.sql     Created hello_with_pii/models/main.sql     Created hello_with_pii/workspace.sdf.yml           Welcome to your new SDF Workspace! To help you on your journey:                  💡 Join the SDF Community Slack -> https://join.slack.com/t/sdfcommunity/shared_invite/zt-2lkokt7ed-xIR0_pcmUOU3~CebxuqeKg                   📚 Read the Docs to Get Started -> https://docs.sdf.com/             Finished new in 0.296 secs

This sample workspace contains a model, and a classifier, and a check.

2

Include Checks in Workspace

Add the checks’ path to woskspace.sdf.yml.

  workspace:
    name: hello
    edition: 1.2
    defaults:
      dialect: trino
    includes:
      - path: models
      # ADD THE BELOW #
      - path: checks
        type: check

Both checks and reports must be written in the Trino SQL dialect.

3

Run the Check

To run the check, simply execute sdf check. During compile, SDF runs all its static analysis on the workspace, creating lineage, metadata, and ultimately executing the checks.

Hurray! The check passes.

Working set 1 model file, 1 .sdf file   Compiling hello.pub.main (./models/main.sql) Working set 1 test file, 1 .sdf file    Checking hello.pub.code_check (./checks/code_check.sql)    Finished 1 model [1 succeeded], 1 check [1 passed] in 1.280 secs [Pass] Check hello.pub.code_check

But, carefully read the check.

SELECT
    DISTINCT c.table_name as "table name",
    c.column_name as "column name",
    c.classifiers
FROM
    sdf.information_schema.columns c
WHERE
    CONTAINS_ARRAY_VARCHAR(c.classifiers, 'PII.name')

It queries for any column that contains PII.name and asserts that this query should be empty. But we never added the classifier in the first place!

4

Add a Classifier

To add the classifier and change the behavior, uncomment the following classifier later in the workspace.

table:
  name: main
  columns:
    - name: column_2
      # UNCOMMENT THE BELOW
      classifiers:
        - PII.name

Running sdf check again will result in a failed test. The output will look like this:

Working set 1 model file, 1 .sdf file   Compiling hello.pub.main (./models/main.sql) Working set 1 test file, 1 .sdf file    Checking hello.pub.code_check (./checks/code_check.sql)    Finished 1 model [1 succeeded], 1 check [1 failed] in 1.283 secs, for details see below.

[Fail] Check hello.pub.code_check ┌────────────┬─────────────┬─────────────┐ │ table_name ┆ column name ┆ classifiers │ ╞════════════╪═════════════╪═════════════╡ │ main       ┆ column_2    ┆ [PII.name]  │ └────────────┴─────────────┴─────────────┘ 1 rows.

------- Summary 1 model [1 succeeded], 1 check [1 failed] in 1.283 secs. -------

Modifying Classifiers

Let’s fix this failing check! We can apply an md5() hash to column_2 (the name column) to obfuscate it. This is called a reclassification since we are transforming one classifier into another.

1

Add a Reclassifying Function

Let’s add a function block that attaches a reclassify statement to the built-in md5 function. In your workspace.sdf.yml create the following function block.

---
function:
  name: md5
  reclassify:
    - from: PII.name
2

Create a New Model

Now, create a second table that uses the md5 function. In the same directory as main.sql, create sink.sql with the line

sink.sql
SELECT MD5(CAST (column_2 AS VARBINARY)) FROM main;
3

Update Check

Update the code_check.sql with the following line and c.table_name like '%.sink'. It should now look like the check below. Now, the Check specifically checks the sink table for the PII.name classifier.

code_check.sql
SELECT
    DISTINCT c.table_name as "table_name",
    c.column_name as "column name",
    c.classifiers
FROM
    sdf.information_schema.columns c
WHERE
    CONTAINS_ARRAY_VARCHAR(c.classifiers, 'PII.name')
    AND c.table_name like '%.sink'

Run sdf check again and voila! All checks pass. The source table has PII.name, the sink table does not.

Working set 2 model files, 1 .sdf file   Compiling hello.pub.main (./models/main.sql)   Compiling hello.pub.sink (./models/sink.sql) Working set 1 test file, 1 .sdf file    Checking hello.pub.code_check (./checks/code_check.sql)    Finished 2 models [2 succeeded], 1 check [1 passed] in 1.278 secs [Pass] Check hello.pub.code_check

Advanced Checks

  • To learn more about transforming classifiers across functions, see: UDFs
  • To learn more about the SDF internal information schema, see: Information Schema