Checks
Understanding Checks
Checks are an example of Metaprogramming. SDF takes your SQL workspace, classifiers, and lineage and creates a rich information schema whenever you run sdf compile
.
Checks are simply small SQL queries that run against this information schema and check for the existence of certain properties which you want to assure. Some examples of effective checks are:
- Ensure Data Privacy: Ensure all personally identifiable information (PII) is appropriately anonymized
- Ensure Data Ownership: Guarantee every table has an owner (a staple of GDPR)
- Ensure Data Quality: Prevent different currency types from combining in calculations (e.g. preventing £ + $ )
Writing Your First Check
Let’s write and modify a check, starting with one of the sdf samples.
Create a Sample Workspace
In the terminal, run:
Created hello_with_pii/.gitignore
Created hello_with_pii/checks/code_check.sql
Created hello_with_pii/models/main.sql
Created hello_with_pii/workspace.sdf.yml
Welcome to your new SDF Workspace! To help you on your journey:
💡 Join the SDF Community Slack -> https://sdf.com/join
📚 Read the Docs to Get Started -> https://docs.sdf.com/
Finished new in 0.210 secs
This sample workspace contains a model, and a classifier, and a check.
Include Checks in Workspace
Add the checks’ path to woskspace.sdf.yml
.
Both checks and reports must be written in the Trino SQL dialect.
Run the Check
To run the check, simply execute sdf check
. During compile, SDF runs all its static analysis on the workspace, creating lineage, metadata, and ultimately executing the checks.
Hurray! The check passes.
Working set 1 model file, 1 .sdf file
Compiling hello.pub.main (./models/main.sql)
Working set 1 check file, 1 .sdf file
Checking hello.pub.code_check (./checks/code_check.sql)
Finished 1 model [1 succeeded], 1 check [1 passed] in 1.240 secs
[Pass] Check hello.pub.code_check
But, carefully read the check.
It queries for any column that contains PII.name and asserts that this query should be empty. But we never added the classifier in the first place!
Add a Classifier
To add the classifier and change the behavior, uncomment the following classifier later in the workspace.
Running sdf check
again will result in a failed test. The output will look like this:
Working set 1 model file, 1 .sdf file
Compiling hello.pub.main (./models/main.sql)
Working set 1 check file, 1 .sdf file
Checking hello.pub.code_check (./checks/code_check.sql)
Finished 1 model [1 succeeded], 1 check [1 failed] in 1.262 secs, for details see below.
[Fail] Check hello.pub.code_check
┌────────────┬─────────────┬─────────────┐
│ table_name ┆ column name ┆ classifiers │
╞════════════╪═════════════╪═════════════╡
│ main ┆ column_2 ┆ [PII.name] │
└────────────┴─────────────┴─────────────┘
1 rows.
-------
Summary 1 model [1 succeeded], 1 check [1 failed] in 1.262 secs.
-------
Modifying Classifiers
Let’s fix this failing check! We can apply an md5()
hash to column_2 (the name column) to obfuscate it. This is called a reclassification since we are transforming one classifier into another.
Add a Reclassifying Function
Let’s add a function block that attaches a reclassify statement to the built-in md5 function. In your workspace.sdf.yml
create the following function block.
Create a New Model
Now, create a second table that uses the md5 function. In the same directory as main.sql, create sink.sql with the line
Update Check
Update the code_check.sql with the following line and c.table_name like '%.sink'
. It should now look like the check below.
Now, the Check specifically checks the sink table for the PII.name classifier.
Run sdf check
again and voila! All checks pass. The source table has PII.name, the sink table does not.
Working set 2 model files, 1 .sdf file
Compiling hello.pub.main (./models/main.sql)
Compiling hello.pub.sink (./models/sink.sql)
Working set 1 check file, 1 .sdf file
Checking hello.pub.code_check (./checks/code_check.sql)
Finished 2 models [2 succeeded], 1 check [1 passed] in 1.254 secs
[Pass] Check hello.pub.code_check
Advanced Checks
- To learn more about transforming classifiers across functions, see: UDFs
- To learn more about the SDF internal information schema, see: Information Schema