Ensuring data quality
Overview
In the previous tutorials we learned how SDF can help enforce metadata-based data contracts, which are defined against SDF’s information schema.
In this tutorial, we will add tests against the data in our warehouse, creating an additional layer of data quality validation throughout the data warehouse.
SDF Provides a standard and open-source testing library. The functions provided in the library are included natively in SDF.
All SDF Tests for all columns are compiled into table scans. This allows SDF to run multiple tests with a single scan. This dramatically reduces the cost of data tests.
Prerequisites
Completion of the previous tutorial.
Guide
What Are SDF Tests?
There are three types of builtin tests.
- Scalar Column Tests - validate expectations of individual values in columns (eg.
x > 100
) - Aggregate Column Tests - validate expectations of all values in a columns (eg.
sum(x) > 100
) - Table Tests - validate expectations for all columns in table (eg. columns a and b are unique).
In this tutorial, we will create one test of each type.
To learn more about SDF’s testing capabilities, read our Tests Guide.
Setup
Since SDF’s tests library leverages Jinja, we also need to add Jinja as a preprocessor.
In the workspace.sdf.yml
file, uncomment the following:
For the sake of this tutorial, we will add tests on the inapp_events
staging model located in
models/staging/inapp_events
.
To add tests to the model, we need to create a YML file to hold the model’s metadata.
Under metadata/staging
, create a new file called inapp_events.sdf.yml
containing the following definition:
In this tutorial, we use the metadata file to define tests, but we’ve seen in past tutorials how the table metadata file is also used for table and column descriptions and classifications’ application.
Scalar Column Tests
Let’s say we want to verify there are no negative orders, meaning the event_value which represents the total order amount in USD.
We can use the scalar test valid_scalar(condition)
where condition = event_value >=0
.
Add to the newly created YML file the following test:
A few things to note:
- This is a column-level test so we need to add a column definition
- To specify the condition, we need to wrap it with triple quotes
"""condition"""
- We can define a severity level for the test - either error or warning
To execute the test, simply run the command:
[Pass] Test moms_flower_shop.staging.test_inapp_events
Aggregate Column Tests
We can write a similar test as an aggregate column test. Instead of validating
each value individually, we can just look at the minimum value of event_value
and assert whether it is positive.
Let’s add the new test to the YML file:
All types of tests work together harmoniously and will be executed within one table scan to save on compute costs.
Let’s run the test again:
[Pass] Test moms_flower_shop.staging.test_inapp_events
Table Tests
On a table level, we want to make sure that our unique key is indeed unique. In this case, the table key is event_id. However, in other cases, it could be a combination of multiple columns. SDF supports all scenarios.
Let’s add this table-level test to the YML file:
Notice we wrap the column name with quotation marks "col_name"
.
We can run the tests again:
[Pass] Test moms_flower_shop.staging.test_inapp_events
Summary
Quality tests against your data are critical to ensure that the data is correct and meets the expectations of data consumers.
SDF’s built-in open-source tests lib makes it super easy to add rule-based data quality checks to your data warehouse.