Tests
Understanding Tests
Tests are a commonly used data quality feature allowing you to scan results of queries and set expectations of your data. You can validate that a particular column is never null, or within a certain thresholding range, between a certain set of values, and more.
SDF provides a standard testing library. The functions provided in the library are included natively in SDF.
All SDF Tests for all columns are compiled into table scans. This allows SDF to run multiple tests with a single scan. This dramatically reduces the cost of data tests.
An SDF data test is a simple assertion test that runs a SQL query against your database and, if any data is returned, fails. The simplest example might be a non-null test on a column.
This would look like:
When sdf test
runs this query, it will check if any data is returned and fail if so.
Working set 1 model file, 1 test file, 1 .sdf file
Running hello.pub.main (./models/main.sql)
Testing hello.pub.test_main (./sdftarget/dbg/tests/hello/pub/test_main.sql)
Finished 1 model [1 succeeded], 1 test [1 passed] in 0.775 secs
[Pass] Test hello.pub.test_main
There are three types of tests builtin.
- Scalar Column Tests - validate expectations of individual values in columns (eg.
x > 100
) - Aggregate Column Tests - validate expectations of all values in a columns (eg.
sum(x) > 100
) - Table Tests - validate expectations for all columns in a table (eg. columns a and b are unique).
Severity
Each test may be configured with a severity flag. By default all tests have a severity of error meaning that a constraint violation will result in a non-zero exit code for sdf.
Tests may have three results, with text highlighting providing an indication of the output.
Test Result | Meaning |
---|---|
🟡 | Warning |
🔴 | Error |
Using a Test
Tests are added to columns in table blocks. Each column has ‘constraints’ and test functions should be expressed as the expectation you have of the column.
Test libraries are written as SDF workspaces that are imported by your workspace.
Below, we add a simple aggregate uniqueness test, and a scalar null values test to a cool_column
.
SDF’s tests library is an out-of-the-box optimized testing mechanism. All tests associated with a table will be run within one query.
One query —> one table scan —> significant compute savings compare to other data quality solutions.
SDF Standard Library Tests
Column Tests
Test Name | Type | Description |
---|---|---|
not_null() | Scalar | Column elements must not be null |
valid_scalar(condition) | Scalar | Column elements all meet scalar condition |
valid_aggregate(condition) | Aggregate | Column elements all meet aggregate condition |
unique() | Aggregate | Column elements must be unique |
in_accepted_values([values]) | Aggregate | Column values are all elements of array |
minimum(value) | Aggregate | Column can not have values less than value |
maximum(value) | Aggregate | Column can not have values greater than value |
exclusive_minimum(value) | Aggregate | Column can not have values less or equal to value |
exclusive_maximum(value) | Aggregate | Column can not have values greater than or equal to value |
between(lower, upper) | Aggregate | Column values must be between lower and upper |
max_length(value) | Aggregate | Column values (as strings) are greater than a given length |
min_length(value) | Aggregate | Column values (as strings) are less than a given length |
like(string) | Aggregate | Column elements must contain string |
try_cast(type) | Aggregate | Column elements must include string |
primary_key(column) | Aggregate | Column is primary key of column |
Table Tests
Test Name | Type | Description |
---|---|---|
unique_columns([c1, c2]) | Table | Table with columns c1 and c2 has unique values across these two columns |
The full repo of sdf_tests
, our open-source tests library, is is available on GitHub here.
Add Your First Tests
Prerequisite
SDF’s built-in tests library sdf_tests
uses Jinja macros. Therefore, setting Jinja
as a preprocessor is required.
Create the sample workspace tests
:
In the file workspace.sdf.yml
, we have Jinja set as the default preprocessor:
Add Tests
Open the table metadata file src_metadata/raw_inapp_events.sdf.yml
.
In the file, find the three commented out tests and uncomment them. They should look like this:
Run Tests
Run the tests. In the terminal, run:
Working set 1 model file, 1 test file, 2 .sdf files
Running tests_workspace.pub.raw_inapp_events (./models/raw_inapp_events.sql)
Testing tests_workspace.pub.test_raw_inapp_events (./sdftarget/dbg/tests/tests_workspace/pub/test_raw_inapp_events.sql)
Finished 1 model [1 succeeded], 1 test [1 passed] in 0.795 secs
[Pass] Test tests_workspace.pub.test_raw_inapp_events
Inspect the Test
This step is not mandatory, but sometimes it can be helpful to view the compiled test query.
To access the query, navigate to the file located in sdftarget/dbg/preprocessed/sdftarget/dbg/tests/tests_workspace/pub/test_raw_inapp_events.sql
.
The file will hold the compiled tests query and will look like this:
WITH
raw_inapp_events_constraints AS (
SELECT
CASE WHEN COUNT(DISTINCT event_id) != COUNT(event_id)
THEN
‘err: column event_id is NOT unique’ ELSE NULL END AS event_id_1
,
CASE WHEN COUNT(CASE WHEN event_id IS NULL THEN 1 ELSE NULL END) > 0
THEN
‘err: column event_id has NULL values’ ELSE NULL END AS event_id_2
,
CASE WHEN COUNT(CASE WHEN event_name NOT IN (‘install’, ‘add_to_cart’, ‘go_to_checkout’, ‘place_order’) THEN 1 ELSE NULL END) > 0
THEN
‘wrn: column event_name has values outside accepted values ”install”, ”add_to_cart”, ”go_to_checkout”, ”place_order''' ELSE NULL END AS event_name_1
,
CASE WHEN NOT(MIN(event_value) >= 0) or MAX(event_value) > 1000
THEN
‘wrn: column event_value has values outside of 0..1000’ ELSE NULL END AS event_value_1
FROM tests_workspace.pub.raw_inapp_events
)
SELECT reason
FROM (
SELECT event_id_1 as reason FROM raw_inapp_events_constraints WHERE event_id_1 IS NOT NULL
UNION ALL
SELECT event_id_2 as reason FROM raw_inapp_events_constraints WHERE event_id_2 IS NOT NULL
UNION ALL
SELECT event_name_1 as reason FROM raw_inapp_events_constraints WHERE event_name_1 IS NOT NULL
UNION ALL
SELECT event_value_1 as reason FROM raw_inapp_events_constraints WHERE event_value_1 IS NOT NULL
) AS _combined_errors
ORDER BY reason;
Fail a Test
Let’s see how it’ll look like if the test fails.
In src_metadata/raw_inapp_events.sdf.yml
, edit the test associated with the column event_value
and reduce the maximum in the range to be 10 instead of 1000:
Now let’s run the test again:
warning: SDF2001: tests_workspace.pub.raw_inapp_events: attempting to execute Trino, local execution is currently experimental
—> models/raw_inapp_events.sql:1
… and 1 more warnings not shown
Add a Generic Test
SDF supports the ability to turn any SQL query into a data test. This is useful for more complex tests that are not covered by the standard library.
Create a New Workspace
Let’s start by creating a new workspace called generic_test_example
Write a Generic Test
Next, let’s write a SQL query that will represent our generic test. For our example, we’ll write the most basic of tests which will always fail (since it will always return data).
Create a new file generic_test.sql
in a new directory called tests
:
Configure Generic Tests
Next, let’s update the workspace.sdf.yml
to include our new generic test.
To do so, add - path: tests
to your workspace.sdf.yml
includes block like so:
The - path: tests
includes item does not need the type: test
property since naming your directory tests
defaults
the include type to test
. For a list of all default include types, check out our includes block documentation
Run the Test
Now that we have a our new generic tests added, let’s trying running it! Note that we are expecting this test to fail, as the SQL query we wrote will always return data.
In the terminal, run:
warning: SDF2001: hello.pub.generic_test: attempting to execute Trino, local execution is currently experimental
—> tests/generic_test.sql:1
Great! It’s the first time in your life you’ve been happy to see a test fail.
Add Custom Tests to the Testing Library
SDF supports an easy way to write custom tests that will utilize our built-in testing infrastructure to run custom testing macros.
Setup
In our workspace, let’s create a new folder for our custom test macros. We can call it custom_tests
.
The new directory will look like this:
Cleaned sdftarget
Finished clean in 0.220 secs
.
├── custom_tests
├── models
│ └── raw_inapp_events.sql
├── seeds
│ └── inapp_events.csv
├── src_metadata
│ └── raw_inapp_events.sdf.yml
└── workspace.sdf.yml
4 directories, 4 files
We also need to include the new folder in our workspace.sdf.yml
file:
Now we are ready to add some custom tests!
Create a Custom Test
To extend SDF’s built-in testing library, add a jinja file named my_test.jinja
with the following custom test in the newly created folder:
In order to use SDF’s testing infrastructure, we need to maintain a similar macro signature as in the official library.
Try It!
First, we need to add the test to our table yml file, just like we did with
SDF’s built-in tests.
In the file src_metadata/raw_inapp_events.sdf.yml
, add our new custom test
to the column event_name
:
Unlike with regular tests, to call a custom test we need to use the syntax
{sdf_workspace_name}.{custom_test_name}
. In the example above, the workspace name
is tests_workspace
and the custom_test_name is custom_column_test
.
Let’s run the test! In your terminal, execute:
The test passed!
Working set 1 model file, 1 test file, 2 .sdf files
Running tests_workspace.pub.raw_inapp_events (./models/raw_inapp_events.sql)
Testing tests_workspace.pub.test_raw_inapp_events (./sdftarget/dbg/tests/tests_workspace/pub/test_raw_inapp_events.sql)
Finished 1 model [1 succeeded], 1 test [1 passed] in 0.762 secs
[Pass] Test tests_workspace.pub.test_raw_inapp_events
We can look at the compiled tests query located in the file
sdftarget/dbg/preprocessed/sdftarget/dbg/tests/tests_workspace/pub/test_raw_inapp_events.sql
.
Notice our custom logic in the compiled query:
WITH
raw_inapp_events_constraints AS (
SELECT
CASE WHEN COUNT(CASE WHEN index > 1000 THEN 1 ELSE NULL END) > 0
THEN
‘err: column index is over 1000”’ ELSE NULL END AS index_1
FROM tests_workspace.pub.raw_inapp_events
)
SELECT reason
FROM (
SELECT index_1 as reason FROM raw_inapp_events_constraints WHERE index_1 IS NOT NULL
) AS _combined_errors
ORDER BY reason;