Tests
Understanding Tests
Tests are a commonly used data quality feature allowing you to scan results of queries and set expectations of your data. You can validate that a particular column is never null, or within a certain thresholding range, between a certain set of values, and more.
SDF provides a standard testing library. The functions provided in the library are included natively in SDF.
All SDF Tests for all columns are compiled into table scans. This allows SDF to run multiple tests with a single scan. This dramatically reduces the cost of data tests.
An SDF data test is a simple assertion test that runs a SQL query against your database and, if any data is returned, fails. The simplest example might be a non-null test on a column.
This would look like:
select * from table where column is null
When sdf test
runs this query, it will check if any data is returned and fail if so.
Working set 1 model file, 1 test file, 1 .sdf file
Running hello.pub.main (./models/main.sql)
Testing hello.pub.test_main (./sdftarget/dbg/tests/hello/pub/test_main.sql)
Finished 1 model [1 succeeded], 1 test [1 passed] in 0.804 secs
[Pass] Test hello.pub.test_main
There are three types of tests builtin.
- Scalar Column Tests - validate expectations of individual values in columns (eg.
x > 100
) - Aggregate Column Tests - validate expectations of all values in a columns (eg.
sum(x) > 100
) - Table Tests - validate expectations for all columns in a table (eg. columns a and b are unique).
Severity
Each test may be configured with a severity flag. By default all tests have a severity of error meaning that a constraint violation will result in a non-zero exit code for sdf.
Tests may have three results, with text highlighting providing an indication of the output.
Test Result | Meaning |
---|---|
🟡 | Warning |
🔴 | Error |
Using a Test
Tests are added to columns in table blocks. Each column has ‘constraints’ and test functions should be expressed as the expectation you have of the column.
Test libraries are written as SDF workspaces that are imported by your workspace.
Below, we add a simple aggregate uniqueness test, and a scalar null values test to a cool_column
.
table:
name: a_really_great_table
columns:
- name: cool_column
tests:
- expect: unique()
- expect: not_null()
severity: warning
SDF’s tests library is an out-of-the-box optimized testing mechanism. All tests associated with a table will be run within one query.
One query —> one table scan —> significant compute savings compare to other data quality solutions.
SDF Standard Library Tests
Column Tests
Test Name | Type | Description |
---|---|---|
not_null() | Scalar | Column elements must not be null |
valid_scalar(condition) | Scalar | Column elements all meet scalar condition |
valid_aggregate(condition) | Aggregate | Column elements all meet aggregate condition |
unique() | Aggregate | Column elements must be unique |
in_accepted_values([values]) | Aggregate | Column values are all elements of array |
minimum(value) | Aggregate | Column can not have values less than value |
maximum(value) | Aggregate | Column can not have values greater than value |
exclusive_minimum(value) | Aggregate | Column can not have values less or equal to value |
exclusive_maximum(value) | Aggregate | Column can not have values greater than or equal to value |
between(lower, upper) | Aggregate | Column values must be between lower and upper |
max_length(value) | Aggregate | Column values (as strings) are greater than a given length |
min_length(value) | Aggregate | Column values (as strings) are less than a given length |
like(string) | Aggregate | Column elements must contain string |
try_cast(type) | Aggregate | Column elements must include string |
primary_key(column) | Aggregate | Column is primary key of column |
Table Tests
Test Name | Type | Description |
---|---|---|
unique_columns([c1, c2]) | Table | Table with columns c1 and c2 has unique values across these two columns |
The full repo of sdf_tests
, our open-source tests library, is is available on GitHub here.
Writing Your Own Test
Prerequisite
SDF’s built-in tests library sdf_tests
uses Jinja macros. Therefore, setting Jinja
as a preprocessor is required.
Create the sample workspace tests
:
sdf new --sample tests
In the file workspace.sdf.yml
, we have Jinja set as the default preprocessor:
workspace:
name: Tests workspace
...
defaults:
preprocessor: jinja
Add Tests
Open the table metadata file src_metadata/raw_inapp_events.sdf.yml
.
In the file, find the three commented out tests and uncomment them. They should look like this:
Run Tests
Run the tests. In the terminal, run:
sdf test
Working set 1 model file, 1 test file, 2 .sdf files
Running tests_workspace.pub.raw_inapp_events (./models/raw_inapp_events.sql)
Testing tests_workspace.pub.test_raw_inapp_events (./sdftarget/dbg/tests/tests_workspace/pub/test_raw_inapp_events.sql)
Finished 1 model [1 succeeded], 1 test [1 passed] in 0.836 secs
[Pass] Test tests_workspace.pub.test_raw_inapp_events
Inspect the Test
This step is not mandatory, but sometimes it can be helpful to view the compiled test query.
To access the query, navigate to the file located in sdftarget/dbg/preprocessed/sdftarget/dbg/tests/tests_workspace/pub/test_raw_inapp_events.sql
.
The file will hold the compiled tests query and will look like this:
WITH
raw_inapp_events_constraints AS (
SELECT
CASE WHEN COUNT(DISTINCT event_id) != COUNT(event_id)
THEN
‘err: column event_id is NOT unique’ ELSE NULL END AS event_id_1
,
CASE WHEN COUNT(CASE WHEN event_id IS NULL THEN 1 ELSE NULL END) > 0
THEN
‘err: column event_id has NULL values’ ELSE NULL END AS event_id_2
,
CASE WHEN COUNT(CASE WHEN event_name NOT IN (‘install’, ‘add_to_cart’, ‘go_to_checkout’, ‘place_order’) THEN 1 ELSE NULL END) > 0
THEN
‘wrn: column event_name has values outside accepted values ”install”, ”add_to_cart”, ”go_to_checkout”, ”place_order''' ELSE NULL END AS event_name_1
,
CASE WHEN NOT(MIN(event_value) >= 0) or MAX(event_value) > 1000
THEN
‘wrn: column event_value has values outside of 0..1000’ ELSE NULL END AS event_value_1
FROM tests_workspace.pub.raw_inapp_events
)
SELECT reason
FROM (
SELECT event_id_1 as reason FROM raw_inapp_events_constraints WHERE event_id_1 IS NOT NULL
UNION ALL
SELECT event_id_2 as reason FROM raw_inapp_events_constraints WHERE event_id_2 IS NOT NULL
UNION ALL
SELECT event_name_1 as reason FROM raw_inapp_events_constraints WHERE event_name_1 IS NOT NULL
UNION ALL
SELECT event_value_1 as reason FROM raw_inapp_events_constraints WHERE event_value_1 IS NOT NULL
) AS _combined_errors
ORDER BY reason;
Fail a Test
Let’s see how it’ll look like if the test fails.
In src_metadata/raw_inapp_events.sdf.yml
, edit the test associated with the column event_value
and reduce the maximum in the range to be 10 instead of 1000:
...
- name: event_value
description: >
Any numeric value associated with the event.
For example, upon placing an order, the event value would be the total order amount in USD.
# Uncomment the following tests:
tests:
- expect: between(0, 10) <-- Change here - from 1000 to 10
severity: warning
...
Now let’s run the test again:
sdf test
Running tests_workspace.pub.raw_inapp_events (./models/raw_inapp_events.sql)
Testing tests_workspace.pub.test_raw_inapp_events (./sdftarget/dbg/tests/tests_workspace/pub/test_raw_inapp_events.sql)
Finished 1 model [1 succeeded], 1 test [1 failed] in 0.859 secs, for details see below.
[Fail] Test tests_workspace.pub.test_raw_inapp_events
┌─────────────────────────────────────────────────────┐
│ reason │
╞═════════════════════════════════════════════════════╡
│ wrn: column event_value has values outside of 0..10 │
└─────────────────────────────────────────────────────┘
1 rows.
Summary 1 model [1 succeeded], 1 test [1 failed] in 0.860 secs.
Add Custom Tests
SDF supports an easy way to write custom tests that will utilize our built-in testing infrastructure to run custom testing macros.
Setup
In our workspace, let’s create a new folder for our custom test macros. We can call it custom_tests
.
The new directory will look like this:
Cleaned sdftarget
Finished clean in 0.288 secs
.
├── custom_tests
├── models
│ └── raw_inapp_events.sql
├── seeds
│ └── inapp_events.csv
├── src_metadata
│ └── raw_inapp_events.sdf.yml
└── workspace.sdf.yml
4 directories, 4 files
We also need to include the new folder in our workspace.sdf.yml
file:
workspace:
name: tests_workspace
...
include:
- path: custom_tests
type: metadata
...
Now we are ready to add some custom tests!
Create a Custom Test
To extend SDF’s built-in testing library, add a jinja file named my_test.jinja
with the following custom test in the newly created folder:
{% macro custom_column_test(severity, column_name) -%}
COUNT(CASE WHEN {{ column_name }} > 1000 THEN 1 ELSE NULL END) > 0
==>
'{{severity}}: column {{ column_name }} is over 1000"'
{%- endmacro %}
In order to use SDF’s testing infrastructure, we need to maintain a similar macro signature as in the official library.
Try It!
First, we need to add the test to our table yml file, just like we did with
SDF’s built-in tests.
In the file src_metadata/raw_inapp_events.sdf.yml
, add our new custom test
to the column event_name
:
table:
...
columns:
- name: index
description: row_number
tests:
- expect: tests_workspace.custom_column_test() <-- this is our new test!
severity: warning
...
Unlike with regular tests, to call a custom test we need to use the syntax
{sdf_workspace_name}.{custom_test_name}
. In the example above, the workspace name
is tests_workspace
and the custom_test_name is custom_column_test
.
Let’s run the test! In your terminal, execute:
sdf test
The test passed!
Working set 1 model file, 1 test file, 2 .sdf files
Running tests_workspace.pub.raw_inapp_events (./models/raw_inapp_events.sql)
Testing tests_workspace.pub.test_raw_inapp_events (./sdftarget/dbg/tests/tests_workspace/pub/test_raw_inapp_events.sql)
Finished 1 model [1 succeeded], 1 test [1 passed] in 0.826 secs
[Pass] Test tests_workspace.pub.test_raw_inapp_events
We can look at the compiled tests query located in the file
sdftarget/dbg/preprocessed/sdftarget/dbg/tests/tests_workspace/pub/test_raw_inapp_events.sql
.
Notice our custom logic in the compiled query:
WITH
raw_inapp_events_constraints AS (
SELECT
CASE WHEN COUNT(CASE WHEN index > 1000 THEN 1 ELSE NULL END) > 0
THEN
‘err: column index is over 1000”’ ELSE NULL END AS index_1
FROM tests_workspace.pub.raw_inapp_events
)
SELECT reason
FROM (
SELECT index_1 as reason FROM raw_inapp_events_constraints WHERE index_1 IS NOT NULL
) AS _combined_errors
ORDER BY reason;