Data Classifiers
SDF promotes reasoning on higher level types through a rich data classification & type system.
SQL only supports primitive types like varchar
, int
, timestamp
, decimal
, etc. But, people reason over higher level data types. A varchar might be a name, or phone number. An integer might be a daily active user metric.
SDF has the ability to annotate columns and tables with user defined types and then automatically propagate those types to downstream assets while respecting aggregations, or functional transformations of data. This unique capability fosters the creation of a dynamic semantic layer which adapts as you build out your data warehouse.
Classifiers are first-class citizens of the SDF ecosystem and are an added layer of metadata on top of SQL models. They are completely compatible with all dialects and databases that SDF supports.
You can think of them like rich types in a language like Typescript. They can be defined, reused, transformed, and propagated programatically by SDF.
SDF uses the term Classifier and SQL Type interchangeably. Any column or table may have 0 or more classifiers.
Using SQL Types & classifiers
What are popular examples of SQL Types?
- Personally Identifiable Information like phone numbers, addresses, etc.
- Metrics
- Financial Information
- Unique Identifiers
- Retention Requiremenets
- Time Grains
What Can You Do With SQL Types?
- Write Reports
- Write Checks
- Audit
- Prevent Business Logic Mismatch
- Ensure Consistency Across Metrics
- Customer Segmentation
Higher order types have many advantages.
Basic Propagation
Propagation is the mechanism of inferring the downstream column types from the upstream column types. The inference process is based on the semantics of the underlying transformation.
In the following example, the lineage
sample workspace is used. If this workspace is not already
set up, it can be created with an sdf new --sample lineage
command.
Defining a Classifier Taxonomy
To work with classifiers and propagation we must first define all the relevant
classifier domains. Here is an example of a simple PII classifier domain,
defined in a separate section of the global workspace.sdf.yml
config file.
workspace:
edition: "1.3"
includes:
- path: models/
---
classifier:
name: PII
labels:
- name: Phone
- name: Address
- name: SSN
- name: UID
This configuration defines five A top level type PII
with 4 labels: PII.Phone
,
PII.Address
, PII.SSN
, and PII.UID
, where the last four labels denote
special classes of the first label. (I.e. if some data is labeled with
PII.Address
, it is also, implicitly, labeled with PII
.)
Classifier domains can also be defined in separate sdf.yml
files as long as
these files are included by the workspace.sdf.yml
file as one of the
paths specified in includes
.
Attaching Classifiers to Tables
Once a classifier domain is defined, we can manually attach classifiers to tables. Once root tables are annotated with the relevant classifiers, SDF automatically propagates classifiers downstream.
Let’s start by labeling our source
. We add a new file (models/source.sdf.yml
) containing our type:
table:
name: source
columns:
- name: user_id
classifiers:
- PII.UID
- name: phone
classifiers:
- PII.Phone
This file will automatically be picked up by SDF because the workspace already includes the /models
directory. Alternatively, the same configuratin can be included
directly into the workspace.sdf.yml file as follows:
workspace:
edition: "1.3"
includes:
- path : models/
---
classifier:
name: PII
labels:
- name: Phone
- name: Address
- name: SSN
- name: UID
- name: AreaCode
---
table:
name: source
columns:
- name: user_id
classifiers:
- PII.UID
- name: phone
classifiers:
- PII.Phone
Notice - all .sdf.yml files are composable! You may include one or more configuration in any .sdf.yml file, as long as the schema is respected.
Inspecting Propagation Results
To see the classifer, and how it was propagated to all tables downstream of our source
table, run sdf compile --show all
.
sdf compile --show all
Schema chain.pub.source
+-------------+-----------+-------------+------------+
| column_name | data_type | is_nullable | classifier |
+-------------+-----------+-------------+------------+
| user_id | bigint | false | PII.UID |
| phone | varchar | false | PII.Phone |
| txn_date | varchar | false | |
| qty | bigint | false | |
+-------------+-----------+-------------+------------+
Schema chain.pub.middle
+-------------+-----------+-------------+------------+
| column_name | data_type | is_nullable | classifier |
+-------------+-----------+-------------+------------+
| user_id | bigint | false | PII.UID |
| phone | varchar | false | PII.Phone |
| txn_date | varchar | false | |
| qty | bigint | false | |
+-------------+-----------+-------------+------------+
Schema chain.pub.knis
+-------------+-----------+-------------+------------+
| column_name | data_type | is_nullable | classifier |
+-------------+-----------+-------------+------------+
| txn_date | varchar | false | |
| qty | bigint | false | |
+-------------+-----------+-------------+------------+
Schema chain.pub.sink
+-------------+-----------+-------------+------------+
| column_name | data_type | is_nullable | classifier |
+-------------+-----------+-------------+------------+
| uid | bigint | false | PII.UID |
| phone | varchar | false | PII.Phone |
| txn_date | varchar | false | |
| qty | bigint | false | |
+-------------+-----------+-------------+------------+
Observe how the classifiers attached to the upstream table automatically
propagated downstream to middle
and sink
, but not to knis
– because it
doesn’t have any columns derived from the PII
columns upstream.
Advanced Propagation
By default classifiers propagate through functions unchanged. For example, if a
substring is extracted from a column labeled as PII.Phone
, the result
will also be labeled with PII.Phone
by default.
However, the result is no longer a phone number - maybe it’s just the area code!
Let’s imagine the result of the substring expression is only area codes. We can use a function block in our sdf.yml
files to define the behavior of a classifier in response to the function being called.
In our current example, we’d want to reclassify PII.Phone
to PII.AreaCode
. Here’s the function overload required to do so:
---
function:
name: substring
reclassify:
- from: PII.Phone
to: PII.AreaCode
Great! Now every time the substring function is called on a column with the classifier PII.Phone
, it will be reclassified to PII.AreaCode
.
Another common case is to prevent a classifier from propagating through an aggregation. For example, if a column user_id
is labeled with the classifier PII.UID
and we SELECT COUNT(DISTINCT user_id)
from the table containing this column, we don’t want PII.UID
to propagate since the COUNT DISTINCT
is not a PII.UID
. It’s a number representing the unique count of PII.UID
s. We can use the same reclassify
block to prevent this propagation. Here’s an example:
---
function:
name: count
parameters:
- datatype: T
returns:
datatype: bigint
reclassify:
- from: PII.UID
By explicitly not setting a to
value in the reclassify
block, we remove the PII.UID
classifier downstream and achieve our desired behavior.
Function reclassifications work across all queries by default. In this way, developers have full programmatic control over how their classifiers propagate and interact with functions across their entire data warehouse.
Lastly, this same reclassification mechanism can be used on a single column or table using the same reclassify
block. Here’s an example below:
table:
name: phone_trimmed
columns:
- name: phone_first_three_digits
reclassify:
- from: PII.Phone
to: PII.AreaCode
In this example, the column phone_first_three_digits
will be reclassified to PII.AreaCode
if it is derived from a column with the classifier PII.Phone
.
Classifier States
Using labels, classifiers can also exist in states that communicate how the data is currently represented.
Here is how a classifier state could be defined in YML:
...
---
classifier:
name: PII_STATE
labels:
- name: clear_text
- name: hashed
- name: anonymized
...
PII_STATE.clear_text
is meant to represent human-readable PII, and anonymized
is meant to represent anonymized PII. We can then use these states to define the effect of functions (including User Defined Functions (UDFs)) on classifiers like the examples above. For example, we can define the effect of an md5
hash as follows:
---
function:
name: md5
reclassify:
- from: PII_STATE.clear_text
to: PII_STATE.hashed
Now let’s imagine we have a UDF called anonymize
which anonymizes PII no matter its current state. For this, we can use a glob pattern (e.g. *
) to tell SDF to reclassify a PII_STATE
classifier no matter its current state. Here’s an example:
---
function:
name: anonymize
parameters:
- name: pii
datatype: varchar
returns:
- datatype: varchar
reclassify:
- from: PII_STATE.*
to: PII_STATE.anonymized
Note the usage of PII_STATE.*
in the from
field of the reclassify
block. This tells SDF to reclassify any PII_STATE
classifier, no matter its current state, to PII_STATE.anonymized
.
Wondering why we also defined the parameters
and returns
in this function block? This is because this function is a User Defined Function (UDF)
.
Organizing Classifiers
When organizing classifiers, it is recommended to create a folder to host all Classfiers created by you and your team.
Setting up a location within the working directory called classifiers
is the recommended setting.
Once created, this location will need to be ammended to the workspace.sdf.yml
so that it is included.
---
includes:
- path: classifiers/