varchar
, int
, timestamp
, decimal
, etc. But, people reason over higher level data types. A varchar might be a name, or phone number. An integer might be a daily active user metric.
SDF has the ability to annotate columns and tables with user defined types and then automatically propagate those types to downstream assets while respecting aggregations, or functional transformations of data. This
unique capability fosters the creation of a dynamic semantic layer which adapts as you build out your data warehouse.
Classifiers are first-class citizens of the SDF ecosystem and are an added layer of metadata on top of SQL models. They are completely compatible with all dialects and databases that SDF supports.
You can think of them like rich types in a language like Typescript. They can be defined, reused, transformed, and propagated programatically by SDF.
SDF uses the term Classifier and SQL Type interchangeably. Any column or table may have 0 or more classifiers.
Using SQL Types & classifiers
What are popular examples of SQL Types?- Personally Identifiable Information like phone numbers, addresses, etc.
- Metrics
- Financial Information
- Unique Identifiers
- Retention Requiremenets
- Time Grains
- Write Reports
- Write Checks
- Audit
- Prevent Business Logic Mismatch
- Ensure Consistency Across Metrics
- Customer Segmentation
Basic Propagation
Propagation is the mechanism of inferring the downstream column types from the upstream column types. The inference process is based on the semantics of the underlying transformation.In the following example, the
lineage
sample workspace is used. If this workspace is not already
set up, it can be created with an sdf new --sample lineage
command.Defining a Classifier Taxonomy
To work with classifiers and propagation we must first define all the relevant classifier domains. Here is an example of a simple PII classifier domain, defined in a separate section of the globalworkspace.sdf.yml
config file.
workspace.sdf.yml
PII
with 4 labels: PII.Phone
,
PII.Address
, PII.SSN
, and PII.UID
, where the last four labels denote
special classes of the first label. (I.e. if some data is labeled with
PII.Address
, it is also, implicitly, labeled with PII
.)
Classifier domains can also be defined in separate
sdf.yml
files as long as
these files are included by the workspace.sdf.yml
file as one of the
paths specified in includes
.Attaching Classifiers to Tables
Once a classifier domain is defined, we can manually attach classifiers to tables. Once root tables are annotated with the relevant classifiers, SDF automatically propagates classifiers downstream. Let’s start by labeling oursource
. We add a new file (models/source.sdf.yml
) containing our type:
models/source.sdf.yml
/models
directory. Alternatively, the same configuratin can be included
directly into the workspace.sdf.yml file as follows:
workspace.sdf.yml
Inspecting Propagation Results
To see the classifer, and how it was propagated to all tables downstream of oursource
table, run sdf compile --show all
.
middle
and sink
, but not to knis
– because it
doesn’t have any columns derived from the PII
columns upstream.
Advanced Propagation
By default classifiers propagate through functions unchanged. For example, if a substring is extracted from a column labeled asPII.Phone
, the result
will also be labeled with PII.Phone
by default.
However, the result is no longer a phone number - maybe it’s just the area code!
Let’s imagine the result of the substring expression is only area codes. We can use a function block in our sdf.yml
files to define the behavior of a classifier in response to the function being called.
In our current example, we’d want to reclassify PII.Phone
to PII.AreaCode
. Here’s the function overload required to do so:
workspace.sdf.yml
PII.Phone
, it will be reclassified to PII.AreaCode
.
Another common case is to prevent a classifier from propagating through an aggregation. For example, if a column user_id
is labeled with the classifier PII.UID
and we SELECT COUNT(DISTINCT user_id)
from the table containing this column, we don’t want PII.UID
to propagate since the COUNT DISTINCT
is not a PII.UID
. It’s a number representing the unique count of PII.UID
s. We can use the same reclassify
block to prevent this propagation. Here’s an example:
workspace.sdf.yml
to
value in the reclassify
block, we remove the PII.UID
classifier downstream and achieve our desired behavior.
Function reclassifications work across all queries by default. In this way, developers have full programmatic control over how their classifiers propagate and interact with functions across their entire data warehouse.
reclassify
block. Here’s an example below:
models/phone_trimmed.sdf.yml
phone_first_three_digits
will be reclassified to PII.AreaCode
if it is derived from a column with the classifier PII.Phone
.
Classifier States
Using labels, classifiers can also exist in states that communicate how the data is currently represented. Here is how a classifier state could be defined in YML:workspace.sdf.yml
PII_STATE.clear_text
is meant to represent human-readable PII, and anonymized
is meant to represent anonymized PII. We can then use these states to define the effect of functions (including User Defined Functions (UDFs)) on classifiers like the examples above. For example, we can define the effect of an md5
hash as follows:
workspace.sdf.yml
anonymize
which anonymizes PII no matter its current state. For this, we can use a glob pattern (e.g. *
) to tell SDF to reclassify a PII_STATE
classifier no matter its current state. Here’s an example:
workspace.sdf.yml
PII_STATE.*
in the from
field of the reclassify
block. This tells SDF to reclassify any PII_STATE
classifier, no matter its current state, to PII_STATE.anonymized
.
Wondering why we also defined the
parameters
and returns
in this function block? This is because this function is a User Defined Function (UDF)
.Organizing Classifiers
When organizing classifiers, it is recommended to create a folder to host all Classfiers created by you and your team. Setting up a location within the working directory calledclassifiers
is the recommended setting.
Once created, this location will need to be ammended to the workspace.sdf.yml
so that it is included.
workspace.sdf.yml