SDF promotes reasoning on higher level types through a rich data classification & type system.
varchar
, int
, timestamp
, decimal
, etc. But, people reason over higher level data types. A varchar might be a name, or phone number. An integer might be a daily active user metric.
SDF has the ability to annotate columns and tables with user defined types and then automatically propagate those types to downstream assets while respecting aggregations, or functional transformations of data. This
unique capability fosters the creation of a dynamic semantic layer which adapts as you build out your data warehouse.
Classifiers are first-class citizens of the SDF ecosystem and are an added layer of metadata on top of SQL models. They are completely compatible with all dialects and databases that SDF supports.
You can think of them like rich types in a language like Typescript. They can be defined, reused, transformed, and propagated programatically by SDF.
lineage
sample workspace is used. If this workspace is not already
set up, it can be created with an sdf new --sample lineage
command.workspace.sdf.yml
config file.
PII
with 4 labels: PII.Phone
,
PII.Address
, PII.SSN
, and PII.UID
, where the last four labels denote
special classes of the first label. (I.e. if some data is labeled with
PII.Address
, it is also, implicitly, labeled with PII
.)
sdf.yml
files as long as
these files are included by the workspace.sdf.yml
file as one of the
paths specified in includes
.source
. We add a new file (models/source.sdf.yml
) containing our type:
/models
directory. Alternatively, the same configuratin can be included
directly into the workspace.sdf.yml file as follows:
source
table, run sdf compile --show all
.
middle
and sink
, but not to knis
– because it
doesn’t have any columns derived from the PII
columns upstream.
PII.Phone
, the result
will also be labeled with PII.Phone
by default.
However, the result is no longer a phone number - maybe it’s just the area code!
Let’s imagine the result of the substring expression is only area codes. We can use a function block in our sdf.yml
files to define the behavior of a classifier in response to the function being called.
In our current example, we’d want to reclassify PII.Phone
to PII.AreaCode
. Here’s the function overload required to do so:
PII.Phone
, it will be reclassified to PII.AreaCode
.
Another common case is to prevent a classifier from propagating through an aggregation. For example, if a column user_id
is labeled with the classifier PII.UID
and we SELECT COUNT(DISTINCT user_id)
from the table containing this column, we don’t want PII.UID
to propagate since the COUNT DISTINCT
is not a PII.UID
. It’s a number representing the unique count of PII.UID
s. We can use the same reclassify
block to prevent this propagation. Here’s an example:
to
value in the reclassify
block, we remove the PII.UID
classifier downstream and achieve our desired behavior.
reclassify
block. Here’s an example below:
phone_first_three_digits
will be reclassified to PII.AreaCode
if it is derived from a column with the classifier PII.Phone
.
PII_STATE.clear_text
is meant to represent human-readable PII, and anonymized
is meant to represent anonymized PII. We can then use these states to define the effect of functions (including User Defined Functions (UDFs)) on classifiers like the examples above. For example, we can define the effect of an md5
hash as follows:
anonymize
which anonymizes PII no matter its current state. For this, we can use a glob pattern (e.g. *
) to tell SDF to reclassify a PII_STATE
classifier no matter its current state. Here’s an example:
PII_STATE.*
in the from
field of the reclassify
block. This tells SDF to reclassify any PII_STATE
classifier, no matter its current state, to PII_STATE.anonymized
.
parameters
and returns
in this function block? This is because this function is a User Defined Function (UDF)
.classifiers
is the recommended setting.
Once created, this location will need to be ammended to the workspace.sdf.yml
so that it is included.