Lineage
This documents aims to layout how SDF’s powerful lineage capabilities creates detailed and accessible visibility into your data warehouse.
SDF’s column-level lineage combined with its metadata and classifier propagation will help you build a data warehouse that is more accurate, easier to maintain, and compliant with an evolving data privacy landscape.
Background
Source SQL Files
In SDF, SQL files contain SELECT queries defining each table. Here are a few SQL statements we’ll be using to demonstrate the concepts in this document.
In the following example, the lineage
sample workspace is used. If this workspace is not already
set up, it can be created with an sdf new --sample lineage
command.
Hint: knis = sink[::-1]
These are wrapped up into an SDF workspace with a simple workspace.sdf.yml
file:
workspace:
name: lineage
edition: "1.3"
includes:
- path: models/
To use these SQL files locally and follow along, create a new SDF workspace from the lineage
sample with the following command:
sdf new --sample lineage
Note how tables can use other tables as sources in the FROM clause using either unqualified (middle) or qualified (chain.pub.middle) names.
Root tables are tables without a FROM
clause or tables that select from an external source outside of the data warehouse. SDF
supports two types of root tables – those that define their data inline using
fixed values – and those that refer to a data file in an external store (e.g.
collection of S3 blobs) The root table source is an example of the former.
With the above definitions, we can run sdf compile to list the tables and their schemas:
sdf compile
Schema chain.pub.source
+-------------+-----------+-------------+------------+
| column_name | data_type | is_nullable | classifier |
+-------------+-----------+-------------+------------+
| user_id | bigint | false | |
| phone | varchar | false | |
| txn_date | varchar | false | |
| qty | bigint | false | |
+-------------+-----------+-------------+------------+
Schema chain.pub.middle
+-------------+-----------+-------------+------------+
| column_name | data_type | is_nullable | classifier |
+-------------+-----------+-------------+------------+
| user_id | bigint | false | |
| phone | varchar | false | |
| txn_date | varchar | false | |
| qty | bigint | false | |
+-------------+-----------+-------------+------------+
Schema chain.pub.knis
+-------------+-----------+-------------+------------+
| column_name | data_type | is_nullable | classifier |
+-------------+-----------+-------------+------------+
| txn_date | varchar | false | |
| qty | bigint | false | |
+-------------+-----------+-------------+------------+
Schema chain.pub.sink
+-------------+-----------+-------------+------------+
| column_name | data_type | is_nullable | classifier |
+-------------+-----------+-------------+------------+
| uid | bigint | false | |
| phone | varchar | false | |
| txn_date | varchar | false | |
| qty | bigint | false | |
+-------------+-----------+-------------+------------+
Integration with an Existing Data Warehouse
When SDF is used in the context of an existing warehouse, the YML files must
provide URLs to table definitions in the underlying systems (Snowflake, Spark,
BigQuery, etc.). SDF can then deliver the same functionality as for the tables
defined natively. This includes being able to run sdf compile
as shown above,
and perform lineage analysis and label propagation described below.
Visit our integration guides to learn more.
Basic Column-Level Lineage
Lineage is a representation of upstream-downstream dependencies between table columns. SDF understands three types of column-level dependencies out of the box: copy, transform, and inspect.
Copy Dependencies
The copy dependency indicates that a column value from some upstream table row flows unchanged into a column of some downstream table row. The cardinality of the downstream column may be different from the cardinality of the upstream column, with only a subset of upstream values finding their way downstream. Consider the column phone in the table sink above. It is a copy of the phone column in the upstream table middle, which in turn is a copy of the phone column in its upstream table source.
sink.phone │ │ copy └──────┐ middle.phone │ │ copy └──────┐ source.phone
In both steps of the above transformation, from source to middle, and from middle to sink, the cardinality of the column goes down: first because of aggregation; then because of filtering.
More than one upstream column can be a copy-dependency of a downstream column. For instance SELECT coalesce(col1, col2) as col3 results in both col1 and col2 being copy-dependencies of col3.
Transform Dependencies
The transform dependency indicates that upstream column values contribute to
downstream column values via some transformation such as aggregation or function
application. Multiple upstream columns can be transform-dependencies of a
downstream column. A column can have either copy-dependencies or
transform-dependencies, but never both. The qty
column of the knis
table
shown above is an example of transform lineage (denoted by mod
):
knis.qty │ │ mod └──────┐ middle.qty │ │ mod └──────┐ source.qty
Both transitions above are judged to be transform-dependencies because of SUM
aggregation.
Inspect Dependencies
The inspect dependency indicates that an upstream column was inspected (as
part of the WHERE
, GROUP BY
, or ON
clauses) to influence, but not directly
contribute to, a downstream column.
Here is a full dependency graph for sink.phone
which involves all three kinds
of dependencies with inspect dependencies marked by scan
for brevity.
sink.phone │ │ copy ├──────┐ │ middle.phone │ │ │ │ copy │ ├──────┐ │ │ source.phone │ │ scan │ └──────┐ │ source.txn_date │ source.user_id │ scan └──────┐ middle.qty │ │ mod ├──────┐ │ source.qty │ scan └──────┐ source.txn_date source.user_id
Lineage CLI Command
SDF provides two interfaces for accessing lineage information. This interface displays information per-table and it includes only one level of upstream dependencies. Here is an example:
sdf lineage models/sink.sql
table:
name: chain.pub.sink
lineage:
inspect:
- middle.qty
columns:
- name: uid
lineage:
copy:
- middle.user_id
- name: phone
lineage:
copy:
- middle.phone
- name: txn_date
lineage:
copy:
- middle.txn_date
- name: qty
lineage:
copy:
- middle.qty
Alternatively you can view a multi-level graphical representation of lineage:
sdf lineage sink --column phone
sink.phone │ │ copy ├──────┐ │ middle.phone │ │ │ │ copy │ ├──────┐ │ │ source.phone │ │ scan │ └──────┐ │ source.txn_date │ source.user_id │ scan └──────┐ middle.qty │ │ mod ├──────┐ │ source.qty │ scan └──────┐ source.txn_date source.user_id
In addition, sdf lineage
supports the --forward
flag to
display all the downstream dependencies of a given column.
For a deeper dive on this command, check out the Lineage CLI Command referece.
Advanced Column-Level Lineage (Coming Soon)
Soon SDF will be able to track dependencies at a more fine-grained level by
specifying which functions or operations should be tracked explicitly. Consider
this updated workspace.sdf.yml
file where we added an instruction to track the
SUM
transformation explicitly.
workspace:
edition: "1.3"
lineage:
functions:
- sum
includes:
- path: models/
Now, the lineage for the knis.qty
column will be as follows:
knis.qty │ │ sum └──────┐ middle.qty │ │ sum └──────┐ source.qty
Other transformations will continue to be tagged by the generic mod
label. By
specifying a wildcard, the user may instruct SDF to track all function
transformations explicitly.
We’re actively working on this functionality and will notify the community when it is available.