Background
Source SQL Files
In SDF, SQL files contain SELECT queries defining each table. Here are a few SQL statements we’ll be using to demonstrate the concepts in this document.In the following example, the
lineage
sample workspace is used. If this workspace is not already
set up, it can be created with an sdf new --sample lineage
command.workspace.sdf.yml
file:
workspace.sdf.yml
lineage
sample with the following command:
FROM
clause or tables that select from an external source outside of the data warehouse. SDF
supports two types of root tables – those that define their data inline using
fixed values – and those that refer to a data file in an external store (e.g.
collection of S3 blobs) The root table source is an example of the former.
With the above definitions, we can run sdf compile to list the tables and their
schemas:
Integration with an Existing Data Warehouse
When SDF is used in the context of an existing warehouse, the YML files must provide URLs to table definitions in the underlying systems (Snowflake, Spark, BigQuery, etc.). SDF can then deliver the same functionality as for the tables defined natively. This includes being able to runsdf compile
as shown above,
and perform lineage analysis and label propagation described below.
Visit our integration guides to learn more.
Basic Column-Level Lineage
Lineage is a representation of upstream-downstream dependencies between table columns. SDF understands three types of column-level dependencies out of the box: copy, transform, and inspect.Copy Dependencies
The copy dependency indicates that a column value from some upstream table row flows unchanged into a column of some downstream table row. The cardinality of the downstream column may be different from the cardinality of the upstream column, with only a subset of upstream values finding their way downstream. Consider the column phone in the table sink above. It is a copy of the phone column in the upstream table middle, which in turn is a copy of the phone column in its upstream table source.sink.phone
│
│ copy
└──────┐
middle.phone
│
│ copy
└──────┐
source.phone
More than one upstream column can be a copy-dependency of a downstream column.
For instance SELECT coalesce(col1, col2) as col3 results in both col1 and col2
being copy-dependencies of col3.
Transform Dependencies
The transform dependency indicates that upstream column values contribute to downstream column values via some transformation such as aggregation or function application. Multiple upstream columns can be transform-dependencies of a downstream column. A column can have either copy-dependencies or transform-dependencies, but never both. Theqty
column of the knis
table
shown above is an example of transform lineage (denoted by mod
):
knis.qty
│
│ mod
└──────┐
middle.qty
│
│ mod
└──────┐
source.qty
SUM
aggregation.
Inspect Dependencies
The inspect dependency indicates that an upstream column was inspected (as part of theWHERE
, GROUP BY
, or ON
clauses) to influence, but not directly
contribute to, a downstream column.
Here is a full dependency graph for sink.phone
which involves all three kinds
of dependencies with inspect dependencies marked by scan
for brevity.
sink.phone
│
│ copy
├──────┐
│ middle.phone
│ │
│ │ copy
│ ├──────┐
│ │ source.phone
│ │ scan
│ └──────┐
│ source.txn_date
│ source.user_id
│ scan
└──────┐
middle.qty
│
│ mod
├──────┐
│ source.qty
│ scan
└──────┐
source.txn_date
source.user_id
Lineage CLI Command
SDF provides two interfaces for accessing lineage information. This interface displays information per-table and it includes only one level of upstream dependencies. Here is an example:sink.phone
│
│ copy
├──────┐
│ middle.phone
│ │
│ │ copy
│ ├──────┐
│ │ source.phone
│ │ scan
│ └──────┐
│ source.txn_date
│ source.user_id
│ scan
└──────┐
middle.qty
│
│ mod
├──────┐
│ source.qty
│ scan
└──────┐
source.txn_date
source.user_id
sdf lineage
supports the --forward
flag to
display all the downstream dependencies of a given column.
For a deeper dive on this command, check out the
Lineage CLI Command referece.
Advanced Column-Level Lineage (Coming Soon)
Soon SDF will be able to track dependencies at a more fine-grained level by specifying which functions or operations should be tracked explicitly. Consider this updatedworkspace.sdf.yml
file where we added an instruction to track the
SUM
transformation explicitly.
workspace.sdf.yml
knis.qty
column will be as follows:
knis.qty
│
│ sum
└──────┐
middle.qty
│
│ sum
└──────┐
source.qty
mod
label. By
specifying a wildcard, the user may instruct SDF to track all function
transformations explicitly.
We’re actively working on this functionality and will notify the community when
it is available.