SDF’s column-level lineage combined with its metadata and classifier propagation will help you build a data warehouse that is more accurate, easier to maintain, and compliant with an evolving data privacy landscape.

Background

Source SQL Files

In SDF, SQL files contain SELECT queries defining each table. Here are a few SQL statements we’ll be using to demonstrate the concepts in this document.

In the following example, the lineage sample workspace is used. If this workspace is not already set up, it can be created with an sdf new --sample lineage command.

Hint: knis = sink[::-1]

These are wrapped up into an SDF workspace with a simple workspace.sdf.yml file:

workspace.sdf.yml
workspace:
  name: lineage
  edition: "1.3"
  includes:
    - path: models/

To use these SQL files locally and follow along, create a new SDF workspace from the lineage sample with the following command:

sdf new --sample lineage

Note how tables can use other tables as sources in the FROM clause using either unqualified (middle) or qualified (chain.pub.middle) names.

Root tables are tables without a FROM clause or tables that select from an external source outside of the data warehouse. SDF supports two types of root tables – those that define their data inline using fixed values – and those that refer to a data file in an external store (e.g. collection of S3 blobs) The root table source is an example of the former.

With the above definitions, we can run sdf compile to list the tables and their schemas:

sdf compile
Schema chain.pub.source
+-------------+-----------+-------------+------------+
| column_name | data_type | is_nullable | classifier |
+-------------+-----------+-------------+------------+
| user_id     | bigint    | false       |            |
| phone       | varchar   | false       |            |
| txn_date    | varchar   | false       |            |
| qty         | bigint    | false       |            |
+-------------+-----------+-------------+------------+

Schema chain.pub.middle
+-------------+-----------+-------------+------------+
| column_name | data_type | is_nullable | classifier |
+-------------+-----------+-------------+------------+
| user_id     | bigint    | false       |            |
| phone       | varchar   | false       |            |
| txn_date    | varchar   | false       |            |
| qty         | bigint    | false       |            |
+-------------+-----------+-------------+------------+

Schema chain.pub.knis
+-------------+-----------+-------------+------------+
| column_name | data_type | is_nullable | classifier |
+-------------+-----------+-------------+------------+
| txn_date    | varchar   | false       |            |
| qty         | bigint    | false       |            |
+-------------+-----------+-------------+------------+

Schema chain.pub.sink
+-------------+-----------+-------------+------------+
| column_name | data_type | is_nullable | classifier |
+-------------+-----------+-------------+------------+
| uid         | bigint    | false       |            |
| phone       | varchar   | false       |            |
| txn_date    | varchar   | false       |            |
| qty         | bigint    | false       |            |
+-------------+-----------+-------------+------------+

Integration with an Existing Data Warehouse

When SDF is used in the context of an existing warehouse, the YML files must provide URLs to table definitions in the underlying systems (Snowflake, Spark, BigQuery, etc.). SDF can then deliver the same functionality as for the tables defined natively. This includes being able to run sdf compile as shown above, and perform lineage analysis and label propagation described below.

Visit our integration guides to learn more.

Basic Column-Level Lineage

Lineage is a representation of upstream-downstream dependencies between table columns. SDF understands three types of column-level dependencies out of the box: copy, transform, and inspect.

Copy Dependencies

The copy dependency indicates that a column value from some upstream table row flows unchanged into a column of some downstream table row. The cardinality of the downstream column may be different from the cardinality of the upstream column, with only a subset of upstream values finding their way downstream. Consider the column phone in the table sink above. It is a copy of the phone column in the upstream table middle, which in turn is a copy of the phone column in its upstream table source.

sink.phone │ │ copy └──────┐        middle.phone        │        │ copy        └──────┐               source.phone

In both steps of the above transformation, from source to middle, and from middle to sink, the cardinality of the column goes down: first because of aggregation; then because of filtering.

More than one upstream column can be a copy-dependency of a downstream column. For instance SELECT coalesce(col1, col2) as col3 results in both col1 and col2 being copy-dependencies of col3.

Transform Dependencies

The transform dependency indicates that upstream column values contribute to downstream column values via some transformation such as aggregation or function application. Multiple upstream columns can be transform-dependencies of a downstream column. A column can have either copy-dependencies or transform-dependencies, but never both. The qty column of the knis table shown above is an example of transform lineage (denoted by mod):

knis.qty │ │ mod └──────┐        middle.qty        │        │ mod        └──────┐               source.qty

Both transitions above are judged to be transform-dependencies because of SUM aggregation.

Inspect Dependencies

The inspect dependency indicates that an upstream column was inspected (as part of the WHERE, GROUP BY, or ON clauses) to influence, but not directly contribute to, a downstream column.

Here is a full dependency graph for sink.phone which involves all three kinds of dependencies with inspect dependencies marked by scan for brevity.

sink.phone │ │ copy ├──────┐ │ middle.phone │ │ │ │ copy │ ├──────┐ │ │ source.phone │ │ scan │ └──────┐ │ source.txn_date │ source.user_id │ scan └──────┐        middle.qty        │        │ mod        ├──────┐        │ source.qty        │ scan        └──────┐               source.txn_date               source.user_id

Lineage CLI Command

SDF provides two interfaces for accessing lineage information. This interface displays information per-table and it includes only one level of upstream dependencies. Here is an example:

sdf lineage models/sink.sql
table:
  name: chain.pub.sink
  lineage:
    inspect:
      - middle.qty
  columns:
    - name: uid
      lineage:
        copy:
          - middle.user_id
    - name: phone
      lineage:
        copy:
          - middle.phone
    - name: txn_date
      lineage:
        copy:
          - middle.txn_date
    - name: qty
      lineage:
        copy:
          - middle.qty

Alternatively you can view a multi-level graphical representation of lineage:

sdf lineage sink --column phone   

sink.phone │ │ copy ├──────┐ │ middle.phone │ │ │ │ copy │ ├──────┐ │ │ source.phone │ │ scan │ └──────┐ │ source.txn_date │ source.user_id │ scan └──────┐        middle.qty        │        │ mod        ├──────┐        │ source.qty        │ scan        └──────┐               source.txn_date               source.user_id

In addition, sdf lineage supports the --forward flag to display all the downstream dependencies of a given column.

For a deeper dive on this command, check out the Lineage CLI Command referece.

Advanced Column-Level Lineage (Coming Soon)

Soon SDF will be able to track dependencies at a more fine-grained level by specifying which functions or operations should be tracked explicitly. Consider this updated workspace.sdf.yml file where we added an instruction to track the SUM transformation explicitly.

workspace.sdf.yml
workspace:
  edition: "1.3"
  lineage:
    functions:
      - sum
  includes:
    - path: models/

Now, the lineage for the knis.qty column will be as follows:

knis.qty │ │ sum └──────┐        middle.qty        │        │ sum        └──────┐               source.qty

Other transformations will continue to be tagged by the generic mod label. By specifying a wildcard, the user may instruct SDF to track all function transformations explicitly.

We’re actively working on this functionality and will notify the community when it is available.