> ## Documentation Index
> Fetch the complete documentation index at: https://docs.sdf.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Lineage

> This documents aims to layout how SDF's powerful lineage capabilities  creates detailed and accessible visibility into your data warehouse.

SDF's column-level lineage combined with its metadata and classifier propagation
will help you build a data warehouse that is more accurate, easier to maintain,
and compliant with an evolving data privacy landscape.

## Background

### Source SQL Files

In SDF, SQL files contain SELECT queries defining each table. Here are a few SQL
statements we'll be using to demonstrate the concepts in this document.

<Note>
  In the following example, the `lineage` sample workspace is used. If this workspace is not already
  set up, it can be created with an `sdf new --sample lineage` command.
</Note>

<CodeGroup>
  ```sql models/source.sql theme={null}
  select column1 as user_id,
         column2 as phone,
         column3 as txn_date,
         column4 as qty from
  (VALUES
    (1, '555-1212', '2022-01-01', 100),
    (1, '555-1212', '2022-02-01', 50),
    (1, '555-1212', '2022-03-01', 75),
    (2, '444-1313', '2022-01-01', 200),
    (2, '444-1313', '2022-02-01', 100),
    (3, '333-1414', '2022-03-01', 300))
  ```

  ```sql models/middle.sql theme={null}
  select user_id, max(phone) as phone, txn_date, sum(qty) as qty
  from source
  group by user_id, txn_date
  ```

  ```sql models/sink.sql theme={null}
  select user_id as uid, phone, txn_date, qty
  from middle
  where qty > 180
  ```

  ```sql models/knis.sql theme={null}
  select txn_date, sum(qty) as qty
  from middle
  ```
</CodeGroup>

*Hint: knis = sink\[::-1]*

These are wrapped up into an SDF workspace with a simple `workspace.sdf.yml`
file:

```yaml workspace.sdf.yml theme={null}
workspace:
  name: lineage
  edition: "1.3"
  includes:
    - path: models/
```

To use these SQL files locally and follow along, create a new SDF workspace from the `lineage` sample with the following command:

```shell theme={null}
sdf new --sample lineage
```

Note how tables can use other tables as sources in the FROM clause using either
unqualified (middle) or qualified (chain.pub.middle) names.

Root tables are tables without a `FROM` clause or tables that select from an external source outside of the data warehouse. SDF
supports two types of root tables – those that define their data inline using
fixed values – and those that refer to a data file in an external store (e.g.
collection of S3 blobs) The root table source is an example of the former.

With the above definitions, we can run sdf compile to list the tables and their
schemas:

```bash theme={null}
sdf compile
```

```text theme={null}
Schema chain.pub.source
+-------------+-----------+-------------+------------+
| column_name | data_type | is_nullable | classifier |
+-------------+-----------+-------------+------------+
| user_id     | bigint    | false       |            |
| phone       | varchar   | false       |            |
| txn_date    | varchar   | false       |            |
| qty         | bigint    | false       |            |
+-------------+-----------+-------------+------------+

Schema chain.pub.middle
+-------------+-----------+-------------+------------+
| column_name | data_type | is_nullable | classifier |
+-------------+-----------+-------------+------------+
| user_id     | bigint    | false       |            |
| phone       | varchar   | false       |            |
| txn_date    | varchar   | false       |            |
| qty         | bigint    | false       |            |
+-------------+-----------+-------------+------------+

Schema chain.pub.knis
+-------------+-----------+-------------+------------+
| column_name | data_type | is_nullable | classifier |
+-------------+-----------+-------------+------------+
| txn_date    | varchar   | false       |            |
| qty         | bigint    | false       |            |
+-------------+-----------+-------------+------------+

Schema chain.pub.sink
+-------------+-----------+-------------+------------+
| column_name | data_type | is_nullable | classifier |
+-------------+-----------+-------------+------------+
| uid         | bigint    | false       |            |
| phone       | varchar   | false       |            |
| txn_date    | varchar   | false       |            |
| qty         | bigint    | false       |            |
+-------------+-----------+-------------+------------+
```

### Integration with an Existing Data Warehouse

When SDF is used in the context of an existing warehouse, the YML files must
provide URLs to table definitions in the underlying systems (Snowflake, Spark,
BigQuery, etc.). SDF can then deliver the same functionality as for the tables
defined natively. This includes being able to run `sdf compile` as shown above,
and perform lineage analysis and label propagation described below.

Visit our [integration guides](/integrations/snowflake/getting-started) to learn more.

## Basic Column-Level Lineage

Lineage is a representation of upstream-downstream dependencies between table
columns. SDF understands three types of column-level dependencies out of the
box: copy, transform, and inspect.

### Copy Dependencies

The copy dependency indicates that a column value from some upstream table row
flows unchanged into a column of some downstream table row. The cardinality of
the downstream column may be different from the cardinality of the upstream
column, with only a subset of upstream values finding their way downstream.
Consider the column phone in the table sink above. It is a copy of the phone
column in the upstream table middle, which in turn is a copy of the phone column
in its upstream table source.

<div className="mt-5 mb-8 not-prose bg-[#0F1117] dark:bg-codeblock rounded-xl dark:ring-1 dark:ring-gray-800 relative p-5 text-sm leading-6 text-gray-50">
  <pre style={{ fontFamily: 'monospace' }}>
    sink.phone
    │
    │ copy
    └──────┐
           middle.phone
           │
           │ copy
           └──────┐
                  source.phone
  </pre>
</div>

In both steps of the above transformation, from source to middle, and from
middle to sink, the cardinality of the column goes down: first because of
aggregation; then because of filtering.

<Note>
  More than one upstream column can be a copy-dependency of a downstream column.
  For instance SELECT coalesce(col1, col2) as col3 results in both col1 and col2
  being copy-dependencies of col3.
</Note>

### Transform Dependencies

The transform dependency indicates that upstream column values contribute to
downstream column values via some transformation such as aggregation or function
application. Multiple upstream columns can be transform-dependencies of a
downstream column. A column can have either copy-dependencies or
transform-dependencies, but never both. The `qty` column of the `knis` table
shown above is an example of transform lineage (denoted by `mod`):

<div className="mt-5 mb-8 not-prose bg-[#0F1117] dark:bg-codeblock rounded-xl dark:ring-1 dark:ring-gray-800 relative p-5 text-sm leading-6 text-gray-50">
  <pre style={{ fontFamily: 'monospace' }}>
    knis.qty
    │
    │ mod
    └──────┐
           middle.qty
           │
           │ mod
           └──────┐
                  source.qty
  </pre>
</div>

Both transitions above are judged to be transform-dependencies because of `SUM`
aggregation.

### Inspect Dependencies

The *inspect* dependency indicates that an upstream column was inspected (as
part of the `WHERE`, `GROUP BY`, or `ON` clauses) to influence, but not directly
contribute to, a downstream column.

Here is a full dependency graph for `sink.phone` which involves all three kinds
of dependencies with inspect dependencies marked by `scan` for brevity.

<div className="mt-5 mb-8 not-prose bg-[#0F1117] dark:bg-codeblock rounded-xl dark:ring-1 dark:ring-gray-800 relative p-5 text-sm leading-6 text-gray-50">
  <pre style={{ fontFamily: 'monospace' }}>
    sink.phone
    │
    │ copy
    ├──────┐
    │      middle.phone
    │      │
    │      │ copy
    │      ├──────┐
    │      │      source.phone
    │      │ scan
    │      └──────┐
    │             source.txn\_date
    │             source.user\_id
    │ scan
    └──────┐
           middle.qty
           │
           │ mod
           ├──────┐
           │      source.qty
           │ scan
           └──────┐
                  source.txn\_date
                  source.user\_id
  </pre>
</div>

## Lineage CLI Command

SDF provides two interfaces for accessing lineage information. This interface displays information per-table and it includes only one
level of upstream dependencies. Here is an example:

```bash theme={null}
sdf lineage models/sink.sql
```

```yaml theme={null}
table:
  name: chain.pub.sink
  lineage:
    inspect:
      - middle.qty
  columns:
    - name: uid
      lineage:
        copy:
          - middle.user_id
    - name: phone
      lineage:
        copy:
          - middle.phone
    - name: txn_date
      lineage:
        copy:
          - middle.txn_date
    - name: qty
      lineage:
        copy:
          - middle.qty
```

Alternatively you can view a multi-level graphical representation of lineage:

```bash theme={null}
sdf lineage sink --column phone   
```

<div className="mt-5 mb-8 not-prose bg-[#0F1117] dark:bg-codeblock rounded-xl dark:ring-1 dark:ring-gray-800 relative p-5 text-sm leading-6 text-gray-50">
  <pre style={{ fontFamily: 'monospace' }}>
    sink.phone
    │
    │ copy
    ├──────┐
    │      middle.phone
    │      │
    │      │ copy
    │      ├──────┐
    │      │      source.phone
    │      │ scan
    │      └──────┐
    │             source.txn\_date
    │             source.user\_id
    │ scan
    └──────┐
           middle.qty
           │
           │ mod
           ├──────┐
           │      source.qty
           │ scan
           └──────┐
                  source.txn\_date
                  source.user\_id
  </pre>
</div>

In addition, `sdf lineage` supports the `--forward` flag to
display all the downstream dependencies of a given column.

For a deeper dive on this command, check out the
[Lineage CLI Command](/reference/sdf-cli#sdf-lineage) referece.

## Advanced Column-Level Lineage (Coming Soon)

Soon SDF will be able to track dependencies at a more fine-grained level by
specifying which functions or operations should be tracked explicitly. Consider
this updated `workspace.sdf.yml` file where we added an instruction to track the
`SUM` transformation explicitly.

```yaml workspace.sdf.yml theme={null}
workspace:
  edition: "1.3"
  lineage:
    functions:
      - sum
  includes:
    - path: models/
```

Now, the lineage for the `knis.qty` column will be as follows:

<div className="mt-5 mb-8 not-prose bg-[#0F1117] dark:bg-codeblock rounded-xl dark:ring-1 dark:ring-gray-800 relative p-5 text-sm leading-6 text-gray-50">
  <pre style={{ fontFamily: 'monospace' }}>
    knis.qty
    │
    │ sum
    └──────┐
           middle.qty
           │
           │ sum
           └──────┐
                  source.qty
  </pre>
</div>

Other transformations will continue to be tagged by the generic `mod` label. By
specifying a wildcard, the user may instruct SDF to track all function
transformations explicitly.

We're actively working on this functionality and will notify the community when
it is available.
