> ## Documentation Index
> Fetch the complete documentation index at: https://docs.sdf.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Data Classifiers

> SDF promotes reasoning on higher level types through a rich data classification & type system. 

SQL only supports primitive types like `varchar`, `int`, `timestamp`, `decimal`, etc. But, people reason over higher level data types. A varchar might be a name, or phone number. An integer might be a daily active user metric.

SDF has the ability to annotate columns and tables with **user defined types** and then **automatically propagate** those types to downstream assets while respecting aggregations, or functional transformations of data. This
unique capability fosters the creation of a *dynamic semantic layer* which adapts as you build out your data warehouse.

Classifiers are first-class citizens of the SDF ecosystem and are an added layer of metadata on top of SQL models. They are completely compatible with **all dialects** and databases that SDF supports.

You can think of them like rich types in a language like Typescript. They can be defined, reused, transformed, and propagated programatically by SDF.

<Note>
  SDF uses the term *Classifier* and *SQL Type* interchangeably. Any column or table may have 0 or more classifiers.
</Note>

## Using SQL Types & classifiers

**What are popular examples of SQL Types?**

* Personally Identifiable Information like phone numbers, addresses, etc.
* Metrics
* Financial Information
* Unique Identifiers
* Retention Requiremenets
* Time Grains

**What Can You Do With SQL Types?**

* Write Reports
* Write Checks
* Audit
* Prevent Business Logic Mismatch
* Ensure Consistency Across Metrics
* Customer Segmentation

Higher order types have many advantages.

## Basic Propagation

Propagation is the mechanism of inferring the downstream column types from
the upstream column types. The inference process is based on the semantics of
the underlying transformation.

<Note>
  In the following example, the `lineage` sample workspace is used. If this workspace is not already
  set up, it can be created with an `sdf new --sample lineage` command.
</Note>

### Defining a Classifier Taxonomy

To work with classifiers and propagation we must first define all the relevant
classifier domains. Here is an example of a simple PII classifier domain,
defined in a separate section of the global `workspace.sdf.yml` config file.

```yaml workspace.sdf.yml theme={null}
workspace:
  edition: "1.3"
  includes:
    - path: models/
---
classifier:
  name: PII
  labels:
    - name: Phone
    - name: Address
    - name: SSN
    - name: UID
```

This configuration defines five A top level type `PII` with 4 labels: `PII.Phone`,
`PII.Address`, `PII.SSN`, and `PII.UID`, where the last four labels denote
special classes of the first label. (I.e. if some data is labeled with
`PII.Address`, it is also, implicitly, labeled with `PII`.)

<Note>
  Classifier domains can also be defined in separate `sdf.yml` files as long as
  these files are included by the `workspace.sdf.yml` file as one of the
  paths specified in `includes`.
</Note>

### Attaching Classifiers to Tables

Once a classifier domain is defined, we can manually attach classifiers to
tables. Once root tables are annotated with the relevant classifiers, SDF
automatically propagates classifiers downstream.

Let's start by labeling our `source`. We add a new file (`models/source.sdf.yml`) containing our type:

```yaml models/source.sdf.yml theme={null}
table:
  name: source
  columns:
    - name: user_id
      classifiers:
        - PII.UID
    - name: phone
      classifiers:
        - PII.Phone
```

This file will automatically be picked up by SDF because the workspace already includes the `/models` directory. Alternatively, the same configuratin can be included
directly into the workspace.sdf.yml file as follows:

```yaml workspace.sdf.yml theme={null}
workspace:
  edition: "1.3"
  includes:
    - path : models/
---
classifier:
  name: PII
  labels:
    - name: Phone
    - name: Address
    - name: SSN
    - name: UID
    - name: AreaCode
---
table:
  name: source
  columns:
    - name: user_id
      classifiers:
        - PII.UID
    - name: phone
      classifiers:
        - PII.Phone
```

Notice - all .sdf.yml files are composable! You may include one or more configuration in any .sdf.yml file, as long as the schema is respected.

### Inspecting Propagation Results

To see the classifer, and how it was propagated to all tables downstream of our `source` table, run `sdf compile --show all`.

```bash theme={null}
sdf compile --show all
```

```text theme={null}
Schema chain.pub.source
+-------------+-----------+-------------+------------+
| column_name | data_type | is_nullable | classifier |
+-------------+-----------+-------------+------------+
| user_id     | bigint    | false       | PII.UID    |
| phone       | varchar   | false       | PII.Phone  |
| txn_date    | varchar   | false       |            |
| qty         | bigint    | false       |            |
+-------------+-----------+-------------+------------+

Schema chain.pub.middle
+-------------+-----------+-------------+------------+
| column_name | data_type | is_nullable | classifier |
+-------------+-----------+-------------+------------+
| user_id     | bigint    | false       | PII.UID    |
| phone       | varchar   | false       | PII.Phone  |
| txn_date    | varchar   | false       |            |
| qty         | bigint    | false       |            |
+-------------+-----------+-------------+------------+

Schema chain.pub.knis
+-------------+-----------+-------------+------------+
| column_name | data_type | is_nullable | classifier |
+-------------+-----------+-------------+------------+
| txn_date    | varchar   | false       |            |
| qty         | bigint    | false       |            |
+-------------+-----------+-------------+------------+

Schema chain.pub.sink
+-------------+-----------+-------------+------------+
| column_name | data_type | is_nullable | classifier |
+-------------+-----------+-------------+------------+
| uid         | bigint    | false       | PII.UID    |
| phone       | varchar   | false       | PII.Phone  |
| txn_date    | varchar   | false       |            |
| qty         | bigint    | false       |            |
+-------------+-----------+-------------+------------+
```

Observe how the classifiers attached to the upstream table automatically
propagated downstream to `middle` and `sink`, but not to `knis` – because it
doesn’t have any columns derived from the `PII` columns upstream.

## Advanced Propagation

By default classifiers propagate through functions unchanged. For example, if a
substring is extracted from a column labeled as `PII.Phone`, the result
will also be labeled with `PII.Phone` by default.

However, the result is no longer a phone number - maybe it's just the area code!

Let's imagine the result of the substring expression is only area codes. We can use a [function block](/reference/sdf-yml#block-function) in our `sdf.yml` files to define the behavior of a classifier in response to the function being called.

In our current example, we'd want to reclassify `PII.Phone` to `PII.AreaCode`. Here's the function overload required to do so:

```yaml workspace.sdf.yml theme={null}
---
function:
  name: substring
  reclassify:
    - from: PII.Phone 
      to: PII.AreaCode
```

Great! Now every time the substring function is called on a column with the classifier `PII.Phone`, it will be reclassified to `PII.AreaCode`.

Another common case is to *prevent* a classifier from propagating through an aggregation. For example, if a column `user_id` is labeled with the classifier `PII.UID` and we `SELECT COUNT(DISTINCT user_id)` from the table containing this column, we don't want `PII.UID` to propagate since the `COUNT DISTINCT` is not a `PII.UID`. It's a number representing the unique count of `PII.UID`s. We can use the same `reclassify` block to prevent this propagation. Here's an example:

```yaml workspace.sdf.yml theme={null}
---
function:
  name: count
  parameters:
    - datatype: T
  returns:
    datatype: bigint
  reclassify:
    - from: PII.UID
```

By explicitly not setting a `to` value in the `reclassify` block, we remove the `PII.UID` classifier downstream and achieve our desired behavior.

<Note>
  Function reclassifications work across all queries by default. In this way, developers have full programmatic control over how their classifiers propagate and interact with functions across their entire data warehouse.
</Note>

Lastly, this same reclassification mechanism can be used on a single column or table using the same `reclassify` block. Here's an example below:

```yaml models/phone_trimmed.sdf.yml theme={null}
table:
	name: phone_trimmed
	columns:
		- name: phone_first_three_digits
	reclassify:
		- from: PII.Phone
			to: PII.AreaCode
```

In this example, the column `phone_first_three_digits` will be reclassified to `PII.AreaCode` if it is derived from a column with the classifier `PII.Phone`.

### Classifier States

Using *labels*, classifiers can also exist in states that communicate how the data is currently represented.

Here is how a classifier state could be defined in YML:

```yaml workspace.sdf.yml theme={null}
...
---
classifier:
  name: PII_STATE
  labels:
    - name: clear_text
    - name: hashed
    - name: anonymized
...
```

`PII_STATE.clear_text` is meant to represent human-readable PII, and `anonymized` is meant to represent anonymized PII. We can then use these states to define the effect of functions (including [User Defined Functions (UDFs)](/guide/advanced/udf)) on classifiers like the examples above. For example, we can define the effect of an `md5` hash as follows:

```yaml workspace.sdf.yml theme={null}
---
function:
  name: md5
  reclassify:
    - from: PII_STATE.clear_text
      to: PII_STATE.hashed
```

Now let's imagine we have a UDF called `anonymize` which anonymizes PII no matter its current state. For this, we can use a glob pattern (e.g. `*`) to tell SDF to reclassify a `PII_STATE` classifier no matter its current state. Here's an example:

```yaml workspace.sdf.yml theme={null}
---
function:
  name: anonymize
  parameters:
    - name: pii
      datatype: varchar
  returns:
    - datatype: varchar
  reclassify:
    - from: PII_STATE.*
      to: PII_STATE.anonymized
```

Note the usage of `PII_STATE.*` in the `from` field of the `reclassify` block. This tells SDF to reclassify any `PII_STATE` classifier, no matter its current state, to `PII_STATE.anonymized`.

<Info>
  Wondering why we also defined the `parameters` and `returns` in this function block? This is because this function is a [`User Defined Function (UDF)`](/guide/advanced/udf).
</Info>

### Organizing Classifiers

When organizing classifiers, it is recommended to create a folder to host all Classfiers created by you and your team.
Setting up a location within the working directory called `classifiers` is the recommended setting.
Once created, this location will need to be ammended to the `workspace.sdf.yml` so that it is included.

```yaml workspace.sdf.yml theme={null}
---
includes:
  - path: classifiers/
```
