An SDF workspace is a directory which defines the data transformation, quality, and governance processes in your warehouse. It consists of SQL files representing data models or quality controls, configuration files, and additional file types supporting advanced data operations.

A YAML file, workspace.sdf.yml, defines the configuration for the project, and will be used by SDF to build and deploy your project. This file is required in your project and sdf will fail without it.

Our engine supports multiple dialects in the same SDF project. That means even column-level lineage will work across SQL dialects.

sdf.yml

As previously mentioned, the workspace.sdf.yml is a specific instance of an sdf.yml file. High-level, sdf.yml's are general-purpose metadata descriptors for describing things like:

  • Where your queries are in the directory
  • The SQL dialect your query is written in
  • Which Checks should run
  • The Schema and Catalog defaults
  • Which classifiers exist and the columns they should be attached to
  • And more…

The workspace.sdf.yml is actually a special instance of the general sdf.yml.

The sdf.yml is structured such that there are a set of specific YAML blocks titled by their purpose in your project. For example, you might have a table block, a classifier block, and a function block all in the same file. Each of these blocks will contain a set of properties that define the metadata for that block. The types of metadata differ per block type, but they all share a few properties.

sdf.yml files can be placed anywhere within your SDF project, and can be named whatever you like, as long as the file extension is .sdf.yml. For example, you could have a tables.sdf.yml file, a classifiers.sdf.yml file, and a functions.sdf.yml file. Or you could have an sdf.yml per table defined in your project. Maybe I have a table six_hourly_ingest.sql and a corresponding six_hourly_ingest.sdf.yml. The possibilites are truly endless.

Shared Properties Among All Metadata Blocks

name - The name of the metadata block. Names must be unique within their block type. For example, two tables cannot have the same name, but a table and a classifier could share the name pii.

description - This is a free-form text field that can be used to describe the metadata block. This is useful for documentation purposes.

All this metadata is indexed and searchable in the SDF Cloud

Metadata Blocks

Here we’ll briefly review each of the metadata block types. Each serves its own purpose, and is used in different ways. For a full list of the options available in the sdf.yml, see the sdf.yml reference.

Workspace Block

The workspace block is a special metadata block that belongs in your workspace.sdf.yml. This defines your workspace configuration, and is used by the SDF engine to build and deploy your project.

Hello World Workspace

The simplest version of a workspace.sdf.yml looks like this:

workspace:
  name: hello_world
  edition: "1.3"
  includes:
    - path: models

So what’s going on here? The first thing you’ll notice is the workspace specifies an edition. This tells the SDF engine which version of the workspace to expect when compiling your project.

Next we see an includes block. This tells the SDF engine where to look to find the code sources it will use to build your project and transform your data. Simply put, your SQL files should go into the models directory.

For a full list of the options available in the workspace.sdf.yml, see the sdf.yml reference.

Important Properties

workspace:   name: moms_flower_shop   edition: “1.3”   description: >     This workspace represents the data warehouse of mom’s flower shop. 

    It contains raw data regarding:     1. Customers     2. Marketing campaigns     3. Mobile in-app events     4. Street addresses

    That data is available in the seeds folder and is referenced in models/raw     to be loaded and used by SDF. Data transformations are performed and additional      models are available in the staging and analytics folders under the models folder.

  includes:     - path: models       type: model       index: schema-table-name     - path: seeds/parquet       type: resource     - path: metadata       type: metadata       index: schema-table-name     - path: classifications       type: metadata     - path: reports       type: report     - path: checks       type: check      defaults:      preprocessor: jinja --- environment:   name: dev   integrations:     - provider: sdf       type: database       targets:         - pattern: moms_flower_shop..           rename-as: moms_workshed.\{1}.{2}

  • name - The name of the workspace. This is used to create the default catalog for the workspace.
  • edition - The edition of the workspace. This is used to determine which version of the workspace to expect when compiling your project. By default this is 1.2.
  • includes - The include block references all metadata, code, and that will be used to build your project and transform your data. Can be a file, folder, or glob pattern. Each includes element can specify additional properties like the IncludeType.
  • excludes - The excludes block defines where to explicitly ignore code sources or metadata. Can be a file, folder, or glob pattern.
  • defaults - The defaults block contains all defaults for the workspace. In this case, we’re specifying the default catalog, schema, and dialect.
  • dialect - The default SQL dialect that will be used to build your project. SDF supports multi-dialect projects, so you can have different SQL dialects in the same project by configuring the dialect in the table metadata block. However, most projects will only use one.

Includes Blocks

The includes sub-property exist only on the workspace block and the environment block. The purpose of this block is to tell SDF where to find the models, metadata specifications, macros, reports, and resources like seed data. The includes property contains a list of includes blocks, each of which are dictionaries that share the same structure. The most commonly used properties on these includes blocks are:

  • path - The relative path to the directory or file in the workspace containing the models, metadata, resources, etc. This path is relative to the workspace.sdf.yml.
  • type - The type of assets being included by this block. Some commonly used types are model, spec, and macro.
  • defaults - This block models the defaults block at the top level of the workspace or environment block. It enables you to overwrite default properties on a per-includes-path basis.

SDF will auto-infer the type of the includes block if the directory name matches a certain naming convention. We consider this a best practice and encourage users to follow these conventions.

Directory NameType
modelsmodel
specsspec
reportsreport
checkscheck
macrosmacro
seedsseed
resourcesresource

Given the naming conventions mentioned above, the following includes block would work without any type specifications:

workspace.sdf.yml
includes:
  - path: models
  - path: specs
  - path: reports
  - path: checks
  - path: macros
  - path: seeds
  - path: resources

Table Block

The table block is used to define the metadata for a table. This includes the table name, the table description, the table’s columns, and even the table’s SQL dialect. This will likely be your most widely used metadata block. For more detailed information on the table block, see the Table Block reference.

Table-Level Classifiers

SDF enables you to tag tables via table-level classifiers, which allows you to quickly identify all tables under a certain category. Some examples for table-level classifiers can be customer_facing, debugging, sensitive_information, and more.

Here’s an example of a table definition with a table-level classifier:

table:
  name: main
  classifiers:
    - purpose.third_party_sharing

Column-Level Classifiers

One of the most important use cases for the table block is defining column-level classifiers. SDF will propagate these classifiers to downstream dependencies automatically. Here’s an example of a basic table definition with a column-level classifier:

table:
  name: main
  columns:
    - name: phone
      classifiers:
        - pii.phone

This table definition attaches the pii.phone classifier to the phone column.

Classifier Block

The classifier block is used to define classifiers in your SDF projects. You can think of these as units of metadata to be reused and measured across your data warehouse. Classifiers have names, descriptions, and labels. For more detailed information on the classifier block, see the Classifier Block reference.

In light of data privacy regulations, a common use case for classifiers is to define PII columns. You can define a classifier for PII columns, then apply that classifier to any column in your project and watch it propagate automatically to downstream queries. This will allow you to easily identify PII columns across your data warehouse, and even enforce data governance checks on them in the future.

Here’s an example of a basic PII classifier:

classifier:
  name: pii
  description: Personally Identifiable Information
  labels:
    - name: email
      description: Email address
    - name: phone
      description: Phone number
    - name: ssn
      description: Social Security Number

This classifier can be used across columns and tables alike to tell the SDF engine where to find PII. See the Table Block section for an example of how to apply this classifier to a column.

SDF automatically propagates classifiers to downstream dependencies. This helps you keep your metadata up to date and consistent across your entire data warehouse without having to manually update it everywhere. This makes it easier to stay compliant with data privacy regulations like GDPR and CCPA.

Generated Metadata

SDF represents all the metadata about your project in a local sdftarget directory. After running an SDF command, you’ll find this new directory appears in your SDF workspace. This is automatically added to your .gitignore and should not be tracked in git. As the name implies, this folder enables SDF to cache intermediaries as you’re building your project locally. Along with its caching capability, it also generates a set of sdf.yml files after running any SDF command.

These sdf.yml files represent all the user-defined and generated SDF metadata across your project. This metadata is used to enable features like column-level lineage, classifier propagation, and more.

Here’s an example of a full-fledged table definition generated in the sdftarget:

table:   name: moms_flower_shop.analytics.agg_installs_and_campaigns   dialect: trino   materialization: view   purpose: model   dependencies:   - moms_flower_shop.staging.app_installs_v2   columns:   - name: install_date     datatype: varchar     lineage:       modify:       - moms_flower_shop.staging.app_installs_v2.install_time   - name: campaign_name     datatype: varchar     lineage:       copy:       - moms_flower_shop.staging.app_installs_v2.campaign_name   - name: platform     datatype: varchar     lineage:       copy:       - moms_flower_shop.staging.app_installs_v2.platform   - name: distinct_installs     datatype: bigint     lineage:       modify:       - moms_flower_shop.staging.app_installs_v2.customer_id   classifiers:   - RETENTION.infinity   lineage:     scan:     - moms_flower_shop.staging.app_installs_v2.campaign_name     - moms_flower_shop.staging.app_installs_v2.install_time     - moms_flower_shop.staging.app_installs_v2.platform   source-locations:   - path: metadata/analytics/agg_installs_and_campaigns.sdf.yml   - path: models/analytics/agg_installs_and_campaigns.sql

These generated sdf.yml files are a powerful representation that can help you understand what SDF is doing under the hood. You can use these files to understand how SDF is interpreting your project and debug issues that might arise. We hope developers will take this representation and build their own tools and functionalities on top of it in the future.

Want to build something cool that utilizes the SDF engine? We’d love to hear about it! Inquire here to let us know what you’re working on and how we can help.