Overview

In this guide, we’ll materialize our first tables and views in Snowflake with SDF. We’ll be using some datasets from our partner, Cybersyn, to bootstrap some example data through Snowflake’s data marketplace.

Prerequisites

This guide should be followed after completing the Getting Started with Snowflake and SDF guide.

Before beginning, we’ll need to make sure we have the following:

  • A Snowflake account with this Cybersyn dataset installed.
  • Valid Snowflake username / password credentials with write access to at least one database we can materialize tables to.
  • Instantiated credentials completed in the previous guide.

When installing the Cybersyn dataset, make sure to grant the Snowflake role you’ll use with SDF read access to TECH__INNOVATION_ESSENTIALS (i.e. the Cybersyn database).

Guide

1

Create a New SDF Project from the Cybersyn Tech Innovation Sample

Create a new SDF project using the Cybersyn Tech Innovation sample. This will create a new project in your current working directory with the sample project files.

sdf new --sample cybersyn_tech_innovation 
2

Compile to Test Credentials

To ensure your credentials are working and have read access to the new Cybersyn database, let’s try compiling one of the models.

sdf compile sdf_snowflake.cybersyn_tech_innovation.funder_aggregates

Working set 3 model files, 1 .sdf file Downloading TECH__INNOVATION_ESSENTIALS.CYBERSYN.OPENALEX_FUNDERS_INDEX (schema)   Compiling sdf_snowflake.cybersyn_tech_innovation.funder_aggregates (./models/sdf_snowflake/cybersyn_tech_innovation/funder_aggregates.sql)    Finished 2 models [2 succeeded] in 1.724 secs

Schema sdf_snowflake.cybersyn_tech_innovation.FUNDER_AGGREGATES ┌──────────────────────┬────────────────┬────────────┬─────────────┐ │ column_name          ┆ data_type      ┆ classifier ┆ description │ ╞══════════════════════╪════════════════╪════════════╪═════════════╡ │ FUNDER_NAME          ┆ varchar        ┆            ┆             │ │ TOTAL_FUNDERS        ┆ decimal(38, 0) ┆            ┆             │ │ TOTAL_WORKS_COUNT    ┆ decimal(38, 0) ┆            ┆             │ │ TOTAL_CITATIONS      ┆ decimal(38, 0) ┆            ┆             │ │ TOTAL_GRANTS         ┆ decimal(38, 0) ┆            ┆             │ │ UNIQUE_COUNTRY_COUNT ┆ decimal(38, 0) ┆            ┆             │ │ FIRST_CREATED_DATE   ┆ timestamp      ┆            ┆             │ │ LAST_UPDATED_DATE    ┆ timestamp      ┆            ┆             │ └──────────────────────┴────────────────┴────────────┴─────────────┘

If you do not see a successful compilation, please ensure you’ve:

  1. Followed the Getting Started with Snowflake and SDF guide to authenticate to your Snowflake.
  2. Granted the correct role read access to the TECH__INNOVATION_ESSENTIALS database.
3

Materialize Tables in Snowflake

If you compiled successfully, great! That means SDF was able to read table schemas from the Cybersyn database. Now, let’s materialize in a separate database called sdf_snowflake.

You’ll notice in the workspace.sdf.yml file specifies the following defaults:

defaults:
    dialect: snowflake
    preprocessor: jinja
    materialization: view

Let’s go through these one by one:

  • dialect: snowflake specifies that the SQL we’re working with uses Snowflake’s syntax and functions.
  • preprocessor: jinja specifies that we’re using Jinja templating in our SQL models (this will come in handy later).
  • materialization: view specifies that we’re materializing our models as views by default in Snowflake. Note this can be overwritten on a per-model, or per-includes-path basis.

Next, let’s look at the integrations block:

integrations:
  - provider: snowflake
    type: database
    sources: 
       - pattern: tech__innovation_essentials.cybersyn.*
    targets:
       - pattern: sdf_snowflake.*.*

This tells SDF to hydrate missing table schemas from the tech__innovation_essentials.cybersyn database and schema and materialize tables to the sdf_snowflake database. For more information on integration configuration, see the integration documentation.

Now, let’s materialize the tables in the sdf_snowflake database.

sdf run

This is great, but what if we want to see the data our views produced?

Let’s add small flag to sdf run that tells SDF to pull down a sample of the data from the views we just materialized.

sdf run --show all

Congratulations! You’ve just materialized your first tables in Snowflake with SDF, and even got a sneak peek at the data they produced.

4

Materialize a New Table

Let’s say we want to materialize a new table in the sdf_snowflake database, but in a different schema. You’ll notice in the workspace.sdf.yml file, we have the following includes block:

- path: models 
  index: catalog-schema-table-name

The index here means we’ll interpret the database and schema name from the file path like so - models/<database>/<schema>/<table>.sql.

Since we want to add a new table, let’s create a new file in the models directory called sdf_snowflake/staging/latest_repo_name.sql with the following content:

WITH partitioned_repo_names AS (
    SELECT
        repo_name,
        repo_id,
        ROW_NUMBER() OVER (
            PARTITION BY repo_id
            ORDER BY
            first_seen DESC
        ) AS rn
    FROM
        tech__innovation_essentials.cybersyn.github_repos
    WHERE
        first_seen >= DATEADD(DAY, -1, CURRENT_TIMESTAMP())
)
SELECT
    repo_name,
    repo_id
FROM partitioned_repo_names
WHERE rn = 1;

This table will select the latest repo_name and repo_id from github_repos created in the last 24 hours.

Now let’s compile this new table to ensure the SQL is valid and selects columns that actually exist:

sdf compile models/sdf_snowflake/staging/latest_repo_name.sql

Working set 4 model files, 1 .sdf file Downloading TECH__INNOVATION_ESSENTIALS.CYBERSYN.GITHUB_REPOS (schema)   Compiling sdf_snowflake.staging.latest_repo_name (./models/sdf_snowflake/staging/latest_repo_name.sql)    Finished 2 models [2 succeeded] in 1.744 secs

Schema sdf_snowflake.staging.LATEST_REPO_NAME ┌─────────────┬────────────────┬────────────┬─────────────┐ │ column_name ┆ data_type      ┆ classifier ┆ description │ ╞═════════════╪════════════════╪════════════╪═════════════╡ │ REPO_NAME   ┆ varchar        ┆            ┆             │ │ REPO_ID     ┆ decimal(38, 0) ┆            ┆             │ └─────────────┴────────────────┴────────────┴─────────────┘

Great! Everything compiled. Now, let’s make sure this table is materialized as a table instead of a view in the sdf_snowflake database. We’re doing this now since later one we’ll want to use this table as an upstream dependency in our DAG.

To accomplish this, we’ll need to overwrite our materialization default in the workspace.sdf.yml file, and specify the materialization as table for this specific model.

Under our workspace.sdf.yml block, we can add:

---
table:
    name: sdf_snowflake.staging.latest_repo_name
    materialization: table

SDF YML blocks are composable like legos. You can add as many blocks as you need to your workspace.sdf.yml file, or even split them out into separate files and include them with the type: metadata in your includes block. Creating separate YML files for each purpose or model can help keep your configuration organized and easy to maintain.

Now, let’s run the model to materialize the table in the sdf_snowflake database.

sdf run models/sdf_snowflake/staging/latest_repo_name.sql

Working set 4 model files, 1 .sdf file Downloading TECH__INNOVATION_ESSENTIALS.CYBERSYN.GITHUB_REPOS (schema)     Running sdf_snowflake.staging.latest_repo_name (./models/sdf_snowflake/staging/latest_repo_name.sql)    Finished 2 models [2 succeeded] in 5.412 secs

Table sdf_snowflake.staging.LATEST_REPO_NAME ┌───────────────────────────────┬───────────┐ │ REPO_NAME                     ┆ REPO_ID   │ ╞═══════════════════════════════╪═══════════╡ │ 0-1CxH/megatron-wrap          ┆ 904669875 │ │ 0-CHR/my-cert                 ┆ 904731422 │ │ 0-CHR/test-cert               ┆ 904730980 │ │ 0-Porcelain/0-Porcelain       ┆ 904578837 │ │ 0-chirag-s/GymAI              ┆ 904776614 │ │ 0-entropy/AdguardFilters      ┆ 904965609 │ │ 0-entropy/breakless-adblocker ┆ 904937190 │ │ 0-entropy/uAssets             ┆ 904970966 │ │ 0-ml/speft                    ┆ 904614261 │ │ 00000000koma/koma             ┆ 904770739 │ └───────────────────────────────┴───────────┘ 10 rows.

Congratulations! You’ve just materialized your first table in Snowflake with SDF. In the next step, we’ll add one more table and a downstream view to create a DAG.

5

Create a DAG

Now that we’ve materialized a new table, let’s create a DAG by adding a downstream view that depends on the table we just created and one more table we’ll materialize now.

The DAG will look like this:

Let’s first create a new file in the models directory called sdf_snowflake/staging/push_events.sql with the following content:

SELECT
    id,
    repo_id
FROM
    tech__innovation_essentials.cybersyn.github_events
WHERE
    type like '%Push%'
AND 
    created_at_timestamp >= DATEADD(DAY, -1, CURRENT_TIMESTAMP())

This table will select all push events from github_events created in the last 24 hours.

Now let’s compile this new table to ensure the SQL is valid and selects columns that actually exist and are properly typed:

sdf compile models/sdf_snowflake/staging/push_events.sql

Working set 5 model files, 1 .sdf file Downloading TECH__INNOVATION_ESSENTIALS.CYBERSYN.GITHUB_EVENTS (schema)   Compiling sdf_snowflake.staging.push_events (./models/sdf_snowflake/staging/push_events.sql)    Finished 2 models [2 succeeded] in 1.875 secs

Schema sdf_snowflake.staging.PUSH_EVENTS ┌─────────────┬────────────────┬────────────┬─────────────┐ │ column_name ┆ data_type      ┆ classifier ┆ description │ ╞═════════════╪════════════════╪════════════╪═════════════╡ │ ID          ┆ varchar        ┆            ┆             │ │ REPO_ID     ┆ decimal(38, 0) ┆            ┆             │ └─────────────┴────────────────┴────────────┴─────────────┘

Great! Everything compiled. Now, let’s make sure this table is materialized as a table instead of a view in the sdf_snowflake database.

Let’s add another table block to our workspace.sdf.yml file:

---
table:
    name: sdf_snowflake.staging.push_events
    materialization: table

Now, let’s run the model to materialize the table in the sdf_snowflake database.

sdf run models/sdf_snowflake/staging/push_events.sql

Great! Now that we have our two tables, let’s finish up our DAG by creating a downstream view that depends on these two tables.

Create a new file in the models directory called sdf_snowflake/cybersyn_tech_innovation/events_by_repo.sql with the following content:

SELECT
    r.repo_name,
    COUNT(DISTINCT e.id) AS event_count
FROM
    sdf_snowflake.staging.push_events AS e
JOIN sdf_snowflake.staging.latest_repo_name AS r ON e.repo_id = r.repo_id
GROUP BY
    r.repo_name
ORDER BY
    event_count DESC
LIMIT 10;

Since our workspace’s default materialization is view, we don’t need to specify the materialization for this model.

Now, let’s compile again to ensure our entire workspace is still in a good state:

    sdf compile

Working set 6 model files, 1 .sdf file Downloading TECH__INNOVATION_ESSENTIALS.CYBERSYN.GITHUB_STARS (schema) Downloading TECH__INNOVATION_ESSENTIALS.CYBERSYN.GITHUB_EVENTS (schema) Downloading TECH__INNOVATION_ESSENTIALS.CYBERSYN.USPTO_CONTRIBUTOR_INDEX (schema) Downloading TECH__INNOVATION_ESSENTIALS.CYBERSYN.USPTO_PATENT_INDEX (schema) Downloading TECH__INNOVATION_ESSENTIALS.CYBERSYN.GITHUB_REPOS (schema) Downloading TECH__INNOVATION_ESSENTIALS.CYBERSYN.OPENALEX_FUNDERS_INDEX (schema) Downloading TECH__INNOVATION_ESSENTIALS.CYBERSYN.USPTO_PATENT_CONTRIBUTOR_RELATIONSHIPS (schema)   Compiling sdf_snowflake.staging.push_events (./models/sdf_snowflake/staging/push_events.sql)   Compiling sdf_snowflake.cybersyn_tech_innovation.funder_aggregates (./models/sdf_snowflake/cybersyn_tech_innovation/funder_aggregates.sql)   Compiling sdf_snowflake.cybersyn_tech_innovation.all_nvidia_patents (./models/sdf_snowflake/cybersyn_tech_innovation/all_nvidia_patents.sql)   Compiling sdf_snowflake.cybersyn_tech_innovation.most_starred_repos (./models/sdf_snowflake/cybersyn_tech_innovation/most_starred_repos.sql)   Compiling sdf_snowflake.staging.latest_repo_name (./models/sdf_snowflake/staging/latest_repo_name.sql)   Compiling sdf_snowflake.cybersyn_tech_innovation.events_by_repo (./models/sdf_snowflake/cybersyn_tech_innovation/events_by_repo.sql)    Finished 13 models [13 succeeded] in 1.970 secs

Woohoo! Another successful compile. Now, let’s try one final run of everything in this workspace to materialize our new view and run the entire DAG.

sdf run 

Congratulations! You’ve just created your first DAG in Snowflake with SDF.

Next Steps

Now that you’ve materialized your first tables and views in Snowflake with SDF, you can move on to more advanced topics like incremental materialization.