Basic Materialization
Materialize tables and views in Snowflake with SDF.
Overview
In this guide, we’ll materialize our first tables and views in Snowflake with SDF. We’ll be using some datasets from our partner, Cybersyn, to bootstrap some example data through Snowflake’s data marketplace.
Prerequisites
This guide should be followed after completing the Getting Started with Snowflake and SDF guide.
Before beginning, we’ll need to make sure we have the following:
- A Snowflake account with this Cybersyn dataset installed.
- Valid Snowflake username / password credentials with write access to at least one database we can materialize tables to.
- Instantiated credentials completed in the previous guide.
When installing the Cybersyn dataset, make sure to grant the Snowflake role you’ll use with SDF read access to TECH__INNOVATION_ESSENTIALS
(i.e. the Cybersyn database).
Guide
Create a New SDF Project from the Cybersyn Tech Innovation Sample
Create a new SDF project using the Cybersyn Tech Innovation sample. This will create a new project in your current working directory with the sample project files.
Compile to Test Credentials
To ensure your credentials are working and have read access to the new Cybersyn database, let’s try compiling one of the models.
Working set 3 model files, 1 .sdf file
Downloading TECH__INNOVATION_ESSENTIALS.CYBERSYN.OPENALEX_FUNDERS_INDEX (schema)
Compiling sdf_snowflake.cybersyn_tech_innovation.funder_aggregates (./models/sdf_snowflake/cybersyn_tech_innovation/funder_aggregates.sql)
Finished 2 models [2 succeeded] in 1.724 secs
Schema sdf_snowflake.cybersyn_tech_innovation.FUNDER_AGGREGATES
┌──────────────────────┬────────────────┬────────────┬─────────────┐
│ column_name ┆ data_type ┆ classifier ┆ description │
╞══════════════════════╪════════════════╪════════════╪═════════════╡
│ FUNDER_NAME ┆ varchar ┆ ┆ │
│ TOTAL_FUNDERS ┆ decimal(38, 0) ┆ ┆ │
│ TOTAL_WORKS_COUNT ┆ decimal(38, 0) ┆ ┆ │
│ TOTAL_CITATIONS ┆ decimal(38, 0) ┆ ┆ │
│ TOTAL_GRANTS ┆ decimal(38, 0) ┆ ┆ │
│ UNIQUE_COUNTRY_COUNT ┆ decimal(38, 0) ┆ ┆ │
│ FIRST_CREATED_DATE ┆ timestamp ┆ ┆ │
│ LAST_UPDATED_DATE ┆ timestamp ┆ ┆ │
└──────────────────────┴────────────────┴────────────┴─────────────┘
If you do not see a successful compilation, please ensure you’ve:
- Followed the Getting Started with Snowflake and SDF guide to authenticate to your Snowflake.
- Granted the correct role read access to the
TECH__INNOVATION_ESSENTIALS
database.
Materialize Tables in Snowflake
If you compiled successfully, great! That means SDF was able to read table schemas from the Cybersyn database. Now, let’s materialize in a separate database called sdf_snowflake
.
You’ll notice in the workspace.sdf.yml
file specifies the following defaults:
Let’s go through these one by one:
dialect: snowflake
specifies that the SQL we’re working with uses Snowflake’s syntax and functions.preprocessor: jinja
specifies that we’re using Jinja templating in our SQL models (this will come in handy later).materialization: view
specifies that we’re materializing our models as views by default in Snowflake. Note this can be overwritten on a per-model, or per-includes-path basis.
Next, let’s look at the integrations block:
This tells SDF to hydrate missing table schemas from the tech__innovation_essentials.cybersyn
database and schema and materialize tables to the sdf_snowflake
database. For more information on integration configuration, see the integration documentation.
Now, let’s materialize the tables in the sdf_snowflake
database.
This is great, but what if we want to see the data our views produced?
Let’s add small flag to sdf run
that tells SDF to pull down a sample of the data from the views we just materialized.
Congratulations! You’ve just materialized your first tables in Snowflake with SDF, and even got a sneak peek at the data they produced.
Materialize a New Table
Let’s say we want to materialize a new table in the sdf_snowflake
database, but in a different schema. You’ll notice in the workspace.sdf.yml
file, we have the following includes block:
The index here means we’ll interpret the database and schema name from the file path like so - models/<database>/<schema>/<table>.sql
.
Since we want to add a new table, let’s create a new file in the models
directory called sdf_snowflake/staging/latest_repo_name.sql
with the following content:
This table will select the latest repo_name
and repo_id
from github_repos
created in the last 24 hours.
Now let’s compile this new table to ensure the SQL is valid and selects columns that actually exist:
Working set 4 model files, 1 .sdf file
Downloading TECH__INNOVATION_ESSENTIALS.CYBERSYN.GITHUB_REPOS (schema)
Compiling sdf_snowflake.staging.latest_repo_name (./models/sdf_snowflake/staging/latest_repo_name.sql)
Finished 2 models [2 succeeded] in 1.744 secs
Schema sdf_snowflake.staging.LATEST_REPO_NAME
┌─────────────┬────────────────┬────────────┬─────────────┐
│ column_name ┆ data_type ┆ classifier ┆ description │
╞═════════════╪════════════════╪════════════╪═════════════╡
│ REPO_NAME ┆ varchar ┆ ┆ │
│ REPO_ID ┆ decimal(38, 0) ┆ ┆ │
└─────────────┴────────────────┴────────────┴─────────────┘
Great! Everything compiled. Now, let’s make sure this table is materialized as a table
instead of a view
in the sdf_snowflake
database. We’re doing this now since later one we’ll want to use this table as an upstream dependency in our DAG.
To accomplish this, we’ll need to overwrite our materialization
default in the workspace.sdf.yml
file, and specify the materialization
as table
for this specific model.
Under our workspace.sdf.yml
block, we can add:
SDF YML blocks are composable like legos. You can add as many blocks as you need to your workspace.sdf.yml
file, or even split them out into separate files and include them with the type: metadata
in your includes
block.
Creating separate YML files for each purpose or model can help keep your configuration organized and easy to maintain.
Now, let’s run the model to materialize the table in the sdf_snowflake
database.
Working set 4 model files, 1 .sdf file
Downloading TECH__INNOVATION_ESSENTIALS.CYBERSYN.GITHUB_REPOS (schema)
Running sdf_snowflake.staging.latest_repo_name (./models/sdf_snowflake/staging/latest_repo_name.sql)
Finished 2 models [2 succeeded] in 5.412 secs
Table sdf_snowflake.staging.LATEST_REPO_NAME
┌───────────────────────────────┬───────────┐
│ REPO_NAME ┆ REPO_ID │
╞═══════════════════════════════╪═══════════╡
│ 0-1CxH/megatron-wrap ┆ 904669875 │
│ 0-CHR/my-cert ┆ 904731422 │
│ 0-CHR/test-cert ┆ 904730980 │
│ 0-Porcelain/0-Porcelain ┆ 904578837 │
│ 0-chirag-s/GymAI ┆ 904776614 │
│ 0-entropy/AdguardFilters ┆ 904965609 │
│ 0-entropy/breakless-adblocker ┆ 904937190 │
│ 0-entropy/uAssets ┆ 904970966 │
│ 0-ml/speft ┆ 904614261 │
│ 00000000koma/koma ┆ 904770739 │
└───────────────────────────────┴───────────┘
10 rows.
Congratulations! You’ve just materialized your first table in Snowflake with SDF. In the next step, we’ll add one more table and a downstream view to create a DAG.
Create a DAG
Now that we’ve materialized a new table, let’s create a DAG by adding a downstream view that depends on the table we just created and one more table we’ll materialize now.
The DAG will look like this:
Let’s first create a new file in the models
directory called sdf_snowflake/staging/push_events.sql
with the following content:
This table will select all push
events from github_events
created in the last 24 hours.
Now let’s compile this new table to ensure the SQL is valid and selects columns that actually exist and are properly typed:
Working set 5 model files, 1 .sdf file
Downloading TECH__INNOVATION_ESSENTIALS.CYBERSYN.GITHUB_EVENTS (schema)
Compiling sdf_snowflake.staging.push_events (./models/sdf_snowflake/staging/push_events.sql)
Finished 2 models [2 succeeded] in 1.875 secs
Schema sdf_snowflake.staging.PUSH_EVENTS
┌─────────────┬────────────────┬────────────┬─────────────┐
│ column_name ┆ data_type ┆ classifier ┆ description │
╞═════════════╪════════════════╪════════════╪═════════════╡
│ ID ┆ varchar ┆ ┆ │
│ REPO_ID ┆ decimal(38, 0) ┆ ┆ │
└─────────────┴────────────────┴────────────┴─────────────┘
Great! Everything compiled. Now, let’s make sure this table is materialized as a table
instead of a view
in the sdf_snowflake
database.
Let’s add another table
block to our workspace.sdf.yml
file:
Now, let’s run the model to materialize the table in the sdf_snowflake
database.
Great! Now that we have our two tables, let’s finish up our DAG by creating a downstream view that depends on these two tables.
Create a new file in the models
directory called sdf_snowflake/cybersyn_tech_innovation/events_by_repo.sql
with the following content:
Since our workspace’s default materialization
is view, we don’t need to specify the materialization for this model.
Now, let’s compile again to ensure our entire workspace is still in a good state:
Working set 6 model files, 1 .sdf file
Downloading TECH__INNOVATION_ESSENTIALS.CYBERSYN.GITHUB_STARS (schema)
Downloading TECH__INNOVATION_ESSENTIALS.CYBERSYN.GITHUB_EVENTS (schema)
Downloading TECH__INNOVATION_ESSENTIALS.CYBERSYN.USPTO_CONTRIBUTOR_INDEX (schema)
Downloading TECH__INNOVATION_ESSENTIALS.CYBERSYN.USPTO_PATENT_INDEX (schema)
Downloading TECH__INNOVATION_ESSENTIALS.CYBERSYN.GITHUB_REPOS (schema)
Downloading TECH__INNOVATION_ESSENTIALS.CYBERSYN.OPENALEX_FUNDERS_INDEX (schema)
Downloading TECH__INNOVATION_ESSENTIALS.CYBERSYN.USPTO_PATENT_CONTRIBUTOR_RELATIONSHIPS (schema)
Compiling sdf_snowflake.staging.push_events (./models/sdf_snowflake/staging/push_events.sql)
Compiling sdf_snowflake.cybersyn_tech_innovation.funder_aggregates (./models/sdf_snowflake/cybersyn_tech_innovation/funder_aggregates.sql)
Compiling sdf_snowflake.cybersyn_tech_innovation.all_nvidia_patents (./models/sdf_snowflake/cybersyn_tech_innovation/all_nvidia_patents.sql)
Compiling sdf_snowflake.cybersyn_tech_innovation.most_starred_repos (./models/sdf_snowflake/cybersyn_tech_innovation/most_starred_repos.sql)
Compiling sdf_snowflake.staging.latest_repo_name (./models/sdf_snowflake/staging/latest_repo_name.sql)
Compiling sdf_snowflake.cybersyn_tech_innovation.events_by_repo (./models/sdf_snowflake/cybersyn_tech_innovation/events_by_repo.sql)
Finished 13 models [13 succeeded] in 1.970 secs
Woohoo! Another successful compile. Now, let’s try one final run of everything in this workspace to materialize our new view and run the entire DAG.
Congratulations! You’ve just created your first DAG in Snowflake with SDF.
Next Steps
Now that you’ve materialized your first tables and views in Snowflake with SDF, you can move on to more advanced topics like incremental materialization.