Materialize incremental models in Snowflake to save time and compute.
TECH__INNOVATION_ESSENTIALS
(i.e. the Cybersyn database).Create a New SDF Project from the Cybersyn Tech Innovation Sample
Compile to Test Credentials
Working set 3 model files, 1 .sdf file
Downloading TECH__INNOVATION_ESSENTIALS.CYBERSYN.OPENALEX_FUNDERS_INDEX (schema)
Compiling sdf_snowflake.cybersyn_tech_innovation.funder_aggregates (./models/sdf_snowflake/cybersyn_tech_innovation/funder_aggregates.sql)
Finished 2 models [2 succeeded] in 1.782 secsSchema sdf_snowflake.cybersyn_tech_innovation.FUNDER_AGGREGATES
┌──────────────────────┬────────────────┬────────────┬─────────────┐
│ column_name ┆ data_type ┆ classifier ┆ description │
╞══════════════════════╪════════════════╪════════════╪═════════════╡
│ FUNDER_NAME ┆ varchar ┆ ┆ │
│ TOTAL_FUNDERS ┆ decimal(38, 0) ┆ ┆ │
│ TOTAL_WORKS_COUNT ┆ decimal(38, 0) ┆ ┆ │
│ TOTAL_CITATIONS ┆ decimal(38, 0) ┆ ┆ │
│ TOTAL_GRANTS ┆ decimal(38, 0) ┆ ┆ │
│ UNIQUE_COUNTRY_COUNT ┆ decimal(38, 0) ┆ ┆ │
│ FIRST_CREATED_DATE ┆ timestamp ┆ ┆ │
│ LAST_UPDATED_DATE ┆ timestamp ┆ ┆ │
└──────────────────────┴────────────────┴────────────┴─────────────┘
TECH__INNOVATION_ESSENTIALS
database.Materialize an Increment Model
github_events
which gets updated frequently with new events across all repositories on GitHub. We’ll use this table to demonstrate incremental materialization with the append
merge strategy.Let’s say we want to track push events to the Apache DataFusion repo. We can create a new model that finds all push events to DataFusion. Since the github_events
source data is updated frequently, we can use incremental materialization to only query new push events, thereby saving on compute and optimizing our pipeline.
Since we don’t care about updates to existing rows, we can use the append
merge strategy. This strategy appends new rows to the target table without updating existing rows.Let’s start creating a file called datafusion_push_events.sql
in the directory models/sdf_snowflake/staging/
and adding the following SQL to it:type
, repository name
, and created_at_timestamp
.{% if builtin.is_incremental_mode %}
block is a Jinja conditional that checks if the model is being materialized incrementally. If it is, we only fetch rows that are newer than the newest row in the previous materialization of this table. If not, we fetch rows from the last week.builtin.is_incremental_mode
to True
? SDF sets this variable to True
when the model has already been materialized in the cloud.Before running this model, we’ll need need to tell SDF to overwrite the default materialization for this table. We can do this by adding the following to the workspace.sdf.yml
file:append
, as such we don’t need to specify it in this YML (we’ll explore this later in the guide).
Now, let’s first compile our workspace with this new model:Working set 4 model files, 1 .sdf file
Downloading SDF_SNOWFLAKE.STAGING.DATAFUSION_PUSH_EVENTS (exists_remotely)
Downloading TECH__INNOVATION_ESSENTIALS.CYBERSYN.GITHUB_EVENTS (schema)
Compiling sdf_snowflake.staging.datafusion_push_events (./models/sdf_snowflake/staging/datafusion_push_events.sql)
Finished 2 models [2 succeeded] in 2.526 secsSchema sdf_snowflake.staging.DATAFUSION_PUSH_EVENTS
┌────────────────────────────────┬────────────────┬────────────┬─────────────┐
│ column_name ┆ data_type ┆ classifier ┆ description │
╞════════════════════════════════╪════════════════╪════════════╪═════════════╡
│ ID ┆ varchar ┆ ┆ │
│ CREATED_AT_TIMESTAMP ┆ timestamp ┆ ┆ │
│ TYPE ┆ varchar ┆ ┆ │
│ ACTOR_AVATAR_URL ┆ varchar ┆ ┆ │
│ ACTOR_DISPLAY_LOGIN ┆ varchar ┆ ┆ │
│ ACTOR_GRAVATAR_ID ┆ varchar ┆ ┆ │
│ ACTOR_ID ┆ decimal(38, 0) ┆ ┆ │
│ ACTOR_LOGIN ┆ varchar ┆ ┆ │
│ ACTOR_URL ┆ varchar ┆ ┆ │
│ REPO_ID ┆ decimal(38, 0) ┆ ┆ │
│ REPO_NAME ┆ varchar ┆ ┆ │
│ REPO_URL ┆ varchar ┆ ┆ │
│ ORG_AVATAR_URL ┆ varchar ┆ ┆ │
│ ORG_GRAVATAR_ID ┆ varchar ┆ ┆ │
│ ORG_ID ┆ decimal(38, 0) ┆ ┆ │
│ ORG_LOGIN ┆ varchar ┆ ┆ │
│ ORG_URL ┆ varchar ┆ ┆ │
│ PAYLOAD ┆ variant ┆ ┆ │
│ PAYLOAD_ACTION ┆ varchar ┆ ┆ │
│ PAYLOAD_DESCRIPTION ┆ varchar ┆ ┆ │
│ PAYLOAD_COMMENT ┆ varchar ┆ ┆ │
│ PAYLOAD_MASTER_BRANCH ┆ varchar ┆ ┆ │
│ PAYLOAD_PULL_REQUEST ┆ varchar ┆ ┆ │
│ PAYLOAD_PUSHER_TYPE ┆ varchar ┆ ┆ │
│ PAYLOAD_PUSH_ID ┆ decimal(38, 0) ┆ ┆ │
│ PAYLOAD_HEAD ┆ varchar ┆ ┆ │
│ PAYLOAD_REF ┆ varchar ┆ ┆ │
│ PAYLOAD_REF_TYPE ┆ varchar ┆ ┆ │
│ PAYLOAD_ISSUE_ID ┆ decimal(38, 0) ┆ ┆ │
│ PAYLOAD_ISSUE ┆ variant ┆ ┆ │
│ PUBLIC ┆ boolean ┆ ┆ │
│ LOAD_DATE ┆ varchar ┆ ┆ │
│ CREATED_AT ┆ varchar ┆ ┆ │
│ PAYLOAD_BODY ┆ varchar ┆ ┆ │
│ PAYLOAD_COMMIT_ID ┆ varchar ┆ ┆ │
│ PAYLOAD_CREATED_AT ┆ varchar ┆ ┆ │
│ PAYLOAD_USER_ID ┆ decimal(38, 0) ┆ ┆ │
│ PAYLOAD_USER_LOGIN ┆ varchar ┆ ┆ │
│ ISSUE ┆ variant ┆ ┆ │
│ ISSUE_ACTIVE_LOCK_REASON ┆ varchar ┆ ┆ │
│ ISSUE_ASSIGNEE ┆ varchar ┆ ┆ │
│ ISSUE_ASSIGNEES ┆ variant ┆ ┆ │
│ ISSUE_AUTHOR_ASSOCIATION ┆ varchar ┆ ┆ │
│ ISSUE_BODY ┆ varchar ┆ ┆ │
│ ISSUE_CLOSED_AT ┆ varchar ┆ ┆ │
│ ISSUE_COMMENTS ┆ decimal(38, 0) ┆ ┆ │
│ ISSUE_COMMENTS_URL ┆ varchar ┆ ┆ │
│ ISSUE_CREATED_AT ┆ varchar ┆ ┆ │
│ ISSUE_DRAFT ┆ boolean ┆ ┆ │
│ ISSUE_EVENTS_URL ┆ varchar ┆ ┆ │
│ ISSUE_HTML_URL ┆ varchar ┆ ┆ │
│ ISSUE_ID ┆ decimal(38, 0) ┆ ┆ │
│ ISSUE_LABELS ┆ variant ┆ ┆ │
│ ISSUE_LABELS_URL ┆ varchar ┆ ┆ │
│ ISSUE_LOCKED ┆ boolean ┆ ┆ │
│ ISSUE_MILESTONE ┆ varchar ┆ ┆ │
│ ISSUE_NODE_ID ┆ varchar ┆ ┆ │
│ ISSUE_NUMBER ┆ decimal(38, 0) ┆ ┆ │
│ ISSUE_PERFORMED_VIA_GITHUB_APP ┆ varchar ┆ ┆ │
│ ISSUE_PULL_REQUEST ┆ variant ┆ ┆ │
│ ISSUE_REACTIONS ┆ variant ┆ ┆ │
│ ISSUE_REPOSITORY_URL ┆ varchar ┆ ┆ │
│ ISSUE_STATE ┆ varchar ┆ ┆ │
│ ISSUE_TIMELINE_URL ┆ varchar ┆ ┆ │
│ ISSUE_TITLE ┆ varchar ┆ ┆ │
│ ISSUE_UPDATED_AT ┆ varchar ┆ ┆ │
│ ISSUE_URL ┆ varchar ┆ ┆ │
│ ISSUE_USER_ID ┆ decimal(38, 0) ┆ ┆ │
│ ISSUE_USER_LOGIN ┆ varchar ┆ ┆ │
│ ISSUE_USER_TYPE ┆ varchar ┆ ┆ │
│ LANGUAGE ┆ varchar ┆ ┆ │
└────────────────────────────────┴────────────────┴────────────┴─────────────┘
sdftarget/dbg/materialized/sdf_snowflake/staging/datafusion_push_events.sql
and inspect the SQL query.It should show this:builtin.is_incremental_mode
set to False
. This is because we haven’t materialized the table yet.builtin.is_incremental_mode
to True
.github_events
table that were created after the last materialization.Furthermore, you might notice something slightly different in the run output, specifically a line that says Preloading
like so:is_incremental_mode
builtin variable.Lastly, if you inspect the compiled query output again, you should see builtin.is_incremental_mode
set to True
and the query’s SQL reflective of that.Utilize the Merge Incremental Strategy
append
strategy, let’s explore the merge
strategy. The merge
strategy is useful when you want to update existing rows in the target table with new data from the source table.merge
strategy is a perfect fit for this use case.Let’s start by adding a new SQL file to our workspace called top_datafusion_contributors.sql
in the directory models/sdf_snowflake/cybersyn_tech_innovation/
with the following SQL:workspace.sdf.yml
to set materialization to incremental-table
for this model.incremental-options
to use the merge
strategy. We’ve also specified the unique-key
as actor_id
and the merge-update-columns
as contribution_count
.
This tells SDF to update the contribution_count
column in the target table with the new count from the source table when it finds two rows that match on their actor_id
.Let’s compile the workspace with this new model:Working set 5 model files, 1 .sdf file
Downloading SDF_SNOWFLAKE.STAGING.GITHUB_PUSH_EVENTS (exists_remotely)
Downloading SDF_SNOWFLAKE.CYBERSYN_TECH_INNOVATION.TOP_DATAFUSION_CONTRIBUTORS (exists_remotely)
Downloading TECH__INNOVATION_ESSENTIALS.CYBERSYN.GITHUB_EVENTS (schema)
Compiling sdf_snowflake.staging.datafusion_push_events (./models/sdf_snowflake/staging/datafusion_push_events.sql)
Compiling sdf_snowflake.cybersyn_tech_innovation.top_datafusion_contributors (./models/sdf_snowflake/cybersyn_tech_innovation/top_datafusion_contributors.sql)
Finished 3 models [3 succeeded] in 2.466 secsSchema sdf_snowflake.cybersyn_tech_innovation.TOP_DATAFUSION_CONTRIBUTORS
┌─────────────────────┬────────────────┬────────────┬─────────────┐
│ column_name ┆ data_type ┆ classifier ┆ description │
╞═════════════════════╪════════════════╪════════════╪═════════════╡
│ ACTOR_ID ┆ decimal(38, 0) ┆ ┆ │
│ ACTOR_DISPLAY_LOGIN ┆ varchar ┆ ┆ │
│ CONTRIBUTION_COUNT ┆ decimal(38, 0) ┆ ┆ │
└─────────────────────┴────────────────┴────────────┴─────────────┘
sdftarget/dbg/materialized/sdf_snowflake/cybersyn_tech_innovation/top_datafusion_contributors.sql
and inspect the SQL query.It should show this:builtin.is_incremental_mode
is set to False
and the table is simply being created. Let’s run the model now to materialize the table in non-incremental mode.builtin.is_incremental_mode
to True
.builtin.is_incremental_mode
set to True
and the query’s SQL reflective of that.builtin.is_incremental_mode
set to True
and, as a byproduct, the SQL query is a merge
statement that updates the contribution_count
column in the target table with the new count from the source table.merge
strategy is a powerful tool for updating existing rows in the target table with new data from the source table. It’s especially useful when you want to keep a running count of events or other metrics.