Workspace Indexing
This document aims to outline the concept of indexing within the SDF CLI and outline a specific example to better articulate how to take advantage of this powerful feature.
To further increase SDF’s velocity, it is possible to restructure the workspace and allow SDF to index it.
It is recommended to store all models within a directory structure of
catalog/schema/table-name
. Following this structure, SDF auto-applies
fully qualified names to the DDLs allowing for faster search and compilation.
Within SDF, the process of indexing begins with the first sdf compile
command.
This creates an index of the local filesystem and workspace, allowing for files
and dependencies to be mapped. The default index type, unless specified within
the workspace, is set to none
.
In the following example, the csv_123
sample workspace is used. If this workspace is not already
set up, it can be created with an sdf new --sample csv_123
command.
Example
Basic Index
Within a workspace.sdf.yml file, the index field can be updated to specify the type of index desired. In this example, the following csv_123 workspace can be edited and utilized where index is specified as none.
workspace:
name: csv_123
edition: "1.3"
defaults:
dialect: trino
includes:
- path: models
index: none
- path: local
index: none
- path: data
type: resource
By setting the index value to none
, every time you run sdf compile
SDF begins a lookup to
find the SQL defining the table in question. SDF begins by parsing a query, understands that it
depends on another table and attempts to find the subsequent table. With indexing set appropriately,
SDF knows exactly where the query files needed are stored. Comparitively, without indexing, or
indexing set to none
, SDF has to scan until it located the query needed.
In this example, the current directory is set to:
csv_123
├── data
│ └── a.csv
├── local
│ └── one.sql
├── remote
│ └── one.sql
├── src
│ ├── three.sql
│ └── two.sql
└── workspace.sdf.yml
And running an SDF compile results in a total run-time of 0.162 seconds:
Working set 3 model files, 1 .sdf file
Compiling example.pub.one (./local/one.sql)
Compiling example.pub.two (./src/two.sql)
Compiling example.pub.three (./src/three.sql)
Finished 3 models [3 succeeded] in 0.162 secs
Indexing at Scale with Catalog-Schema-Table-Name
Utilizing the default index function of none
for indexing works at reasonable scale,
but when it comes to large organizations that have thousands of tables utilizing a different
index function can drastically increase the speed of the process.
Three other options are the following:
catalog-schema-table-name
schema-table-name
table-name
Let’s test the speed improvement by changing the index and directories:
Begin by specifying the index type to catalog-schema-table-name
and defining defaults for the catalog
and schema
:
workspace:
name: csv_123
edition: "1.3"
defaults:
catalog: example
schema: pub
dialect: trino
includes:
- path: models
index: catalog-schema-table-name
- path: local
index: catalog-schema-table-name
- path: data
type: resource
Making the above change to index, where we specify catalog-schema-table-name
as
the index type, requires that under src
and local
files are structured in a directories like
<catalog>/<schema>/<table-name>.sql
.
csv_123
├── data
│ └── a.csv
├── local
│ └── example
│ └── pub
│ └── one.sql
├── remote
│ └── one.sql
├── src
│ └── example
│ └── pub
│ ├── three.sql
│ └── two.sql
└── workspace.sdf.yml
SDF workspaces default the catalog name to the workspace name specified in workspace.sdf.yml
. It is
important to add and update the catalog
field within defaults
of the workspace.sdf.yml
file to reflect
the appropriate catalog SDF will use for indexing.
Running sdf compile
after making these changes and updating the index results in an improvemnet of total
time to run, reducing down to less than 0.1 seconds.
Working set 3 model files, 1 .sdf file
Compiling example.pub.one (./local/one.sql)
Compiling example.pub.two (./src/example/pub/two.sql)
Compiling example.pub.three (./src/example/pub/three.sql)
Finished 3 models [3 succeeded] in 0.094 secs
Indexing with Schema-Table-Name
Another option, if your directories are set up as <schema>/<table-name>
then you
are able to use the index type schema-table-name.
To test this example type, update the workspace.sdf.yml
to reflect the new indexing types and update
the folder directories to reflect this change. It is important to set the catalog
within defaults
on each
of the includes paths. For example, add the catalog
default to each path, src
and local
below.
If the default catalog is not set in the top-level workspace block, the catalog will default to the workspace name.
workspace:
name: csv_123
edition: "1.3"
defaults:
schema: pub
dialect: trino
includes:
- path: models
index: schema-table-name
defaults:
catalog: example
- path: local
index: schema-table-name
defaults:
catalog: example
- path: data
Working set 3 model files, 1 .sdf file
Compiling csv_123.pub.one (./local/pub/one.sql)
Compiling csv_123.pub.two (./src/pub/two.sql)
Compiling csv_123.pub.three (./src/pub/three.sql)
Finished 3 models [3 succeeded] in 0.081 secs