sdftarget
When we say that SDF is a build system, we mean that it builds data like a build system builds binaries.
This means the best parts of build systems, like dependency tracking, caching, and reuse of artifacts are all encompassed in the SDF vision.
In this section we’ll discuss the caching strategy of SDF and how it will save you time, compute, and resources as a data developer.
If you take a look in your SDF workspace, you’ll likely find a directory called
sdftarget
(i.e. the name of this reference). This is where SDF caches the data it builds and much more…SDF & Build Tools
SDF builds data from queries and external tables. SDF’s scheduling and caching solution is thus similar to the famousmake
build tool. Make is typically used to compile binaries. A typical Make rule says that a binary has to be rebuilt if it is older than its source. In SDF this corresponds to a simple rule of table materialization:
A materialized table has to be rebuilt if its source file*, a sql query, *or any of its materialized data dependencies (in the cache) have changed.
Features
Dependency Tracking. SDF analyzes all dependencies that are established through table definitions and table uses. Tables are defined viaselect
and create table
statements while table uses are defined via from
clauses. SDF uses the Table definition and use relationships to build a dependency graph. A partial order of dependencies is established due to the fact that analytical queries typically can’t be recursive. SDF schedules the dependency graphs in definition before use order.
SDF cache. When SDF builds the dependencies in order, it does not only display the results but materializes all tables in its data cache, called sdftarget
. You can find this folder withinin your SDF workspace directory after running any command.
SDF reuse. Users and bots constantly build data. But data that is up-to-date doesn’t need to be rebuilt. To decide whether a query needs to be re-executed or whether a past materialization can be reused, SDF uses file modification times of source files as well as of data files, the latter all located in its cache. We will amend this section later with the handling of data that is located in the cloud.
Example
Let’s see this in practice. Let’s simply run a pipeline having three stages where the first is a csv file. This sample is available within the CLI. Follow the steps below to create it locally. (1) We create a new sample
sdf new --sample csv123 .
sdf build --no-show
Stdout shows that all tables are written into
sdftarget
under the respective date/hr paths.
(3) Let’s rerun the query again.
sdf build --no-show
touch src/three.sql
This has the effect that three.sql is now newer than any of the derived artifacts for three in the cache.
(5) If we rerun the query again, it tells us that it reuses one and two, but has to rebuild three.
sdf build --no-show
touch data/a.csv
(7) Changing the original data validates all cache entries, so SDF simply rebuilds everything.
sdf build
Caching partitioned data
SDF supports time-partitioned tables. Queries can read an arbitrary number of partitions, but can currently only write one partition.Datetime Partitioning
Data Paths
All partitions are mapped to file paths. Thus a partitioned table named c.s.t (catalog.schema.table) with a partition ofdate
and hour
(with respective formats %Y-%m-%d and %H, respectively) will store its partitions under c/s/t/date=yyyy-mm-dd/hour=hh/part-*.parquet
where yyyy-mm-dd
and hh
depend on the schedule of the query. Let’s say we’re instead running on an hourly schedule, in that case we might have data artifacts for c.s.t
under a concrete path like c/s/t/date=2023-04-04/hour=13/part-*.parquet
Data Dependencies
Now that we have mapped all time partitions to files, we can apply the same trick as above. Except this time, we not only track data dependencies per table but also data dependencies per datetime as determined by a schedule. So if tablet
depends on m-partitions
of a cached table d
, then t
can only reuse d
if all its m-partitions
are still up-to-date. Outside of this generalization of data dependencies for partitioned files, all else stays the same.
Example
Let’s see this in practice. Let’s simply run a pipeline having three stages where the first is a csv file. (1) Let’s create a new sample where tablesone
and two
are partitioned every 6-hours, and table three
is partitioned daily.
sdf new –sample part123 .
(2) After changing directories into part123 dir (
cd part123
), we can run sdf build
to show how it creates these partitioned files.
sdf build –from 2022-01-02 —to 2022-01-03 –no-show
Running it again can be done by simply rebuilding.
c/s/t/date=yyyy-mm-dd/hour=hh/part-*.parquet
(3) But what happens if the source input changes, let’s say we change 2202-01-03T00:00. We can simulate that by touching
touch ...
(4) Let’s run it again. Note that only the 6 hourly partition needs to be rebuild followed by the last day.
sdf build --no-show
End of Example.
Caching in the cloud
You are working on a local box, but you want to work with data in the cloud. SDF distinguishes two scenarios:- Download: Your want to download external data from the cloud
- Upload: You want to upload materialized data to the cloud
s3_123
sample. The sample has two differences to csv_123
, namely for download and upload. Let’s discuss download first. The original definition of csv_123.pub.one
is
s3_123.pub.one
table is defined by an s3 location.
sdf build --no-show
one
is downloaded on first active use.
(2) Rebuilding is very fast, since no cloud is being consulted.
sdf build --no-show
--download
. This will trigger a download, but only if needed. So here we see just a sync with the cloud but no download…
sdf build --no-show
s3_123
example. To support upload we have emnriched teh workspace with an upload location like so:
sdf build —no-show —upload
sdf build —no-show —upload
No save, no cache
In 99% of scenarios caching works amazingly. But in 1% of scenarios it might actually degrade performance, for example- if you know that all data is always fresh and thus can never be in the cache, or
- if writing the cache would be huge performance penalty to just recomputing the result.
- Use
--no-save
to suppress writing to the cache. - Use
--no-cache
to suppress reading from the cache.