Executing on Data from S3
SDF DB can execute queries on data stored in S3 parquet, CSV, JSON, gzip, and more
Overview
This guide will walk you through the steps of creating an AWS S3 integration with SDF to analyze world population data.
Guide to walking through the AWS S3 example
Create Hello World S3 Workspace from Example
Begin by running the following command to create an SDF workspace from the example
hello_world_s3
.
This command will generate a new workspace containing a two environments - one configured to run data locally, the other configured to pull data down
from an SDF-hosted public S3 bucket s3://sdf-datasets
.
The project structure looks like this:
.
├── local
│ ├── pop.csv
│ ├── popdata.sql
│ └── world_metrics.sql
├── remote
│ ├── q1.sql
│ └── un_pop_data.sql
└── workspace.sdf.yml
2 directories, 6 files
Build using local population data
To start, let’s run the queries on the local CSV data packaged into the sample. To do this, we can use the local environment defined in the workspace.sdf.yml
.
The local environment is configured to run the queries on the pop.csv
file in the local
directory. Let’s run our queries to get a sense for the data we’re working with.
Working set 1 .sdf file
Finished run in 0.843 secs
Since the default environment specified in the workspace.sdf.yml
is local
we don’t need specify the environment (-e
) in the compile command.
Authenticate SDF using a local AWS Profile
Nice! We’re run the queries locally. Let’s move on to running them on remote data pulled down from S3.
Let’s start by authenticating SDF to AWS. This will be used to pull data from our public bucket (s3://sdfdatasets
),
so this can be any AWS profile you have available locally.
If you have not set up an AWS profile yet, follow the instructions on the Getting Started with S3 Guide
Run the following command:
This will create a new credential in a ~/.sdf/
directory in the root of your system. This credential will be used to authenticate with
AWS services. By default, the credential’s name is default. As such, the credential does
not need to be explicitly referenced in the integrations configuration below.
To validate the connection, run:
Run using the remote environment
Now that we’ve successfully authenticated SDF with AWS, we can try running in the remote environment.
This command will pull down the data from remote, then run the queries on the downloaded data locally. It will then cache the results and the downloaded data, so that subsequent runs will be much faster.
Let’s try it!
Working set 2 model files, 1 .sdf file
Running hello_world.pub.un_pop_data (./remote/un_pop_data.sql)
Downloading hello_world.pub.un_pop_data (s3://sdfdatasets/hello-world/world_population_full.csv)
Running hello_world.pub.q1 (./remote/q1.sql)
Finished 2 models [2 succeeded] in 1.358 secs
Table hello_world.pub.un_pop_data
┌───────┬───────────┬───────────────────┬───────────────┬─────────────────┬───────┬──────┬──────────────────────┬──────────────────────────────────┬────────────┐
│ Index ┆ Variant ┆ region_or_country ┆ Location_Code ┆ ISO3_Alpha_code ┆ Type ┆ Year ┆ Population_Thousands ┆ Population_Density_Per_Square_KM ┆ Median_Age │
╞═══════╪═══════════╪═══════════════════╪═══════════════╪═════════════════╪═══════╪══════╪══════════════════════╪══════════════════════════════════╪════════════╡
│ 1 ┆ Estimates ┆ WORLD ┆ 900 ┆ ┆ World ┆ 1950 ┆ 2 477 675 ┆ 19.2 ┆ 22.2 │
│ 2 ┆ Estimates ┆ WORLD ┆ 900 ┆ ┆ World ┆ 1951 ┆ 2 520 970 ┆ 19.5 ┆ 22.2 │
│ 3 ┆ Estimates ┆ WORLD ┆ 900 ┆ ┆ World ┆ 1952 ┆ 2 565 291 ┆ 19.9 ┆ 22.1 │
│ 4 ┆ Estimates ┆ WORLD ┆ 900 ┆ ┆ World ┆ 1953 ┆ 2 615 251 ┆ 20.2 ┆ 22.0 │
│ 5 ┆ Estimates ┆ WORLD ┆ 900 ┆ ┆ World ┆ 1954 ┆ 2 665 307 ┆ 20.6 ┆ 22.0 │
│ 6 ┆ Estimates ┆ WORLD ┆ 900 ┆ ┆ World ┆ 1955 ┆ 2 718 652 ┆ 21.1 ┆ 21.8 │
│ 7 ┆ Estimates ┆ WORLD ┆ 900 ┆ ┆ World ┆ 1956 ┆ 2 773 493 ┆ 21.5 ┆ 21.7 │
│ 8 ┆ Estimates ┆ WORLD ┆ 900 ┆ ┆ World ┆ 1957 ┆ 2 828 513 ┆ 21.9 ┆ 21.6 │
│ 9 ┆ Estimates ┆ WORLD ┆ 900 ┆ ┆ World ┆ 1958 ┆ 2 887 221 ┆ 22.4 ┆ 21.5 │
│ 10 ┆ Estimates ┆ WORLD ┆ 900 ┆ ┆ World ┆ 1959 ┆ 2 944 995 ┆ 22.8 ┆ 21.5 │
└───────┴───────────┴───────────────────┴───────────────┴─────────────────┴───────┴──────┴──────────────────────┴──────────────────────────────────┴────────────┘
20595 rows, showing only 10 rows.
Run with —limit 0 to show all rows.
Table hello_world.pub.q1
┌───────────────────┬─────────────────┬──────────────────────────────────┐
│ region_or_country ┆ ISO3_Alpha_code ┆ Population_Density_Per_Square_KM │
╞═══════════════════╪═════════════════╪══════════════════════════════════╡
│ Greenland ┆ ┆ 0.1 │
└───────────────────┴─────────────────┴──────────────────────────────────┘
1 rows.
Great! We’ve showcased how SDF can pull down remote data for local execution.
Run it again, and notice how much faster it is while using the cache :)