Overview

This guide will walk you through the steps of creating an AWS S3 integration with SDF to analyze world population data.

Guide to walking through the AWS S3 example

1

Create Hello World S3 Workspace from Example

Begin by running the following command to create an SDF workspace from the example hello_world_s3.

sdf new --sample hello_world_s3 && cd hello_world_s3

This command will generate a new workspace containing a two environments - one configured to run data locally, the other configured to pull data down from an SDF-hosted public S3 bucket s3://sdf-datasets.

The project structure looks like this:

. ├── local │   ├── pop.csv │   ├── popdata.sql │   └── world_metrics.sql ├── remote │   ├── q1.sql │   └── un_pop_data.sql └── workspace.sdf.yml

2 directories, 6 files

2

Build using local population data

To start, let’s run the queries on the local CSV data packaged into the sample. To do this, we can use the local environment defined in the workspace.sdf.yml.

The local environment is configured to run the queries on the pop.csv file in the local directory. Let’s run our queries to get a sense for the data we’re working with.

sdf run --show all

Working set 1 .sdf file    Finished run in 0.843 secs

Since the default environment specified in the workspace.sdf.yml is local we don’t need specify the environment (-e) in the compile command.

3

Authenticate SDF using a local AWS Profile

Nice! We’re run the queries locally. Let’s move on to running them on remote data pulled down from S3.

Let’s start by authenticating SDF to AWS. This will be used to pull data from our public bucket (s3://sdfdatasets), so this can be any AWS profile you have available locally.

If you have not set up an AWS profile yet, follow the instructions on the Getting Started with S3 Guide

Run the following command:

sdf auth login aws --profile <profile>

This will create a new credential in a ~/.sdf/ directory in the root of your system. This credential will be used to authenticate with AWS services. By default, the credential’s name is default. As such, the credential does not need to be explicitly referenced in the integrations configuration below.

To validate the connection, run:

sdf auth status
4

Run using the remote environment

Now that we’ve successfully authenticated SDF with AWS, we can try running in the remote environment.

This command will pull down the data from remote, then run the queries on the downloaded data locally. It will then cache the results and the downloaded data, so that subsequent runs will be much faster.

Let’s try it!

sdf run -e remote --show all

Working set 2 model files, 1 .sdf file     Running hello_world.pub.un_pop_data (./remote/un_pop_data.sql) Downloading hello_world.pub.un_pop_data  (s3://sdfdatasets/hello-world/world_population_full.csv)     Running hello_world.pub.q1 (./remote/q1.sql)    Finished 2 models [2 succeeded] in 1.358 secs

Table hello_world.pub.un_pop_data ┌───────┬───────────┬───────────────────┬───────────────┬─────────────────┬───────┬──────┬──────────────────────┬──────────────────────────────────┬────────────┐ │ Index ┆ Variant   ┆ region_or_country ┆ Location_Code ┆ ISO3_Alpha_code ┆ Type  ┆ Year ┆ Population_Thousands ┆ Population_Density_Per_Square_KM ┆ Median_Age │ ╞═══════╪═══════════╪═══════════════════╪═══════════════╪═════════════════╪═══════╪══════╪══════════════════════╪══════════════════════════════════╪════════════╡ │ 1     ┆ Estimates ┆ WORLD             ┆ 900           ┆                 ┆ World ┆ 1950 ┆  2 477 675           ┆ 19.2                             ┆ 22.2       │ │ 2     ┆ Estimates ┆ WORLD             ┆ 900           ┆                 ┆ World ┆ 1951 ┆  2 520 970           ┆ 19.5                             ┆ 22.2       │ │ 3     ┆ Estimates ┆ WORLD             ┆ 900           ┆                 ┆ World ┆ 1952 ┆  2 565 291           ┆ 19.9                             ┆ 22.1       │ │ 4     ┆ Estimates ┆ WORLD             ┆ 900           ┆                 ┆ World ┆ 1953 ┆  2 615 251           ┆ 20.2                             ┆ 22.0       │ │ 5     ┆ Estimates ┆ WORLD             ┆ 900           ┆                 ┆ World ┆ 1954 ┆  2 665 307           ┆ 20.6                             ┆ 22.0       │ │ 6     ┆ Estimates ┆ WORLD             ┆ 900           ┆                 ┆ World ┆ 1955 ┆  2 718 652           ┆ 21.1                             ┆ 21.8       │ │ 7     ┆ Estimates ┆ WORLD             ┆ 900           ┆                 ┆ World ┆ 1956 ┆  2 773 493           ┆ 21.5                             ┆ 21.7       │ │ 8     ┆ Estimates ┆ WORLD             ┆ 900           ┆                 ┆ World ┆ 1957 ┆  2 828 513           ┆ 21.9                             ┆ 21.6       │ │ 9     ┆ Estimates ┆ WORLD             ┆ 900           ┆                 ┆ World ┆ 1958 ┆  2 887 221           ┆ 22.4                             ┆ 21.5       │ │ 10    ┆ Estimates ┆ WORLD             ┆ 900           ┆                 ┆ World ┆ 1959 ┆  2 944 995           ┆ 22.8                             ┆ 21.5       │ └───────┴───────────┴───────────────────┴───────────────┴─────────────────┴───────┴──────┴──────────────────────┴──────────────────────────────────┴────────────┘ 20595 rows, showing only 10 rows.   Run with —limit 0 to show all rows.

Table hello_world.pub.q1 ┌───────────────────┬─────────────────┬──────────────────────────────────┐ │ region_or_country ┆ ISO3_Alpha_code ┆ Population_Density_Per_Square_KM │ ╞═══════════════════╪═════════════════╪══════════════════════════════════╡ │ Greenland         ┆                 ┆ 0.1                              │ └───────────────────┴─────────────────┴──────────────────────────────────┘ 1 rows.

Great! We’ve showcased how SDF can pull down remote data for local execution.

Run it again, and notice how much faster it is while using the cache :)