Getting Started with BigQuery and SDF
SDF as a best-in-class transformation layer for BigQuery
Overview
SDF can hydrate table schemas and execute queries against a BigQuery DB. This guide will walk you through the steps of creating a BigQuery integration. By the end, you will be able to reference tables in your BigQuery instance in SDF.
SDF will use remote BigQuery schema information to do type bindings and other static analysis checks on your workspace.
Prerequisites
In order to connect to BigQuery, you will need to have on hand the following:
- A BigQuery account and project with billing set up
- Valid credentials with read access to table metadata in your BigQuery account.
Guide
Collect Required Information
To connect to BigQuery, you’ll need to have created a service account, and a corresponding key for the service account. The following values are needed to authenticate SDF to BigQuery:
- Project ID - the project ID of the project you want to pull table metadata from
- Client Email - the email address of the service account
- Private Key - the private key of the service account
The easiest way to extract all the necessary information is to download the service account’s key as JSON, and fill in the values.
To create a service account key and download the JSON, follow this official GCP guide.
Your BigQuery service account will need the following permissions:
Connect SDF to BigQuery
To authenticate SDF to BigQuery, run the following command with the values from the previous step filled in:
sdf auth login bigquery --project-id <PROJECT_ID> --client-email <CLIENT_EMAIL> --private-key <PRIVATE_KEY>
Run sdf auth login bigquery --help
to see all login options.
After running this command, SDF will validate that it has access to all the necessary permissions for BigQuery with these credentials. A warning will show if SDF does not have the necessary permissions, or if there was a different issue authenticating to GCP.
This will create a new credential in a ~/.sdf/
directory in the root of your system. This credential will be used to authenticate with BigQuery services. By default, the
credential’s name is default. As such, the credential does not need to be explicitly referenced in the integrations configuration below.
To validate the connection, run:
sdf auth status
Add BigQuery Provider in Workspace
Once authenticated, add an integration block to your workspace.sdf.yml
. This tells SDF to use BigQuery to hydrate missing table schemas.
workspace:
...
integrations:
- provider: bigquery
type: database
sources:
- pattern: <PROJECT_ID>.*.*
Replace <PROJECT_ID>
with the name of the project you want to pull table metadata from. Note this is configurable and can be changed to any project you have access to. For example, if I wanted SDF to pull from two projects, called proj1
and proj2
, I would write:
workspace:
...
integrations:
- provider: BigQuery
type: database
sources:
- pattern: proj1.*.*
- pattern: proj2.*.*
Integrations can also be configured to use specific credentials by referencing the credential name in the integration block. For example, if you wanted to use a credential called bq-creds
, you would write:
workspace:
...
integrations:
- provider: BigQuery
type: database
credential: bq-creds
sources:
- pattern: proj1.dataset1.mytable
- pattern: proj2.dataset2.*
For more information on integration configuration, see the integration documentation.
Try it out!
Now that you’re connected, let’s make sure SDF can pull the schema information it needs.
Run sdf compile -q "select * from <PROJECT_ID>.<DATASET>.<TABLE>" --show all
If the connection is successful, you will see the schema for the table you selected.
SDF is capable of creating projects in BigQuery on sdf run
if the project does not already exist. This is useful for creating a new project in BigQuery programmatically on execution of new tables.
If you want to enable this feature, the service account used to authenticate SDF to BigQuery will need to following additional permissions:
Furthermore, this also requires the Cloud Resource Manager API to be enabled. If not already, please enable this API in your GCP project.
The term PROJECT
in BigQuery is interchangeable with the term CATALOG
in SDF. Similarly, the term DATASET
in BigQuery is interchangeable with SCHEMA
in SDF.