Getting Started with BigQuery and SDF
SDF as a best-in-class transformation layer for BigQuery
Overview
SDF can hydrate table schemas and execute queries against a BigQuery DB. This guide will walk you through the steps of creating a BigQuery integration. By the end, you will be able to reference tables in your BigQuery instance in SDF.
SDF will use remote BigQuery schema information to do type bindings and other static analysis checks on your workspace.
Prerequisites
In order to connect to BigQuery, you will need to have on hand the following:
- A BigQuery account and project with billing set up
- Valid credentials with read access to table metadata in your BigQuery account.
Guide
Collect Required Information
To connect to BigQuery, you’ll need to have created a service account, and a corresponding key for the service account. The following values are needed to authenticate SDF to BigQuery:
- Project ID - the project ID of the project you want to pull table metadata from
- Client Email - the email address of the service account
- Private Key - the private key of the service account
The easiest way to extract all the necessary information is to download the service account’s key as JSON, and fill in the values.
To create a service account key and download the JSON, follow this official GCP guide.
Your BigQuery service account will need the following permissions:
Connect SDF to BigQuery
To authenticate SDF to BigQuery, run the following command with the values from the previous step filled in:
Run sdf auth login bigquery --help
to see all login options. If the JSON path does not work or cannot be specific, the values can be copied in
as command flags like so:
After running this command, SDF will validate that it has access to all the necessary permissions for BigQuery with these credentials. A warning will show if SDF does not have the necessary permissions, or if there was a different issue authenticating to GCP.
This will create a new credential in a ~/.sdf/
directory in the root of your system. This credential will be used to authenticate with BigQuery services. By default, the
credential’s name is default. As such, the credential does not need to be explicitly referenced in the integrations configuration below.
To validate the connection, run:
Add BigQuery Provider in Workspace
Now that we’re authenticated, we can configure our workspace.sdf.yml
to use BigQuery as a provider.
The first thing we’ll need to do is set the default catalog
to our BigQuery project ID. This tells SDF to use this project as the default project for all queries.
Next, let’s add an integration block to the workspace.sdf.yml
. This tells SDF to use BigQuery to hydrate missing table schemas.
Replace <PROJECT_ID>
with the name of the project you want to pull table metadata from. Note this is configurable and can be changed to any project you have access to. For example, if I wanted SDF to pull from two projects, called proj1
and proj2
, I would write:
Integrations can also be configured to use specific credentials by referencing the credential name in the integration block. For example, if you wanted to use a credential called bq-creds
, you would write:
For more information on integration configuration, see the integration documentation.
Try it out!
Now that you’re connected, let’s make sure SDF can pull the schema information it needs.
Run sdf compile -q "select * from <PROJECT_ID>.<DATASET>.<TABLE>" --show all
If the connection is successful, you will see the schema for the table you selected.
SDF is capable of creating projects in BigQuery on sdf run
if the project does not already exist. This is useful for creating a new project in BigQuery programmatically on execution of new tables.
If you want to enable this feature, the service account used to authenticate SDF to BigQuery will need to following additional permissions:
Furthermore, this also requires the Cloud Resource Manager API to be enabled. If not already, please enable this API in your GCP project.
The term PROJECT
in BigQuery is interchangeable with the term CATALOG
in SDF. Similarly, the term DATASET
in BigQuery is interchangeable with SCHEMA
in SDF.