Getting Started with Redshift and SDF
SDF as a best-in-class transformation layer for Redshift
Overview
This guide will walk you through the steps of creating a Redshift integration. By the end, you will be able to reference remote Redshift tables in SDF as if they were locally defined!
SDF will use remote Redshift schema information to do type bindings, column level lineage, and other static analysis checks on your workspace.
Redshift catalog, schema, and tables map 1:1 to SDF’s catalog.schema.table model.
Guide using an IAM access_key and secret_key
Collect Required Information
To connect to Redshift via IAM credentials, you need a valid keypair with the minimum credentials to read schema information in your redshift instance.
access_key
the access key of your IAM profilesecret_access_key
the secret key of your IAM profileCLUSTER_IDENTIFIER
- The Redshift cluster you want to connect to
Run sdf auth login aws --help
to see all login options.
If you already have an IAM role, you may use that one. Or, simply create a new IAM user. The below is the minimum viable allow ARN policy.
Register the User in the AWS CLI
The newly created user needs to be registered with the AWS CLI. You can optionally provide a profile name. As default region, please use the AWS region that your redshift instance is located in.
The example below configures a profile called redshift
Connect SDF to AWS
Connect SDF to AWS by telling SDF which of your AWS profiles to use.
This will create a new credential in a ~/.sdf/
directory in the root of your system. This credential will be used to authenticate with AWS services. By default, the
credential’s name is default. As such, the credential does not need to be explicitly referenced in the integrations configuration below.
To validate the connection, run:
Add Redshift Provider in Workspace
Once authenticated, add an integrations block in your workspace.sdf.yml
workspace block. This tells SDF to use Redshift to hydrate missing table schemas.
Replace <DATABASE>
with the name of the database you want to hydrate. Note this is configurable and can be changed to any database you have access to. For example, if I wanted SDF to pull from two databases, called db1
and db2
, I would write:
Lastly, integrations can also be configured to use specific credentials by referencing the credential name in the integration block. For example, if you wanted to use a credential called redshift-creds
, you would write:
For more information on provider configuration, see the integration documentation.
Try it out!
Now that you’re connected, let’s make sure SDF can pull the schema information it needs.
Run sdf compile -q "select * from <DATABASE>.<SCHEMA>.<TABLE>" --show all
If the connection is successful, you will see the schema for the table you selected.
The ARN you created also needs user access to the database. Depending on your DBs Access Policy, you might also have to give explicit access to this new user within your Redshift cluster. You can do this through by adding the ARN user to the appropriate redshift group:
Troubleshoot Resources
-
Seeing errors like this
ERROR: permission denied for schema schema1
fromsdf run
?This may be due to the Redshift user not having the correct permissions to this schema. To check if you have the right access, you can run the following query.
Use
SVV_SCHEMA_PRIVILEGES
to view the schema owner, and schema_acl which shows the access control list for the schema.If your user does not have access to the schema, you can grant access to the schema by running the following query. To run this query, you either have to be a superuser or the owner of the schema.
The
<username>
is the user name of your IAM user, withIAM:
prefix. For example, if your IAM user name ismy_user
, you can run: