Databricks Spark Listener
Install SDF’s Spark Listener in your Databricks Cluster
SDF does not support transformation on Databricks. The SDF Spark Listener is only used to capture lineage from your Databricks Cluster. This is unlike Snowflake and Redshift integrations where you can run transformations against your cloud compute with sdf run
.
Support for sdf run
on Databricks with SparkSQL is coming soon.
The SDF Console can be configured to listen for Spark events on your Databricks Cluster and automatically ingest and analyze the lineage of your Spark Warehouse.
Prerequisites
Ensure that you have the following installed and configured locally before beginning.
Databricks CLI
For more information on how to install, see the Databricks CLI Tutorial
Installation
Follow the steps below to install the SDF Spark Listener on your Databricks Cluster.
Upload SDF Spark Listener to Databricks File System
Using the databricks cli, create a directory on your Databricks File System (DBFS) called databricks/spark-listener
.
Next, run the following to upload the latest version of the SDF Spark Listener to the dbfs directory you just created.
Add the Spark Listener init script to your Workspace
Using the databricks cli, create a directory in your databricks workspace called /Shared/sdf-listener
.
Log into the Databricks Workspace UI and navigate to Workspace in the side bar. Navigate to the /Shared/sdf-listener
directory. Once there, click the Add dropdown and select File. Name the file init-script.sh
and paste the following into the file.
Create a Databricks Integration in the SDF Console
Navigate to console.sdf.com and log in. Go to Settings > Integrations and click Connect Database.
Name your integration and select Databricks as the type. Click Next and follow the on-screen instructions to complete the integration.
Add generated Spark Config to Databricks Compute
In the Databricks UI, navigate to Compute in the sidebar. Select the cluster (or clusters) that you want to enable the SDF Spark Listener on.
Alternativelly, you can create a new compute cluster by clicking Create compute and following these instructions.
In the Cluster Configuration page, click Edit. Note, if your cluster is currently running, you will need to restart it in order for changes to apply. Expand the section labeled Advanced options and in the section labeled Spark, add the spark configuration generated in the previous step to the Spark Config section.
Next, click the Init Scripts Tab and select the init script located at /Shared/sdf-listener/init-script.sh
.
<SDF_CLUSTER_ENDPOINT>
, <SDF_WORKSPACE_NAME>
, and <SDF_PROJECT_NAME>
with your values.Save your changes by clicking Confirm and then start your cluster by clicking Start.
(Optional) Run a Test Query
In a Databricks Notebook, select the Connect button and select the compute cluster that you just configured. Run the following query to test that the SDF Spark Listener is working.