Databricks Spark Listener
Install SDF’s Spark Listener in your Databricks Cluster
SDF does not support transformation on Databricks. The SDF Spark Listener is only used to capture lineage from your Databricks Cluster. This is unlike Snowflake and Redshift integrations where you can run transformations against your cloud compute with sdf run
.
Support for sdf run
on Databricks with SparkSQL is coming soon.
The SDF Console can be configured to listen for Spark events on your Databricks Cluster and automatically ingest and analyze the lineage of your Spark Warehouse.
Prerequisites
Ensure that you have the following installed and configured locally before beginning.
Installation
Follow the steps below to install the SDF Spark Listener on your Databricks Cluster.
Upload SDF Spark Listener to Databricks File System
Using the databricks cli, create a directory on your Databricks File System (DBFS) called databricks/spark-listener
.
databricks fs mkdirs dbfs:/databricks/sdf-listener
Next, run the following to upload the latest version of the SDF Spark Listener to the dbfs directory you just created.
databricks fs cp <(curl -L https://cdn.sdf.com/spark-listener/releases/download/sdf-spark-listener-latest.jar) dbfs:/databricks/sdf-listener/sdf-spark-listener-latest.jar
Add the Spark Listener init script to your Workspace
Using the databricks cli, create a directory in your databricks workspace called /Shared/sdf-listener
.
databricks workspace mkdirs /Shared/sdf-listener
Log into the Databricks Workspace UI and navigate to Workspace in the side bar. Navigate to the /Shared/sdf-listener
directory. Once there, click the Add dropdown and select File. Name the file init-script.sh
and paste the following into the file.
#!/bin/bash
STAGE_DIR="/dbfs/databricks/sdf-listener"
echo "BEGIN: Upload Spark Listener JARs"
cp -f $STAGE_DIR/sdf-spark-listener*.jar /mnt/driver-daemon/jars || { echo "Error copying Spark Listener library file"; exit 1;}
echo "END: Upload Spark Listener JARs"
echo "BEGIN: Modify Spark config settings"
cat << 'EOF' > /databricks/driver/conf/openlineage-spark-driver-defaults.conf
[driver] {
"spark.extraListeners" = "io.openlineage.spark.agent.OpenLineageSparkListener"
}
EOF
echo "END: Modify Spark config settings"
Create a Databricks Integration in the SDF Console
Navigate to console.sdf.com and log in. Go to Settings > Integrations and click Connect Database.
Name your integration and select Databricks as the type. Click Next and follow the on-screen instructions to complete the integration.
Add generated Spark Config to Databricks Compute
In the Databricks UI, navigate to Compute in the sidebar. Select the cluster (or clusters) that you want to enable the SDF Spark Listener on.
Alternativelly, you can create a new compute cluster by clicking Create compute and following these instructions.
In the Cluster Configuration page, click Edit. Note, if your cluster is currently running, you will need to restart it in order for changes to apply. Expand the section labeled Advanced options and in the section labeled Spark, add the spark configuration generated in the previous step to the Spark Config section.
Next, click the Init Scripts Tab and select the init script located at /Shared/sdf-listener/init-script.sh
.
<SDF_CLUSTER_ENDPOINT>
, <SDF_WORKSPACE_NAME>
, and <SDF_PROJECT_NAME>
with your values.spark.openlineage.version v1
spark.extraListeners io.openlineage.spark.agent.OpenLineageSparkListener
spark.openlineage.debugFacet enabled
spark.openlineage.transport.type http
spark.openlineage.transport.url <SDF_CLUSTER_ENDPOINT>
spark.openlineage.transport.endpoint /api/v1/open-lineage/lineage
spark.openlineage.transport.headers.workspace <SDF_WORKSPACE_NAME>
spark.openlineage.transport.headers.project <SDF_PROJECT_NAME>
spark.openlineage.transport.headers.cluster <OPTIONAL_NAME_OF_CLUSTER>
spark.openlineage.transport.auth.type api_key
spark.openlineage.transport.auth.apiKey {{secrets/sdf-integration/events-access-key}}
Save your changes by clicking Confirm and then start your cluster by clicking Start.
(Optional) Run a Test Query
In a Databricks Notebook, select the Connect button and select the compute cluster that you just configured. Run the following query to test that the SDF Spark Listener is working.
spark.createDataFrame([
{'a': 1, 'b': 2},
{'a': 3, 'b': 4}
]).write.mode("overwrite").saveAsTable("default.test_sdf_integration")