June 1st, 2023

We released a very exciting version of the CLI which contains many updates, bug fixes, and other improvements to SDF, but most importantly, Pyspark support.

With this release one can now run sdf describe and sdf lineage on workspaces with pyspark pipelines with the help of the new Pyspark plugin. Here is an example of how this functionality can be unlocked in a workspace definition file:

workspace:
  edition: "1.1"
  default-schema: scratch
  includes:
    - path : path/
---

plugin:
  name: pyspark
  binary: analyze_pyspark_pipeline.py
  extension: py
  format: spark-lp-json

Here path contains Python pipeline files with extension .py. The Pyspark plugin relies on the analyze_pyspark_pipeline.py script which accesses the user’s Databricks cluster and downloads the metadata of all the tables in the specified cluster.

This is an alpha release. Known limitations include:

  • Classifier propagation is not yet enabled for Pyspark tables
  • The analyze_pyspark_pipeline.py script accesses all tables in a catalog sequentially, which is slow. We may parallelize this or allow the user to specify a subset of schemas/databases to access.

What’s New?

  • PySpark support 😄
  • Introduced support for super types
  • Bug fixes and stability improvements

Latest Builds

ArchitectureStatusVersionDownload
Linux Intel X86-640.1.16Download
Linux Arm ARM-640.1.16Download
Apple Intel X86-640.1.16Download
Apple Arm AARCh-640.1.16Download