Understanding Reports

Reports are a useful way to track & analyze datatypes, metadata, and business logic throughout your data warehouse. Reports are expressed as SQL queries against the SDF Information Schema, similar to Checks. Since reports are generated with Static Analysis, executing them is fast. Reports are simply small SQL queries that run against the information schema and enable you to analyze your warehouse and metadata. Some examples of effective reports are:

  • Analyze Privacy Compliance: Get a list of all tables and columns that contain PII
  • Create a Data Deletion Report: Improve your data deletion capabilities by tracking retention and writing a report to find all tables that contain PII and are past their retention period.
  • Run a Dead Column Analysis: Optimize your warehouse by computing a list of all columns that have no dependencies, then remove them to save on storage and compute costs.

A report is very similar to a check. A check is a specialized report where the expected result of the report is empty. In this way, reports can be thought of as variables reflecting the state of your warehouse. These variables can then be used to power a check.

Writing Your First Report

Let’s write and modify a report, starting with one of the sdf samples.

1

Create a new SDF Project

sdf new --sample pii_saas_platform

    Created pii_saas_platform/.gitignore     Created pii_saas_platform/checks/no_pii_in_external.sql     Created pii_saas_platform/classification/taxonomy.sdf.yml     Created pii_saas_platform/classification/users.sdf.yml     Created pii_saas_platform/ddls/payment/public/invoices.sql     Created pii_saas_platform/ddls/payment/public/organizations.sql     Created pii_saas_platform/ddls/payment/public/users.sql     Created pii_saas_platform/models/external/invoice_stats.sql     Created pii_saas_platform/models/external/org_invoice_stats.sql     Created pii_saas_platform/models/internal/avg_invoice_amt.sql     Created pii_saas_platform/models/internal/invoice_payment_delay.sql     Created pii_saas_platform/models/internal/mau_per_org.sql     Created pii_saas_platform/models/internal/most_frequent_payer.sql     Created pii_saas_platform/models/internal/total_revenue_per_org.sql     Created pii_saas_platform/models/internal/users_per_domain.sql     Created pii_saas_platform/models/internal/users_per_org.sql     Created pii_saas_platform/reports/tables_with_pii.sql     Created pii_saas_platform/workspace.sdf.yml           Welcome to your new SDF Workspace! To help you on your journey:                  💡 Join the SDF Community Slack -> https://sdf.com/join                   📚 Read the Docs to Get Started -> https://docs.sdf.com/             Finished new in 0.279 secs

This sample contains a report and a workspace modeled after a simple SAAS platform.

Notice the report in the reports directory. This report simply lists all tables with PII. It’s configured in the workspace.sdf.yml file with includes type report like so:

workspace:
  name: hello
  ...
  includes:
    ...
    - path: reports
      type: report

The report itself lists all tables with PII by fetching all columns with classifiers that have PII. This is done with the following SQL:

SELECT 
  t.table_id, 
  t.description, 
  t.dialect
FROM 
  sdf.information_schema.tables t
JOIN 
  sdf.information_schema.columns c ON t.table_id = c.table_id
WHERE CONTAINS_ARRAY_VARCHAR(c.classifiers, 'PII')
GROUP BY 1,2,3;

Both checks and reports must be written in the Trino SQL dialect.

This report will show us tables that contain PII only because we defined a classifier like so:

    classifier:
    name: PII
    labels: 
      - name: NAME
      - name: email
      - name: phone

As such, this report will show tables that contain columns with any instance of the PII classifier (i.e. PII.name, PII.email, or PII.phone).

SDF stores the classifiers on a column as a string. As such, we can use the SQL LIKE operator to search for a substring.

2

Run & Modify a Report

sdf report --show all

Working set 12 model files, 3 .sdf files   Compiling payment.public.users (./ddls/payment/public/users.sql)   Compiling payment.public.organizations (./ddls/payment/public/organizations.sql)   Compiling payment.public.invoices (./ddls/payment/public/invoices.sql)   Compiling transformations.external.org_invoice_stats (./models/external/org_invoice_stats.sql)   Compiling transformations.external.invoice_stats (./models/external/invoice_stats.sql)   Compiling transformations.internal.mau_per_org (./models/internal/mau_per_org.sql)   Compiling transformations.internal.avg_invoice_amt (./models/internal/avg_invoice_amt.sql)   Compiling transformations.internal.invoice_payment_delay (./models/internal/invoice_payment_delay.sql)   Compiling transformations.internal.most_frequent_payer (./models/internal/most_frequent_payer.sql)   Compiling transformations.internal.total_revenue_per_org (./models/internal/total_revenue_per_org.sql)   Compiling transformations.internal.users_per_domain (./models/internal/users_per_domain.sql)   Compiling transformations.internal.users_per_org (./models/internal/users_per_org.sql) Working set 1 report file, 1 .sdf file   Reporting sdf.reports.tables_with_pii (./reports/tables_with_pii.sql)    Finished 12 models [12 succeeded], 1 report [1 succeeded] in 1.968 secs

Report sdf.reports.TABLES_WITH_PII ┌──────────────────────────────────────────────┬─────────────────────────┬───────────┐ │ table_id                                     ┆ description             ┆ dialect   │ ╞══════════════════════════════════════════════╪═════════════════════════╪═══════════╡ │ TRANSFORMATIONS.INTERNAL.MOST_FREQUENT_PAYER ┆                         ┆ snowflake │ │ payment.public.USERS                         ┆ DDL for the users table ┆ snowflake │ │ TRANSFORMATIONS.EXTERNAL.INVOICE_STATS       ┆                         ┆ snowflake │ │ TRANSFORMATIONS.INTERNAL.USERS_PER_DOMAIN    ┆                         ┆ snowflake │ └──────────────────────────────────────────────┴─────────────────────────┴───────────┘ 4 rows.

As you can see from the output, it looks we have three tables containing columns with PII. Let’s say we also wanted to show the names of the columns that contain PII. We can do this by modifying the SQL query to include the column name like so:

SELECT 
  t.table_id, 
  c.column_name,
  t.description, 
  t.dialect
FROM 
  sdf.information_schema.tables t
JOIN 
  sdf.information_schema.columns c ON t.table_id = c.table_id
WHERE CONTAINS_ARRAY_VARCHAR(c.classifiers, 'PII')
GROUP BY 2,1,3,4;

Now, when we run the report, we can see the column names as well:

sdf report --show all

Working set 12 model files, 3 .sdf files Working set 1 report file, 1 .sdf file   Reporting sdf.reports.tables_with_pii (./reports/tables_with_pii.sql)    Finished 12 models [12 reused], 1 report [1 succeeded] in 1.740 secs

Report sdf.reports.TABLES_WITH_PII ┌──────────────────────────────────────────────┬──────────────┬─────────────────────────┬───────────┐ │ table_id                                     ┆ column_name  ┆ description             ┆ dialect   │ ╞══════════════════════════════════════════════╪══════════════╪═════════════════════════╪═══════════╡ │ TRANSFORMATIONS.INTERNAL.MOST_FREQUENT_PAYER ┆ EMAIL        ┆                         ┆ snowflake │ │ payment.public.USERS                         ┆ NAME         ┆ DDL for the users table ┆ snowflake │ │ TRANSFORMATIONS.EXTERNAL.INVOICE_STATS       ┆ NAME         ┆                         ┆ snowflake │ │ TRANSFORMATIONS.INTERNAL.USERS_PER_DOMAIN    ┆ EMAIL_DOMAIN ┆                         ┆ snowflake │ │ TRANSFORMATIONS.EXTERNAL.INVOICE_STATS       ┆ EMAIL        ┆                         ┆ snowflake │ │ payment.public.USERS                         ┆ EMAIL        ┆ DDL for the users table ┆ snowflake │ │ payment.public.USERS                         ┆ PHONE        ┆ DDL for the users table ┆ snowflake │ └──────────────────────────────────────────────┴──────────────┴─────────────────────────┴───────────┘ 7 rows.

3

Add a New Report

Let’s say we’d like to extend our analysis a bit but still maintain the original report. We can do this by adding a new report to the reports directory. Let’s add a new report to find the most frequently used datatypes in columns with PII. We can do this with the following SQL:

pii_datatypes.sql
  SELECT c.datatype, COUNT(*) as frequency
  FROM sdf.information_schema.columns c
  WHERE CONTAINS_ARRAY_VARCHAR(c.classifiers, 'PII')
  GROUP BY c.datatype 
  ORDER BY frequency DESC;

Now, when we run the report, we can see the most frequently used datatypes in columns with PII:

sdf report --show all

Working set 12 model files, 3 .sdf files Working set 2 report files, 1 .sdf file   Reporting sdf.reports.pii_datatypes (./reports/pii_datatypes.sql)    Finished 12 models [12 reused], 2 reports [1 succeeded, 1 reused] in 1.720 secs

Report sdf.reports.TABLES_WITH_PII ┌──────────────────────────────────────────────┬──────────────┬─────────────────────────┬───────────┐ │ table_id                                     ┆ column_name  ┆ description             ┆ dialect   │ ╞══════════════════════════════════════════════╪══════════════╪═════════════════════════╪═══════════╡ │ TRANSFORMATIONS.EXTERNAL.INVOICE_STATS       ┆ EMAIL        ┆                         ┆ snowflake │ │ payment.public.USERS                         ┆ EMAIL        ┆ DDL for the users table ┆ snowflake │ │ payment.public.USERS                         ┆ PHONE        ┆ DDL for the users table ┆ snowflake │ │ TRANSFORMATIONS.INTERNAL.MOST_FREQUENT_PAYER ┆ EMAIL        ┆                         ┆ snowflake │ │ payment.public.USERS                         ┆ NAME         ┆ DDL for the users table ┆ snowflake │ │ TRANSFORMATIONS.EXTERNAL.INVOICE_STATS       ┆ NAME         ┆                         ┆ snowflake │ │ TRANSFORMATIONS.INTERNAL.USERS_PER_DOMAIN    ┆ EMAIL_DOMAIN ┆                         ┆ snowflake │ └──────────────────────────────────────────────┴──────────────┴─────────────────────────┴───────────┘ 7 rows.

Report sdf.reports.PII_DATATYPES ┌──────────┬───────────┐ │ datatype ┆ frequency │ ╞══════════╪═══════════╡ │ varchar  ┆ 7         │ └──────────┴───────────┘ 1 rows.

4

Save the Output (Optional)

We can save the output of a report to a file by using the --format flag. Let’s save the output of our reports to a file called sdf_reports.csv:

sdf report --format csv --show result > sdf_reports.csv

Note that we used --show result to only show the result of the report, not any output regarding the execution of the report. In a scenario where you’d want to programatically generate valid CSVs or JSONs of your reports, this is necessary to avoid any extraneous output. See our input and output guide for more on this.

Advanced Reports

  • To learn more about transforming classifiers across functions, see: UDFs
  • To learn more about the SDF internal information schema, see: Information Schema