Reports
Understanding Reports
Reports are a useful way to track & analyze datatypes, metadata, and business logic throughout your data warehouse. Reports are expressed as SQL queries against the SDF Information Schema, similar to Checks. Since reports are generated with Static Analysis, executing them is fast. Reports are simply small SQL queries that run against the information schema and enable you to analyze your warehouse and metadata. Some examples of effective reports are:
- Analyze Privacy Compliance: Get a list of all tables and columns that contain PII
- Create a Data Deletion Report: Improve your data deletion capabilities by tracking retention and writing a report to find all tables that contain PII and are past their retention period.
- Run a Dead Column Analysis: Optimize your warehouse by computing a list of all columns that have no dependencies, then remove them to save on storage and compute costs.
A report is very similar to a check. A check is a specialized report where the expected result of the report is empty. In this way, reports can be thought of as variables reflecting the state of your warehouse. These variables can then be used to power a check.
Writing Your First Report
Let’s write and modify a report, starting with one of the sdf samples.
Create a new SDF Project
Created pii_saas_platform/.gitignore
Created pii_saas_platform/checks/no_pii_in_external.sql
Created pii_saas_platform/classification/taxonomy.sdf.yml
Created pii_saas_platform/classification/users.sdf.yml
Created pii_saas_platform/ddls/payment/public/invoices.sql
Created pii_saas_platform/ddls/payment/public/organizations.sql
Created pii_saas_platform/ddls/payment/public/users.sql
Created pii_saas_platform/models/external/invoice_stats.sql
Created pii_saas_platform/models/external/org_invoice_stats.sql
Created pii_saas_platform/models/internal/avg_invoice_amt.sql
Created pii_saas_platform/models/internal/invoice_payment_delay.sql
Created pii_saas_platform/models/internal/mau_per_org.sql
Created pii_saas_platform/models/internal/most_frequent_payer.sql
Created pii_saas_platform/models/internal/total_revenue_per_org.sql
Created pii_saas_platform/models/internal/users_per_domain.sql
Created pii_saas_platform/models/internal/users_per_org.sql
Created pii_saas_platform/reports/tables_with_pii.sql
Created pii_saas_platform/workspace.sdf.yml
Welcome to your new SDF Workspace! To help you on your journey:
💡 Join the SDF Community Slack -> https://sdf.com/join
📚 Read the Docs to Get Started -> https://docs.sdf.com/
Finished new in 0.279 secs
This sample contains a report and a workspace modeled after a simple SAAS platform.
Notice the report in the reports
directory. This report simply lists all tables with PII. It’s configured in the workspace.sdf.yml
file with includes type report
like so:
The report itself lists all tables with PII by fetching all columns with classifiers that have PII
. This is done with the following SQL:
Both checks and reports must be written in the Trino SQL dialect.
This report will show us tables that contain PII only because we defined a classifier like so:
As such, this report will show tables that contain columns with any instance of the PII classifier (i.e. PII.name
, PII.email
, or PII.phone
).
SDF stores the classifiers
on a column as a string. As such, we can use the SQL LIKE
operator to search for a substring.
Run & Modify a Report
Working set 12 model files, 3 .sdf files
Compiling payment.public.users (./ddls/payment/public/users.sql)
Compiling payment.public.organizations (./ddls/payment/public/organizations.sql)
Compiling payment.public.invoices (./ddls/payment/public/invoices.sql)
Compiling transformations.external.org_invoice_stats (./models/external/org_invoice_stats.sql)
Compiling transformations.external.invoice_stats (./models/external/invoice_stats.sql)
Compiling transformations.internal.mau_per_org (./models/internal/mau_per_org.sql)
Compiling transformations.internal.avg_invoice_amt (./models/internal/avg_invoice_amt.sql)
Compiling transformations.internal.invoice_payment_delay (./models/internal/invoice_payment_delay.sql)
Compiling transformations.internal.most_frequent_payer (./models/internal/most_frequent_payer.sql)
Compiling transformations.internal.total_revenue_per_org (./models/internal/total_revenue_per_org.sql)
Compiling transformations.internal.users_per_domain (./models/internal/users_per_domain.sql)
Compiling transformations.internal.users_per_org (./models/internal/users_per_org.sql)
Working set 1 report file, 1 .sdf file
Reporting sdf.reports.tables_with_pii (./reports/tables_with_pii.sql)
Finished 12 models [12 succeeded], 1 report [1 succeeded] in 1.968 secs
Report sdf.reports.TABLES_WITH_PII
┌──────────────────────────────────────────────┬─────────────────────────┬───────────┐
│ table_id ┆ description ┆ dialect │
╞══════════════════════════════════════════════╪═════════════════════════╪═══════════╡
│ TRANSFORMATIONS.INTERNAL.MOST_FREQUENT_PAYER ┆ ┆ snowflake │
│ payment.public.USERS ┆ DDL for the users table ┆ snowflake │
│ TRANSFORMATIONS.EXTERNAL.INVOICE_STATS ┆ ┆ snowflake │
│ TRANSFORMATIONS.INTERNAL.USERS_PER_DOMAIN ┆ ┆ snowflake │
└──────────────────────────────────────────────┴─────────────────────────┴───────────┘
4 rows.
As you can see from the output, it looks we have three tables containing columns with PII. Let’s say we also wanted to show the names of the columns that contain PII. We can do this by modifying the SQL query to include the column name like so:
Now, when we run the report, we can see the column names as well:
Working set 12 model files, 3 .sdf files
Working set 1 report file, 1 .sdf file
Reporting sdf.reports.tables_with_pii (./reports/tables_with_pii.sql)
Finished 12 models [12 reused], 1 report [1 succeeded] in 1.740 secs
Report sdf.reports.TABLES_WITH_PII
┌──────────────────────────────────────────────┬──────────────┬─────────────────────────┬───────────┐
│ table_id ┆ column_name ┆ description ┆ dialect │
╞══════════════════════════════════════════════╪══════════════╪═════════════════════════╪═══════════╡
│ TRANSFORMATIONS.INTERNAL.MOST_FREQUENT_PAYER ┆ EMAIL ┆ ┆ snowflake │
│ payment.public.USERS ┆ NAME ┆ DDL for the users table ┆ snowflake │
│ TRANSFORMATIONS.EXTERNAL.INVOICE_STATS ┆ NAME ┆ ┆ snowflake │
│ TRANSFORMATIONS.INTERNAL.USERS_PER_DOMAIN ┆ EMAIL_DOMAIN ┆ ┆ snowflake │
│ TRANSFORMATIONS.EXTERNAL.INVOICE_STATS ┆ EMAIL ┆ ┆ snowflake │
│ payment.public.USERS ┆ EMAIL ┆ DDL for the users table ┆ snowflake │
│ payment.public.USERS ┆ PHONE ┆ DDL for the users table ┆ snowflake │
└──────────────────────────────────────────────┴──────────────┴─────────────────────────┴───────────┘
7 rows.
Add a New Report
Let’s say we’d like to extend our analysis a bit but still maintain the original report. We can do this by adding a new report to the reports
directory. Let’s add a new report to find the most frequently used datatypes in columns with PII. We can do this with the following SQL:
Now, when we run the report, we can see the most frequently used datatypes in columns with PII:
Working set 12 model files, 3 .sdf files
Working set 2 report files, 1 .sdf file
Reporting sdf.reports.pii_datatypes (./reports/pii_datatypes.sql)
Finished 12 models [12 reused], 2 reports [1 succeeded, 1 reused] in 1.720 secs
Report sdf.reports.TABLES_WITH_PII
┌──────────────────────────────────────────────┬──────────────┬─────────────────────────┬───────────┐
│ table_id ┆ column_name ┆ description ┆ dialect │
╞══════════════════════════════════════════════╪══════════════╪═════════════════════════╪═══════════╡
│ TRANSFORMATIONS.EXTERNAL.INVOICE_STATS ┆ EMAIL ┆ ┆ snowflake │
│ payment.public.USERS ┆ EMAIL ┆ DDL for the users table ┆ snowflake │
│ payment.public.USERS ┆ PHONE ┆ DDL for the users table ┆ snowflake │
│ TRANSFORMATIONS.INTERNAL.MOST_FREQUENT_PAYER ┆ EMAIL ┆ ┆ snowflake │
│ payment.public.USERS ┆ NAME ┆ DDL for the users table ┆ snowflake │
│ TRANSFORMATIONS.EXTERNAL.INVOICE_STATS ┆ NAME ┆ ┆ snowflake │
│ TRANSFORMATIONS.INTERNAL.USERS_PER_DOMAIN ┆ EMAIL_DOMAIN ┆ ┆ snowflake │
└──────────────────────────────────────────────┴──────────────┴─────────────────────────┴───────────┘
7 rows.
Report sdf.reports.PII_DATATYPES
┌──────────┬───────────┐
│ datatype ┆ frequency │
╞══════════╪═══════════╡
│ varchar ┆ 7 │
└──────────┴───────────┘
1 rows.
Save the Output (Optional)
We can save the output of a report to a file by using the --format
flag. Let’s save the output of our reports to a file called sdf_reports.csv
:
Note that we used --show result
to only show the result of the report, not any output regarding the execution of the report. In a scenario where you’d want to programatically generate valid CSVs or JSONs of your reports, this is necessary to avoid any extraneous output. See our input and output guide for more on this.
Advanced Reports
- To learn more about transforming classifiers across functions, see: UDFs
- To learn more about the SDF internal information schema, see: Information Schema