SDF can be used to fully automate data deletion and retention policies for GDPR compliance.
Our approach assumes the data warehoue is updated daily and contains the following main components for automation:
SDF introduces three features which are central to the proposed automation.
We introduce each of these three SDF concepts in turn.
SDF distinguishes between Table classifiers and Column classifiers.
We define the column classifier PII with a single label tag to label columns that have PII. PII label are automatically propagated following the flow of information through the warehouse. Of course users can refine PII labels, for instance specify that data is user_id, phone, email, full_name, etc. But here it is sufficient to just have a tag.
We define two table classifiers. The table classifier Content has the labels machine data, static data, and user data. The table classifiers machine data and static data flow automatically based on information flow of tables. User data represents any table containing a column with PII
Legal might require that certain tables have to be kept for a specified number of years. That’s we also introduce a classifier LegalHold
with label is_true. LegalHold
is a static table classifier, that the metadata which is attached to a particular table does not propagate.
We classify all root tables as either machine or static or user data. A table that is classified as user data must have at least one column labeled as PII. We can write a check to ensure any table containing PII also contains a user_data
table classifier.
For example, here is a daily user event table in SQL
and here is its companion SDF block with classifiers TableContent.user_data and PII.tag
A machine event table might have very similar structure, except it has no PII and its table classifier is machine_data.
We want to make sure that root tables are either machine data, static data. For that check we can rely on SDF tables information schema, whcih quintessence is
Here is query that finds all root tables that have no TableContent, and thus violate our policy!
All SDF queries are parameterized by an @date parameter. Evaluating a query for a particular date write a table (partition) for that date.
Every materialized table(partition) is stored in the SDF’s table partition information schema.
SDF distinguishes between ad-hoc and recurring tables. Ad hoc queries have a default @date value of now(). So if ad hoc query catalog.schema.table runs for @date, we insert this record into partitions.
Each recurring query runs on a schedule. While the schedule can be an arbitrary cron expression, here we consider a simpler language, following this [+-]d+[dwmy] label format, a daily schedule is +1d, a weekly schedule +7d, a yearly schedule +1y, etc.
Every recurring table has a schedule field. For instance if we have a daily root table for user-events, we say
For a monthly accumulating table we say:
SDF stores a recurring query also under its date parameter. But this time not any date can be passed, but only those that fit the schedule.
Example: let’s run a monthly accumulation of daily events. We run this on midnight 2022/01/02 with a @date value of 2022/01/01.
So we write monthly_user_events for date 2022/01/01.
The monthly user events captured the data for one month. But often the ingestion frequency and time range that a query captures don’t align. For instance an “L7 daily active users” counts the number of unique users who have interacted with a service over the course of the last seven days. So while the metric has a time range of a week, it is calculated daily.
We introduce a new table qualifier TimeGrain to capture the fact that the accumulated data is for a week. TimeGrain
propagates automatically. The default TimeGrain
label is set to the table’s schedule label.
In SDF we implement policies with checks against the information schema. Recall that we have three information schema: tables, columns, and partitions.
A simple GDPR compliant policy might say:
In addition we want to
Let’s start the policy evaluation with the user deletion followed by partition deletion.
Deletion of users. We distinguish three cases.
Deletion of partitions: Here is a sql snippet for a report that for given @date lists all partitions that should be deleted since they fall their retention period is expired.
Is this policy good enough? It is close, but not perfect: Here are two small issues we can address.
Schedule period > deletion period. ** Suppose you have a yearly query YQ capturing user data. According to above policy YQ will be deleted after 90days. But then it is no longer of any use. This table thus makes no sense. Let’s introduce a check that makes sure that at least all user tables are rerun within 90 days (the udf as_interval takes a list of [+-]d+[dwmy] strings and casts it as an interval).
Timegrain greater than +1d*.* A table with timegrain greater than 1d, captures past events . Suppose you have the L7 daily user action table, which timegrain is +7d. Assume that you have the L7 table starting on 2023-01-01, 02, etc. The L7 table from 2023-01-10 captures all data from 2023-01-04 to 2023-01-10. Suppose a user leaves on 2023-01-05. The user is fully gone on 2023-01-12. So if we delete the L7 table only after 90 days, we still might have PII for a user that we promised to delete after 90 days.
There are two ways to deal with time grains that are greater than 1day. First, simply adopt above query and use the time grain in addition to the policy value, e.g
An alternative approach is to simply say that no table can have a schedule > 45d. This guarantees that after 90 days a user is automatically deleted.
SDF can be used to fully automate data deletion and retention policies for GDPR compliance.
Our approach assumes the data warehoue is updated daily and contains the following main components for automation:
SDF introduces three features which are central to the proposed automation.
We introduce each of these three SDF concepts in turn.
SDF distinguishes between Table classifiers and Column classifiers.
We define the column classifier PII with a single label tag to label columns that have PII. PII label are automatically propagated following the flow of information through the warehouse. Of course users can refine PII labels, for instance specify that data is user_id, phone, email, full_name, etc. But here it is sufficient to just have a tag.
We define two table classifiers. The table classifier Content has the labels machine data, static data, and user data. The table classifiers machine data and static data flow automatically based on information flow of tables. User data represents any table containing a column with PII
Legal might require that certain tables have to be kept for a specified number of years. That’s we also introduce a classifier LegalHold
with label is_true. LegalHold
is a static table classifier, that the metadata which is attached to a particular table does not propagate.
We classify all root tables as either machine or static or user data. A table that is classified as user data must have at least one column labeled as PII. We can write a check to ensure any table containing PII also contains a user_data
table classifier.
For example, here is a daily user event table in SQL
and here is its companion SDF block with classifiers TableContent.user_data and PII.tag
A machine event table might have very similar structure, except it has no PII and its table classifier is machine_data.
We want to make sure that root tables are either machine data, static data. For that check we can rely on SDF tables information schema, whcih quintessence is
Here is query that finds all root tables that have no TableContent, and thus violate our policy!
All SDF queries are parameterized by an @date parameter. Evaluating a query for a particular date write a table (partition) for that date.
Every materialized table(partition) is stored in the SDF’s table partition information schema.
SDF distinguishes between ad-hoc and recurring tables. Ad hoc queries have a default @date value of now(). So if ad hoc query catalog.schema.table runs for @date, we insert this record into partitions.
Each recurring query runs on a schedule. While the schedule can be an arbitrary cron expression, here we consider a simpler language, following this [+-]d+[dwmy] label format, a daily schedule is +1d, a weekly schedule +7d, a yearly schedule +1y, etc.
Every recurring table has a schedule field. For instance if we have a daily root table for user-events, we say
For a monthly accumulating table we say:
SDF stores a recurring query also under its date parameter. But this time not any date can be passed, but only those that fit the schedule.
Example: let’s run a monthly accumulation of daily events. We run this on midnight 2022/01/02 with a @date value of 2022/01/01.
So we write monthly_user_events for date 2022/01/01.
The monthly user events captured the data for one month. But often the ingestion frequency and time range that a query captures don’t align. For instance an “L7 daily active users” counts the number of unique users who have interacted with a service over the course of the last seven days. So while the metric has a time range of a week, it is calculated daily.
We introduce a new table qualifier TimeGrain to capture the fact that the accumulated data is for a week. TimeGrain
propagates automatically. The default TimeGrain
label is set to the table’s schedule label.
In SDF we implement policies with checks against the information schema. Recall that we have three information schema: tables, columns, and partitions.
A simple GDPR compliant policy might say:
In addition we want to
Let’s start the policy evaluation with the user deletion followed by partition deletion.
Deletion of users. We distinguish three cases.
Deletion of partitions: Here is a sql snippet for a report that for given @date lists all partitions that should be deleted since they fall their retention period is expired.
Is this policy good enough? It is close, but not perfect: Here are two small issues we can address.
Schedule period > deletion period. ** Suppose you have a yearly query YQ capturing user data. According to above policy YQ will be deleted after 90days. But then it is no longer of any use. This table thus makes no sense. Let’s introduce a check that makes sure that at least all user tables are rerun within 90 days (the udf as_interval takes a list of [+-]d+[dwmy] strings and casts it as an interval).
Timegrain greater than +1d*.* A table with timegrain greater than 1d, captures past events . Suppose you have the L7 daily user action table, which timegrain is +7d. Assume that you have the L7 table starting on 2023-01-01, 02, etc. The L7 table from 2023-01-10 captures all data from 2023-01-04 to 2023-01-10. Suppose a user leaves on 2023-01-05. The user is fully gone on 2023-01-12. So if we delete the L7 table only after 90 days, we still might have PII for a user that we promised to delete after 90 days.
There are two ways to deal with time grains that are greater than 1day. First, simply adopt above query and use the time grain in addition to the policy value, e.g
An alternative approach is to simply say that no table can have a schedule > 45d. This guarantees that after 90 days a user is automatically deleted.