Retention and Data Deletion with SDF
Overview of Approach
Our approach assumes the data warehoue is updated daily and contains the following main components for automation:- Classifiers. We have to categorize data, their purpose and retention period.
- Date ranges and partitions. We have to know which table partition captures what data range.
- User state. We have to know whether is a user is still active or left the service.
- Policy evaluation. On any change to the warehouse or at least once a day, deletion policies must be upheld and enforced.
Simple Version
SDF introduces three features which are central to the proposed automation.- SDF introduces classifiers and classifier propagation. This builds the foundation for having a perfect record of data categories and purpose for all data assets in the warehouse.
- SDF introduces a table and column information schema that captures the result of aforementioned annotation and propagation. In addition SDF has a partition schema, keeping a record of each table and its landing time.
- SDF introduces Reports and Checks, e.g. queries that run against above information schema. A check might for example be that all tables specify their purpose.
A report, for example, can represent the partitions that have expired and thus should be deleted.
Classifiers
SDF distinguishes between Table classifiers and Column classifiers. We define the column classifier PII with a single label tag to label columns that have PII. PII label are automatically propagated following the flow of information through the warehouse. Of course users can refine PII labels, for instance specify that data is user_id, phone, email, full_name, etc. But here it is sufficient to just have a tag.LegalHold
with label is_true. LegalHold
is a static table classifier, that the metadata which is attached to a particular table does not propagate.
user_data
table classifier.
For example, here is a daily user event table in SQL
Writing our First Check
We want to make sure that root tables are either machine data, static data. For that check we can rely on SDF tables information schema, whcih quintessence isAd-hoc and Recurring Queries, Dates, Date Partitions and Time Grains
All SDF queries are parameterized by an @date parameter. Evaluating a query for a particular date write a table (partition) for that date. Every materialized table(partition) is stored in the SDF’s table partition information schema.- A daily query of @date yyyy-mm-dd is run as soon as all the data of the day yyyy-mm-dd has landed
- A monthly query of @date yyyy-mm-01, where01 is constant, is run immediately after all the data of the month yyyy-mm has landed
- A yearly query of @date yyyy-01-01 is run immediately after all the data of the year yyyy has landed, etc
Time Grains
The monthly user events captured the data for one month. But often the ingestion frequency and time range that a query captures don’t align. For instance an “L7 daily active users” counts the number of unique users who have interacted with a service over the course of the last seven days. So while the metric has a time range of a week, it is calculated daily. We introduce a new table qualifier TimeGrain to capture the fact that the accumulated data is for a week.TimeGrain
propagates automatically. The default TimeGrain
label is set to the table’s schedule label.
Policy
In SDF we implement policies with checks against the information schema. Recall that we have three information schema: tables, columns, and partitions. A simple GDPR compliant policy might say:- Retain machine data for 7 days
- Retain user data for 90 days
- Retain unlabeled data for 360 days
- Retain legal hold for 7 years
- Retain static data (like country code mappings) indefinitely
- Guarantee that users that left the service are deleted after 90 days
- Implicit deletion for recurring daily time grain tables. If a user U quits the service at day 2022/12/01, there will be no more interactions from U after 2022/12/01, so if all user data partitions are deleted after 90 days, users U last trace of the system will be deleted on 2022/12/01 +90d.
- Explicit handling needed for recurring non daily time grain tables. Suppose we have an L7 daily active users table. This table keeps track of all users who were active during this period. We can handle deletion for time-grain tables in two ways:
- To implicit delete a user from this table we add a new column last_seen: date. When we join at @date a daily active user table with the accumulating L7 table, we distinguish for each user U:
- if U is in the daily table then keep the user with last_seen @date,
- if U wasn’t in the daily table, then keep the user last_seen > @date-timegrain,
- otherwise drop the user in the new table.
- To implicit delete a user from this table we add a new column last_seen: date. When we join at @date a daily active user table with the accumulating L7 table, we distinguish for each user U:
- Implicit handling for ad hoc tables. Every ad hoc table behaves like a daily recurring table. Traces of users are deleted as soon as the table is deleted.