Chapter 5 — Compliance Day 1

The Conflict

GDPR says delete everything. Data Vault says delete nothing. Both are right.

GDPR Article 17 PII Isolation Artificial Hubs EU AI Act Data Mesh

The Conflict

GDPR says delete everything. Data Vault says delete nothing. Both are law. Both are right. Resolving this conflict is one of the hardest problems in data engineering.

Language Bridge — Compliance

Compliance — from Latin complēre: “to fill up, to fulfill, to complete.”

Same root as complete (fulfilled, nothing missing) and complement (what fills a gap).

Why this word: Compliance means fulfilling an obligation completely. Not partially, not approximately. When GDPR says “delete,” compliance means actually deleted — from every layer, every backup, every mart. The word itself tells you: half-measures don’t count.

Language Bridge — Erasure

Erasure — from Latin erasura, from eradere: “to scrape off.” In Roman times, writing was done on wax tablets. To erase, you scraped the wax flat.

Why this word: Article 17 uses “right to erasure” rather than “right to deletion” because erasure implies thoroughness. You don’t just remove the record — you scrape away all traces. In a data warehouse with 6+ layers, that scraping reaches deep.

4-Layer Explanation — GDPR Core
L1 Plain language — for anyone

GDPR is a European law that gives people control over their personal data. If someone asks a company to delete their data, the company has to do it.

L2 With mechanism — how it works

GDPR (General Data Protection Regulation) applies to any organization processing personal data of EU residents. It grants specific rights — access, erasure, portability, rectification — and imposes obligations: purpose limitation, data minimization, accountability. Enforceable since May 2018. Fines up to EUR 20M or 4% of global annual turnover for the most serious violations (Art. 83(5)); EUR 10M or 2% for other violations (Art. 83(4)).

L3 Technical — full detail

GDPR was designed for transactional databases where you find a row and delete it. Data warehouses keep everything forever for analytical purposes. These two design philosophies are in direct conflict. The regulation doesn’t distinguish between an operational database and a data warehouse — Article 17 applies to both. This means right to erasure requires a multi-layer deletion: staging, Raw Vault, Business Vault, Information Marts, and backups.

L4 Expert — nuance & edge cases

GDPR is a regulation (directly applicable across all EU member states), but member states have implementation discretion. Germany has 17 supervisory authorities (1 federal + 16 state-level) and stricter employee data protection (§26 BDSG). France has CNIL, which has issued the largest GDPR fines in Europe. A Data Vault serving clients in both Germany and France may need different handling depending on the national authority’s guidance. A consultant with GDPR transition experience under CNIL's regime brings directly applicable knowledge to French-market clients.

Key GDPR Rights

Six rights that directly impact how you design and operate a Data Vault:

GDPR Rights & DV Implications
RightArticleWhat It MeansDV Implication
Right to erasureArt. 17Delete my personal dataMulti-layer deletion across all DV layers
Right of access (DSAR)Art. 15Show me all data you hold about meJOINs from Hub through all Satellites
Right to rectificationArt. 16Fix incorrect data about meAppend-only Satellite: insert corrected row
Right to data portabilityArt. 20Give me my data in machine-readable formatExport from vault in structured format
Purpose limitationArt. 5(1)(b)Data collected for one purpose can’t be freely repurposedTracked via record_source and metadata
Data minimizationArt. 5(1)(c)Only collect what’s necessaryTension with DV’s “load everything” philosophy

Pseudonymization ≠ Anonymization

Pseudonymization: Replace identifying info with a code. Person can still be re-identified if you have the mapping key. Data is still personal data under GDPR. Still protected.

Anonymization: Remove all identifying info such that re-identification is impossible. Only truly anonymized data falls outside GDPR’s scope.

Hash keys in Data Vault are pseudonymization, not anonymization. If the business key is an email address, the hash of that email is recoverable by anyone with the original email and the hash algorithm. Retaining the hash after deleting the business key does NOT achieve GDPR compliance.

PII Satellite Isolation

Not all attributes are created equal. Some are personal data protected by law. Others are analytical data safe to keep forever. The trick: put them in different Satellites.

Language Bridge — Isolation

Isolation — from Latin insula: “island.” To isolate is to make something into an island — separated, surrounded, reachable only by deliberate crossing.

Why this word: PII Satellite isolation makes personal data into an island. It’s physically separated from the analytical data. When a deletion request arrives, you delete the island without touching the mainland.

4-Layer Explanation — PII Isolation
L1 Plain language

Instead of putting all of a character’s details in one Satellite table, you split it: personal stuff (name, email, address) goes in one Satellite, analytical stuff (race, level, faction) goes in another. When someone says “delete my personal data,” you delete the personal Satellite and leave the analytical one untouched.

L2 With mechanism

This is a specific case of the Satellite splitting you already know. In Day 2, you learned Satellites split by rate of change or by source system. Here’s a third reason: by data classification. PII attributes get their own Satellite. Both Satellites still connect to the same Hub via the same hash key. The split is physical, not logical.

L3 Technical

The PII Satellite (Sat_Character_PII) contains attributes classified as personal data: name, email, address, date of birth. The non-PII Satellite (Sat_Character_Attributes) contains analytical attributes: race, level, faction, guild membership. Both share the same hub_character_hk. On deletion: DELETE FROM Sat_Character_PII WHERE hub_character_hk = [target]. The non-PII Satellite is untouched.

L4 Expert nuance

The classification decision — which attributes are PII and which aren’t — is not purely technical. It requires legal judgment. “Race” might seem like analytical data, but under GDPR Article 9, racial/ethnic origin is special category data with even stricter protections. PII classification should be done with the Data Protection Officer, not by the data engineer alone. This is exactly where a BI consultant with legal background adds value.

The Skyrim Model — Before & After

Before: Sat_Character_Details (all together)
ColumnExample ValueClassification
hub_character_hka3f7...
real_nameNazeemPII
email_addressnazeem@whiterun.govPII
raceRedguardNon-PII*
level10Non-PII
factionNoneNon-PII
After: Sat_Character_PII (deletable)
ColumnExample Value
hub_character_hka3f7...
real_nameNazeem
email_addressnazeem@whiterun.gov
home_addressChillfurrow Farm, Whiterun
date_of_birth4E 175
load_date2026-01-01
record_sourceWhiterun Census
After: Sat_Character_Attributes (retained)
ColumnExample Value
hub_character_hka3f7...
raceRedguard
level10
factionNone
guild_membershipNone
load_date2026-01-01
record_sourceWhiterun Census

Same Hub. Same hash key. Two Satellites instead of one. Deletion targets only the PII Satellite.

The Deletion Walkthrough

Scenario: Nazeem wants to be forgotten. He submits an Article 17 erasure request. Here’s what happens at each layer of the Data Vault.

1
Persistent Staging Area (PSA) Delete

Challenge: PSA is a historical copy of source data — it exists for auditability. Deleting from it means losing the ability to replay loads.

Solution: Find all source records containing Nazeem’s data. Delete or overwrite with tombstone records. This is a deliberate tradeoff: compliance trumps replayability.

2
Raw Vault — Hub Review

Challenge: Hub_Character contains Nazeem’s business key. Is the business key PII?

If business key = “Nazeem” (a name) → YES, it’s PII → must address (see Artificial Hubs).

If business key = “CHAR_00742” (a system ID) → NO → Hub can remain.

3
Raw Vault — PII Satellite Delete

Action: Delete all rows for Nazeem’s hash key from Sat_Character_PII, OR insert tombstone records.

After this: the hash key still exists in the Hub, but points to nothing identifiable.

4
Raw Vault — Non-PII Satellite Keep

Action: KEEP. Sat_Character_Attributes contains race (Redguard), level (10), faction (None). These are not identifiable without the PII Satellite.

Analytical value preserved: “an anonymous Redguard, level 10, no faction” still contributes to aggregate analysis.

5
Raw Vault — Links Review

If Nazeem’s Hub entry remains (non-PII business key or Artificial Hub): Links can remain.

If Nazeem’s Hub entry is deleted: Links referencing it become orphaned → clean up.

6
Business Vault Review

Any derived calculations using Nazeem’s PII directly: recalculate without it. If only aggregated/anonymized data was used: may remain.

7
Information Marts & Backups Review

Marts: Pre-aggregated views including Nazeem at row level → rebuild without him. Aggregate summaries → usually fine.

Backups: Right to erasure extends to backups. Practical reality: deletion is deferred to the next backup rotation cycle, documented with a retention exception and timeline.

Key Insight: Data Lineage Is Compliance Infrastructure

The deletion walkthrough above requires knowing exactly where Nazeem’s data lives across all layers. This is data lineage. Without it, you can’t fulfill the erasure request.

Data Vault’s record_source and load_date on every row make lineage traceable by design — but you still need tooling (data catalog, lineage graphs) to operationalize it.

Artificial Hubs

PII Satellite isolation works perfectly — until the business key IS the personal data. If Nazeem’s Hub stores his email address as the business key, deleting the PII Satellite doesn’t help. The Hub itself is the problem.

Language Bridge — Artificial

Artificial — from Latin artificium: “made by human craft.” Ars (skill, craft) + facere (to make). Not fake or inferior — deliberately constructed for a purpose.

Why this word: An Artificial Hub uses a crafted, synthetic key instead of the natural business key. It’s artificial not because it’s fake, but because it was deliberately constructed to solve a specific problem: making the Hub non-identifiable.

4-Layer Explanation — Artificial Hubs
L1 Plain language

Sometimes the business key itself is personal data (like an email). You can’t just split it into a Satellite — it’s the core identifier in the Hub. So you create an Artificial Hub that uses a random number as its business key instead. The real personal identifier lives in a Satellite that can be deleted.

L2 With mechanism

The Artificial Hub replaces the PII business key with a randomly generated surrogate key (e.g., a UUID). The real personal identifier lives in a PII Satellite attached to the Artificial Hub. When a deletion request arrives: delete the PII Satellite, the Artificial Hub remains (it’s just a random number), and all analytical relationships remain intact.

L3 Technical

Standard Hub: Hub_Customer(hub_customer_hk, email_address, load_date, record_source) — the email is the business key. Artificial Hub: Hub_Customer(hub_customer_hk, artificial_key, load_date, record_source) — the business key is a UUID like a8f3e2b1-.... The email lives in Sat_Customer_PII(hub_customer_hk, email_address, real_name, ...). On deletion: DELETE all rows from Sat_Customer_PII. The Hub retains a8f3e2b1-... — meaningless without the PII Satellite.

L4 Expert nuance

The hash key itself may still be problematic. If the original business key was an email, the hash key was derived FROM that email. Anyone with the original email and the hash algorithm could recalculate the hash and match it — meaning the hash constitutes personal data under GDPR. With an Artificial Hub, the hash key is derived from the random surrogate, not from PII, so this problem disappears. This is one of the strongest arguments for Artificial Hubs in GDPR-sensitive implementations.

Standard Hub vs. Artificial Hub

Standard Hub (PII business key — problem)
ColumnValueIssue
hub_character_hkhash(“Nazeem”)Derived from PII
business_keyNazeemIS personal data
load_date2026-01-01
record_sourceWhiterun Census
Artificial Hub (safe surrogate key)
ColumnValueStatus
hub_character_hkhash(“ART-00742”)Not derived from PII
business_keyART-00742Not personal data
load_date2026-01-01
record_sourceWhiterun Census

Tombstone Records

Language Bridge — Tombstone

Tombstone — literally, a gravestone. A marker that says: “something was here. It is gone now.”

Why this word: A tombstone record replaces deleted data with a marker that says “this data existed and was deleted on this date, for this reason.” It preserves the audit trail without preserving the personal data.

4-Layer Explanation — Tombstone Records
L1 Plain language

Instead of just deleting a row and leaving nothing, you replace it with a special row that says “deleted.” That way the audit trail shows the deletion happened — you can prove you complied with the request.

L2 With mechanism

A tombstone record contains the hash key (to identify which entity), a deletion timestamp, the reason for deletion (e.g., “Article 17 request #2847”), and NULL or placeholder values where the PII used to be. The row itself contains no personal data — it’s just a marker.

L3 Technical

In Data Vault’s append-only model, a DELETE is philosophically uncomfortable — you’re destroying history. Tombstones resolve this: you INSERT a new Satellite row with the same hash key, a current load_date, NULLs for all PII columns, and a flag indicating “GDPR_ERASED.” The previous PII rows are then physically deleted. The tombstone remains as the audit trail.

L4 Expert nuance

The tombstone itself must not contain PII. “Deleted per request from nazeem@whiterun.gov” defeats the purpose. The deletion reason should reference a request ID, not the requester’s identity. The request ID maps to a separate, access-controlled deletion log maintained by the DPO’s office — not stored in the data warehouse itself.

Tombstone Record in Sat_Character_PII
ColumnValue
hub_character_hka3f7...
real_nameNULL
email_addressNULL
home_addressNULL
date_of_birthNULL
load_date2026-02-21
record_sourceGDPR_ERASED
deletion_request_idART17-2847
Exercise 01

PII Classification

Given these Skyrim character attributes, classify each as PII (personal data under GDPR) or Non-PII (analytical data safe to retain). Click your choice for each, then check.

real_name
race
level
faction
home_address
date_of_birth
guild_membership
email_address
weapon_preference
bounty_amount

EU AI Act — Article 10

GDPR tells you what to delete. The AI Act tells you what to document. They’re designed to be complementary — but in practice, they create compounding obligations that only a governed data architecture can satisfy.

Language Bridge — Governance

Governance — from Greek kybernan: “to steer a ship.” Same root as cybernetics (the study of steering/control systems) and governor (the one who steers).

Why this word: Data governance is steering your data — deciding where it goes, who can use it, how it’s documented, and what happens when things go wrong. Without governance, AI on data is a ship without a rudder.

4-Layer Explanation — EU AI Act
L1 Plain language

The EU AI Act is a new law that regulates AI systems based on how risky they are. High-risk AI (like credit scoring or hiring tools) must prove their training data is high-quality, unbiased, and well-documented. Article 10 is the part that says how.

L2 With mechanism

The AI Act classifies AI into four risk tiers: prohibited (banned), high-risk (heavily regulated), limited risk (transparency required), minimal risk (no obligations). Article 10 applies to high-risk systems and requires documented data provenance, transformation documentation, quality verification, and bias examination.

L3 Technical

Article 10(2) requires: documented design choices for data selection, full data lineage including original collection purpose (DV’s record_source + load_date), documentation of all preparation operations (staging + Business Vault transformations), assessment of data quantity and suitability (pre-AI Mart validation), and bias examination for impacts on health, safety, and fundamental rights.

L4 Expert nuance

Article 10(5) creates a limited exception to GDPR’s prohibition on processing special category data (race, health, religion). Providers of high-risk AI may process this data specifically for bias detection and correction — but only when anonymized/synthetic data won’t work, with strict safeguards, no third-party transfers, and mandatory deletion after bias correction. This is a direct bridge between GDPR and the AI Act.

AI Act Requirements → Data Vault Features

Article 10 ↔ Data Vault Mapping
AI Act RequirementData Vault Feature
Data provenance (Art. 10(2)(aa))record_source and load_date on every row
Transformation documentation (Art. 10(2)(c))Staging → Raw Vault → Business Vault: each layer is distinct and auditable
Quality verification (Art. 10(3))Business Vault validation rules + Feature Mart quality gates
Bias examination (Art. 10(2)(f))Profiling queries on Business Vault — demographic breakdowns
Immutability for audit (Arts. 11-12)Append-only Satellite design — nothing overwritten

Penalty Comparison

RegulationMaximum Fine% of Turnover
GDPR (most serious violations, Art. 83(5))EUR 20,000,000or 4%
AI Act (high-risk non-compliance)EUR 15,000,000or 3%
AI Act (prohibited practices)EUR 35,000,000or 7%

The AI Act’s top tier is nearly double GDPR’s. Combined violations compound.

Data Mesh Governance

Four Principles of Data Mesh

1. Domain Ownership — Each business domain (sales, logistics, finance) owns its data end-to-end.

2. Data as a Product — Each domain treats its data outputs with the same rigor as customer-facing products.

3. Self-Serve Data Infrastructure — A central platform provides tools and templates so domains can build independently.

4. Federated Computational Governance — Governance policies defined centrally, executed locally by each domain.

How Data Vault Supports Data Mesh

Raw Vault = centralized — single integration layer across all domains, cross-domain consistency.

Business Vault = decentralized — each domain applies its own business rules.

Built-in governance = federated governance infrastructurerecord_source, load_date, immutability provide the audit infrastructure that federated governance relies on.

Key Takeaway
ScaleFree has presented on this at conferences. Understanding Data Mesh means being able to define it, state the four principles, and explain how Data Vault supports it.
Exercise 02

Match GDPR Rights to DV Features

For each GDPR right, select the Data Vault feature that helps implement it.

A. Right to erasure (Art. 17)
B. Right of access / DSAR (Art. 15)
C. Right to rectification (Art. 16)
D. Purpose limitation (Art. 5(1)(b))
E. Accountability (Art. 5(2))

Both Exercises Complete

You understand PII classification and how GDPR rights map to Data Vault features. Well done — solid foundations.

Part 2 — Applied Concepts

The Consultant’s Day

Walking through a real compliance scenario

The Scenario

You’re a BI consultant at ScaleFree. A German financial services client calls: a customer has submitted a DSAR and a deletion request under Article 17. The client’s data warehouse is built on Data Vault 2.0.

Step 1

Understand the Request

Two separate obligations:

DSAR (Art. 15): Show the customer all data held about them. Must respond within 30 days.

Deletion request (Art. 17): Delete all personal data. Also 30 days. The DPO checks for exceptions (legal retention obligations, ongoing disputes, public interest).

Your role as BI consultant: You don’t handle the legal side — the DPO does. You handle the architectural side: finding the data, executing the deletion, documenting it.

Step 2

Map the Data

Use the data catalog and lineage tools to find every location of this customer’s data:

Data Map — Layer by Layer
LayerContains PII?Action
Persistent Staging AreaYesDelete / tombstone
Raw Vault — HubDepends on BKKeep if non-PII key; Artificial Hub if PII
Raw Vault — PII SatelliteYesDelete + tombstone
Raw Vault — Non-PII SatNoKeep
Raw Vault — LinksDependsKeep if Hub remains
Business VaultPossiblyReview and recalculate
Information MartsPossiblyRebuild affected views
BackupsYesMark for next rotation
Step 3

Execute the DSAR

Query the vault to find all data about this customer. Follow every hash key path:

Hub_CustomerSat_Customer_PII (name, email, address)

Hub_CustomerSat_Customer_Attributes (analytical data)

Hub_CustomerLink_Customer_AccountHub_Account

Compile the results. Send to DPO for review and delivery to the customer.

Step 4

Execute the Deletion

1. Run the deletion procedure (automated via datavault4dbt if configured, manual otherwise).

2. Log every action: what was deleted, from which table, at what timestamp, by whom, under which request ID.

3. The deletion log itself is a new audit record — you need to prove you deleted the data.

Step 5

Confirm and Document

DPO confirms completion to the data subject.

Deletion record stored (without the deleted PII) for accountability.

If data was shared with third parties: notify them (Art. 17(2)).

The Collaboration Points

Consultant vs. DPO Responsibilities
PhaseConsultant’s RoleDPO’s Role
Design timeDesign Satellite splitting, Artificial Hub patternReview and approve PII classification
Build timeImplement deletion procedures, lineage trackingValidate all PII locations are covered
RuntimeArchitecture executes the deletionTriggers the request, confirms to data subject

Key Concepts

On the GDPR-DV Tension
“What makes this genuinely interesting is that GDPR was written for transactional systems where you find a row and delete it. Data warehouses are designed to never delete anything — that’s the whole point. Resolving that tension isn’t just a legal problem or a technical problem, it’s both at once.”
On PII Satellite Isolation
“The standard approach is to design your Satellites so PII attributes are physically separated from analytical data. When a deletion request comes in, you delete the PII Satellite rows and leave the analytical value intact. The Hub and its hash key can remain because they no longer point to identifiable data.”
On Artificial Hubs
“When the business key itself is personal data — like an email address — you use an Artificial Hub with a synthetic surrogate key. The real personal identifier lives in a deletable Satellite. That way the Hub can survive deletion without breaking referential integrity.”
On Data Vault’s GDPR Advantage
“Data Vault wasn’t designed for GDPR, but its architecture is structurally well-suited for compliance. Every row has a load timestamp and record source — that’s your audit trail built in. Nothing gets overwritten — that’s your accountability. The layered architecture means you can track data lineage from source to mart.”
On AI Act + DV
“The EU AI Act requires documented data provenance, transformation audit trails, and bias examination for high-risk AI systems. Data Vault provides most of this by design — record source, load date, and the separation between Raw Vault and Business Vault create a traceable transformation chain. The Feature Mart becomes the compliance enforcement point.”