The Conflict
GDPR says delete everything. Data Vault says delete nothing. Both are right.
The Conflict
GDPR says delete everything. Data Vault says delete nothing. Both are law. Both are right. Resolving this conflict is one of the hardest problems in data engineering.
Compliance — from Latin complēre: “to fill up, to fulfill, to complete.”
Same root as complete (fulfilled, nothing missing) and complement (what fills a gap).
Why this word: Compliance means fulfilling an obligation completely. Not partially, not approximately. When GDPR says “delete,” compliance means actually deleted — from every layer, every backup, every mart. The word itself tells you: half-measures don’t count.
Erasure — from Latin erasura, from eradere: “to scrape off.” In Roman times, writing was done on wax tablets. To erase, you scraped the wax flat.
Why this word: Article 17 uses “right to erasure” rather than “right to deletion” because erasure implies thoroughness. You don’t just remove the record — you scrape away all traces. In a data warehouse with 6+ layers, that scraping reaches deep.
GDPR is a European law that gives people control over their personal data. If someone asks a company to delete their data, the company has to do it.
GDPR (General Data Protection Regulation) applies to any organization processing personal data of EU residents. It grants specific rights — access, erasure, portability, rectification — and imposes obligations: purpose limitation, data minimization, accountability. Enforceable since May 2018. Fines up to EUR 20M or 4% of global annual turnover for the most serious violations (Art. 83(5)); EUR 10M or 2% for other violations (Art. 83(4)).
GDPR was designed for transactional databases where you find a row and delete it. Data warehouses keep everything forever for analytical purposes. These two design philosophies are in direct conflict. The regulation doesn’t distinguish between an operational database and a data warehouse — Article 17 applies to both. This means right to erasure requires a multi-layer deletion: staging, Raw Vault, Business Vault, Information Marts, and backups.
GDPR is a regulation (directly applicable across all EU member states), but member states have implementation discretion. Germany has 17 supervisory authorities (1 federal + 16 state-level) and stricter employee data protection (§26 BDSG). France has CNIL, which has issued the largest GDPR fines in Europe. A Data Vault serving clients in both Germany and France may need different handling depending on the national authority’s guidance. A consultant with GDPR transition experience under CNIL's regime brings directly applicable knowledge to French-market clients.
Key GDPR Rights
Six rights that directly impact how you design and operate a Data Vault:
| Right | Article | What It Means | DV Implication |
|---|---|---|---|
| Right to erasure | Art. 17 | Delete my personal data | Multi-layer deletion across all DV layers |
| Right of access (DSAR) | Art. 15 | Show me all data you hold about me | JOINs from Hub through all Satellites |
| Right to rectification | Art. 16 | Fix incorrect data about me | Append-only Satellite: insert corrected row |
| Right to data portability | Art. 20 | Give me my data in machine-readable format | Export from vault in structured format |
| Purpose limitation | Art. 5(1)(b) | Data collected for one purpose can’t be freely repurposed | Tracked via record_source and metadata |
| Data minimization | Art. 5(1)(c) | Only collect what’s necessary | Tension with DV’s “load everything” philosophy |
Pseudonymization ≠ Anonymization
Pseudonymization: Replace identifying info with a code. Person can still be re-identified if you have the mapping key. Data is still personal data under GDPR. Still protected.
Anonymization: Remove all identifying info such that re-identification is impossible. Only truly anonymized data falls outside GDPR’s scope.
Hash keys in Data Vault are pseudonymization, not anonymization. If the business key is an email address, the hash of that email is recoverable by anyone with the original email and the hash algorithm. Retaining the hash after deleting the business key does NOT achieve GDPR compliance.
PII Satellite Isolation
Not all attributes are created equal. Some are personal data protected by law. Others are analytical data safe to keep forever. The trick: put them in different Satellites.
Isolation — from Latin insula: “island.” To isolate is to make something into an island — separated, surrounded, reachable only by deliberate crossing.
Why this word: PII Satellite isolation makes personal data into an island. It’s physically separated from the analytical data. When a deletion request arrives, you delete the island without touching the mainland.
Instead of putting all of a character’s details in one Satellite table, you split it: personal stuff (name, email, address) goes in one Satellite, analytical stuff (race, level, faction) goes in another. When someone says “delete my personal data,” you delete the personal Satellite and leave the analytical one untouched.
This is a specific case of the Satellite splitting you already know. In Day 2, you learned Satellites split by rate of change or by source system. Here’s a third reason: by data classification. PII attributes get their own Satellite. Both Satellites still connect to the same Hub via the same hash key. The split is physical, not logical.
The PII Satellite (Sat_Character_PII) contains attributes classified as personal data: name, email, address, date of birth. The non-PII Satellite (Sat_Character_Attributes) contains analytical attributes: race, level, faction, guild membership. Both share the same hub_character_hk. On deletion: DELETE FROM Sat_Character_PII WHERE hub_character_hk = [target]. The non-PII Satellite is untouched.
The classification decision — which attributes are PII and which aren’t — is not purely technical. It requires legal judgment. “Race” might seem like analytical data, but under GDPR Article 9, racial/ethnic origin is special category data with even stricter protections. PII classification should be done with the Data Protection Officer, not by the data engineer alone. This is exactly where a BI consultant with legal background adds value.
The Skyrim Model — Before & After
| Column | Example Value | Classification |
|---|---|---|
| hub_character_hk | a3f7... | — |
| real_name | Nazeem | PII |
| email_address | nazeem@whiterun.gov | PII |
| race | Redguard | Non-PII* |
| level | 10 | Non-PII |
| faction | None | Non-PII |
| Column | Example Value |
|---|---|
| hub_character_hk | a3f7... |
| real_name | Nazeem |
| email_address | nazeem@whiterun.gov |
| home_address | Chillfurrow Farm, Whiterun |
| date_of_birth | 4E 175 |
| load_date | 2026-01-01 |
| record_source | Whiterun Census |
| Column | Example Value |
|---|---|
| hub_character_hk | a3f7... |
| race | Redguard |
| level | 10 |
| faction | None |
| guild_membership | None |
| load_date | 2026-01-01 |
| record_source | Whiterun Census |
Same Hub. Same hash key. Two Satellites instead of one. Deletion targets only the PII Satellite.
The Deletion Walkthrough
Scenario: Nazeem wants to be forgotten. He submits an Article 17 erasure request. Here’s what happens at each layer of the Data Vault.
Challenge: PSA is a historical copy of source data — it exists for auditability. Deleting from it means losing the ability to replay loads.
Solution: Find all source records containing Nazeem’s data. Delete or overwrite with tombstone records. This is a deliberate tradeoff: compliance trumps replayability.
Challenge: Hub_Character contains Nazeem’s business key. Is the business key PII?
If business key = “Nazeem” (a name) → YES, it’s PII → must address (see Artificial Hubs).
If business key = “CHAR_00742” (a system ID) → NO → Hub can remain.
Action: Delete all rows for Nazeem’s hash key from Sat_Character_PII, OR insert tombstone records.
After this: the hash key still exists in the Hub, but points to nothing identifiable.
Action: KEEP. Sat_Character_Attributes contains race (Redguard), level (10), faction (None). These are not identifiable without the PII Satellite.
Analytical value preserved: “an anonymous Redguard, level 10, no faction” still contributes to aggregate analysis.
If Nazeem’s Hub entry remains (non-PII business key or Artificial Hub): Links can remain.
If Nazeem’s Hub entry is deleted: Links referencing it become orphaned → clean up.
Any derived calculations using Nazeem’s PII directly: recalculate without it. If only aggregated/anonymized data was used: may remain.
Marts: Pre-aggregated views including Nazeem at row level → rebuild without him. Aggregate summaries → usually fine.
Backups: Right to erasure extends to backups. Practical reality: deletion is deferred to the next backup rotation cycle, documented with a retention exception and timeline.
Key Insight: Data Lineage Is Compliance Infrastructure
The deletion walkthrough above requires knowing exactly where Nazeem’s data lives across all layers. This is data lineage. Without it, you can’t fulfill the erasure request.
Data Vault’s record_source and load_date on every row make lineage traceable by design — but you still need tooling (data catalog, lineage graphs) to operationalize it.
Artificial Hubs
PII Satellite isolation works perfectly — until the business key IS the personal data. If Nazeem’s Hub stores his email address as the business key, deleting the PII Satellite doesn’t help. The Hub itself is the problem.
Artificial — from Latin artificium: “made by human craft.” Ars (skill, craft) + facere (to make). Not fake or inferior — deliberately constructed for a purpose.
Why this word: An Artificial Hub uses a crafted, synthetic key instead of the natural business key. It’s artificial not because it’s fake, but because it was deliberately constructed to solve a specific problem: making the Hub non-identifiable.
Sometimes the business key itself is personal data (like an email). You can’t just split it into a Satellite — it’s the core identifier in the Hub. So you create an Artificial Hub that uses a random number as its business key instead. The real personal identifier lives in a Satellite that can be deleted.
The Artificial Hub replaces the PII business key with a randomly generated surrogate key (e.g., a UUID). The real personal identifier lives in a PII Satellite attached to the Artificial Hub. When a deletion request arrives: delete the PII Satellite, the Artificial Hub remains (it’s just a random number), and all analytical relationships remain intact.
Standard Hub: Hub_Customer(hub_customer_hk, email_address, load_date, record_source) — the email is the business key. Artificial Hub: Hub_Customer(hub_customer_hk, artificial_key, load_date, record_source) — the business key is a UUID like a8f3e2b1-.... The email lives in Sat_Customer_PII(hub_customer_hk, email_address, real_name, ...). On deletion: DELETE all rows from Sat_Customer_PII. The Hub retains a8f3e2b1-... — meaningless without the PII Satellite.
The hash key itself may still be problematic. If the original business key was an email, the hash key was derived FROM that email. Anyone with the original email and the hash algorithm could recalculate the hash and match it — meaning the hash constitutes personal data under GDPR. With an Artificial Hub, the hash key is derived from the random surrogate, not from PII, so this problem disappears. This is one of the strongest arguments for Artificial Hubs in GDPR-sensitive implementations.
Standard Hub vs. Artificial Hub
| Column | Value | Issue |
|---|---|---|
| hub_character_hk | hash(“Nazeem”) | Derived from PII |
| business_key | Nazeem | IS personal data |
| load_date | 2026-01-01 | — |
| record_source | Whiterun Census | — |
| Column | Value | Status |
|---|---|---|
| hub_character_hk | hash(“ART-00742”) | Not derived from PII |
| business_key | ART-00742 | Not personal data |
| load_date | 2026-01-01 | — |
| record_source | Whiterun Census | — |
Tombstone Records
Tombstone — literally, a gravestone. A marker that says: “something was here. It is gone now.”
Why this word: A tombstone record replaces deleted data with a marker that says “this data existed and was deleted on this date, for this reason.” It preserves the audit trail without preserving the personal data.
Instead of just deleting a row and leaving nothing, you replace it with a special row that says “deleted.” That way the audit trail shows the deletion happened — you can prove you complied with the request.
A tombstone record contains the hash key (to identify which entity), a deletion timestamp, the reason for deletion (e.g., “Article 17 request #2847”), and NULL or placeholder values where the PII used to be. The row itself contains no personal data — it’s just a marker.
In Data Vault’s append-only model, a DELETE is philosophically uncomfortable — you’re destroying history. Tombstones resolve this: you INSERT a new Satellite row with the same hash key, a current load_date, NULLs for all PII columns, and a flag indicating “GDPR_ERASED.” The previous PII rows are then physically deleted. The tombstone remains as the audit trail.
The tombstone itself must not contain PII. “Deleted per request from nazeem@whiterun.gov” defeats the purpose. The deletion reason should reference a request ID, not the requester’s identity. The request ID maps to a separate, access-controlled deletion log maintained by the DPO’s office — not stored in the data warehouse itself.
| Column | Value |
|---|---|
| hub_character_hk | a3f7... |
| real_name | NULL |
| email_address | NULL |
| home_address | NULL |
| date_of_birth | NULL |
| load_date | 2026-02-21 |
| record_source | GDPR_ERASED |
| deletion_request_id | ART17-2847 |
PII Classification
Given these Skyrim character attributes, classify each as PII (personal data under GDPR) or Non-PII (analytical data safe to retain). Click your choice for each, then check.
EU AI Act — Article 10
GDPR tells you what to delete. The AI Act tells you what to document. They’re designed to be complementary — but in practice, they create compounding obligations that only a governed data architecture can satisfy.
Governance — from Greek kybernan: “to steer a ship.” Same root as cybernetics (the study of steering/control systems) and governor (the one who steers).
Why this word: Data governance is steering your data — deciding where it goes, who can use it, how it’s documented, and what happens when things go wrong. Without governance, AI on data is a ship without a rudder.
The EU AI Act is a new law that regulates AI systems based on how risky they are. High-risk AI (like credit scoring or hiring tools) must prove their training data is high-quality, unbiased, and well-documented. Article 10 is the part that says how.
The AI Act classifies AI into four risk tiers: prohibited (banned), high-risk (heavily regulated), limited risk (transparency required), minimal risk (no obligations). Article 10 applies to high-risk systems and requires documented data provenance, transformation documentation, quality verification, and bias examination.
Article 10(2) requires: documented design choices for data selection, full data lineage including original collection purpose (DV’s record_source + load_date), documentation of all preparation operations (staging + Business Vault transformations), assessment of data quantity and suitability (pre-AI Mart validation), and bias examination for impacts on health, safety, and fundamental rights.
Article 10(5) creates a limited exception to GDPR’s prohibition on processing special category data (race, health, religion). Providers of high-risk AI may process this data specifically for bias detection and correction — but only when anonymized/synthetic data won’t work, with strict safeguards, no third-party transfers, and mandatory deletion after bias correction. This is a direct bridge between GDPR and the AI Act.
AI Act Requirements → Data Vault Features
| AI Act Requirement | Data Vault Feature |
|---|---|
| Data provenance (Art. 10(2)(aa)) | record_source and load_date on every row |
| Transformation documentation (Art. 10(2)(c)) | Staging → Raw Vault → Business Vault: each layer is distinct and auditable |
| Quality verification (Art. 10(3)) | Business Vault validation rules + Feature Mart quality gates |
| Bias examination (Art. 10(2)(f)) | Profiling queries on Business Vault — demographic breakdowns |
| Immutability for audit (Arts. 11-12) | Append-only Satellite design — nothing overwritten |
Penalty Comparison
| Regulation | Maximum Fine | % of Turnover |
|---|---|---|
| GDPR (most serious violations, Art. 83(5)) | EUR 20,000,000 | or 4% |
| AI Act (high-risk non-compliance) | EUR 15,000,000 | or 3% |
| AI Act (prohibited practices) | EUR 35,000,000 | or 7% |
The AI Act’s top tier is nearly double GDPR’s. Combined violations compound.
Data Mesh Governance
Four Principles of Data Mesh
1. Domain Ownership — Each business domain (sales, logistics, finance) owns its data end-to-end.
2. Data as a Product — Each domain treats its data outputs with the same rigor as customer-facing products.
3. Self-Serve Data Infrastructure — A central platform provides tools and templates so domains can build independently.
4. Federated Computational Governance — Governance policies defined centrally, executed locally by each domain.
How Data Vault Supports Data Mesh
Raw Vault = centralized — single integration layer across all domains, cross-domain consistency.
Business Vault = decentralized — each domain applies its own business rules.
Built-in governance = federated governance infrastructure — record_source, load_date, immutability provide the audit infrastructure that federated governance relies on.
ScaleFree has presented on this at conferences. Understanding Data Mesh means being able to define it, state the four principles, and explain how Data Vault supports it.
Match GDPR Rights to DV Features
For each GDPR right, select the Data Vault feature that helps implement it.
The Consultant’s Day
Walking through a real compliance scenario
The Scenario
You’re a BI consultant at ScaleFree. A German financial services client calls: a customer has submitted a DSAR and a deletion request under Article 17. The client’s data warehouse is built on Data Vault 2.0.
Understand the Request
Two separate obligations:
DSAR (Art. 15): Show the customer all data held about them. Must respond within 30 days.
Deletion request (Art. 17): Delete all personal data. Also 30 days. The DPO checks for exceptions (legal retention obligations, ongoing disputes, public interest).
Your role as BI consultant: You don’t handle the legal side — the DPO does. You handle the architectural side: finding the data, executing the deletion, documenting it.
Map the Data
Use the data catalog and lineage tools to find every location of this customer’s data:
| Layer | Contains PII? | Action |
|---|---|---|
| Persistent Staging Area | Yes | Delete / tombstone |
| Raw Vault — Hub | Depends on BK | Keep if non-PII key; Artificial Hub if PII |
| Raw Vault — PII Satellite | Yes | Delete + tombstone |
| Raw Vault — Non-PII Sat | No | Keep |
| Raw Vault — Links | Depends | Keep if Hub remains |
| Business Vault | Possibly | Review and recalculate |
| Information Marts | Possibly | Rebuild affected views |
| Backups | Yes | Mark for next rotation |
Execute the DSAR
Query the vault to find all data about this customer. Follow every hash key path:
Hub_Customer → Sat_Customer_PII (name, email, address)
Hub_Customer → Sat_Customer_Attributes (analytical data)
Hub_Customer → Link_Customer_Account → Hub_Account
Compile the results. Send to DPO for review and delivery to the customer.
Execute the Deletion
1. Run the deletion procedure (automated via datavault4dbt if configured, manual otherwise).
2. Log every action: what was deleted, from which table, at what timestamp, by whom, under which request ID.
3. The deletion log itself is a new audit record — you need to prove you deleted the data.
Confirm and Document
DPO confirms completion to the data subject.
Deletion record stored (without the deleted PII) for accountability.
If data was shared with third parties: notify them (Art. 17(2)).
The Collaboration Points
| Phase | Consultant’s Role | DPO’s Role |
|---|---|---|
| Design time | Design Satellite splitting, Artificial Hub pattern | Review and approve PII classification |
| Build time | Implement deletion procedures, lineage tracking | Validate all PII locations are covered |
| Runtime | Architecture executes the deletion | Triggers the request, confirms to data subject |
Key Concepts
“What makes this genuinely interesting is that GDPR was written for transactional systems where you find a row and delete it. Data warehouses are designed to never delete anything — that’s the whole point. Resolving that tension isn’t just a legal problem or a technical problem, it’s both at once.”
“The standard approach is to design your Satellites so PII attributes are physically separated from analytical data. When a deletion request comes in, you delete the PII Satellite rows and leave the analytical value intact. The Hub and its hash key can remain because they no longer point to identifiable data.”
“When the business key itself is personal data — like an email address — you use an Artificial Hub with a synthetic surrogate key. The real personal identifier lives in a deletable Satellite. That way the Hub can survive deletion without breaking referential integrity.”
“Data Vault wasn’t designed for GDPR, but its architecture is structurally well-suited for compliance. Every row has a load timestamp and record source — that’s your audit trail built in. Nothing gets overwritten — that’s your accountability. The layered architecture means you can track data lineage from source to mart.”
“The EU AI Act requires documented data provenance, transformation audit trails, and bias examination for high-risk AI systems. Data Vault provides most of this by design — record source, load date, and the separation between Raw Vault and Business Vault create a traceable transformation chain. The Feature Mart becomes the compliance enforcement point.”