Chapter 5: Compliance and GDPR — Data Vault Foundations

The Conflict

Language Bridge — Compliance

Compliance — from Latin complēre: “to fill up, to fulfill, to complete.”

Same root as complete (fulfilled, nothing missing) and complement (what fills a gap).

Why this word: Compliance means fulfilling an obligation completely. Not partially, not approximately. When GDPR says “delete,” compliance means actually deleted — from every layer, every backup, every mart. The word itself tells you: half-measures don’t count.

Language Bridge — Erasure

Erasure — from Latin erasura, from eradere: “to scrape off.” In Roman times, writing was done on wax tablets. To erase, you scraped the wax flat.

Why this word: Article 17 uses “right to erasure” rather than “right to deletion” because erasure implies thoroughness. You don’t just remove the record — you scrape away all traces. In a data warehouse with 6+ layers, that scraping reaches deep.

Key GDPR Rights

Six rights that directly impact how you design and operate a Data Vault:

Right	Article	What It Means	DV Implication
Right to erasure	Art. 17	Delete my personal data	Multi-layer deletion across all DV layers
Right of access (DSAR)	Art. 15	Show me all data you hold about me	JOINs from Hub through all Satellites
Right to rectification	Art. 16	Fix incorrect data about me	Append-only Satellite: insert corrected row
Right to data portability	Art. 20	Give me my data in machine-readable format	Export from vault in structured format
Purpose limitation	Art. 5(1)(b)	Data collected for one purpose can’t be freely repurposed	Tracked via record_source and metadata
Data minimization	Art. 5(1)(c)	Only collect what’s necessary	Tension with DV’s “load everything” philosophy

Pseudonymization ≠ Anonymization

Pseudonymization: Replace identifying info with a code. Person can still be re-identified if you have the mapping key. Data is still personal data under GDPR. Still protected.

Anonymization: Remove all identifying info such that re-identification is impossible. Only truly anonymized data falls outside GDPR’s scope.

Hash keys in Data Vault are pseudonymization, not anonymization. If the business key is an email address, the hash of that email is recoverable by anyone with the original email and the hash algorithm. Retaining the hash after deleting the business key does NOT achieve GDPR compliance.

PII Isolation

PII Satellite Isolation

Not all attributes are created equal. Some are personal data protected by law. Others are analytical data safe to keep forever. The trick: put them in different Satellites.

Language Bridge — Isolation

Isolation — from Latin insula: “island.” To isolate is to make something into an island — separated, surrounded, reachable only by deliberate crossing.

Why this word: PII Satellite isolation makes personal data into an island. It’s physically separated from the analytical data. When a deletion request arrives, you delete the island without touching the mainland.

4-Layer Explanation — PII Isolation

L1 Plain language ↓

Instead of putting all of a character’s details in one Satellite table, you split it: personal stuff (name, email, address) goes in one Satellite, analytical stuff (race, level, faction) goes in another. When someone says “delete my personal data,” you delete the personal Satellite and leave the analytical one untouched.

L2 With mechanism ↓

This is a specific case of the Satellite splitting you already know. In Day 2, you learned Satellites split by rate of change or by source system. Here’s a third reason: by data classification. PII attributes get their own Satellite. Both Satellites still connect to the same Hub via the same hash key. The split is physical, not logical.

L3 Technical ↓

The PII Satellite (Sat_Character_PII) contains attributes classified as personal data: name, email, address, date of birth. The non-PII Satellite (Sat_Character_Attributes) contains analytical attributes: race, level, faction, guild membership. Both share the same hub_character_hk. On deletion: DELETE FROM Sat_Character_PII WHERE hub_character_hk = [target]. The non-PII Satellite is untouched.

L4 Expert nuance ↓

The classification decision — which attributes are PII and which aren’t — is not purely technical. It requires legal judgment. “Race” might seem like analytical data, but under GDPR Article 9, racial/ethnic origin is special category data with even stricter protections. PII classification should be done with the Data Protection Officer, not by the data engineer alone. This is exactly where a BI consultant with legal background adds value.

The Skyrim Model — Before & After

Before: Sat_Character_Details (all together)

Column	Example Value	Classification
hub_character_hk	a3f7...	—
real_name	Nazeem	PII
email_address	nazeem@whiterun.gov	PII
race	Redguard	Non-PII*
level	10	Non-PII
faction	None	Non-PII

After: Sat_Character_PII (deletable)

Column	Example Value
hub_character_hk	a3f7...
real_name	Nazeem
email_address	nazeem@whiterun.gov
home_address	Chillfurrow Farm, Whiterun
date_of_birth	4E 175
load_date	2026-01-01
record_source	Whiterun Census

After: Sat_Character_Attributes (retained)

Column	Example Value
hub_character_hk	a3f7...
race	Redguard
level	10
faction	None
guild_membership	None
load_date	2026-01-01
record_source	Whiterun Census

Same Hub. Same hash key. Two Satellites instead of one. Deletion targets only the PII Satellite.

Erasure

The Deletion Walkthrough

Scenario: Nazeem wants to be forgotten. He submits an Article 17 erasure request. Here’s what happens at each layer of the Data Vault.

1

Persistent Staging Area (PSA) Delete

Challenge: PSA is a historical copy of source data — it exists for auditability. Deleting from it means losing the ability to replay loads.

Solution: Find all source records containing Nazeem’s data. Delete or overwrite with tombstone records. This is a deliberate tradeoff: compliance trumps replayability.

2

Raw Vault — Hub Review

Challenge: Hub_Character contains Nazeem’s business key. Is the business key PII?

If business key = “Nazeem” (a name) → YES, it’s PII → must address (see Artificial Hubs).

If business key = “CHAR_00742” (a system ID) → NO → Hub can remain.

3

Raw Vault — PII Satellite Delete

Action: Delete all rows for Nazeem’s hash key from Sat_Character_PII, OR insert tombstone records.

After this: the hash key still exists in the Hub, but points to nothing identifiable.

4

Raw Vault — Non-PII Satellite Keep

Action: KEEP. Sat_Character_Attributes contains race (Redguard), level (10), faction (None). These are not identifiable without the PII Satellite.

Analytical value preserved: “an anonymous Redguard, level 10, no faction” still contributes to aggregate analysis.

5

Raw Vault — Links Review

If Nazeem’s Hub entry remains (non-PII business key or Artificial Hub): Links can remain.

If Nazeem’s Hub entry is deleted: Links referencing it become orphaned → clean up.

6

Business Vault Review

Any derived calculations using Nazeem’s PII directly: recalculate without it. If only aggregated/anonymized data was used: may remain.

7

Information Marts & Backups Review

Marts: Pre-aggregated views including Nazeem at row level → rebuild without him. Aggregate summaries → usually fine.

Backups: Right to erasure extends to backups. Practical reality: deletion is deferred to the next backup rotation cycle, documented with a retention exception and timeline.

Key Insight: Data Lineage Is Compliance Infrastructure

The deletion walkthrough above requires knowing exactly where Nazeem’s data lives across all layers. This is data lineage. Without it, you can’t fulfill the erasure request.

Data Vault’s record_source and load_date on every row make lineage traceable by design — but you still need tooling (data catalog, lineage graphs) to operationalize it.

Artificial Hubs

PII Satellite isolation works perfectly — until the business key IS the personal data. If Nazeem’s Hub stores his email address as the business key, deleting the PII Satellite doesn’t help. The Hub itself is the problem.

Language Bridge — Artificial

Artificial — from Latin artificium: “made by human craft.” Ars (skill, craft) + facere (to make). Not fake or inferior — deliberately constructed for a purpose.

Why this word: An Artificial Hub uses a crafted, synthetic key instead of the natural business key. It’s artificial not because it’s fake, but because it was deliberately constructed to solve a specific problem: making the Hub non-identifiable.

4-Layer Explanation — Artificial Hubs

L1 Plain language ↓

Sometimes the business key itself is personal data (like an email). You can’t just split it into a Satellite — it’s the core identifier in the Hub. So you create an Artificial Hub that uses a random number as its business key instead. The real personal identifier lives in a Satellite that can be deleted.

L2 With mechanism ↓

The Artificial Hub replaces the PII business key with a randomly generated surrogate key (e.g., a UUID). The real personal identifier lives in a PII Satellite attached to the Artificial Hub. When a deletion request arrives: delete the PII Satellite, the Artificial Hub remains (it’s just a random number), and all analytical relationships remain intact.

L3 Technical ↓

Standard Hub: Hub_Customer(hub_customer_hk, email_address, load_date, record_source) — the email is the business key. Artificial Hub: Hub_Customer(hub_customer_hk, artificial_key, load_date, record_source) — the business key is a UUID like a8f3e2b1-.... The email lives in Sat_Customer_PII(hub_customer_hk, email_address, real_name, ...). On deletion: DELETE all rows from Sat_Customer_PII. The Hub retains a8f3e2b1-... — meaningless without the PII Satellite.

L4 Expert nuance ↓

The hash key itself may still be problematic. If the original business key was an email, the hash key was derived FROM that email. Anyone with the original email and the hash algorithm could recalculate the hash and match it — meaning the hash constitutes personal data under GDPR. With an Artificial Hub, the hash key is derived from the random surrogate, not from PII, so this problem disappears. This is one of the strongest arguments for Artificial Hubs in GDPR-sensitive implementations.

Standard Hub vs. Artificial Hub

Standard Hub (PII business key — problem)

Column	Value	Issue
hub_character_hk	hash(“Nazeem”)	Derived from PII
business_key	Nazeem	IS personal data
load_date	2026-01-01	—
record_source	Whiterun Census	—

Artificial Hub (safe surrogate key)

Column	Value	Status
hub_character_hk	hash(“ART-00742”)	Not derived from PII
business_key	ART-00742	Not personal data
load_date	2026-01-01	—
record_source	Whiterun Census	—

Tombstones

Tombstone Records

Language Bridge — Tombstone

Tombstone — literally, a gravestone. A marker that says: “something was here. It is gone now.”

Why this word: A tombstone record replaces deleted data with a marker that says “this data existed and was deleted on this date, for this reason.” It preserves the audit trail without preserving the personal data.

4-Layer Explanation — Tombstone Records

L1 Plain language ↓

Instead of just deleting a row and leaving nothing, you replace it with a special row that says “deleted.” That way the audit trail shows the deletion happened — you can prove you complied with the request.

L2 With mechanism ↓

A tombstone record contains the hash key (to identify which entity), a deletion timestamp, the reason for deletion (e.g., “Article 17 request #2847”), and NULL or placeholder values where the PII used to be. The row itself contains no personal data — it’s just a marker.

L3 Technical ↓

In Data Vault’s append-only model, a DELETE is philosophically uncomfortable — you’re destroying history. Tombstones resolve this: you INSERT a new Satellite row with the same hash key, a current load_date, NULLs for all PII columns, and a flag indicating “GDPR_ERASED.” The previous PII rows are then physically deleted. The tombstone remains as the audit trail.

L4 Expert nuance ↓

The tombstone itself must not contain PII. “Deleted per request from nazeem@whiterun.gov” defeats the purpose. The deletion reason should reference a request ID, not the requester’s identity. The request ID maps to a separate, access-controlled deletion log maintained by the DPO’s office — not stored in the data warehouse itself.

Tombstone Record in Sat_Character_PII

Column	Value
hub_character_hk	a3f7...
real_name	NULL
email_address	NULL
home_address	NULL
date_of_birth	NULL
load_date	2026-02-21
record_source	GDPR_ERASED
deletion_request_id	ART17-2847

Exercise 01

PII Classification

Given these Skyrim character attributes, classify each as PII (personal data under GDPR) or Non-PII (analytical data safe to retain). Click your choice for each, then check.

real_name

race

level

faction

home_address

date_of_birth

guild_membership

email_address

weapon_preference

bounty_amount

AI Act

EU AI Act — Article 10

GDPR tells you what to delete. The AI Act tells you what to document. They’re designed to be complementary — but in practice, they create compounding obligations that only a governed data architecture can satisfy.

Language Bridge — Governance

Governance — from Greek kybernan: “to steer a ship.” Same root as cybernetics (the study of steering/control systems) and governor (the one who steers).

Why this word: Data governance is steering your data — deciding where it goes, who can use it, how it’s documented, and what happens when things go wrong. Without governance, AI on data is a ship without a rudder.

4-Layer Explanation — EU AI Act

L1 Plain language ↓

The EU AI Act is a new law that regulates AI systems based on how risky they are. High-risk AI (like credit scoring or hiring tools) must prove their training data is high-quality, unbiased, and well-documented. Article 10 is the part that says how.

L2 With mechanism ↓

The AI Act classifies AI into four risk tiers: prohibited (banned), high-risk (heavily regulated), limited risk (transparency required), minimal risk (no obligations). Article 10 applies to high-risk systems and requires documented data provenance, transformation documentation, quality verification, and bias examination.

L3 Technical ↓

Article 10(2) requires: documented design choices for data selection, full data lineage including original collection purpose (DV’s record_source + load_date), documentation of all preparation operations (staging + Business Vault transformations), assessment of data quantity and suitability (pre-AI Mart validation), and bias examination for impacts on health, safety, and fundamental rights.

L4 Expert nuance ↓

Article 10(5) creates a limited exception to GDPR’s prohibition on processing special category data (race, health, religion). Providers of high-risk AI may process this data specifically for bias detection and correction — but only when anonymized/synthetic data won’t work, with strict safeguards, no third-party transfers, and mandatory deletion after bias correction. This is a direct bridge between GDPR and the AI Act.

AI Act Requirements → Data Vault Features

Article 10 ↔ Data Vault Mapping

AI Act Requirement	Data Vault Feature
Data provenance (Art. 10(2)(aa))	record_source and load_date on every row
Transformation documentation (Art. 10(2)(c))	Staging → Raw Vault → Business Vault: each layer is distinct and auditable
Quality verification (Art. 10(3))	Business Vault validation rules + Feature Mart quality gates
Bias examination (Art. 10(2)(f))	Profiling queries on Business Vault — demographic breakdowns
Immutability for audit (Arts. 11-12)	Append-only Satellite design — nothing overwritten

Penalty Comparison

Regulation	Maximum Fine	% of Turnover
GDPR (most serious violations, Art. 83(5))	EUR 20,000,000	or 4%
AI Act (high-risk non-compliance)	EUR 15,000,000	or 3%
AI Act (prohibited practices)	EUR 35,000,000	or 7%

The AI Act’s top tier is nearly double GDPR’s. Combined violations compound.

Data Mesh

Data Mesh Governance

Four Principles of Data Mesh

1. Domain Ownership — Each business domain (sales, logistics, finance) owns its data end-to-end.

2. Data as a Product — Each domain treats its data outputs with the same rigor as customer-facing products.

3. Self-Serve Data Infrastructure — A central platform provides tools and templates so domains can build independently.

4. Federated Computational Governance — Governance policies defined centrally, executed locally by each domain.

How Data Vault Supports Data Mesh

Raw Vault = centralized — single integration layer across all domains, cross-domain consistency.

Business Vault = decentralized — each domain applies its own business rules.

Built-in governance = federated governance infrastructure — record_source, load_date, immutability provide the audit infrastructure that federated governance relies on.

Key Takeaway

ScaleFree has presented on this at conferences. Understanding Data Mesh means being able to define it, state the four principles, and explain how Data Vault supports it.

Exercise 02

Match GDPR Rights to DV Features

For each GDPR right, select the Data Vault feature that helps implement it.

A. Right to erasure (Art. 17)

B. Right of access / DSAR (Art. 15)

C. Right to rectification (Art. 16)

D. Purpose limitation (Art. 5(1)(b))

E. Accountability (Art. 5(2))

Applied Concepts

The Scenario

You’re a BI consultant at ScaleFree. A German financial services client calls: a customer has submitted a DSAR and a deletion request under Article 17. The client’s data warehouse is built on Data Vault 2.0.

Step 1

Understand the Request

Two separate obligations:

DSAR (Art. 15): Show the customer all data held about them. Must respond within 30 days.

Deletion request (Art. 17): Delete all personal data. Also 30 days. The DPO checks for exceptions (legal retention obligations, ongoing disputes, public interest).

Your role as BI consultant: You don’t handle the legal side — the DPO does. You handle the architectural side: finding the data, executing the deletion, documenting it.

Step 2

Map the Data

Use the data catalog and lineage tools to find every location of this customer’s data:

Data Map — Layer by Layer

Layer	Contains PII?	Action
Persistent Staging Area	Yes	Delete / tombstone
Raw Vault — Hub	Depends on BK	Keep if non-PII key; Artificial Hub if PII
Raw Vault — PII Satellite	Yes	Delete + tombstone
Raw Vault — Non-PII Sat	No	Keep
Raw Vault — Links	Depends	Keep if Hub remains
Business Vault	Possibly	Review and recalculate
Information Marts	Possibly	Rebuild affected views
Backups	Yes	Mark for next rotation

Step 3

Execute the DSAR

Query the vault to find all data about this customer. Follow every hash key path:

Hub_Customer → Sat_Customer_PII (name, email, address)

Hub_Customer → Sat_Customer_Attributes (analytical data)

Hub_Customer → Link_Customer_Account → Hub_Account

Compile the results. Send to DPO for review and delivery to the customer.

Step 4

Execute the Deletion

1. Run the deletion procedure (automated via datavault4dbt if configured, manual otherwise).

2. Log every action: what was deleted, from which table, at what timestamp, by whom, under which request ID.

3. The deletion log itself is a new audit record — you need to prove you deleted the data.

Step 5

Confirm and Document

DPO confirms completion to the data subject.

Deletion record stored (without the deleted PII) for accountability.

If data was shared with third parties: notify them (Art. 17(2)).

The Collaboration Points

Consultant vs. DPO Responsibilities

Phase	Consultant’s Role	DPO’s Role
Design time	Design Satellite splitting, Artificial Hub pattern	Review and approve PII classification
Build time	Implement deletion procedures, lineage tracking	Validate all PII locations are covered
Runtime	Architecture executes the deletion	Triggers the request, confirms to data subject

Key Concepts

On the GDPR-DV Tension

“What makes this genuinely interesting is that GDPR was written for transactional systems where you find a row and delete it. Data warehouses are designed to never delete anything — that’s the whole point. Resolving that tension isn’t just a legal problem or a technical problem, it’s both at once.”

On PII Satellite Isolation

“The standard approach is to design your Satellites so PII attributes are physically separated from analytical data. When a deletion request comes in, you delete the PII Satellite rows and leave the analytical value intact. The Hub and its hash key can remain because they no longer point to identifiable data.”

On Artificial Hubs

“When the business key itself is personal data — like an email address — you use an Artificial Hub with a synthetic surrogate key. The real personal identifier lives in a deletable Satellite. That way the Hub can survive deletion without breaking referential integrity.”

On Data Vault’s GDPR Advantage

“Data Vault wasn’t designed for GDPR, but its architecture is structurally well-suited for compliance. Every row has a load timestamp and record source — that’s your audit trail built in. Nothing gets overwritten — that’s your accountability. The layered architecture means you can track data lineage from source to mart.”

On AI Act + DV

“The EU AI Act requires documented data provenance, transformation audit trails, and bias examination for high-risk AI systems. Data Vault provides most of this by design — record source, load date, and the separation between Raw Vault and Business Vault create a traceable transformation chain. The Feature Mart becomes the compliance enforcement point.”

The Conflict

The Conflict

Key GDPR Rights

Pseudonymization ≠ Anonymization

PII Satellite Isolation

The Skyrim Model — Before & After

The Deletion Walkthrough

Key Insight: Data Lineage Is Compliance Infrastructure

Artificial Hubs

Standard Hub vs. Artificial Hub

Tombstone Records

PII Classification

EU AI Act — Article 10

AI Act Requirements → Data Vault Features

Penalty Comparison

Data Mesh Governance

Four Principles of Data Mesh

How Data Vault Supports Data Mesh

Match GDPR Rights to DV Features

Both Exercises Complete

The Consultant’s Day

The Scenario

Understand the Request

Map the Data

Execute the DSAR

Execute the Deletion

Confirm and Document

The Collaboration Points

Key Concepts