Chapter 2: Data Vault Deep Cuts — Data Vault Foundations

Section 1

Hash Keys

You don't decrypt a hash key. There's nothing to decrypt. It was never encrypted. So what's the point of it?

Language Bridge

Hash comes from the French hacher — to chop up. Same root as hachis parmentier. A hash function chops the input and scrambles it into a fixed-length output. The word was chosen because the process literally chops up information and mixes it together — like mincing meat into a hash.

In Data Vault, "key" means the same as in everyday life: the thing that uniquely opens one specific door. A hash key is a chopped-up identifier that unlocks one specific row.

4-Layer Explanation

Simple For anyone, no background needed ▼

A hash key is like a barcode on a product. Scan the same product information twice, you get the same barcode. It's generated from the data, it's faster for computers to work with than reading the full label, and every identical product gets the exact same barcode. But it's not a lock — anyone who knows the system can reproduce it.

Mechanism Adds why it works this way ▼

It's generated by a formula called SHA-256 that always produces the same output for the same input. So two different computers loading the same customer — say "FR-8834" — get the exact same key, without ever talking to each other. No central counter needed. No "what number is next?" bottleneck. That's why Data Vault can load data in parallel from multiple sources simultaneously.

Technical Correct terminology, full detail ▼

SHA-256 is a cryptographic hash function that maps any input to a 256-bit fixed-length string, deterministically — same input, same output, always. The three properties that matter: deterministic (reproducible on any machine), collision-resistant (different inputs produce different outputs), and fixed-length (whether you hash "FR-8834" or the entire text of the Lusty Argonian Maid, the output is always 64 hex characters). This enables parallel loading across distributed systems without a central sequence generator.

Expert Edge cases and nuance ▼

Hash keys in Data Vault are not used for security. The business key is stored in plaintext right next to the hash in the Hub, and anyone with the algorithm and input can reproduce it. They exist purely for consistency and join performance, not obfuscation. The Data Vault community emphasizes: do not rely on hash keys for security. Also notable: the choice of SHA-256 specifically is a convention, not a requirement — MD5 was common in DV 1.0, but SHA-256 is preferred in 2.0 for its lower collision probability at scale.

Why not just count 1, 2, 3?

Three problems that sequence numbers create — and hash keys solve:

Problem 1: Load Order Dependency

The Companions guild hall in Whiterun. You're loading warrior data from two sources. With sequence numbers, whoever loads first gets ID 1. Load the game DB first? Vilkas is #1. Load the wiki first? Farkas is #1. Identity changes based on something completely irrelevant — the order you pressed the button.

A hash of "COMPANION_VILKAS" is always the same hash. Load order doesn't matter. Identity is stable.

Problem 2: Cross-Source Collisions

Your French CRM has a customer #1. Your German ERP also has a customer #1. Different people, same number. Sequence numbers from different systems crash into each other.

But HASH("FR-CRM-0001") and HASH("DE-ERP-0001") produce completely different keys. Collision solved.

Problem 3: The Bottleneck

Sequence numbers need a central counter — one system asking "what's next?" before every insert. That's a queue. One thing at a time.

A hash is calculated independently, on any machine, with zero coordination. Ten machines hash ten business keys simultaneously. This is what makes Data Vault scalable.

Common Misconception

Hash keys are not a security mechanism. In Data Vault, you know the algorithm (SHA-256), you know the input (the business key), and you store both right next to each other in the Hub. There's nothing to "decrypt" — it was never encrypted. Hash keys are a consistency and performance mechanism.

Key Concept

"A hash key gives you the same identifier regardless of when or where you load the data. That's what makes parallel loading and multi-source integration possible — no central counter, no collisions, no load-order dependency."

Section 2

Business Keys

The hash key is the barcode. The business key is what's actually printed on the label. One is for machines. The other is for humans. The Hub keeps both.

Language Bridge

Straightforward for once. A business key is the key that the business uses — the real-world identifier that exists before any database gets involved. Customer number, product SKU, employee badge ID. If a customer calls the helpline and says "my account number is FR-8834," that's the business key. They'll never say "my hash key is a3f8b72e."

4-Layer Explanation

Simple For anyone, no background needed ▼

A business key is the real name that the real world uses for something. A customer number. An invoice number. A product code. It's what people actually say and use, not what a computer generated.

Mechanism Adds why it works this way ▼

The business key is the source of truth for identity. The hash key is just a technical optimization built on top of the business key. The Hub stores both side by side: the hash key for fast computer operations (joins, lookups), and the business key so humans can still read and trace what's in the vault.

Technical Correct terminology, full detail ▼

Business keys originate from source systems, not from the data warehouse. They're the anchor of identity in Data Vault. A Hub row is defined by its business key — the hash key is deterministically derived from it. A Hub has exactly four column types: Hash Key (PK), Business Key (unique, not null), Load Date (first appearance), and Record Source (which system sent it). No descriptive attributes. Ever.

Expert Edge cases and nuance ▼

Composite business keys handle the case where different sources use the same identifier for different entities. If French CRM and German ERP both have customer "0001" (different people), the composite business key becomes source_system + customer_id. The hash is computed on the composite. This is also why Record Source matters: it disambiguates origin. In practice, defining the correct business key is one of the hardest design decisions in Data Vault modeling — get it wrong, and entities merge or split incorrectly.

The Complete Hub: Four Columns, Nothing More

Hub_Character — Skyrim Example

Hash Key (PK)	Business Key	Load Date	Record Source
a3f8b7...	CHAR_ULFRIC	2026-02-19	skyrim_npc_db
c91e2d...	CHAR_LYDIA	2026-02-19	skyrim_npc_db
f4a0bc...	CHAR_NAZEEM	2026-02-19	skyrim_npc_db

Hash key for fast joins. Business key for human readability. Load Date for when this entity first entered the vault. Record Source for where it came from. That's it — no name, no race, no city. Those live in Satellites.

Section 3

Three Schools of Thought

Data Vault isn't the only way to build a data warehouse. There are three major approaches — and the surprising thing is, they're not rivals. They're layers.

4-Layer Explanation

Simple For anyone, no background needed ▼

Think of three ways to organize a massive collection of documents. The library (3NF) files everything in exactly one place, cross-referenced by catalog cards — no duplicates, but finding stuff requires following lots of references. The shopping mall directory (Star Schema) organizes everything by what you want to do — fast to navigate, but information is duplicated across directories. The legal archive (Data Vault) files by entity, cross-references every relationship, and dates every change — nothing is ever erased.

Mechanism What each optimizes for ▼

3NF (Bill Inmon): Maximum normalization. Every fact stored once. Lots of tables, lots of joins. Optimizes for data integrity. Star Schema (Ralph Kimball): Denormalized for speed. One big fact table surrounded by dimension tables. Optimizes for query performance. Data Vault (Dan Linstedt): Separates identity, relationships, and context. Optimizes for flexibility, auditability, and handling change over time.

Technical Full comparison with trade-offs ▼

3NF excels at integrity but struggles with adding new sources (schema changes ripple). Star Schema excels at query speed but struggles with history (slowly changing dimensions are complex) and new sources (fact table redesign). Data Vault excels at absorbing new sources (just add Hubs/Links/Sats) and tracking history (new Satellite rows, timestamped) but requires a star schema reporting layer on top because raw vault queries involve many joins.

Expert How they stack in practice ▼

In a modern enterprise, you often use all three: 3NF in the source systems (OLTP), Data Vault as the resilient middle layer (absorbs everything, tracks everything), and star schemas as the Information Marts on top (fast dashboards). Data Vault doesn't replace star schema — it feeds it. ScaleFree builds that middle layer. That's the entire business model.

3NF — Third Normal Form

Bill Inmon · "The Father of Data Warehousing"

The librarian. Every fact stored exactly once. Maximum normalization. Many tables, many joins.

Pro: No redundancy, strong integrity. Con: Slow queries (many JOINs), painful to add new sources.

Star Schema

Ralph Kimball · "The Dimensional Modeler"

The shopping mall directory. One big fact table in the center, dimension tables around it. Pre-joined, denormalized for speed.

Pro: Fast queries, easy for dashboards. Con: Redundant data, hard to restructure, doesn't handle history well.

Data Vault 2.0

Dan Linstedt · Extended by Michael Olschimke

The legal archive. Separates identity (Hubs), relationships (Links), and context (Satellites). Everything timestamped. Nothing overwritten.

Pro: Absorbs any source, tracks any change, full auditability, parallel loading. Con: More tables, needs star schema marts on top.

	3NF (Inmon)	Star (Kimball)	Data Vault 2.0
New sources?	Painful	Painful	Easy (add H/L/S)
History?	Overwrites	SCD (complex)	Natural (new rows)
Query speed?	Slow	Fast (pre-joined)	Needs marts on top
Auditability?	Limited	Limited	Complete
Parallel load?	Difficult	Difficult	Built for it
Typical layer	EDW / sources	Marts / reporting	Core warehouse

The Enterprise Analogy — Data Vault → Star Schema → Dashboard

The Enterprise's computer core stores everything — every sensor reading, every crew log, every system status. Nothing deleted. Full history. That's Data Vault.

When Picard needs a tactical decision, he doesn't query the raw core. He says: "Computer, display all Borg incursion events in Sector 001, organized by stardate and ship class." The display appears instantly — because the computer has a pre-built structure for common tactical queries. Not searching the entire core each time. Pre-organized for the most common questions.

That pre-built structure = Star Schema. The LCARS screen Picard looks at = Dashboard.

Data Vault = raw computer core (complete, nothing lost) • Star Schema = pre-built tactical query structure (organized for speed on common questions) • Dashboard = the LCARS display on the bridge

Key Concept

"Data Vault isn't a replacement for star schema — it's the resilient middle layer. You still build star schemas on top for reporting. But the vault underneath means you can absorb any source, track any change, and audit anything."

Section 4

What 2.0 Added

Data Vault 1.0 was the blueprint. Data Vault 2.0 made it industrial. The difference? Automation, speed, and a way to detect change with a single comparison.

4-Layer Explanation

Simple For anyone, no background needed ▼

Version 1.0 invented the filing system (Hubs, Links, Satellites). Version 2.0 added better labels (hash keys so any office can file independently), a quick change-detector (hash diffs so you can tell if anything changed without reading every field), and a set of standard procedures so machines can do most of the filing automatically.

Mechanism What each addition solves ▼

Hash keys replaced sequence numbers — enabling parallel loading from multiple sources (Section 1 of this tutorial). Hash diffs on Satellites let you detect changes in one comparison instead of checking every column individually. Standardized loading patterns turned hand-crafted ETL into repeatable recipes that tools can automate. That's where ScaleFree's datavault4dbt comes in.

Technical The five additions, formally ▼

1. Hash keys — deterministic, parallel-safe primary keys. 2. Hash diffs on Satellites — a single hash of all attribute values for fast change detection. 3. Standardized loading patterns — formal, repeatable ETL recipes (the foundation of automation). 4. Big Data / NoSQL support — the methodology extended beyond relational databases. 5. Automation-first design — DV 2.0 is designed to be generated by tools, not hand-coded.

Expert ScaleFree's role in 2.0 ▼

DV 2.0 was co-formalized by Dan Linstedt and Michael Olschimke (ScaleFree's CEO). The Data Vault 2.0 book and the ScaleFree training program are the canonical sources. ScaleFree's toolchain — datavault4dbt (macro library), TurboVault4dbt (model generator), and datavault4coalesce (visual interface) — is the direct implementation of DV 2.0's automation-first principle. When they say "automation," they mean their own tools.

Hash Diffs: Change Detection in One Step

Language Bridge

Diff from "difference." A hash diff is a hash of all attribute values in a Satellite row. It's a fingerprint of "everything in this row." When new data arrives, you compare one hash instead of checking every column individually. Same hash = nothing changed. Different hash = something changed, insert a new row.

Hash Diff in Action — Ulfric's Satellite

Initial load: Sat_Character_Details has: name=Ulfric, race=Nord, faction=Stormcloaks, city=Windhelm

Hash diff = HASH("Ulfric|Nord|Stormcloaks|Windhelm") → e7c3a1...

Next data load: Same values arrive. Hash diff = e7c3a1.... Matches. Skip — nothing to do.

Civil war ends: New data: name=Ulfric, race=Nord, faction=None, city=Sovngarde

New hash diff = HASH("Ulfric|Nord|None|Sovngarde") → 9b2f7d.... Different. Insert new row.

Both rows preserved. Full history. One hash comparison instead of four column comparisons.

DV 1.0 — Early 2000s

Core architecture. Hubs, Links, Satellites. Business keys. Load dates. Sequence-based keys. Manual loading.

DV 2.0 — ~2013

Hash keys. Hash diffs. Standardized patterns. NoSQL support. Automation-first. Co-formalized by Linstedt & Olschimke. This is what ScaleFree teaches and builds.

Key Concept

"DV 2.0 made the methodology industrial-strength — hash keys for parallel loading, hash diffs for change detection, and standardized patterns that tools like datavault4dbt can automate."

Section 5

Splitting Satellites

One Hub, multiple filing cabinets. Why would you keep separate files about the same person?

4-Layer Explanation

Simple For anyone, no background needed ▼

Sometimes you need separate filing cabinets for the same person. One cabinet for info from the hospital, another for info from the insurance company — because they have different levels of trust. Or one cabinet for things that rarely change (your name, date of birth) and another for things that change all the time (your address, your phone number) — so you don't have to refile everything just because you moved.

Mechanism The two reasons to split ▼

Reason 1: Different sources. Each source has its own reliability, update frequency, and data quality. Mixing them in one table means you can't tell where a fact came from. Reason 2: Different rates of change. If location changes daily but race never changes, a single Satellite rewrites the entire row (including unchanged race) every time location updates. Splitting means you only write what actually changed.

Technical The splitting rule ▼

Group attributes that share the same source and the same rate of change. Each resulting Satellite has its own hash diff, its own load frequency, and its own record source tracking. The foreign key to the Hub (the Hub's hash key) is what connects all Satellites back to the same entity. Multiple Satellites, one identity.

Expert The balancing act ▼

Over-splitting creates join complexity — if you need to reconstruct a full picture, you're joining many Satellites. Under-splitting creates write amplification — rewriting unchanged data wastes storage and processing. The balance is pragmatic: split when there's a clear difference in source or volatility, not for every individual attribute. In practice, 2–4 Satellites per Hub is common. 10 would be a red flag.

Reason 1: Different Sources

Hub_Character

Sat_Character_Details_GameDB

name, race, class, base_health

Source: Skyrim NPC database

Sat_Character_Details_Wiki

lore_description, first_appearance, voice_actor

Source: Fan wiki

Both Satellites attach to the same Hub. Both keyed by the same hash key. But they're separate tables because they come from different sources with different trust levels and update frequencies.

Reason 2: Different Rates of Change

Hub_Character

Sat_Character_Identity

name, race, gender

Changes: Almost never

Sat_Character_Location

current_city, current_hold, current_building

Changes: Constantly

If Lydia moves from Whiterun to Solitude, you only insert a new row in Sat_Character_Location. Sat_Character_Identity is untouched. No redundant rewrites.

This connects back to Chapter 1's Aristotelian framework. Satellites hold accidents — properties that describe but aren't essential to identity. Some accidents are nearly permanent (race). Others are fleeting (location). Splitting Satellites is Data Vault's way of saying: not all accidents are equal.

Key Concept

"You split Satellites when the source or rate of change differs — it's about isolating what changes from what doesn't, so you only write what's actually new."

Exercise — Upgrade Your Skyrim Model

Take your model from Chapter 1 and level it up

Add the four metadata columns to Hub_Character: Hash Key, Business Key, Load Date, Record Source. What would you use as the business key for a Skyrim character?
Split one Satellite into two and explain why — is it different sources, or different rates of change?
In one sentence: why does Data Vault use hash keys instead of sequence numbers?

You already know the answers — this is just about saying them out loud. We'll check them together.