Chapter 4: SQL JOINs — Data Vault Foundations

Section 1

You Already Found the Problem

In the last session, you queried Sat_Character_Details and got race, level, faction — but not the character’s name. Because names live in Hub_Character, and you were looking at one table. JOIN is how you look at two tables at once.

That moment in Day 3 — when you realized you couldn’t get names from the Satellite alone — is exactly the problem JOINs solve. You separated data into Hubs and Satellites for good reasons (no repetition, parallel loading, auditability). JOIN is the tool that reads it back together.

Key Concept

“A JOIN combines data from two tables based on a shared key. In Data Vault, that’s typically the hash key — the Satellite has it as a foreign key pointing back to the Hub. The JOIN follows that link to combine identity with attributes.”

Section 2

Why the Word “JOIN”

Language Bridge

JOIN — from Latin jungere: “to yoke, to bind, to connect.”

Same root as junction (where roads connect), conjunction (in grammar: “and,” “but” — words that connect clauses), and in logic: the AND operator (∧).

This is conjunction from symbolic logic. P ∧ Q produces a result only when both P and Q are true. A SQL JOIN produces a result row only when both tables have a matching value. Same structural shape: “give me something only when BOTH sides match.”

4-Layer Explanation — What a JOIN Is

L1 Plain language — for anyone ↓

A JOIN combines rows from two tables based on a value they share. Like looking up a character’s name on one sheet and their stats on another, then reading both together by matching the character ID.

L2 With mechanism — how it works ↓

The shared value is a key — specifically, a foreign key in one table that points to a primary key in another. The JOIN reads both tables and pairs up rows where the key matches. Rows without a partner get excluded (in the default INNER JOIN).

L3 Technical — full detail ↓

A JOIN is a set operation that produces the Cartesian product of two relations, filtered by a join predicate (the ON clause). INNER JOIN returns only rows where the predicate is TRUE. The query optimizer decides the physical execution strategy (nested loop, hash join, merge join) — the SQL only declares what to match, not how.

L4 Expert — nuance & edge cases ↓

In Data Vault, JOINs follow a predictable pattern because the model is standardized. Every Satellite JOINs to its parent Hub on the hash key. Every Link JOINs to its participant Hubs. This standardization is why DataVault4dbt can generate JOIN logic automatically — the foreign key relationships are always in the same place. When you can read a DV JOIN, you can read any query the automation produces.

Section 3

ON & Table Aliases

ON specifies the condition that must be true for two rows to be joined. It answers: “on what basis are these tables connected?” The answer is always: on matching keys.

Language Bridge — ON

ON — everyday: “based on,” “conditional upon.” “I’ll go on one condition.” In SQL: the condition that must hold for two rows to be paired.

ON is to JOIN what WHERE is to the full query. WHERE filters rows after reading. ON filters which rows get paired during the JOIN. Both are boolean conditions — they operate at different stages.

Table Aliases — Nicknames

When two tables share a column name (like hub_character_hk exists in both Hub_Character and Sat_Character_Details), SQL needs to know which one you mean. Aliases disambiguate:

Table aliases

FROM Hub_Character h            -- h is the alias
JOIN Sat_Character_Details s  -- s is the alias
  ON h.hub_character_hk = s.hub_character_hk

Convention: h for Hub, s for Satellite, l for Link, hc for Hub_Character, hw for Hub_Weapon. These are nicknames the database doesn’t care about — but humans reading the query do.

+ What happens if you forget ON? EXPLORE

Without ON, the database has no rule for matching rows. Two things can happen:

1. An error — most databases reject JOIN without ON as a syntax error. The query won’t run.

2. A Cartesian product — if the syntax is technically valid (old-style comma joins), every row from the left table pairs with every row from the right. 5 Hub rows × 5 Satellite rows = 25 result rows. Most of them nonsense.

This is exactly the bug in Exercise 4 at the end of this tutorial. When you get there, you’ll spot it instantly.

Section 4

The Basic JOIN: Hub ↔ Satellite

The question: “Give me each character’s name AND their race, level, and faction.”

Names are in Hub_Character. Race, level, faction are in Sat_Character_Details. These are separate tables. JOIN connects them.

Hub ↔ Satellite JOIN

SELECT h.business_key AS character_name,
       s.race,
       s.level,
       s.faction
FROM Hub_Character h
JOIN Sat_Character_Details s
  ON h.hub_character_hk = s.hub_character_hk;

How to Read This Query

1. FROM Hub_Character h — Start with the Hub. This is our “left” table.

2. JOIN Sat_Character_Details s — Bring in the Satellite.

3. ON h.hub_character_hk = s.hub_character_hk — The bridge: pair rows where hash keys match.

4. SELECT ... — From the paired result, show name (from Hub), race, level, faction (from Satellite).

The Result

JOIN Result

character_name	race	level	faction
Ulfric	Nord	30	Stormcloak
Lydia	Nord	25	Companions
Nazeem	Redguard	10	None
Aela	Nord	28	Companions
Balgruuf	Nord	35	Jarl

This is the query you couldn’t write in Day 3. Now you can.

Adding WHERE

JOINs and WHERE combine naturally. JOIN pairs the tables, then WHERE filters the paired result:

JOIN + WHERE

SELECT h.business_key AS character_name,
       s.race, s.level
FROM Hub_Character h
JOIN Sat_Character_Details s
  ON h.hub_character_hk = s.hub_character_hk
WHERE s.faction = 'Companions';

Result: Lydia and Aela only. The JOIN paired all characters with their details, then WHERE kept only Companions.

Section 5

The Chain: Hub → Link → Hub

“Which weapons does Ulfric carry?” Characters are in one Hub. Weapons are in another. The connection lives in Link_Character_Weapon — which stores the hash keys of both.

The Link Table

Link_Character_Weapon

link_character_weapon_hk	hub_character_hk	hub_weapon_hk	load_date
x1y2...	a3f7...	w8k2...	2026-01-01
z3a4...	a3f7...	w9m3...	2026-01-01
b5c6...	b82c...	w8k2...	2026-01-01

Each row = one relationship: “character X carries weapon Y.”

The Link stores only hash key references. To get names, we JOIN through the Link to both Hubs.

Chain JOIN — Hub → Link → Hub

SELECT hc.business_key AS character_name,
       hw.business_key AS weapon_name
FROM Hub_Character hc
JOIN Link_Character_Weapon l
  ON hc.hub_character_hk = l.hub_character_hk
JOIN Hub_Weapon hw
  ON l.hub_weapon_hk = hw.hub_weapon_hk
WHERE hc.business_key = 'Ulfric';

How to read this chain:

1. Start at Hub_Character (aliased hc).

2. Follow the hash key into Link_Character_Weapon (aliased l). Now we have every character–weapon pair.

3. From the Link, follow the weapon hash key into Hub_Weapon (aliased hw). Now both sides have names.

4. WHERE filters to Ulfric only.

The Result

Chain JOIN Result

character_name	weapon_name
Ulfric	Steel Sword
Ulfric	Voice of the Sky

The structural insight: The Link table IS the JOIN table. In relational design, many-to-many relationships always need a junction table. In Data Vault, that’s exactly what a Link is — except it also adds load_date and record_source for auditability.

When you built Link_Character_Weapon in Day 1, you were designing the table this query now reads.

+ Why does the Link need its own hash key? EXPLORE

Look at link_character_weapon_hk in the Link table. It’s not a foreign key to either Hub — it’s the Link’s own primary key, computed from the combination of both participant hash keys.

Why? The same two reasons as Hub hash keys: (1) it gives the Link a single-column primary key for fast lookups, and (2) it’s deterministic — the same character+weapon pair always produces the same Link hash key, regardless of which source system sent it.

If Ulfric carrying a Steel Sword arrives from both the Stormcloak roster and the Whiterun armory log, both sources produce the same link_character_weapon_hk. The vault recognizes it as the same relationship. No duplicates.

The Link hash key also becomes a foreign key target — if a Link ever gets its own Satellite (storing attributes of the relationship, like “equipped_date”), that Satellite would point to the Link’s hash key, exactly like a Hub Satellite points to a Hub hash key.

Section 6

INNER vs LEFT JOIN

INNER JOIN (the default)

JOIN without a qualifier means INNER JOIN. Only rows with a match in both tables appear. No match = excluded.

Analogy: A guest list where both the invitation (Hub) AND the RSVP (Satellite) must exist. No RSVP, not on the final list.

LEFT JOIN

Returns all rows from the left table (FROM), plus matching rows from the right. No match? The right side fills with NULLs.

LEFT JOIN

SELECT h.business_key, s.race
FROM Hub_Character h
LEFT JOIN Sat_Character_Details s
  ON h.hub_character_hk = s.hub_character_hk;

If a character exists in the Hub but has no Satellite data, they still appear — with race = NULL. This catches entities that exist but don’t have details yet.

Key Concept

“In a properly loaded Data Vault, INNER JOIN and LEFT JOIN usually give the same result — Hubs and Satellites load together. But LEFT JOIN is useful during development, testing, or when late-arriving data creates temporary gaps.”

+ What about RIGHT JOIN and FULL JOIN? EXPLORE

RIGHT JOIN — the mirror image of LEFT JOIN. All rows from the right table (the JOINed one), plus matches from the left. In practice, almost nobody uses it. You can always rewrite a RIGHT JOIN as a LEFT JOIN by swapping the table order. It exists for completeness, not convenience.

FULL OUTER JOIN — returns all rows from both tables, with NULLs where there’s no match on either side. Useful for finding orphans in both directions: “show me everything, matched or not.”

CROSS JOIN — the Cartesian product. Every row paired with every other row, no ON clause. Rarely wanted, but sometimes used to generate combinations (e.g., all possible character+weapon pairings for a game matrix).

For Data Vault work, you’ll use INNER JOIN 90% of the time, LEFT JOIN for development/debugging. The others are good to know, but not daily tools.

Section 7

Late Arriving Data

What happens when data arrives out of order? The Stormcloak roster arrives Monday, but Ulfric’s character details don’t come until Wednesday. On Tuesday, the Hub has his name — but the Satellite has nothing to JOIN to.

Late arriving data = data that arrives after the entity it describes has already been loaded. The Hub row exists, but the Satellite data comes later.

Business Date vs. Load Date

load_date — when the data physically arrived in the vault. A system timestamp. “When did we receive this?”

business date (or effective date) — when the data was actually true in the real world. “When was this actually the case?”

If Ulfric’s level was updated to 30 on January 15 in the source system, but the update didn’t arrive until January 18:

load_date = January 18 (arrival)
business_date = January 15 (truth)

4-Layer Explanation — Late Arriving Data

L1 Plain language ↓

Sometimes data arrives late. The vault handles it by recording both when the data was true and when it arrived.

L2 With mechanism ↓

Data Vault separates load_date (arrival time) from business date (when the fact was true). Late data gets its own row with the correct business date, even if it arrived after other data. No overwrites.

L3 Technical ↓

Late data creates a temporal gap between the Hub’s load_date and the Satellite’s load_date. Standard INNER JOIN on hash key still works, but “what did we know at time X?” queries need both dates. Point-In-Time (PIT) tables — pre-computed snapshots — optimize this pattern in production.

L4 Expert nuance ↓

Late-arriving data is one of the strongest arguments for the Data Vault approach. In 3NF, late data means updating existing rows (destructive). In Star Schema, it means recalculating dimensions. In Data Vault, it’s just another INSERT — the hash key ties it to the right Hub, load_date records arrival, business date records truth. No updates, no lost history.

Key Concept — Late Arriving Data

“Late arriving data is when information arrives after the entity is already in the vault. Data Vault handles this well because it separates when data was true from when it arrived. A late record becomes another INSERT with the correct dates — no updates needed, no history lost.”

+ What are PIT tables? PREVIEW

Point-In-Time (PIT) tables solve a performance problem. When a Hub has multiple Satellites that change at different rates, asking “what did we know about Ulfric on January 17?” requires finding the latest record before that date in each Satellite — expensive at scale.

A PIT table pre-computes this. For every snapshot date, it stores the correct Satellite hash-diff pointer for each Satellite. Instead of complex subqueries, you JOIN once to the PIT table and get the right version of every Satellite instantly.

Think of it like a bookmark table: “on this date, the right row in Sat_Character_Details is this one, and the right row in Sat_Character_Stats is that one.”

You won’t need to build PIT tables yourself — but knowing they exist and why shows you understand temporal query challenges. ScaleFree has presented on late-arriving data patterns that touch this topic.

Section 8

How JOINs Map to Data Vault

Hub_Character

hub_character_hk
business_key

Sat_Character_Details

hub_character_hk (FK)
race, level, faction

Link_Char_Weapon

hub_character_hk (FK)
hub_weapon_hk (FK)

Hub_Weapon

hub_weapon_hk
business_key

0 Characters in Vault

0 DV Table Types

0 JOIN Patterns Learned

DV Concept → JOIN Application

What You Know	What JOINs Add
Hub stores identity	JOIN from Sat back to Hub gets the business key (name)
Satellite stores attributes	JOIN from Hub to Sat enriches identity with attributes
Link connects Hubs	Chain JOINs through Link connect entities
Hash key = Hub PK	Hash key is the JOIN column (what ON uses)
FK points to another PK	JOINs follow those arrows
SQL at every DV layer	DataVault4dbt generates INSERT...SELECT...JOIN

Section 9

Final Exercise — You Know Enough For This

Read, Write, Chain, Debug

Four tasks. Each uses what you just learned. All completable right here.

Task 1 — Read This JOIN

What does this query return? Describe in plain English.

Read this

SELECT h.business_key AS character_name,
       s.level
FROM Hub_Character h
JOIN Sat_Character_Details s
  ON h.hub_character_hk = s.hub_character_hk
WHERE s.level > 25;

Answer: This query JOINs Hub_Character to Sat_Character_Details on the hash key, then filters to characters above level 25. It returns the character name (from Hub) and level (from Satellite). Result: Ulfric (30), Aela (28), Balgruuf (35).

Task 2 — Write a Hub ↔ Sat JOIN

Write the SQL to answer: “Give me the name and race of all Nord characters.”

You need Hub_Character for the name, Sat_Character_Details for race. Use table aliases.

Answer

SELECT h.business_key AS character_name,
       s.race
FROM Hub_Character h
JOIN Sat_Character_Details s
  ON h.hub_character_hk = s.hub_character_hk
WHERE s.race = 'Nord';

Key parts: SELECT from both tables, FROM one, JOIN the other, ON the hash key, WHERE the race filter. Aliases are flexible.

Task 3 — Write a Chain JOIN

Write the SQL to answer: “Which weapons does Lydia carry?”

You need Hub_Character → Link_Character_Weapon → Hub_Weapon. Two JOINs.

Answer

SELECT hc.business_key AS character_name,
       hw.business_key AS weapon_name
FROM Hub_Character hc
JOIN Link_Character_Weapon l
  ON hc.hub_character_hk = l.hub_character_hk
JOIN Hub_Weapon hw
  ON l.hub_weapon_hk = hw.hub_weapon_hk
WHERE hc.business_key = 'Lydia';

Three tables, two JOINs, one WHERE. Each JOIN follows a hash key through the Link.

Task 4 — Spot the Bug

This query has an error. What’s wrong?

Broken query

SELECT h.business_key, s.race
FROM Hub_Character h
JOIN Sat_Character_Details s
WHERE s.faction = 'Companions';

The bug: The ON clause is missing. Without ON h.hub_character_hk = s.hub_character_hk, the database doesn’t know how to match rows between the two tables. Either it errors, or it produces a Cartesian product (every Hub row paired with every Satellite row).

JOINs

You Already Found the Problem

Why the Word “JOIN”

ON & Table Aliases

Table Aliases — Nicknames

The Basic JOIN: Hub ↔ Satellite

How to Read This Query

The Result

Adding WHERE

The Chain: Hub → Link → Hub

The Link Table

The Result

INNER vs LEFT JOIN

INNER JOIN (the default)

LEFT JOIN

Late Arriving Data

Business Date vs. Load Date

How JOINs Map to Data Vault

Read, Write, Chain, Debug

All Four Complete