JOINs
The bridge that reunites what normalization separated. Every JOIN follows a foreign key — and in Data Vault, that means following the hash key from Satellite back to Hub, or through a Link to another Hub.
You Already Found the Problem
In the last session, you queried Sat_Character_Details and got race, level, faction — but not the character’s name. Because names live in Hub_Character, and you were looking at one table. JOIN is how you look at two tables at once.
That moment in Day 3 — when you realized you couldn’t get names from the Satellite alone — is exactly the problem JOINs solve. You separated data into Hubs and Satellites for good reasons (no repetition, parallel loading, auditability). JOIN is the tool that reads it back together.
“A JOIN combines data from two tables based on a shared key. In Data Vault, that’s typically the hash key — the Satellite has it as a foreign key pointing back to the Hub. The JOIN follows that link to combine identity with attributes.”
Why the Word “JOIN”
JOIN — from Latin jungere: “to yoke, to bind, to connect.”
Same root as junction (where roads connect), conjunction (in grammar: “and,” “but” — words that connect clauses), and in logic: the AND operator (∧).
This is conjunction from symbolic logic. P ∧ Q produces a result only when both P and Q are true. A SQL JOIN produces a result row only when both tables have a matching value. Same structural shape: “give me something only when BOTH sides match.”
A JOIN combines rows from two tables based on a value they share. Like looking up a character’s name on one sheet and their stats on another, then reading both together by matching the character ID.
The shared value is a key — specifically, a foreign key in one table that points to a primary key in another. The JOIN reads both tables and pairs up rows where the key matches. Rows without a partner get excluded (in the default INNER JOIN).
A JOIN is a set operation that produces the Cartesian product of two relations, filtered by a join predicate (the ON clause). INNER JOIN returns only rows where the predicate is TRUE. The query optimizer decides the physical execution strategy (nested loop, hash join, merge join) — the SQL only declares what to match, not how.
In Data Vault, JOINs follow a predictable pattern because the model is standardized. Every Satellite JOINs to its parent Hub on the hash key. Every Link JOINs to its participant Hubs. This standardization is why DataVault4dbt can generate JOIN logic automatically — the foreign key relationships are always in the same place. When you can read a DV JOIN, you can read any query the automation produces.
ON & Table Aliases
ON specifies the condition that must be true for two rows to be joined. It answers: “on what basis are these tables connected?” The answer is always: on matching keys.
ON — everyday: “based on,” “conditional upon.” “I’ll go on one condition.” In SQL: the condition that must hold for two rows to be paired.
ON is to JOIN what WHERE is to the full query. WHERE filters rows after reading. ON filters which rows get paired during the JOIN. Both are boolean conditions — they operate at different stages.
Table Aliases — Nicknames
When two tables share a column name (like hub_character_hk exists in both Hub_Character and Sat_Character_Details), SQL needs to know which one you mean. Aliases disambiguate:
FROM Hub_Character h -- h is the alias JOIN Sat_Character_Details s -- s is the alias ON h.hub_character_hk = s.hub_character_hk
Convention: h for Hub, s for Satellite, l for Link, hc for Hub_Character, hw for Hub_Weapon. These are nicknames the database doesn’t care about — but humans reading the query do.
Without ON, the database has no rule for matching rows. Two things can happen:
1. An error — most databases reject JOIN without ON as a syntax error. The query won’t run.
2. A Cartesian product — if the syntax is technically valid (old-style comma joins), every row from the left table pairs with every row from the right. 5 Hub rows × 5 Satellite rows = 25 result rows. Most of them nonsense.
This is exactly the bug in Exercise 4 at the end of this tutorial. When you get there, you’ll spot it instantly.
The Basic JOIN: Hub ↔ Satellite
The question: “Give me each character’s name AND their race, level, and faction.”
Names are in Hub_Character. Race, level, faction are in Sat_Character_Details. These are separate tables. JOIN connects them.
SELECT h.business_key AS character_name, s.race, s.level, s.faction FROM Hub_Character h JOIN Sat_Character_Details s ON h.hub_character_hk = s.hub_character_hk;
How to Read This Query
1. FROM Hub_Character h — Start with the Hub. This is our “left” table.
2. JOIN Sat_Character_Details s — Bring in the Satellite.
3. ON h.hub_character_hk = s.hub_character_hk — The bridge: pair rows where hash keys match.
4. SELECT ... — From the paired result, show name (from Hub), race, level, faction (from Satellite).
The Result
| character_name | race | level | faction |
|---|---|---|---|
| Ulfric | Nord | 30 | Stormcloak |
| Lydia | Nord | 25 | Companions |
| Nazeem | Redguard | 10 | None |
| Aela | Nord | 28 | Companions |
| Balgruuf | Nord | 35 | Jarl |
This is the query you couldn’t write in Day 3. Now you can.
Adding WHERE
JOINs and WHERE combine naturally. JOIN pairs the tables, then WHERE filters the paired result:
SELECT h.business_key AS character_name, s.race, s.level FROM Hub_Character h JOIN Sat_Character_Details s ON h.hub_character_hk = s.hub_character_hk WHERE s.faction = 'Companions';
Result: Lydia and Aela only. The JOIN paired all characters with their details, then WHERE kept only Companions.
The Chain: Hub → Link → Hub
“Which weapons does Ulfric carry?” Characters are in one Hub. Weapons are in another. The connection lives in Link_Character_Weapon — which stores the hash keys of both.
The Link Table
| link_character_weapon_hk | hub_character_hk | hub_weapon_hk | load_date |
|---|---|---|---|
| x1y2... | a3f7... | w8k2... | 2026-01-01 |
| z3a4... | a3f7... | w9m3... | 2026-01-01 |
| b5c6... | b82c... | w8k2... | 2026-01-01 |
Each row = one relationship: “character X carries weapon Y.”
The Link stores only hash key references. To get names, we JOIN through the Link to both Hubs.
SELECT hc.business_key AS character_name, hw.business_key AS weapon_name FROM Hub_Character hc JOIN Link_Character_Weapon l ON hc.hub_character_hk = l.hub_character_hk JOIN Hub_Weapon hw ON l.hub_weapon_hk = hw.hub_weapon_hk WHERE hc.business_key = 'Ulfric';
How to read this chain:
1. Start at Hub_Character (aliased hc).
2. Follow the hash key into Link_Character_Weapon (aliased l). Now we have every character–weapon pair.
3. From the Link, follow the weapon hash key into Hub_Weapon (aliased hw). Now both sides have names.
4. WHERE filters to Ulfric only.
The Result
| character_name | weapon_name |
|---|---|
| Ulfric | Steel Sword |
| Ulfric | Voice of the Sky |
The structural insight: The Link table IS the JOIN table. In relational design, many-to-many relationships always need a junction table. In Data Vault, that’s exactly what a Link is — except it also adds load_date and record_source for auditability.
When you built Link_Character_Weapon in Day 1, you were designing the table this query now reads.
Look at link_character_weapon_hk in the Link table. It’s not a foreign key to either Hub — it’s the Link’s own primary key, computed from the combination of both participant hash keys.
Why? The same two reasons as Hub hash keys: (1) it gives the Link a single-column primary key for fast lookups, and (2) it’s deterministic — the same character+weapon pair always produces the same Link hash key, regardless of which source system sent it.
If Ulfric carrying a Steel Sword arrives from both the Stormcloak roster and the Whiterun armory log, both sources produce the same link_character_weapon_hk. The vault recognizes it as the same relationship. No duplicates.
The Link hash key also becomes a foreign key target — if a Link ever gets its own Satellite (storing attributes of the relationship, like “equipped_date”), that Satellite would point to the Link’s hash key, exactly like a Hub Satellite points to a Hub hash key.
INNER vs LEFT JOIN
INNER JOIN (the default)
JOIN without a qualifier means INNER JOIN. Only rows with a match in both tables appear. No match = excluded.
Analogy: A guest list where both the invitation (Hub) AND the RSVP (Satellite) must exist. No RSVP, not on the final list.
LEFT JOIN
Returns all rows from the left table (FROM), plus matching rows from the right. No match? The right side fills with NULLs.
SELECT h.business_key, s.race FROM Hub_Character h LEFT JOIN Sat_Character_Details s ON h.hub_character_hk = s.hub_character_hk;
If a character exists in the Hub but has no Satellite data, they still appear — with race = NULL. This catches entities that exist but don’t have details yet.
“In a properly loaded Data Vault, INNER JOIN and LEFT JOIN usually give the same result — Hubs and Satellites load together. But LEFT JOIN is useful during development, testing, or when late-arriving data creates temporary gaps.”
RIGHT JOIN — the mirror image of LEFT JOIN. All rows from the right table (the JOINed one), plus matches from the left. In practice, almost nobody uses it. You can always rewrite a RIGHT JOIN as a LEFT JOIN by swapping the table order. It exists for completeness, not convenience.
FULL OUTER JOIN — returns all rows from both tables, with NULLs where there’s no match on either side. Useful for finding orphans in both directions: “show me everything, matched or not.”
CROSS JOIN — the Cartesian product. Every row paired with every other row, no ON clause. Rarely wanted, but sometimes used to generate combinations (e.g., all possible character+weapon pairings for a game matrix).
For Data Vault work, you’ll use INNER JOIN 90% of the time, LEFT JOIN for development/debugging. The others are good to know, but not daily tools.
Late Arriving Data
What happens when data arrives out of order? The Stormcloak roster arrives Monday, but Ulfric’s character details don’t come until Wednesday. On Tuesday, the Hub has his name — but the Satellite has nothing to JOIN to.
Late arriving data = data that arrives after the entity it describes has already been loaded. The Hub row exists, but the Satellite data comes later.
Business Date vs. Load Date
load_date — when the data physically arrived in the vault. A system timestamp. “When did we receive this?”
business date (or effective date) — when the data was actually true in the real world. “When was this actually the case?”
If Ulfric’s level was updated to 30 on January 15 in the source system, but the update didn’t arrive until January 18:
load_date= January 18 (arrival)business_date= January 15 (truth)
Sometimes data arrives late. The vault handles it by recording both when the data was true and when it arrived.
Data Vault separates load_date (arrival time) from business date (when the fact was true). Late data gets its own row with the correct business date, even if it arrived after other data. No overwrites.
Late data creates a temporal gap between the Hub’s load_date and the Satellite’s load_date. Standard INNER JOIN on hash key still works, but “what did we know at time X?” queries need both dates. Point-In-Time (PIT) tables — pre-computed snapshots — optimize this pattern in production.
Late-arriving data is one of the strongest arguments for the Data Vault approach. In 3NF, late data means updating existing rows (destructive). In Star Schema, it means recalculating dimensions. In Data Vault, it’s just another INSERT — the hash key ties it to the right Hub, load_date records arrival, business date records truth. No updates, no lost history.
“Late arriving data is when information arrives after the entity is already in the vault. Data Vault handles this well because it separates when data was true from when it arrived. A late record becomes another INSERT with the correct dates — no updates needed, no history lost.”
Point-In-Time (PIT) tables solve a performance problem. When a Hub has multiple Satellites that change at different rates, asking “what did we know about Ulfric on January 17?” requires finding the latest record before that date in each Satellite — expensive at scale.
A PIT table pre-computes this. For every snapshot date, it stores the correct Satellite hash-diff pointer for each Satellite. Instead of complex subqueries, you JOIN once to the PIT table and get the right version of every Satellite instantly.
Think of it like a bookmark table: “on this date, the right row in Sat_Character_Details is this one, and the right row in Sat_Character_Stats is that one.”
You won’t need to build PIT tables yourself — but knowing they exist and why shows you understand temporal query challenges. ScaleFree has presented on late-arriving data patterns that touch this topic.
How JOINs Map to Data Vault
business_key
race, level, faction
hub_weapon_hk (FK)
business_key
| What You Know | What JOINs Add |
|---|---|
| Hub stores identity | JOIN from Sat back to Hub gets the business key (name) |
| Satellite stores attributes | JOIN from Hub to Sat enriches identity with attributes |
| Link connects Hubs | Chain JOINs through Link connect entities |
| Hash key = Hub PK | Hash key is the JOIN column (what ON uses) |
| FK points to another PK | JOINs follow those arrows |
| SQL at every DV layer | DataVault4dbt generates INSERT...SELECT...JOIN |
Read, Write, Chain, Debug
Four tasks. Each uses what you just learned. All completable right here.
What does this query return? Describe in plain English.
SELECT h.business_key AS character_name, s.level FROM Hub_Character h JOIN Sat_Character_Details s ON h.hub_character_hk = s.hub_character_hk WHERE s.level > 25;
Answer: This query JOINs Hub_Character to Sat_Character_Details on the hash key, then filters to characters above level 25. It returns the character name (from Hub) and level (from Satellite). Result: Ulfric (30), Aela (28), Balgruuf (35).
Write the SQL to answer: “Give me the name and race of all Nord characters.”
You need Hub_Character for the name, Sat_Character_Details for race. Use table aliases.
SELECT h.business_key AS character_name, s.race FROM Hub_Character h JOIN Sat_Character_Details s ON h.hub_character_hk = s.hub_character_hk WHERE s.race = 'Nord';
Key parts: SELECT from both tables, FROM one, JOIN the other, ON the hash key, WHERE the race filter. Aliases are flexible.
Write the SQL to answer: “Which weapons does Lydia carry?”
You need Hub_Character → Link_Character_Weapon → Hub_Weapon. Two JOINs.
SELECT hc.business_key AS character_name, hw.business_key AS weapon_name FROM Hub_Character hc JOIN Link_Character_Weapon l ON hc.hub_character_hk = l.hub_character_hk JOIN Hub_Weapon hw ON l.hub_weapon_hk = hw.hub_weapon_hk WHERE hc.business_key = 'Lydia';
Three tables, two JOINs, one WHERE. Each JOIN follows a hash key through the Link.
This query has an error. What’s wrong?
SELECT h.business_key, s.race FROM Hub_Character h JOIN Sat_Character_Details s WHERE s.faction = 'Companions';
The bug: The ON clause is missing. Without ON h.hub_character_hk = s.hub_character_hk, the database doesn’t know how to match rows between the two tables. Either it errors, or it produces a Cartesian product (every Hub row paired with every Satellite row).