CHAPTER 09 // SQL PART 3

CREATE TABLE
and Aggregates

Writing blank character sheets. Counting the party.

CREATE TABLE Data Types Hub / Sat / Link COUNT + GROUP BY WHERE vs HAVING Creation Order
Core

You’ve been reading character sheets. Now you’re going to write blank ones.

LANGUAGE BRIDGE

“CREATE” — from Latin creare, “to make, to produce.” In SQL: make a new, empty table structure.

The grammatical contrast: SELECT reads data from a table that already exists. CREATE TABLE defines the structure of a new, empty table. One is a question you ask again and again. The other is a blueprint you draw once.

THE GRAMMAR OF CREATE TABLE
CREATE TABLE table_name ( column_name DATA_TYPE CONSTRAINT, column_name DATA_TYPE CONSTRAINT, PRIMARY KEY (column_name) );

Read it as a sentence: “Create a table called Hub_Character, with these columns, each accepting this kind of data, under these rules.”

GRAMMATICAL ROLES

CREATE TABLE = the verb (“make this thing”)

table_name = the name you’re giving it (like alias assignment — a naming act)

Each line inside parentheses = one column definition (name + type + rule)

DATA_TYPE = what kind of data that column accepts — text, number, timestamp

CONSTRAINT = a rule for that column — NOT NULL means “this field is required”

PRIMARY KEY (column) = which column uniquely identifies each row — no duplicates allowed

Parentheses = contain the full column list

Semicolon = end of statement

4-Layer Explanation — CREATE TABLE
L1Plain language — for anyone

CREATE TABLE builds an empty table. You specify column names, what type of data each column holds, and which columns are required. Nothing gets stored yet — it just defines the structure. Think of D&D: SELECT is reading a filled-in character sheet. CREATE TABLE is drawing the blank template — Name here, Race here, Level here — before anyone fills it in.

L2With mechanism — how it works

The database reads your CREATE TABLE statement and creates the table’s metadata — column names, types, constraints. From that point on, any INSERT or SELECT has to follow those rules. If you declared level INTEGER, you cannot insert the text “high” into that column — the database rejects it.

L3Technical — full detail

CREATE TABLE is a DDL (Data Definition Language) statement — it defines schema, not data. SELECT is DML (Data Manipulation Language) — it queries data. DDL runs once to set up the schema; DML runs every time you need to read or change data. In Data Vault, every Hub, Link, and Satellite starts as a CREATE TABLE statement. The schema IS the data model.

L4Expert — nuance & edge cases

In production Data Vault projects, CREATE TABLE statements are rarely written by hand. Tools like datavault4dbt generate them from metadata. But understanding the structure matters: when you read a dbt model’s output SQL or debug a loading error (“column X violates NOT NULL constraint”), you need to know what the CREATE TABLE defined. The schema is the contract that every data load must honor.

IS / IS NOT
IS: CREATE TABLE IS a DDL statement that defines an empty table’s structure — column names, data types, constraints.
IS NOT: CREATE TABLE IS NOT a way to store data. It builds the container; INSERT fills it; SELECT reads it.
IS NOT: CREATE TABLE IS NOT something that runs repeatedly. It runs once to define the schema. Running it again on an existing table throws an error.
IS NOT: CREATE TABLE IS NOT the same as SELECT. SELECT reads from a table that already exists. CREATE TABLE makes a table exist in the first place.
KEY CONCEPT

“CREATE TABLE defines structure — it builds a new, empty table with declared columns, types, and constraints. SELECT reads data from a table that already exists. CREATE TABLE runs once to define the schema. SELECT runs every time you query.”

Core
EVERY COLUMN NEEDS A TYPE

When you define a column, you tell the database what kind of data it accepts. hub_character_hk CHAR(64) means: this column holds exactly 64 characters of text. Try to put a number in there? The database rejects it. Try to put 65 characters? Rejected.

Think of each column as a form field with a specific input type: a text box, a date picker, a number field. The data type IS the input type.

THE FOUR TYPES YOU NEED FOR DATA VAULT
Type What it stores Example value DV use case
CHAR(64) Exactly 64 characters, always a1b2c3d4… (SHA-256 hash) Hash keys, hash_diff
VARCHAR(n) Variable text, up to n characters Ulfric (6 chars), Archmage Savos Aren (19 chars) Business keys, record sources, names
TIMESTAMP Date + time down to the second 2026-02-27 10:30:00 load_date, load_end_date
INTEGER Whole numbers, no decimals 42, 1, 999 Levels, counts, quantities
CHAR vs VARCHAR — THE KEY DISTINCTION

CHAR(64) always reserves exactly 64 characters of storage. If your data is 60 characters, it pads the remaining 4 with spaces. This is efficient when the data is ALWAYS the same length — and SHA-256 hashes are always exactly 64 hex characters. Every hash key in every Hub, Link, and Satellite is CHAR(64).

VARCHAR(255) uses only as much space as the actual data needs. “Ulfric” uses 6 characters of storage. “Archmage Savos Aren” uses 19. You set the maximum (255), but storage adapts. This is efficient for names, labels, descriptions — anything where the length varies.

+ Why TIMESTAMP instead of just text? Explore

You could store 2026-02-27 10:30:00 as VARCHAR — it’s technically just characters. But then the database can’t do time math. With TIMESTAMP, you can ask: “show me all rows loaded in the last 24 hours” or “what’s the time difference between load_date and load_end_date?” The type gives the database permission to treat the value as a point in time, not just a string of characters.

Core

The Hub is the simplest table in Data Vault. Four columns. All required. One row per entity, forever.

CREATE TABLE Hub_Character ( hub_character_hk CHAR(64) NOT NULL, load_date TIMESTAMP NOT NULL, record_source VARCHAR(100) NOT NULL, business_key VARCHAR(255) NOT NULL, PRIMARY KEY (hub_character_hk) );
LINE BY LINE

hub_character_hk CHAR(64) NOT NULL — The hash key. SHA-256 always produces exactly 64 hex characters, so CHAR(64). NOT NULL because every row MUST have an identity — a Hub row without a hash key is meaningless.

load_date TIMESTAMP NOT NULL — When this row first entered the vault. TIMESTAMP because you need date AND time precision. NOT NULL because every row needs a record of when it arrived.

record_source VARCHAR(100) NOT NULL — Which system sent this data. Could be 'Skyrim_NPC_Database' (20 chars) or 'Elder_Scrolls_Online_API' (24 chars) — lengths vary, so VARCHAR. NOT NULL because you always need to know where data came from (audit trail).

business_key VARCHAR(255) NOT NULL — The real-world name. “Ulfric” is 6 characters, “Archmage Savos Aren” is 19 — lengths vary, so VARCHAR. NOT NULL because the business key is the reason this Hub row exists.

PRIMARY KEY (hub_character_hk) — This column uniquely identifies every row. No two rows can have the same hash key. The database enforces this — try to insert a duplicate and it rejects the row.

THE HUB PATTERN

Every Hub follows the same formula: 4 columns, all NOT NULL, single-column primary key on the hash key.

The only things that change between Hub_Character, Hub_Weapon, Hub_Location are the table name and the hash key column name. The structure is identical.

4-Layer Explanation — Hub Schema
L1Plain language — for anyone

A Hub CREATE TABLE defines 4 required columns: a hash key (the ID), a load date (when it arrived), a record source (where it came from), and a business key (the name). Metadata first, then content. The hash key is the primary key — no duplicates allowed.

L2With mechanism — how it works

The hash key is CHAR(64) because SHA-256 output is always exactly 64 characters — fixed-length data gets a fixed-length column. The business key is VARCHAR because “Ulfric” and “Archmage Savos Aren” are different lengths. Every column is NOT NULL because a Hub row with missing data is structurally broken — you can’t have an identity without an ID, a name, a timestamp, or a source.

L3Technical — full detail

The Hub is the simplest DV 2.0 entity type. Its CREATE TABLE is a template: {entity}_hk CHAR(64) as PK, load_date TIMESTAMP, record_source VARCHAR(n), business_key VARCHAR(n). Metadata first, content last. All NOT NULL. The PK constraint on the hash key enforces entity uniqueness — one row per business entity, forever. Satellites and Links reference this hash key.

L4Expert — nuance & edge cases

In multi-source environments, the same business entity might arrive from different systems with different business keys. The hash key resolves this: it is generated from the business key value, so the same entity from different sources produces the same hash key and maps to the same Hub row. In datavault4dbt, the Hub CREATE TABLE is generated from a YAML configuration — but the underlying structure is always these same 4 columns.

IS / IS NOT
IS: A Hub IS one row per unique business entity, forever. Ulfric has exactly one Hub row regardless of how many times he appears in source data.
IS NOT: A Hub IS NOT where you store changing details. Level, faction, equipment — those go in a Satellite.
IS NOT: A Hub IS NOT an entity table in the traditional sense. It stores nothing about the entity except its identity (hash key, business key, when it arrived, where it came from).
IS: NOT NULL on all Hub columns IS a design requirement, not a convention. A Hub row with missing data is structurally broken — it cannot serve as an anchor point.
KEY CONCEPT

“A Hub has four columns: hash key as CHAR(64), business key as VARCHAR, load_date as TIMESTAMP, and record_source as VARCHAR. All NOT NULL. The hash key is the primary key.”

Core
HUB vs SATELLITE — THE KEY IDEA

A Hub gives Ulfric one row, forever — his identity. A Satellite gives Ulfric a NEW row every time his data changes. Level 20 in January, level 30 in March — that is 2 rows in the Satellite, both pointing to the same 1 row in the Hub.

CREATE TABLE Sat_Character_Details ( hub_character_hk CHAR(64) NOT NULL, load_date TIMESTAMP NOT NULL, load_end_date TIMESTAMP, record_source VARCHAR(100) NOT NULL, hash_diff CHAR(64) NOT NULL, race VARCHAR(50), level INTEGER, faction VARCHAR(100), PRIMARY KEY (hub_character_hk, load_date) );
SCALEFREE / DATAVAULT4DBT STANDARD

General DV pattern vs ScaleFree practice: The schema above shows the general Data Vault pattern with load_end_date as a stored column. In ScaleFree’s datavault4dbt, the v0 Satellite (the physical, insert-only table) has no load_end_date column at all. Instead, a v1 Satellite (a view on top of v0) computes load_end_date virtually at query time by looking at adjacent rows. The v0 table stays purely append-only — nothing is ever updated after insert.

FIVE DIFFERENCES FROM A HUB

1. Composite PRIMARY KEY — (hub_character_hk, load_date)

In a Hub, hub_character_hk alone is unique — Ulfric has exactly one row. In a Satellite, Ulfric has MULTIPLE rows (one per change). So hub_character_hk alone would have duplicates. The primary key needs BOTH columns together: “Ulfric’s hash key + January 15 at 10:00” is unique. “Ulfric’s hash key + March 3 at 14:00” is also unique. Neither column alone guarantees uniqueness. Both together do.

2. load_end_date is nullable — no NOT NULL

Look at load_end_date: just TIMESTAMP, no NOT NULL after it. The column is allowed to be empty. Ulfric’s March row is his CURRENT data — it hasn’t ended yet. NULL means “still active.” When April data arrives, March’s load_end_date gets filled in, and April becomes the new current row with NULL.

3. Descriptive columns can be NULL — race, level, faction

The structural columns (hub_character_hk, load_date, record_source, hash_diff) are always NOT NULL — the Satellite row itself must be complete. But the descriptive data it carries can have gaps. Maybe the source system has Ulfric’s level but not his faction.

4. No business_key column

The Satellite connects to the Hub through the hash key column. If you want to see Ulfric’s name alongside his Satellite data, you JOIN the Satellite to the Hub. This avoids storing “Ulfric Stormcloak” in 50 different places.

5. hash_diff — change detection

hash_diff CHAR(64) NOT NULL is a DV 2.0 feature. It is a SHA-256 hash of ALL the descriptive columns combined (race + level + faction). Before inserting a new Satellite row, compare the hash_diff to the current row’s hash_diff. If they match, nothing changed — skip the insert. If they differ, something changed — insert the new row.

HUB
  • 4 columns
  • All NOT NULL
  • PK = hash key alone
  • Has business_key
  • One row per entity
  • No hash_diff
SATELLITE
  • Structural + descriptive columns
  • Descriptive columns can be NULL
  • PK = hash key + load_date
  • No business_key
  • Multiple rows per entity (snapshots)
  • Has hash_diff for change detection
4-Layer Explanation — Satellite
L1Plain language — for anyone

A Satellite stores the changing details of an entity over time. Ulfric at level 20 is one row. Ulfric at level 30 is another row. The primary key uses two columns — who (hash key) and when (load date) — because the who alone would have duplicates.

L2With mechanism — how it works

The composite primary key (hub_character_hk, load_date) works because no entity can have two different snapshots at the exact same timestamp. load_end_date being nullable marks which row is current (NULL = still active). hash_diff prevents unnecessary inserts — if the source sends Ulfric’s data again but nothing changed, the hash_diff matches and no new row is created.

L3Technical — full detail

The Satellite schema separates structural columns (all NOT NULL: hash key, load_date, record_source, hash_diff) from descriptive columns (nullable: race, level, faction). This separation means the Satellite’s own integrity is always guaranteed even when source data has gaps. The foreign key relationship to the Hub means the Satellite cannot reference a nonexistent entity — the Hub row must exist first.

L4Expert — nuance & edge cases

Satellites are the highest-volume DV table type — one entity might generate hundreds of rows over its lifetime. The hash_diff optimization is critical at scale: a source that sends Ulfric’s data daily but only changes his level monthly produces 1 new Satellite row per month, not 30. load_end_date enables “point-in-time” queries: “What was Ulfric’s data as of February 15?” — find the row where load_date <= Feb 15 and (load_end_date > Feb 15 OR load_end_date IS NULL).

IS / IS NOT
IS: A Satellite IS a time-series of snapshots for one entity. Ulfric at level 20 is one row. Ulfric at level 30 is a second row. Both point to the same Hub row.
IS NOT: A Satellite IS NOT a table of unique entities. The composite PK means the same entity can appear many times — that is the point.
IS: load_end_date NULL IS an active signal — NULL means this is the current snapshot, no end date yet.
IS NOT: load_end_date NULL IS NOT missing data or an error. It is a deliberate design choice marking the current row.
IS: hash_diff IS a SHA-256 hash of all descriptive columns combined. Its purpose is change detection.
IS NOT: hash_diff IS NOT the same as the hub hash key. The hub hash key identifies the entity. hash_diff detects whether the entity’s attributes changed.
KEY CONCEPT

“A Satellite has a composite primary key — hash key plus load_date — because it stores multiple snapshots over time. It adds hash_diff for change detection and allows NULL in descriptive columns. It connects to the Hub via the hash key, not via business_key.”

Core
THE LINK — RELATIONSHIPS AS FIRST-CLASS OBJECTS

A Link records that a relationship EXISTS between two entities. Ulfric wields a Steel Sword — that pairing is the Link row. The Link does not store details about the relationship (like “equipped since January”). It just records: these two things are connected.

CREATE TABLE Link_Character_Weapon ( link_character_weapon_hk CHAR(64) NOT NULL, load_date TIMESTAMP NOT NULL, record_source VARCHAR(100) NOT NULL, hub_character_hk CHAR(64) NOT NULL, hub_weapon_hk CHAR(64) NOT NULL, PRIMARY KEY (link_character_weapon_hk) );
LINE BY LINE

link_character_weapon_hk CHAR(64) NOT NULL — The Link’s own hash key. Generated by hashing the COMBINATION of hub_character_hk + hub_weapon_hk. Same pairing always produces the same link hash key.

load_date TIMESTAMP NOT NULL — When this relationship was first recorded.

record_source VARCHAR(100) NOT NULL — Which system reported the relationship.

hub_character_hk CHAR(64) NOT NULL — Points to Ulfric’s row in Hub_Character. This is a foreign key — it references the Hub.

hub_weapon_hk CHAR(64) NOT NULL — Points to Steel Sword’s row in Hub_Weapon. Another foreign key.

PRIMARY KEY (link_character_weapon_hk) — Simple primary key on the link hash key, just like a Hub. NOT composite like a Satellite.

5 DIFFERENCES — LINK vs HUB AND SATELLITE

1. Multiple foreign keys. A Hub has 0 foreign keys. A Link has one per Hub it connects. This Link has 2: hub_character_hk and hub_weapon_hk. A Link connecting 3 Hubs would have 3 foreign keys.

2. Its own hash key. Derived from the combination of the Hub hash keys it references. “Ulfric’s hash + Steel Sword’s hash” → hashed together → produces the link hash key.

3. No business_key. The RELATIONSHIP between them does not have a human-readable name. The link hash key serves as the identifier.

4. No descriptive columns. “When did Ulfric equip the Steel Sword?” goes in a Link Satellite, NOT in the Link itself. The Link records existence of the relationship — nothing more.

5. Simple PRIMARY KEY. Like a Hub: one column (the link hash key). Not composite like a Satellite. A specific pairing only gets recorded once.

All columns NOT NULL — like a Hub. A relationship where you do not know what it connects is not a relationship.

HUB
  • PK: hash key
  • Has business_key
  • All NOT NULL
  • One entity
SATELLITE
  • PK: hash key + load_date
  • No business_key
  • Descriptive = nullable
  • Snapshots over time
  • PK: link hash key
  • No business_key
  • All NOT NULL
  • Multi-FK (Hub HKs)
IS / IS NOT
IS: A Link IS a first-class object that records the existence of a relationship between two or more Hub entities.
IS NOT: A Link IS NOT a junction table in the traditional relational sense. A Link records a relationship as a business fact — it is as real as the entities it connects.
IS NOT: A Link IS NOT where you store details about a relationship. “Equipped since January at proficiency level 3” goes in a Link Satellite, not the Link itself.
IS: The Link’s hash key IS derived from the combination of the Hub hash keys it references. Same pairing = same key.
IS NOT: The Link’s hash key IS NOT generated from a business key. The Link has no natural business key — the relationship is identified by its participants.
KEY CONCEPT

“A Link connects Hubs by storing their hash keys as foreign keys, plus its own link hash key as the primary key. No business key, no descriptive columns — if the relationship has attributes, those go in a Link Satellite.”

Core

How many characters per faction? You can’t answer that by reading one row.

LANGUAGE BRIDGE

“GROUP BY” — everyday meaning: arrange into groups. SQL meaning: collapse rows that share a value into one result row per group.

“COUNT” — everyday meaning: tally. SQL meaning: an aggregate function that counts rows in each group.

“Aggregate” — from Latin aggregare, “to add to a flock.” Aggregation = gathering individual items into a collective answer.

HOW GROUP BY + COUNT WORKS

You write it in this order (what you want the output to look like):

SELECT faction, COUNT(*) AS member_count FROM Sat_Character_Details GROUP BY faction;

The database runs it in this order (execution order, not write order):

Step 1 — FROM: The database starts with all 6 rows in Sat_Character_Details.

Step 2 — GROUP BY faction: The database sorts the 6 rows into piles based on the faction column value.

Step 3 — COUNT(*): For each pile, the database counts the rows. This runs after GROUP BY, even though it appears before it in the query.

Step 4 — SELECT: Assembles the output: faction label + count, named member_count (alias — “henceforth called”).

INPUT → OUTPUT
business_keyfaction
UlfricStormcloak
GalmarStormcloak
RalofStormcloak
LydiaCompanions
AelaCompanions
NazeemNone
↓ GROUP BY faction + COUNT(*) ↓
factionmember_count
Stormcloak3
Companions2
None1

6 input rows became 3 output rows. Each output row represents one group. GROUP BY collapsed the individual rows into group summaries.

+ What happens if you use COUNT without GROUP BY? Explore
SELECT COUNT(*) AS total_characters FROM Sat_Character_Details;

Result: one single row containing 6. Without GROUP BY, COUNT treats the ENTIRE table as one group and counts all rows. No faction breakdown — just the total.

4-Layer Explanation — GROUP BY
L1Plain language — for anyone

GROUP BY sorts rows into piles by a column value, then COUNT tells you how many rows are in each pile. 6 character rows grouped by faction becomes 3 output rows: Stormcloak (3), Companions (2), None (1).

L2With mechanism — how it works

The database processes FROM first (get all rows), then WHERE (filter individual rows), then GROUP BY (form piles), then HAVING (filter piles), then SELECT (compute COUNT per pile). WHERE cannot use COUNT because groups have not been formed yet. HAVING can because groups already exist.

L3Technical — full detail

GROUP BY + aggregate functions (COUNT, SUM, AVG, MIN, MAX) transform row-level data into group-level summaries. Every column in SELECT must either appear in GROUP BY or be inside an aggregate function — because the output has one row per group. SELECT faction, COUNT(*) works because faction is in GROUP BY and COUNT is an aggregate. SELECT faction, level would fail because level is not grouped and not aggregated — which of Ulfric’s, Galmar’s, and Ralof’s levels should it show?

L4Expert — nuance & edge cases

In Data Vault contexts, COUNT + GROUP BY is how you answer operational questions against Satellites: “How many snapshot rows per entity?” (quality check), “How many characters changed faction last month?” (business intelligence). Combined with HAVING, you can find anomalies: HAVING COUNT(*) > 100 flags entities with suspiciously many changes — potential data quality issues.

IS / IS NOT
IS: GROUP BY IS an instruction to collapse rows with the same column value into one output row per group, then compute aggregates for each group.
IS NOT: GROUP BY IS NOT a filter. It does not remove rows — it reorganizes them into piles and produces one summary row per pile.
IS: COUNT(*) IS a count of rows in a group (or in the entire table if no GROUP BY).
IS NOT: COUNT(*) IS NOT a count of distinct values. COUNT(*) counts all rows including those with NULL values. COUNT(column_name) counts rows where that column is not NULL.
IS: HAVING IS a filter that runs AFTER groups are formed — it can use COUNT, SUM, and other aggregates.
IS NOT: HAVING IS NOT the same as WHERE. WHERE filters individual rows BEFORE grouping. HAVING filters groups AFTER grouping.
WHERE vs HAVING — THE CONTRASTIVE PAIR
WHEREHAVING
FiltersIndividual rowsGroups
WhenBefore GROUP BYAfter GROUP BY
Can use COUNT?No — groups don’t exist yetYes — groups already formed
ExampleWHERE faction != 'None'HAVING COUNT(*) > 1
WHERE EXAMPLE
SELECT faction, COUNT(*) FROM Sat_Character_Details WHERE faction != 'None' GROUP BY faction;

Start with 6 rows. WHERE removes Nazeem (faction = ‘None’). 5 rows remain. GROUP BY sorts them: Stormcloak (3), Companions (2). Result: 2 output rows.

HAVING EXAMPLE
SELECT faction, COUNT(*) FROM Sat_Character_Details GROUP BY faction HAVING COUNT(*) > 1;

Start with 6 rows. No WHERE, all 6 proceed. GROUP BY sorts: Stormcloak (3), Companions (2), None (1). HAVING checks: 3 > 1? Keep. 2 > 1? Keep. 1 > 1? Remove. Result: 2 output rows — same result, different mechanism.

EXECUTION ORDER (COMPLETE)

FROMWHEREGROUP BYHAVINGSELECTORDER BY

WHERE runs before GROUP BY. HAVING runs after. That’s why WHERE can’t use COUNT(*) — the groups don’t exist yet when WHERE runs.

KEY CONCEPT

“GROUP BY answers questions about groups of rows. Given 6 character rows, to find how many are in each faction, GROUP BY faction and COUNT the rows in each group. The output has one row per group. HAVING filters groups after counting — for example, only factions with more than two members.”

Important
WHY ORDER MATTERS

When building a full Data Vault in SQL, the order matters because of dependencies. You cannot reference a table that does not exist yet.

1. HUBS FIRST
No dependencies — stand alone
Reference Hub hash keys
3. SATELLITES LAST
Reference Hub or Link hash keys
THE REASONING

Hubs first — they stand alone. Hub_Character and Hub_Weapon have no foreign keys. They reference nothing. Can be created in any order, even in parallel.

Links second — Link_Character_Weapon references hub_character_hk and hub_weapon_hk. Those columns point to the Hubs. If the Hubs do not exist yet, the Link cannot reference them.

Satellites last — Sat_Character_Details references hub_character_hk. The Hub must exist first. A Link Satellite references link_character_weapon_hk. The Link must exist first.

This is the same as the loading order in production: Hubs load first (parallel, no dependencies), then Links (need Hub hash keys to exist), then Satellites (need Hub or Link hash keys to exist).

KEY CONCEPT

“Hubs first because they have no dependencies — they stand alone. Links second because they reference Hub hash keys. Satellites last because they reference Hub or Link hash keys. Same order as loading in production.”

Core
WORKED EXAMPLE

“Which SQL statement defines structure vs reads data?”

Runs once to set up the table: CREATE TABLE — because it is a DDL statement that defines the schema.

Runs every time you query: SELECT — because it is a DML statement that reads existing data.

The key distinction is define vs read. CREATE TABLE defines structure once. SELECT reads data repeatedly.

Exercise 1 — CREATE vs SELECT

Match each description to the correct SQL statement.

Defines the structure of a new, empty table
Retrieves data from an existing table
Specifies column names, data types, and constraints
Can be run many times on the same table
WORKED EXAMPLE

“What data type should hub_character_hk use?”

Answer: CHAR(64)

Hash keys are always exactly 64 characters (SHA-256 hex output). CHAR is for fixed-length data. VARCHAR would also work but wastes overhead since the length never varies. CHAR(64) is the standard for every hash key in Data Vault.

Exercise 2 — Choose the Data Type

For each DV column, pick the correct data type.

hub_character_hk — SHA-256 hash, always 64 characters
business_key — character name, variable length
load_date — when the data was loaded into the vault
level — numeric character level
record_source — name of the source system
WORKED EXAMPLE

“Write the CREATE TABLE for a Hub_Player with hash key hub_player_hk.”

CREATE TABLE Hub_Player (
    hub_player_hk       CHAR(64)     NOT NULL,
    load_date           TIMESTAMP    NOT NULL,
    record_source       VARCHAR(100) NOT NULL,
    business_key        VARCHAR(255) NOT NULL,
    PRIMARY KEY (hub_player_hk)
);

Every Hub follows the same 4-column pattern: hash key as CHAR(64), load_date as TIMESTAMP, record_source as VARCHAR, business_key as VARCHAR. Metadata first, then content. All NOT NULL. The hash key is the PRIMARY KEY. Only the table name and hash key column name change between Hubs.

Exercise 3 — Write Hub_Weapon

Your turn: Write the CREATE TABLE for Hub_Weapon

Hub_Weapon stores weapon identities. Same 4-column pattern as Hub_Character. The hash key column is hub_weapon_hk.

WORKED EXAMPLE

“Write the CREATE TABLE for Sat_Character_Details with descriptive columns race, level, and faction.”

CREATE TABLE Sat_Character_Details (
    hub_character_hk    CHAR(64)     NOT NULL,
    load_date           TIMESTAMP    NOT NULL,
    load_end_date       TIMESTAMP,
    record_source       VARCHAR(100) NOT NULL,
    hash_diff           CHAR(64)     NOT NULL,
    race                VARCHAR(50),
    level               INTEGER,
    faction             VARCHAR(100),
    PRIMARY KEY (hub_character_hk, load_date)
);

Satellite pattern: composite PK (hash key + load_date), nullable load_end_date (NULL = current), hash_diff for change detection, descriptive columns can be NULL, no business_key. Structural columns are all NOT NULL; descriptive columns are not.

Exercise 4 — Write Sat_Weapon_Details

Your turn: Write the CREATE TABLE for Sat_Weapon_Details

This Satellite tracks weapon attributes over time. It connects to Hub_Weapon via hub_weapon_hk. Descriptive columns: damage (INTEGER), weapon_type (VARCHAR(50)), enchantment (VARCHAR(100)). Remember: composite PK, nullable descriptive columns, no business_key, hash_diff for change detection.

WORKED EXAMPLE

“Write the CREATE TABLE for Link_Character_Weapon connecting Hub_Character and Hub_Weapon.”

CREATE TABLE Link_Character_Weapon (
    link_character_weapon_hk    CHAR(64)     NOT NULL,
    load_date                   TIMESTAMP    NOT NULL,
    record_source               VARCHAR(100) NOT NULL,
    hub_character_hk            CHAR(64)     NOT NULL,
    hub_weapon_hk               CHAR(64)     NOT NULL,
    PRIMARY KEY (link_character_weapon_hk)
);

Link pattern: own hash key (derived from Hub HKs), load_date, record_source, then one foreign key per Hub it connects. Metadata first, then content. All NOT NULL. Simple PK on the link hash key (not composite). No business_key, no descriptive columns.

Exercise 5 — Write Link_Character_Location

Your turn: Write the CREATE TABLE for Link_Character_Location

This Link connects Hub_Character to Hub_Location. The link hash key is link_character_location_hk. It references hub_character_hk and hub_location_hk. Same pattern as Link_Character_Weapon.

WORKED EXAMPLE

“Given 4 rows where 2 have race = ‘Nord’ and 2 have race = ‘Breton’, what does SELECT race, COUNT(*) FROM Sat_Character_Details GROUP BY race return?”

Answer: 2 output rows — Nord (2) and Breton (2). GROUP BY creates one pile per unique race value, COUNT(*) tallies each pile.

4 input rows → 2 groups (Nord, Breton) → 2 output rows. Each output row represents one group.

Exercise 6 — Predict the Output

Given the faction data (Ulfric/Galmar/Ralof = Stormcloak, Lydia/Aela = Companions, Nazeem = None) and the query SELECT faction, COUNT(*) FROM Sat_Character_Details GROUP BY faction:

How many rows in the result?
What is the count for “Stormcloak”?
What is the count for “Companions”?
What is the count for “None” (Nazeem)?
WORKED EXAMPLE

“I want only factions with more than 1 member. WHERE or HAVING?”

Answer: HAVING — because “more than 1 member” requires COUNT, and COUNT only works after groups are formed. HAVING runs after GROUP BY; WHERE runs before.

The rule: if the filter needs an aggregate (COUNT, SUM, AVG), use HAVING. If the filter is about individual row values, use WHERE.

Exercise 7 — WHERE or HAVING?

For each scenario, choose whether you’d use WHERE or HAVING.

Filter out characters with level < 10 before grouping
Only show factions with more than 2 members
Exclude rows where faction IS NULL
Only show groups where the average level exceeds 20
All exercises complete! You can now define Hubs, Satellites, and Links in SQL — and count groups with GROUP BY.