CREATE TABLE
and Aggregates
Writing blank character sheets. Counting the party.
You’ve been reading character sheets. Now you’re going to write blank ones.
“CREATE” — from Latin creare, “to make, to produce.” In SQL: make a new, empty table structure.
The grammatical contrast: SELECT reads data from a table that already exists. CREATE TABLE defines the structure of a new, empty table. One is a question you ask again and again. The other is a blueprint you draw once.
Read it as a sentence: “Create a table called Hub_Character, with these columns, each accepting this kind of data, under these rules.”
CREATE TABLE = the verb (“make this thing”)
table_name = the name you’re giving it (like alias assignment — a naming act)
Each line inside parentheses = one column definition (name + type + rule)
DATA_TYPE = what kind of data that column accepts — text, number, timestamp
CONSTRAINT = a rule for that column — NOT NULL means “this field is required”
PRIMARY KEY (column) = which column uniquely identifies each row — no duplicates allowed
Parentheses = contain the full column list
Semicolon = end of statement
CREATE TABLE builds an empty table. You specify column names, what type of data each column holds, and which columns are required. Nothing gets stored yet — it just defines the structure. Think of D&D: SELECT is reading a filled-in character sheet. CREATE TABLE is drawing the blank template — Name here, Race here, Level here — before anyone fills it in.
The database reads your CREATE TABLE statement and creates the table’s metadata — column names, types, constraints. From that point on, any INSERT or SELECT has to follow those rules. If you declared level INTEGER, you cannot insert the text “high” into that column — the database rejects it.
CREATE TABLE is a DDL (Data Definition Language) statement — it defines schema, not data. SELECT is DML (Data Manipulation Language) — it queries data. DDL runs once to set up the schema; DML runs every time you need to read or change data. In Data Vault, every Hub, Link, and Satellite starts as a CREATE TABLE statement. The schema IS the data model.
In production Data Vault projects, CREATE TABLE statements are rarely written by hand. Tools like datavault4dbt generate them from metadata. But understanding the structure matters: when you read a dbt model’s output SQL or debug a loading error (“column X violates NOT NULL constraint”), you need to know what the CREATE TABLE defined. The schema is the contract that every data load must honor.
“CREATE TABLE defines structure — it builds a new, empty table with declared columns, types, and constraints. SELECT reads data from a table that already exists. CREATE TABLE runs once to define the schema. SELECT runs every time you query.”
When you define a column, you tell the database what kind of data it accepts. hub_character_hk CHAR(64) means: this column holds exactly 64 characters of text. Try to put a number in there? The database rejects it. Try to put 65 characters? Rejected.
Think of each column as a form field with a specific input type: a text box, a date picker, a number field. The data type IS the input type.
| Type | What it stores | Example value | DV use case |
|---|---|---|---|
CHAR(64) |
Exactly 64 characters, always | a1b2c3d4… (SHA-256 hash) |
Hash keys, hash_diff |
VARCHAR(n) |
Variable text, up to n characters | Ulfric (6 chars), Archmage Savos Aren (19 chars) |
Business keys, record sources, names |
TIMESTAMP |
Date + time down to the second | 2026-02-27 10:30:00 |
load_date, load_end_date |
INTEGER |
Whole numbers, no decimals | 42, 1, 999 |
Levels, counts, quantities |
CHAR(64) always reserves exactly 64 characters of storage. If your data is 60 characters, it pads the remaining 4 with spaces. This is efficient when the data is ALWAYS the same length — and SHA-256 hashes are always exactly 64 hex characters. Every hash key in every Hub, Link, and Satellite is CHAR(64).
VARCHAR(255) uses only as much space as the actual data needs. “Ulfric” uses 6 characters of storage. “Archmage Savos Aren” uses 19. You set the maximum (255), but storage adapts. This is efficient for names, labels, descriptions — anything where the length varies.
You could store 2026-02-27 10:30:00 as VARCHAR — it’s technically just characters. But then the database can’t do time math. With TIMESTAMP, you can ask: “show me all rows loaded in the last 24 hours” or “what’s the time difference between load_date and load_end_date?” The type gives the database permission to treat the value as a point in time, not just a string of characters.
The Hub is the simplest table in Data Vault. Four columns. All required. One row per entity, forever.
hub_character_hk CHAR(64) NOT NULL — The hash key. SHA-256 always produces exactly 64 hex characters, so CHAR(64). NOT NULL because every row MUST have an identity — a Hub row without a hash key is meaningless.
load_date TIMESTAMP NOT NULL — When this row first entered the vault. TIMESTAMP because you need date AND time precision. NOT NULL because every row needs a record of when it arrived.
record_source VARCHAR(100) NOT NULL — Which system sent this data. Could be 'Skyrim_NPC_Database' (20 chars) or 'Elder_Scrolls_Online_API' (24 chars) — lengths vary, so VARCHAR. NOT NULL because you always need to know where data came from (audit trail).
business_key VARCHAR(255) NOT NULL — The real-world name. “Ulfric” is 6 characters, “Archmage Savos Aren” is 19 — lengths vary, so VARCHAR. NOT NULL because the business key is the reason this Hub row exists.
PRIMARY KEY (hub_character_hk) — This column uniquely identifies every row. No two rows can have the same hash key. The database enforces this — try to insert a duplicate and it rejects the row.
Every Hub follows the same formula: 4 columns, all NOT NULL, single-column primary key on the hash key.
The only things that change between Hub_Character, Hub_Weapon, Hub_Location are the table name and the hash key column name. The structure is identical.
A Hub CREATE TABLE defines 4 required columns: a hash key (the ID), a load date (when it arrived), a record source (where it came from), and a business key (the name). Metadata first, then content. The hash key is the primary key — no duplicates allowed.
The hash key is CHAR(64) because SHA-256 output is always exactly 64 characters — fixed-length data gets a fixed-length column. The business key is VARCHAR because “Ulfric” and “Archmage Savos Aren” are different lengths. Every column is NOT NULL because a Hub row with missing data is structurally broken — you can’t have an identity without an ID, a name, a timestamp, or a source.
The Hub is the simplest DV 2.0 entity type. Its CREATE TABLE is a template: {entity}_hk CHAR(64) as PK, load_date TIMESTAMP, record_source VARCHAR(n), business_key VARCHAR(n). Metadata first, content last. All NOT NULL. The PK constraint on the hash key enforces entity uniqueness — one row per business entity, forever. Satellites and Links reference this hash key.
In multi-source environments, the same business entity might arrive from different systems with different business keys. The hash key resolves this: it is generated from the business key value, so the same entity from different sources produces the same hash key and maps to the same Hub row. In datavault4dbt, the Hub CREATE TABLE is generated from a YAML configuration — but the underlying structure is always these same 4 columns.
“A Hub has four columns: hash key as CHAR(64), business key as VARCHAR, load_date as TIMESTAMP, and record_source as VARCHAR. All NOT NULL. The hash key is the primary key.”
A Hub gives Ulfric one row, forever — his identity. A Satellite gives Ulfric a NEW row every time his data changes. Level 20 in January, level 30 in March — that is 2 rows in the Satellite, both pointing to the same 1 row in the Hub.
General DV pattern vs ScaleFree practice: The schema above shows the general Data Vault pattern with load_end_date as a stored column. In ScaleFree’s datavault4dbt, the v0 Satellite (the physical, insert-only table) has no load_end_date column at all. Instead, a v1 Satellite (a view on top of v0) computes load_end_date virtually at query time by looking at adjacent rows. The v0 table stays purely append-only — nothing is ever updated after insert.
1. Composite PRIMARY KEY — (hub_character_hk, load_date)
In a Hub, hub_character_hk alone is unique — Ulfric has exactly one row. In a Satellite, Ulfric has MULTIPLE rows (one per change). So hub_character_hk alone would have duplicates. The primary key needs BOTH columns together: “Ulfric’s hash key + January 15 at 10:00” is unique. “Ulfric’s hash key + March 3 at 14:00” is also unique. Neither column alone guarantees uniqueness. Both together do.
2. load_end_date is nullable — no NOT NULL
Look at load_end_date: just TIMESTAMP, no NOT NULL after it. The column is allowed to be empty. Ulfric’s March row is his CURRENT data — it hasn’t ended yet. NULL means “still active.” When April data arrives, March’s load_end_date gets filled in, and April becomes the new current row with NULL.
3. Descriptive columns can be NULL — race, level, faction
The structural columns (hub_character_hk, load_date, record_source, hash_diff) are always NOT NULL — the Satellite row itself must be complete. But the descriptive data it carries can have gaps. Maybe the source system has Ulfric’s level but not his faction.
4. No business_key column
The Satellite connects to the Hub through the hash key column. If you want to see Ulfric’s name alongside his Satellite data, you JOIN the Satellite to the Hub. This avoids storing “Ulfric Stormcloak” in 50 different places.
5. hash_diff — change detection
hash_diff CHAR(64) NOT NULL is a DV 2.0 feature. It is a SHA-256 hash of ALL the descriptive columns combined (race + level + faction). Before inserting a new Satellite row, compare the hash_diff to the current row’s hash_diff. If they match, nothing changed — skip the insert. If they differ, something changed — insert the new row.
- 4 columns
- All NOT NULL
- PK = hash key alone
- Has business_key
- One row per entity
- No hash_diff
- Structural + descriptive columns
- Descriptive columns can be NULL
- PK = hash key + load_date
- No business_key
- Multiple rows per entity (snapshots)
- Has hash_diff for change detection
A Satellite stores the changing details of an entity over time. Ulfric at level 20 is one row. Ulfric at level 30 is another row. The primary key uses two columns — who (hash key) and when (load date) — because the who alone would have duplicates.
The composite primary key (hub_character_hk, load_date) works because no entity can have two different snapshots at the exact same timestamp. load_end_date being nullable marks which row is current (NULL = still active). hash_diff prevents unnecessary inserts — if the source sends Ulfric’s data again but nothing changed, the hash_diff matches and no new row is created.
The Satellite schema separates structural columns (all NOT NULL: hash key, load_date, record_source, hash_diff) from descriptive columns (nullable: race, level, faction). This separation means the Satellite’s own integrity is always guaranteed even when source data has gaps. The foreign key relationship to the Hub means the Satellite cannot reference a nonexistent entity — the Hub row must exist first.
Satellites are the highest-volume DV table type — one entity might generate hundreds of rows over its lifetime. The hash_diff optimization is critical at scale: a source that sends Ulfric’s data daily but only changes his level monthly produces 1 new Satellite row per month, not 30. load_end_date enables “point-in-time” queries: “What was Ulfric’s data as of February 15?” — find the row where load_date <= Feb 15 and (load_end_date > Feb 15 OR load_end_date IS NULL).
“A Satellite has a composite primary key — hash key plus load_date — because it stores multiple snapshots over time. It adds hash_diff for change detection and allows NULL in descriptive columns. It connects to the Hub via the hash key, not via business_key.”
A Link records that a relationship EXISTS between two entities. Ulfric wields a Steel Sword — that pairing is the Link row. The Link does not store details about the relationship (like “equipped since January”). It just records: these two things are connected.
link_character_weapon_hk CHAR(64) NOT NULL — The Link’s own hash key. Generated by hashing the COMBINATION of hub_character_hk + hub_weapon_hk. Same pairing always produces the same link hash key.
load_date TIMESTAMP NOT NULL — When this relationship was first recorded.
record_source VARCHAR(100) NOT NULL — Which system reported the relationship.
hub_character_hk CHAR(64) NOT NULL — Points to Ulfric’s row in Hub_Character. This is a foreign key — it references the Hub.
hub_weapon_hk CHAR(64) NOT NULL — Points to Steel Sword’s row in Hub_Weapon. Another foreign key.
PRIMARY KEY (link_character_weapon_hk) — Simple primary key on the link hash key, just like a Hub. NOT composite like a Satellite.
1. Multiple foreign keys. A Hub has 0 foreign keys. A Link has one per Hub it connects. This Link has 2: hub_character_hk and hub_weapon_hk. A Link connecting 3 Hubs would have 3 foreign keys.
2. Its own hash key. Derived from the combination of the Hub hash keys it references. “Ulfric’s hash + Steel Sword’s hash” → hashed together → produces the link hash key.
3. No business_key. The RELATIONSHIP between them does not have a human-readable name. The link hash key serves as the identifier.
4. No descriptive columns. “When did Ulfric equip the Steel Sword?” goes in a Link Satellite, NOT in the Link itself. The Link records existence of the relationship — nothing more.
5. Simple PRIMARY KEY. Like a Hub: one column (the link hash key). Not composite like a Satellite. A specific pairing only gets recorded once.
All columns NOT NULL — like a Hub. A relationship where you do not know what it connects is not a relationship.
- PK: hash key
- Has business_key
- All NOT NULL
- One entity
- PK: hash key + load_date
- No business_key
- Descriptive = nullable
- Snapshots over time
- PK: link hash key
- No business_key
- All NOT NULL
- Multi-FK (Hub HKs)
A Link records that two things are connected. Ulfric wields a Steel Sword — that is a Link row. It stores pointers to both Hubs (character + weapon) and has its own ID.
The Link’s hash key is generated from the combination of the Hub hash keys. The same pairing always produces the same link hash key, even when reported by different source systems. No descriptive data — if the relationship has attributes, those go in a separate Link Satellite.
Links are structurally similar to Hubs: all NOT NULL, simple PK. The difference is the foreign key columns — one per connected Hub. In a normalized Data Vault, Links capture M:N relationships natively. The link hash key is a deterministic function of its component hub hash keys, enabling parallel loading without coordination.
Links are the reason Data Vault handles schema evolution gracefully. Adding a new entity to a relationship means adding a new foreign key column or creating a new Link. Relationships in traditional star schemas are baked into fact table grain; in DV, they are first-class objects that can be queried, versioned (via Link Satellites), and audited independently.
“A Link connects Hubs by storing their hash keys as foreign keys, plus its own link hash key as the primary key. No business key, no descriptive columns — if the relationship has attributes, those go in a Link Satellite.”
How many characters per faction? You can’t answer that by reading one row.
“GROUP BY” — everyday meaning: arrange into groups. SQL meaning: collapse rows that share a value into one result row per group.
“COUNT” — everyday meaning: tally. SQL meaning: an aggregate function that counts rows in each group.
“Aggregate” — from Latin aggregare, “to add to a flock.” Aggregation = gathering individual items into a collective answer.
You write it in this order (what you want the output to look like):
The database runs it in this order (execution order, not write order):
Step 1 — FROM: The database starts with all 6 rows in Sat_Character_Details.
Step 2 — GROUP BY faction: The database sorts the 6 rows into piles based on the faction column value.
Step 3 — COUNT(*): For each pile, the database counts the rows. This runs after GROUP BY, even though it appears before it in the query.
Step 4 — SELECT: Assembles the output: faction label + count, named member_count (alias — “henceforth called”).
| business_key | faction |
|---|---|
| Ulfric | Stormcloak |
| Galmar | Stormcloak |
| Ralof | Stormcloak |
| Lydia | Companions |
| Aela | Companions |
| Nazeem | None |
| faction | member_count |
|---|---|
| Stormcloak | 3 |
| Companions | 2 |
| None | 1 |
6 input rows became 3 output rows. Each output row represents one group. GROUP BY collapsed the individual rows into group summaries.
Result: one single row containing 6. Without GROUP BY, COUNT treats the ENTIRE table as one group and counts all rows. No faction breakdown — just the total.
GROUP BY sorts rows into piles by a column value, then COUNT tells you how many rows are in each pile. 6 character rows grouped by faction becomes 3 output rows: Stormcloak (3), Companions (2), None (1).
The database processes FROM first (get all rows), then WHERE (filter individual rows), then GROUP BY (form piles), then HAVING (filter piles), then SELECT (compute COUNT per pile). WHERE cannot use COUNT because groups have not been formed yet. HAVING can because groups already exist.
GROUP BY + aggregate functions (COUNT, SUM, AVG, MIN, MAX) transform row-level data into group-level summaries. Every column in SELECT must either appear in GROUP BY or be inside an aggregate function — because the output has one row per group. SELECT faction, COUNT(*) works because faction is in GROUP BY and COUNT is an aggregate. SELECT faction, level would fail because level is not grouped and not aggregated — which of Ulfric’s, Galmar’s, and Ralof’s levels should it show?
In Data Vault contexts, COUNT + GROUP BY is how you answer operational questions against Satellites: “How many snapshot rows per entity?” (quality check), “How many characters changed faction last month?” (business intelligence). Combined with HAVING, you can find anomalies: HAVING COUNT(*) > 100 flags entities with suspiciously many changes — potential data quality issues.
COUNT(*) counts all rows including those with NULL values. COUNT(column_name) counts rows where that column is not NULL.| WHERE | HAVING | |
|---|---|---|
| Filters | Individual rows | Groups |
| When | Before GROUP BY | After GROUP BY |
| Can use COUNT? | No — groups don’t exist yet | Yes — groups already formed |
| Example | WHERE faction != 'None' | HAVING COUNT(*) > 1 |
Start with 6 rows. WHERE removes Nazeem (faction = ‘None’). 5 rows remain. GROUP BY sorts them: Stormcloak (3), Companions (2). Result: 2 output rows.
Start with 6 rows. No WHERE, all 6 proceed. GROUP BY sorts: Stormcloak (3), Companions (2), None (1). HAVING checks: 3 > 1? Keep. 2 > 1? Keep. 1 > 1? Remove. Result: 2 output rows — same result, different mechanism.
FROM → WHERE → GROUP BY → HAVING → SELECT → ORDER BY
WHERE runs before GROUP BY. HAVING runs after. That’s why WHERE can’t use COUNT(*) — the groups don’t exist yet when WHERE runs.
“GROUP BY answers questions about groups of rows. Given 6 character rows, to find how many are in each faction, GROUP BY faction and COUNT the rows in each group. The output has one row per group. HAVING filters groups after counting — for example, only factions with more than two members.”
When building a full Data Vault in SQL, the order matters because of dependencies. You cannot reference a table that does not exist yet.
Hubs first — they stand alone. Hub_Character and Hub_Weapon have no foreign keys. They reference nothing. Can be created in any order, even in parallel.
Links second — Link_Character_Weapon references hub_character_hk and hub_weapon_hk. Those columns point to the Hubs. If the Hubs do not exist yet, the Link cannot reference them.
Satellites last — Sat_Character_Details references hub_character_hk. The Hub must exist first. A Link Satellite references link_character_weapon_hk. The Link must exist first.
This is the same as the loading order in production: Hubs load first (parallel, no dependencies), then Links (need Hub hash keys to exist), then Satellites (need Hub or Link hash keys to exist).
“Hubs first because they have no dependencies — they stand alone. Links second because they reference Hub hash keys. Satellites last because they reference Hub or Link hash keys. Same order as loading in production.”
“Which SQL statement defines structure vs reads data?”
Runs once to set up the table: CREATE TABLE — because it is a DDL statement that defines the schema.
Runs every time you query: SELECT — because it is a DML statement that reads existing data.
The key distinction is define vs read. CREATE TABLE defines structure once. SELECT reads data repeatedly.
Exercise 1 — CREATE vs SELECT
Match each description to the correct SQL statement.
“What data type should hub_character_hk use?”
Answer: CHAR(64)
Hash keys are always exactly 64 characters (SHA-256 hex output). CHAR is for fixed-length data. VARCHAR would also work but wastes overhead since the length never varies. CHAR(64) is the standard for every hash key in Data Vault.
Exercise 2 — Choose the Data Type
For each DV column, pick the correct data type.
hub_character_hk — SHA-256 hash, always 64 charactersbusiness_key — character name, variable lengthload_date — when the data was loaded into the vaultlevel — numeric character levelrecord_source — name of the source system“Write the CREATE TABLE for a Hub_Player with hash key hub_player_hk.”
CREATE TABLE Hub_Player (
hub_player_hk CHAR(64) NOT NULL,
load_date TIMESTAMP NOT NULL,
record_source VARCHAR(100) NOT NULL,
business_key VARCHAR(255) NOT NULL,
PRIMARY KEY (hub_player_hk)
);
Every Hub follows the same 4-column pattern: hash key as CHAR(64), load_date as TIMESTAMP, record_source as VARCHAR, business_key as VARCHAR. Metadata first, then content. All NOT NULL. The hash key is the PRIMARY KEY. Only the table name and hash key column name change between Hubs.
Exercise 3 — Write Hub_Weapon
Your turn: Write the CREATE TABLE for Hub_Weapon
Hub_Weapon stores weapon identities. Same 4-column pattern as Hub_Character. The hash key column is hub_weapon_hk.
“Write the CREATE TABLE for Sat_Character_Details with descriptive columns race, level, and faction.”
CREATE TABLE Sat_Character_Details (
hub_character_hk CHAR(64) NOT NULL,
load_date TIMESTAMP NOT NULL,
load_end_date TIMESTAMP,
record_source VARCHAR(100) NOT NULL,
hash_diff CHAR(64) NOT NULL,
race VARCHAR(50),
level INTEGER,
faction VARCHAR(100),
PRIMARY KEY (hub_character_hk, load_date)
);
Satellite pattern: composite PK (hash key + load_date), nullable load_end_date (NULL = current), hash_diff for change detection, descriptive columns can be NULL, no business_key. Structural columns are all NOT NULL; descriptive columns are not.
Exercise 4 — Write Sat_Weapon_Details
Your turn: Write the CREATE TABLE for Sat_Weapon_Details
This Satellite tracks weapon attributes over time. It connects to Hub_Weapon via hub_weapon_hk. Descriptive columns: damage (INTEGER), weapon_type (VARCHAR(50)), enchantment (VARCHAR(100)). Remember: composite PK, nullable descriptive columns, no business_key, hash_diff for change detection.
“Write the CREATE TABLE for Link_Character_Weapon connecting Hub_Character and Hub_Weapon.”
CREATE TABLE Link_Character_Weapon (
link_character_weapon_hk CHAR(64) NOT NULL,
load_date TIMESTAMP NOT NULL,
record_source VARCHAR(100) NOT NULL,
hub_character_hk CHAR(64) NOT NULL,
hub_weapon_hk CHAR(64) NOT NULL,
PRIMARY KEY (link_character_weapon_hk)
);
Link pattern: own hash key (derived from Hub HKs), load_date, record_source, then one foreign key per Hub it connects. Metadata first, then content. All NOT NULL. Simple PK on the link hash key (not composite). No business_key, no descriptive columns.
Exercise 5 — Write Link_Character_Location
Your turn: Write the CREATE TABLE for Link_Character_Location
This Link connects Hub_Character to Hub_Location. The link hash key is link_character_location_hk. It references hub_character_hk and hub_location_hk. Same pattern as Link_Character_Weapon.
“Given 4 rows where 2 have race = ‘Nord’ and 2 have race = ‘Breton’, what does SELECT race, COUNT(*) FROM Sat_Character_Details GROUP BY race return?”
Answer: 2 output rows — Nord (2) and Breton (2). GROUP BY creates one pile per unique race value, COUNT(*) tallies each pile.
4 input rows → 2 groups (Nord, Breton) → 2 output rows. Each output row represents one group.
Exercise 6 — Predict the Output
Given the faction data (Ulfric/Galmar/Ralof = Stormcloak, Lydia/Aela = Companions, Nazeem = None) and the query SELECT faction, COUNT(*) FROM Sat_Character_Details GROUP BY faction:
“I want only factions with more than 1 member. WHERE or HAVING?”
Answer: HAVING — because “more than 1 member” requires COUNT, and COUNT only works after groups are formed. HAVING runs after GROUP BY; WHERE runs before.
The rule: if the filter needs an aggregate (COUNT, SUM, AVG), use HAVING. If the filter is about individual row values, use WHERE.
Exercise 7 — WHERE or HAVING?
For each scenario, choose whether you’d use WHERE or HAVING.