Chapter 11: SQL Practice and dbt — Data Vault Foundations

SYMBOL DRILLS

Core

The difference between * and % is tiny on a keyboard. In SQL, it’s the difference between seeing everything and searching for a pattern.

SYMBOL COMPARISON

Symbol	Where	Meaning
`*`	SELECT	All columns
`%`	LIKE	Any characters (wildcard)
`.`	table.column	Possessive (owner.property)
`_`	LIKE	Exactly one character
`'text'`	WHERE	String value (data)
`"name"`	FROM/SELECT	Identifier (schema name)
`!=`	WHERE	Not equal (same as <>)
`AS`	SELECT / FROM	Rename (column or table)

`*` vs `%` — Column Selector vs String Wildcard

SELECT * FROM Hub_Character — the asterisk means “give me every column.” It lives in SELECT and has nothing to do with string matching.

WHERE business_key LIKE 'Ulfric%' — the percent sign means “any characters after this point” inside a LIKE pattern. It lives in WHERE and has nothing to do with columns.

They look similar on a keyboard. In SQL, they live in completely different clauses and do completely different jobs.

`.` vs `_` — Possessive vs Single-Char Wildcard

h.business_key — the dot is the possessive operator. Read it as “h’s business_key.” It separates an owner (table or alias) from a property (column).

WHERE name LIKE 'Ulfi_c' — the underscore matches exactly one character. “Ulfric” matches. So does “Ulficc.” It is a pattern wildcard, not a structural separator.

One is syntax structure. The other is pattern matching. No overlap.

`'value'` vs `"identifier"` — Data vs Schema Names

WHERE record_source = 'Skyrim_NPC_DB' — single quotes wrap string values. This is data you’re comparing against.

SELECT "my column" FROM "weird table" — double quotes wrap identifiers (column or table names) that need quoting because they have spaces or conflict with reserved words.

In practice: single quotes almost always. Double quotes almost never unless a name is unusual.

`!=` vs `<>` — Two Ways to Say Not Equal

Both are valid in Snowflake. WHERE record_source != 'Skyrim_NPC_DB' and WHERE record_source <> 'Skyrim_NPC_DB' do the same thing.

No trick here. Either form is valid and both are widely used.

`AS` — Two Grammatical Roles, Same Keyword

SELECT h.business_key AS customer_name — after a column in SELECT, AS renames the output column.

FROM Hub_Character AS h — after a table in FROM, AS gives the table a shorthand name for this query.

Same keyword, two positions, two meanings. The position tells you which role it plays.

IS / IS NOT — Symbol Domains

IS: * is a column selector in SELECT — “all columns.”

IS NOT: * is NOT a wildcard for string patterns. That is % in LIKE.

IS: . is a possessive separator between owner and property (table.column).

IS NOT: _ is NOT a separator. It is a single-character wildcard inside LIKE patterns.

IS: 'single quotes' wrap string data values.

IS NOT: "double quotes" do NOT wrap data. They wrap identifiers (column/table names).

WORKED EXAMPLE

“You want all characters from Hub_Character (alias: h) whose name starts with ‘Dragonborn’ from the source ‘Skyrim_DB’. Write the query.”

SELECT * FROM Hub_Character AS h WHERE h.business_key LIKE 'Dragonborn%' AND h.record_source = 'Skyrim_DB'

* in SELECT = all columns. AS h in FROM = alias. h.business_key = dot is possessive (h’s business_key). LIKE 'Dragonborn%' = % is the string wildcard. = 'Skyrim_DB' = single quotes around the data value. Every symbol in its correct domain.

Exercise 1 — Symbol Discrimination

Match each scenario to the correct symbol or syntax.

You want every column from Hub_Character

You want names starting with “Ulfric”

You reference h’s business_key column

You match names where the 4th character can be anything

You compare record_source to a text value like Skyrim_NPC_DB

HUB CREATE TABLE

Core

A Hub has exactly 4 columns. You know all of them. Can you write them from memory?

HUB SCHEMA — 4 COLUMNS

CREATE TABLE Hub_Character ( hub_character_hk CHAR(64) NOT NULL, load_date TIMESTAMP NOT NULL, record_source VARCHAR(100) NOT NULL, business_key VARCHAR(255) NOT NULL, PRIMARY KEY (hub_character_hk) );

`hub_character_hk` — CHAR(64) NOT NULL

The hash key. SHA-256 always produces exactly 64 hex characters — never shorter, never longer. Fixed length means CHAR, not VARCHAR. This is the primary key: one hash per unique business key, forever.

`load_date` — TIMESTAMP NOT NULL

When this entity was first loaded into the vault. TIMESTAMP gives you date AND time precision. This is metadata — the Hub records when it learned about each entity.

`record_source` — VARCHAR(100) NOT NULL

Which source system this entity came from. “Skyrim_NPC_Database” or “Elder_Scrolls_Online_API” — lengths vary, so VARCHAR. Data provenance: where did this come from?

`business_key` — VARCHAR(255) NOT NULL

The original identifier from the source system. “Ulfric” is 6 characters; “Archmage Savos Aren” is 19. Names vary in length, so VARCHAR stretches to fit. The hash key is derived from this, but we keep the original too.

All NOT NULL — Every Column Required

A Hub row without any of these four values is meaningless. No hash key? Can’t identify it. No business key? Don’t know what it represents. No load date? Don’t know when it arrived. No source? Don’t know where it came from. Every column is structurally essential.

IS / IS NOT — Hub

IS: The entity registry — one row per unique business key, forever. Insert-only.

IS NOT: Where entity attributes live. Names, levels, guilds go in the Satellite, not the Hub.

IS: 4 columns, all NOT NULL, single-column PK on the hash key.

IS NOT: A table with descriptive or changeable data. Hub rows never change after insertion.

WORKED EXAMPLE

“Write CREATE TABLE for Hub_Faction — a Hub tracking faction entities (Stormcloaks, Companions, etc.).”

CREATE TABLE Hub_Faction (hub_faction_hk CHAR(64) NOT NULL, load_date TIMESTAMP NOT NULL, record_source VARCHAR(100) NOT NULL, business_key VARCHAR(255) NOT NULL, PRIMARY KEY (hub_faction_hk));

Same 4-column pattern as Hub_Character. Only two things change: the table name and the hash key column name. The structure is always the same for every Hub.

WRITE FROM MEMORY — Hub_Character

Write the complete CREATE TABLE for Hub_Character. All 4 columns, correct types, correct constraints, PRIMARY KEY declaration.

REFERENCE ANSWER

CREATE TABLE Hub_Character (
    hub_character_hk    CHAR(64)     NOT NULL,
    load_date           TIMESTAMP    NOT NULL,
    record_source       VARCHAR(100) NOT NULL,
    business_key        VARCHAR(255) NOT NULL,
    PRIMARY KEY (hub_character_hk)
);

LINK CREATE TABLE

Core

Links connect Hubs. They have their own hash key plus the hash keys of every Hub they connect.

LINK SCHEMA — 5 COLUMNS

CREATE TABLE Link_Character_Faction ( link_character_faction_hk CHAR(64) NOT NULL, load_date TIMESTAMP NOT NULL, record_source VARCHAR(100) NOT NULL, hub_character_hk CHAR(64) NOT NULL, hub_faction_hk CHAR(64) NOT NULL, PRIMARY KEY (link_character_faction_hk) );

`link_character_faction_hk` — The Link’s Own Hash Key

Every Link has its own hash key, derived from the combination of the foreign hash keys it connects. CHAR(64) — same as any hash key. This is the PK.

`load_date` + `record_source` — Same Metadata

Same pattern as the Hub: when was this relationship recorded, and where did it come from? TIMESTAMP and VARCHAR(100), both NOT NULL. Metadata always comes right after the hash key.

`hub_character_hk` + `hub_faction_hk` — Foreign Hash Keys

These reference the PKs of Hub_Character and Hub_Faction. They are both CHAR(64) because they ARE hash keys — the same hash keys stored in the corresponding Hubs. This is how the Link “points to” two entities.

No business_key. No descriptive columns.

A Link is purely structural. It records that a relationship exists between two entities. It does not store names, levels, or any attributes. Those belong in Satellites (specifically, Effectivity Satellites for relationship attributes).

IS / IS NOT — Link

IS: A relationship record between entities — pure structure, only hash keys.

IS NOT: A table with descriptive data. No names, no attributes — just hash keys and metadata.

IS: 5 columns for a 2-Hub Link: own HK + 2 foreign HKs + load_date + record_source.

IS NOT: The same as a Hub. A Hub has a business_key; a Link does not. A Link has foreign hash keys; a Hub does not.

WORKED EXAMPLE

“Write a Link connecting Hub_Character and Hub_Weapon. What columns does it have?”

5 columns: link_character_weapon_hk (own HK), load_date, record_source, hub_character_hk (FK), hub_weapon_hk (FK). All CHAR(64) for hash keys, TIMESTAMP for date, VARCHAR(100) for source. All NOT NULL.

Pattern is identical to Link_Character_Faction. Only the table name, the Link’s own hash key name, and the second foreign hash key name change. The structure never varies.

WRITE FROM MEMORY — Link_Character_Faction

Write the complete CREATE TABLE for Link_Character_Faction. 5 columns: own hash key, load_date, record_source, two foreign hash keys. Correct types and constraints.

REFERENCE ANSWER

CREATE TABLE Link_Character_Faction (
    link_character_faction_hk  CHAR(64)     NOT NULL,
    load_date                  TIMESTAMP    NOT NULL,
    record_source              VARCHAR(100) NOT NULL,
    hub_character_hk           CHAR(64)     NOT NULL,
    hub_faction_hk             CHAR(64)     NOT NULL,
    PRIMARY KEY (link_character_faction_hk)
);

SATELLITE CREATE TABLE

Core

Satellites track change. Their secret weapon: a composite primary key that says who changed AND when.

SATELLITE SCHEMA — COMPOSITE PK

CREATE TABLE Sat_Character_Details ( hub_character_hk CHAR(64) NOT NULL, load_date TIMESTAMP NOT NULL, record_source VARCHAR(100) NOT NULL, hash_diff CHAR(64) NOT NULL, character_name VARCHAR(255), guild VARCHAR(100), level INTEGER, PRIMARY KEY (hub_character_hk, load_date) );

Composite PK: `(hub_character_hk, load_date)`

The same character can have multiple rows as their attributes change over time. The combination of WHO (hash key) and WHEN (load date) uniquely identifies each version. This is why it’s a composite key — neither column alone is sufficient.

`hash_diff` — CHAR(64) NOT NULL

A fingerprint of all descriptive columns. If the incoming hash_diff matches the last known one for this entity, nothing changed — don’t insert. If it differs, something changed — insert a new row. Change detection without comparing columns one by one.

Descriptive Columns ARE Nullable

character_name VARCHAR(255), guild VARCHAR(100), level INTEGER — notice no NOT NULL. You might receive a record where only some attributes are populated. NULL is a valid state for descriptive data. This is a key contrast: Hub/Link columns are all NOT NULL. Satellite descriptive columns are nullable.

No `load_end_date` in v0

In the ScaleFree/datavault4dbt implementation, the physical (v0) Satellite table does NOT have a load_end_date column. The v1 VIEW computes it virtually using the LEAD() window function. If you add load_end_date to the v0 CREATE TABLE, it’s incorrect for the ScaleFree pattern.

SCALEFREE / DATAVAULT4DBT

v0 Satellite = the physical table you write with CREATE TABLE (no load_end_date). v1 Satellite = a VIEW on top of v0 that computes load_end_date using LEAD(). The key point about load_end_date: it exists as a computed value, not a stored column.

IS / IS NOT — Satellite

IS: PK is a composite of (hash_key, load_date) — two columns together.

IS NOT: PK is NOT just the hash key alone. That’s the Hub PK pattern.

IS: Descriptive columns (name, guild, level) ARE nullable.

IS NOT: Hub and Link columns are NOT nullable. Every structural column must be present.

IS: Where attributes and change history live — multiple rows per entity over time.

IS NOT: A table with its own hash key. The Satellite references the Hub’s hash key; it does not generate a new one.

WORKED EXAMPLE

“Write a Satellite tracking a faction’s description and member_count, attached to Hub_Faction.”

Sat_Faction_Details with: hub_faction_hk CHAR(64) NOT NULL, load_date TIMESTAMP NOT NULL, record_source VARCHAR(100) NOT NULL, hash_diff CHAR(64) NOT NULL, description VARCHAR(500) (nullable), member_count INTEGER (nullable), PRIMARY KEY (hub_faction_hk, load_date).

Same structural pattern as Sat_Character_Details. The structural columns (hash key, load_date, hash_diff, record_source) are always the same and always NOT NULL. Only the descriptive columns change between Satellites — and they’re always nullable.

WRITE FROM MEMORY — Sat_Character_Details

Write the complete CREATE TABLE for Sat_Character_Details. Structural columns (hub_character_hk, load_date, record_source, hash_diff) + descriptive columns (character_name, guild, level). Correct constraints. Composite PK.

REFERENCE ANSWER

CREATE TABLE Sat_Character_Details (
    hub_character_hk    CHAR(64)     NOT NULL,
    load_date           TIMESTAMP    NOT NULL,
    record_source       VARCHAR(100) NOT NULL,
    hash_diff           CHAR(64)     NOT NULL,
    character_name      VARCHAR(255),
    guild               VARCHAR(100),
    level               INTEGER,
    PRIMARY KEY (hub_character_hk, load_date)
);

dbt MODEL

Important

You wrote a Hub from scratch in 7 lines of SQL. datavault4dbt writes it from 7 lines of Jinja.

HUB MODEL FILE — hub_character.sql

-- models/raw_vault/hub_character.sql {{ config(materialized='incremental') }} {%- set yaml_metadata -%} hashkey: 'hub_character_hk' business_keys: - character_name source_models: stage_character {%- endset -%} {{ datavault4dbt.hub(yaml_metadata=yaml_metadata) }}

`config(materialized='incremental')`

Tells dbt to create this as an incremental table — only process new records on each run, not reload everything. This matches the DV insert-only pattern: never delete old rows, only add new ones.

`set yaml_metadata` — Your Configuration

This block tells the macro what YOUR data looks like. It’s a YAML-formatted variable: the hash key name, which columns form the business key, and which staging model to read from. This is the only thing that changes between Hub model files.

`hashkey` + `business_keys` + `source_models`

hashkey = the output hash key column name. business_keys = the column(s) from staging that get hashed. source_models = which upstream model to read from. Three config values — the macro handles everything else.

`datavault4dbt.hub()` — The Macro Call

This single line expands into ~50+ lines of SQL. The macro knows the DV loading pattern — insert-only, deduplication, hash key matching — and generates all of it. You specify the “what” (3 config values). The macro handles the “how” (the full SQL).

Where are load_date and record_source?

You only specify the business_key. The macro adds load_date and record_source automatically using dbt’s standard metadata columns (LDTS and RSRC from the staging model). The macro knows DV conventions — you only tell it what’s specific to your entity.

What is Jinja?+

Jinja is a templating language (originally from Python). In dbt, it lets you write SQL templates with variables, loops, and macros. The {{ }} syntax wraps Jinja expressions. The {%- -%} syntax wraps Jinja statements (like setting variables). dbt compiles these templates into plain SQL before running them against the database.

What does “incremental” mean?+

An incremental model only processes NEW rows since the last run. On first run, it loads everything. On subsequent runs, it checks what’s new in the staging model and only inserts those rows. This is efficient for large datasets — you don’t re-process data you’ve already loaded. It aligns perfectly with DV’s insert-only pattern.

Where do load_date and record_source come from?+

The staging model (stage_character) has columns named LDTS (load date timestamp) and RSRC (record source). The macro maps these to load_date and record_source in the Hub table. These column names are configured globally in the dbt project settings, so every model uses the same convention without repeating it.

Exercise 3 — dbt Model Components

Match each dbt model component to what it does.

config(materialized='incremental')

hashkey: 'hub_character_hk'

business_keys: - character_name

source_models: stage_character

datavault4dbt.hub(yaml_metadata=...)

COMPILED SQL

Important

This is the SQL your 7-line model file actually generates. Every line maps to something you already know.

COMPILED INSERT SQL

INSERT INTO hub_character ( hub_character_hk, character_name, load_date, record_source ) SELECT DISTINCT SHA2(UPPER(TRIM(stage.character_name)), 256) AS hub_character_hk, stage.character_name, stage.load_date, stage.record_source FROM stage_character AS stage WHERE NOT EXISTS ( SELECT 1 FROM hub_character AS existing WHERE existing.hub_character_hk = SHA2(UPPER(TRIM(stage.character_name)), 256) );

`INSERT INTO hub_character (...)`

DML — adds rows to an existing table. The table was already created by CREATE TABLE (which ran at deployment time). This is the load operation that runs on each dbt execution.

`SELECT DISTINCT`

Deduplication. The staging area might contain the same character_name multiple times from multiple loads. DISTINCT collapses duplicates before inserting — each character enters the Hub exactly once per run.

`SHA2(UPPER(TRIM(...)), 256)` — Hash Key Computation

TRIM removes whitespace. UPPER normalizes case (“Ulfric” and “ULFRIC” hash the same). SHA2(..., 256) is Snowflake’s SHA-256 function — it hashes the result to a 64-character hex string, matching the CHAR(64) column. This is where the hash key is actually computed — in the INSERT SQL, not in the CREATE TABLE.

`WHERE NOT EXISTS (...)` — The Insert-Only Guard

Only insert a row if that hash key doesn’t already exist in the Hub. If it does, the entity was already registered — do nothing. This enforces the DV principle: a business key enters the Hub exactly once, ever. No updates. No deletes.

Why No UPDATE Statement?

Data Vault is insert-only. Hub rows never change. If a character already exists in Hub_Character, you don’t update anything. Changes in attributes go into Satellites, not Hubs. The WHERE NOT EXISTS clause is the enforcement mechanism.

IS / IS NOT — WHERE NOT EXISTS

IS: The insert-only guard — prevents duplicate hash keys from entering the Hub.

IS NOT: A filter on the data itself. It does not change what data looks like — it checks whether the row already exists.

IS: How the DV insert-only principle is enforced in SQL.

IS NOT: The same as WHERE with a simple comparison. NOT EXISTS uses a subquery to check another table.

Why DISTINCT?+

The staging table might have the same character loaded twice (from two different batch loads, or from two source files). Without DISTINCT, both rows would pass the NOT EXISTS check (since neither is in the Hub yet) and you’d insert duplicates. DISTINCT collapses them to one row before the insert.

Why SHA2(..., 256)?+

SHA2 is Snowflake’s name for the SHA-256 function. The second argument (256) specifies the bit length — SHA-256 produces a 256-bit hash, which encodes to a 64-character hex string. That’s why hash key columns are CHAR(64). MD5 only produces 32 hex characters and has known collision weaknesses. datavault4dbt uses SHA-256 by default. The hash function is configured globally in the dbt project — you set it once and every model uses it.

What about Satellites?+

Satellite compiled SQL is similar but more complex. Instead of WHERE NOT EXISTS checking the hash key, it compares the hash_diff — only insert a new row if the descriptive data actually changed. The macro handles all of this. The important thing is: same pattern (INSERT with a guard), different guard logic (hash_diff comparison instead of hash_key existence).

Exercise 4 — Compiled SQL Components

Match each SQL component to its purpose in the compiled Hub INSERT.

INSERT INTO hub_character

SELECT DISTINCT

SHA2(UPPER(TRIM(stage.character_name)), 256)

WHERE NOT EXISTS (SELECT 1 FROM hub_character ...)

FROM stage_character AS stage

SUMMARY

Core

WHAT YOU PRACTICED

Symbol Discrimination: * vs %, . vs _, 'value' vs "identifier", != vs <>, AS in two positions. No more mix-ups under pressure.

Hub CREATE TABLE: 4 columns, all NOT NULL, single-column PK on hash key. Written from memory.

Link CREATE TABLE: 5 columns (own HK + 2 foreign HKs + metadata), all NOT NULL, no business_key, no descriptive columns.

Satellite CREATE TABLE: Composite PK (hash_key, load_date), hash_diff for change detection, nullable descriptive columns, no load_end_date in v0.

dbt Model Reading: 7-line model file → ~50 lines of compiled SQL. Config, metadata, macro call. You specify the “what”; the macro generates the “how.”

Compiled SQL: INSERT INTO + SELECT DISTINCT + hash computation + WHERE NOT EXISTS. The insert-only pattern enforced in code.

KEY CONCEPT

“After this chapter, you can write Hub, Link, and Satellite CREATE TABLE statements from memory and read datavault4dbt model files — understanding both the SQL the automation generates and why each component exists.”

CONNECTION TABLE

What you wrote by hand	What datavault4dbt automates
CREATE TABLE Hub_Character (...)	Generated via dbt-core run + schema YAML
Hash key is CHAR(64)	SHA2(UPPER(TRIM(business_key)), 256) — 64-char hex
NOT NULL on all Hub columns	INSERT only succeeds if staging provides all values
Insert-only pattern	WHERE NOT EXISTS enforces no duplicates
load_date = when it arrived	Captured automatically from staging LDTS column
record_source = where it came from	Captured automatically from staging RSRC column

BRIDGE TO CHAPTER 12

Next session: Integration Day. We build a mini Data Vault end to end — CREATE TABLE for Hubs, Links, and Satellites, INSERT sample data, then JOIN queries across them. Everything you’ve done converges.

From Knowingto Producing

* vs % — Column Selector vs String Wildcard

. vs _ — Possessive vs Single-Char Wildcard

'value' vs "identifier" — Data vs Schema Names

!= vs <> — Two Ways to Say Not Equal

AS — Two Grammatical Roles, Same Keyword

Exercise 1 — Symbol Discrimination

hub_character_hk — CHAR(64) NOT NULL

load_date — TIMESTAMP NOT NULL

record_source — VARCHAR(100) NOT NULL

business_key — VARCHAR(255) NOT NULL

All NOT NULL — Every Column Required

WRITE FROM MEMORY — Hub_Character

link_character_faction_hk — The Link’s Own Hash Key

load_date + record_source — Same Metadata

hub_character_hk + hub_faction_hk — Foreign Hash Keys

No business_key. No descriptive columns.

WRITE FROM MEMORY — Link_Character_Faction

Composite PK: (hub_character_hk, load_date)

hash_diff — CHAR(64) NOT NULL

Descriptive Columns ARE Nullable

No load_end_date in v0

WRITE FROM MEMORY — Sat_Character_Details

config(materialized='incremental')

set yaml_metadata — Your Configuration

hashkey + business_keys + source_models

datavault4dbt.hub() — The Macro Call

Where are load_date and record_source?

Exercise 3 — dbt Model Components

INSERT INTO hub_character (...)

SELECT DISTINCT

SHA2(UPPER(TRIM(...)), 256) — Hash Key Computation

WHERE NOT EXISTS (...) — The Insert-Only Guard

Why No UPDATE Statement?

Exercise 4 — Compiled SQL Components

From Knowing
to Producing

`*` vs `%` — Column Selector vs String Wildcard

`.` vs `_` — Possessive vs Single-Char Wildcard

`'value'` vs `"identifier"` — Data vs Schema Names

`!=` vs `<>` — Two Ways to Say Not Equal

`AS` — Two Grammatical Roles, Same Keyword

`hub_character_hk` — CHAR(64) NOT NULL

`load_date` — TIMESTAMP NOT NULL

`record_source` — VARCHAR(100) NOT NULL

`business_key` — VARCHAR(255) NOT NULL

`link_character_faction_hk` — The Link’s Own Hash Key

`load_date` + `record_source` — Same Metadata

`hub_character_hk` + `hub_faction_hk` — Foreign Hash Keys

Composite PK: `(hub_character_hk, load_date)`

`hash_diff` — CHAR(64) NOT NULL

No `load_end_date` in v0

`config(materialized='incremental')`

`set yaml_metadata` — Your Configuration

`hashkey` + `business_keys` + `source_models`

`datavault4dbt.hub()` — The Macro Call

`INSERT INTO hub_character (...)`

`SELECT DISTINCT`

`SHA2(UPPER(TRIM(...)), 256)` — Hash Key Computation

`WHERE NOT EXISTS (...)` — The Insert-Only Guard