From Knowing
to Producing
No new concepts. Just you, the SQL, and the schemas you already understand.
The difference between * and % is tiny on a keyboard. In SQL, it’s the difference between seeing everything and searching for a pattern.
| Symbol | Where | Meaning |
|---|---|---|
* |
SELECT | All columns |
% |
LIKE | Any characters (wildcard) |
. |
table.column | Possessive (owner.property) |
_ |
LIKE | Exactly one character |
'text' |
WHERE | String value (data) |
"name" |
FROM/SELECT | Identifier (schema name) |
!= |
WHERE | Not equal (same as <>) |
AS |
SELECT / FROM | Rename (column or table) |
* vs % — Column Selector vs String Wildcard
SELECT * FROM Hub_Character — the asterisk means “give me every column.” It lives in SELECT and has nothing to do with string matching.
WHERE business_key LIKE 'Ulfric%' — the percent sign means “any characters after this point” inside a LIKE pattern. It lives in WHERE and has nothing to do with columns.
They look similar on a keyboard. In SQL, they live in completely different clauses and do completely different jobs.
. vs _ — Possessive vs Single-Char Wildcard
h.business_key — the dot is the possessive operator. Read it as “h’s business_key.” It separates an owner (table or alias) from a property (column).
WHERE name LIKE 'Ulfi_c' — the underscore matches exactly one character. “Ulfric” matches. So does “Ulficc.” It is a pattern wildcard, not a structural separator.
One is syntax structure. The other is pattern matching. No overlap.
'value' vs "identifier" — Data vs Schema Names
WHERE record_source = 'Skyrim_NPC_DB' — single quotes wrap string values. This is data you’re comparing against.
SELECT "my column" FROM "weird table" — double quotes wrap identifiers (column or table names) that need quoting because they have spaces or conflict with reserved words.
In practice: single quotes almost always. Double quotes almost never unless a name is unusual.
!= vs <> — Two Ways to Say Not Equal
Both are valid in Snowflake. WHERE record_source != 'Skyrim_NPC_DB' and WHERE record_source <> 'Skyrim_NPC_DB' do the same thing.
No trick here. Either form is valid and both are widely used.
AS — Two Grammatical Roles, Same Keyword
SELECT h.business_key AS customer_name — after a column in SELECT, AS renames the output column.
FROM Hub_Character AS h — after a table in FROM, AS gives the table a shorthand name for this query.
Same keyword, two positions, two meanings. The position tells you which role it plays.
* is a column selector in SELECT — “all columns.”* is NOT a wildcard for string patterns. That is % in LIKE.. is a possessive separator between owner and property (table.column)._ is NOT a separator. It is a single-character wildcard inside LIKE patterns.'single quotes' wrap string data values."double quotes" do NOT wrap data. They wrap identifiers (column/table names).“You want all characters from Hub_Character (alias: h) whose name starts with ‘Dragonborn’ from the source ‘Skyrim_DB’. Write the query.”
SELECT * FROM Hub_Character AS h WHERE h.business_key LIKE 'Dragonborn%' AND h.record_source = 'Skyrim_DB'
* in SELECT = all columns. AS h in FROM = alias. h.business_key = dot is possessive (h’s business_key). LIKE 'Dragonborn%' = % is the string wildcard. = 'Skyrim_DB' = single quotes around the data value. Every symbol in its correct domain.
Exercise 1 — Symbol Discrimination
Match each scenario to the correct symbol or syntax.
A Hub has exactly 4 columns. You know all of them. Can you write them from memory?
hub_character_hk — CHAR(64) NOT NULL
The hash key. SHA-256 always produces exactly 64 hex characters — never shorter, never longer. Fixed length means CHAR, not VARCHAR. This is the primary key: one hash per unique business key, forever.
load_date — TIMESTAMP NOT NULL
When this entity was first loaded into the vault. TIMESTAMP gives you date AND time precision. This is metadata — the Hub records when it learned about each entity.
record_source — VARCHAR(100) NOT NULL
Which source system this entity came from. “Skyrim_NPC_Database” or “Elder_Scrolls_Online_API” — lengths vary, so VARCHAR. Data provenance: where did this come from?
business_key — VARCHAR(255) NOT NULL
The original identifier from the source system. “Ulfric” is 6 characters; “Archmage Savos Aren” is 19. Names vary in length, so VARCHAR stretches to fit. The hash key is derived from this, but we keep the original too.
All NOT NULL — Every Column Required
A Hub row without any of these four values is meaningless. No hash key? Can’t identify it. No business key? Don’t know what it represents. No load date? Don’t know when it arrived. No source? Don’t know where it came from. Every column is structurally essential.
“Write CREATE TABLE for Hub_Faction — a Hub tracking faction entities (Stormcloaks, Companions, etc.).”
CREATE TABLE Hub_Faction (hub_faction_hk CHAR(64) NOT NULL, load_date TIMESTAMP NOT NULL, record_source VARCHAR(100) NOT NULL, business_key VARCHAR(255) NOT NULL, PRIMARY KEY (hub_faction_hk));
Same 4-column pattern as Hub_Character. Only two things change: the table name and the hash key column name. The structure is always the same for every Hub.
WRITE FROM MEMORY — Hub_Character
Write the complete CREATE TABLE for Hub_Character. All 4 columns, correct types, correct constraints, PRIMARY KEY declaration.
Links connect Hubs. They have their own hash key plus the hash keys of every Hub they connect.
link_character_faction_hk — The Link’s Own Hash Key
Every Link has its own hash key, derived from the combination of the foreign hash keys it connects. CHAR(64) — same as any hash key. This is the PK.
load_date + record_source — Same Metadata
Same pattern as the Hub: when was this relationship recorded, and where did it come from? TIMESTAMP and VARCHAR(100), both NOT NULL. Metadata always comes right after the hash key.
hub_character_hk + hub_faction_hk — Foreign Hash Keys
These reference the PKs of Hub_Character and Hub_Faction. They are both CHAR(64) because they ARE hash keys — the same hash keys stored in the corresponding Hubs. This is how the Link “points to” two entities.
No business_key. No descriptive columns.
A Link is purely structural. It records that a relationship exists between two entities. It does not store names, levels, or any attributes. Those belong in Satellites (specifically, Effectivity Satellites for relationship attributes).
“Write a Link connecting Hub_Character and Hub_Weapon. What columns does it have?”
5 columns: link_character_weapon_hk (own HK), load_date, record_source, hub_character_hk (FK), hub_weapon_hk (FK). All CHAR(64) for hash keys, TIMESTAMP for date, VARCHAR(100) for source. All NOT NULL.
Pattern is identical to Link_Character_Faction. Only the table name, the Link’s own hash key name, and the second foreign hash key name change. The structure never varies.
WRITE FROM MEMORY — Link_Character_Faction
Write the complete CREATE TABLE for Link_Character_Faction. 5 columns: own hash key, load_date, record_source, two foreign hash keys. Correct types and constraints.
Satellites track change. Their secret weapon: a composite primary key that says who changed AND when.
Composite PK: (hub_character_hk, load_date)
The same character can have multiple rows as their attributes change over time. The combination of WHO (hash key) and WHEN (load date) uniquely identifies each version. This is why it’s a composite key — neither column alone is sufficient.
hash_diff — CHAR(64) NOT NULL
A fingerprint of all descriptive columns. If the incoming hash_diff matches the last known one for this entity, nothing changed — don’t insert. If it differs, something changed — insert a new row. Change detection without comparing columns one by one.
Descriptive Columns ARE Nullable
character_name VARCHAR(255), guild VARCHAR(100), level INTEGER — notice no NOT NULL. You might receive a record where only some attributes are populated. NULL is a valid state for descriptive data. This is a key contrast: Hub/Link columns are all NOT NULL. Satellite descriptive columns are nullable.
No load_end_date in v0
In the ScaleFree/datavault4dbt implementation, the physical (v0) Satellite table does NOT have a load_end_date column. The v1 VIEW computes it virtually using the LEAD() window function. If you add load_end_date to the v0 CREATE TABLE, it’s incorrect for the ScaleFree pattern.
v0 Satellite = the physical table you write with CREATE TABLE (no load_end_date). v1 Satellite = a VIEW on top of v0 that computes load_end_date using LEAD(). The key point about load_end_date: it exists as a computed value, not a stored column.
“Write a Satellite tracking a faction’s description and member_count, attached to Hub_Faction.”
Sat_Faction_Details with: hub_faction_hk CHAR(64) NOT NULL, load_date TIMESTAMP NOT NULL, record_source VARCHAR(100) NOT NULL, hash_diff CHAR(64) NOT NULL, description VARCHAR(500) (nullable), member_count INTEGER (nullable), PRIMARY KEY (hub_faction_hk, load_date).
Same structural pattern as Sat_Character_Details. The structural columns (hash key, load_date, hash_diff, record_source) are always the same and always NOT NULL. Only the descriptive columns change between Satellites — and they’re always nullable.
WRITE FROM MEMORY — Sat_Character_Details
Write the complete CREATE TABLE for Sat_Character_Details. Structural columns (hub_character_hk, load_date, record_source, hash_diff) + descriptive columns (character_name, guild, level). Correct constraints. Composite PK.
You wrote a Hub from scratch in 7 lines of SQL. datavault4dbt writes it from 7 lines of Jinja.
config(materialized='incremental')
Tells dbt to create this as an incremental table — only process new records on each run, not reload everything. This matches the DV insert-only pattern: never delete old rows, only add new ones.
set yaml_metadata — Your Configuration
This block tells the macro what YOUR data looks like. It’s a YAML-formatted variable: the hash key name, which columns form the business key, and which staging model to read from. This is the only thing that changes between Hub model files.
hashkey + business_keys + source_models
hashkey = the output hash key column name. business_keys = the column(s) from staging that get hashed. source_models = which upstream model to read from. Three config values — the macro handles everything else.
datavault4dbt.hub() — The Macro Call
This single line expands into ~50+ lines of SQL. The macro knows the DV loading pattern — insert-only, deduplication, hash key matching — and generates all of it. You specify the “what” (3 config values). The macro handles the “how” (the full SQL).
Where are load_date and record_source?
You only specify the business_key. The macro adds load_date and record_source automatically using dbt’s standard metadata columns (LDTS and RSRC from the staging model). The macro knows DV conventions — you only tell it what’s specific to your entity.
Jinja is a templating language (originally from Python). In dbt, it lets you write SQL templates with variables, loops, and macros. The {{ }} syntax wraps Jinja expressions. The {%- -%} syntax wraps Jinja statements (like setting variables). dbt compiles these templates into plain SQL before running them against the database.
An incremental model only processes NEW rows since the last run. On first run, it loads everything. On subsequent runs, it checks what’s new in the staging model and only inserts those rows. This is efficient for large datasets — you don’t re-process data you’ve already loaded. It aligns perfectly with DV’s insert-only pattern.
The staging model (stage_character) has columns named LDTS (load date timestamp) and RSRC (record source). The macro maps these to load_date and record_source in the Hub table. These column names are configured globally in the dbt project settings, so every model uses the same convention without repeating it.
Exercise 3 — dbt Model Components
Match each dbt model component to what it does.
config(materialized='incremental')hashkey: 'hub_character_hk'business_keys: - character_namesource_models: stage_characterdatavault4dbt.hub(yaml_metadata=...)This is the SQL your 7-line model file actually generates. Every line maps to something you already know.
INSERT INTO hub_character (...)
DML — adds rows to an existing table. The table was already created by CREATE TABLE (which ran at deployment time). This is the load operation that runs on each dbt execution.
SELECT DISTINCT
Deduplication. The staging area might contain the same character_name multiple times from multiple loads. DISTINCT collapses duplicates before inserting — each character enters the Hub exactly once per run.
SHA2(UPPER(TRIM(...)), 256) — Hash Key Computation
TRIM removes whitespace. UPPER normalizes case (“Ulfric” and “ULFRIC” hash the same). SHA2(..., 256) is Snowflake’s SHA-256 function — it hashes the result to a 64-character hex string, matching the CHAR(64) column. This is where the hash key is actually computed — in the INSERT SQL, not in the CREATE TABLE.
WHERE NOT EXISTS (...) — The Insert-Only Guard
Only insert a row if that hash key doesn’t already exist in the Hub. If it does, the entity was already registered — do nothing. This enforces the DV principle: a business key enters the Hub exactly once, ever. No updates. No deletes.
Why No UPDATE Statement?
Data Vault is insert-only. Hub rows never change. If a character already exists in Hub_Character, you don’t update anything. Changes in attributes go into Satellites, not Hubs. The WHERE NOT EXISTS clause is the enforcement mechanism.
The staging table might have the same character loaded twice (from two different batch loads, or from two source files). Without DISTINCT, both rows would pass the NOT EXISTS check (since neither is in the Hub yet) and you’d insert duplicates. DISTINCT collapses them to one row before the insert.
SHA2 is Snowflake’s name for the SHA-256 function. The second argument (256) specifies the bit length — SHA-256 produces a 256-bit hash, which encodes to a 64-character hex string. That’s why hash key columns are CHAR(64). MD5 only produces 32 hex characters and has known collision weaknesses. datavault4dbt uses SHA-256 by default. The hash function is configured globally in the dbt project — you set it once and every model uses it.
Satellite compiled SQL is similar but more complex. Instead of WHERE NOT EXISTS checking the hash key, it compares the hash_diff — only insert a new row if the descriptive data actually changed. The macro handles all of this. The important thing is: same pattern (INSERT with a guard), different guard logic (hash_diff comparison instead of hash_key existence).
Exercise 4 — Compiled SQL Components
Match each SQL component to its purpose in the compiled Hub INSERT.
INSERT INTO hub_characterSELECT DISTINCTSHA2(UPPER(TRIM(stage.character_name)), 256)WHERE NOT EXISTS (SELECT 1 FROM hub_character ...)FROM stage_character AS stageSymbol Discrimination: * vs %, . vs _, 'value' vs "identifier", != vs <>, AS in two positions. No more mix-ups under pressure.
Hub CREATE TABLE: 4 columns, all NOT NULL, single-column PK on hash key. Written from memory.
Link CREATE TABLE: 5 columns (own HK + 2 foreign HKs + metadata), all NOT NULL, no business_key, no descriptive columns.
Satellite CREATE TABLE: Composite PK (hash_key, load_date), hash_diff for change detection, nullable descriptive columns, no load_end_date in v0.
dbt Model Reading: 7-line model file → ~50 lines of compiled SQL. Config, metadata, macro call. You specify the “what”; the macro generates the “how.”
Compiled SQL: INSERT INTO + SELECT DISTINCT + hash computation + WHERE NOT EXISTS. The insert-only pattern enforced in code.
“After this chapter, you can write Hub, Link, and Satellite CREATE TABLE statements from memory and read datavault4dbt model files — understanding both the SQL the automation generates and why each component exists.”
| What you wrote by hand | What datavault4dbt automates |
|---|---|
| CREATE TABLE Hub_Character (...) | Generated via dbt-core run + schema YAML |
| Hash key is CHAR(64) | SHA2(UPPER(TRIM(business_key)), 256) — 64-char hex |
| NOT NULL on all Hub columns | INSERT only succeeds if staging provides all values |
| Insert-only pattern | WHERE NOT EXISTS enforces no duplicates |
| load_date = when it arrived | Captured automatically from staging LDTS column |
| record_source = where it came from | Captured automatically from staging RSRC column |
Next session: Integration Day. We build a mini Data Vault end to end — CREATE TABLE for Hubs, Links, and Satellites, INSERT sample data, then JOIN queries across them. Everything you’ve done converges.