Chapter 6 — Chapter 6

The Tools

You’ve been learning Data Vault by hand. In real projects, almost nobody writes the SQL manually. Let’s figure out what actually builds it.

ELT vs ETL dbt DataVault4dbt TurboVault4dbt Snowflake Python Pipeline
Core

The Order Changed Everything

For decades, data teams cleaned their data before putting it in the warehouse. Then cloud warehouses got powerful enough that everyone asked: wait, why don’t we just load it first and clean it there?

Language Bridge — Extract, Transform, Load

Extract — from Latin extrahere: “to pull out.” Pull data out of source systems.

Transform — from Latin transformare: “to change shape.” Reshape data into the structure you need.

Load — from Old English lad: “course, journey.” Send data on its journey into the warehouse.

Why the order matters: ETL = transform then load. ELT = load then transform. Same three steps, different sequence. That sequence change is the entire revolution.

ETL — The Old Way

ExtractTransformLoad

Data is cleaned, reshaped, and validated before it enters the warehouse.

Like cooking at home, bringing the finished meal to a restaurant, and plating it there.

Why it existed: warehouses were expensive. You didn’t waste precious warehouse compute on messy data.

ELT — The Modern Way

ExtractLoadTransform

Raw data is loaded into the warehouse first, then transformed inside it.

Like delivering all the raw ingredients to a professional kitchen and letting the chefs handle it.

Why it won: cloud warehouses (Snowflake, BigQuery) have massive, elastic compute. Transforming there is cheaper and faster.

4-Layer Explanation — ELT
L1Plain language — for anyone

ELT means: grab data from wherever it lives, dump it into the warehouse as-is, then reshape it there. It’s like moving all your stuff into a new apartment first, then organizing it, instead of sorting everything before the move.

L2With mechanism — how it works

In ETL, a separate tool (like Informatica or Talend) does the heavy transformation work on a dedicated server. In ELT, raw data lands in the warehouse (staging layer), and the warehouse itself runs the transformations using SQL. Cloud warehouses can spin up massive compute clusters on demand, making this approach both faster and cheaper at scale.

L3Technical — full detail

Data Vault is naturally ELT-aligned. Raw data loads into the staging area and Raw Vault with minimal transformation — just hash key generation and structural conformance. Business rules and complex transformations happen in the Business Vault, inside the warehouse. This preserves the source data in its original form (audit trail) while still producing clean, queryable output.

L4Expert — nuance & edge cases

The ETL/ELT distinction is less binary than it sounds. Modern pipelines often do light transformations during extraction (e.g., format conversion, deduplication) before loading. What changed is where the business logic runs. In ETL, business rules were embedded in the extraction tool — opaque, hard to version-control, coupled to the tool vendor. In ELT, business logic lives in SQL inside the warehouse — transparent, version-controlled (via dbt), and tool-agnostic. That’s the real win: not just performance, but engineering discipline.

+ Why did cloud warehouses change everything? Explore

Traditional warehouses (Oracle, Teradata) had fixed compute capacity. Every query competed for the same resources. Running heavy transformations during business hours could slow down reports.

Cloud warehouses separate storage from compute. Snowflake, for example, lets you spin up a “warehouse” (compute cluster) specifically for transformations, run it for 20 minutes, then shut it down. You pay only for those 20 minutes. Meanwhile, analysts query on a completely separate cluster. No contention, elastic scaling, pay-per-use.

Core

What dbt Actually Is

dbt is not a database. It’s not a pipeline. It’s not an ETL tool. It’s the transformation layer — the part that takes raw data sitting in a warehouse and reshapes it into something useful.

Language Bridge — dbt

dbt = data build tool. “Build” like building a house. You have blueprints (models), raw materials (data in the warehouse), and dbt is the construction crew that reads the blueprints and builds the structure.

It doesn’t fetch the materials (that’s the Extract + Load part). It doesn’t store them (that’s the warehouse). It just builds — the T in ELT.

4-Layer Explanation — dbt
L1Plain language

dbt lets you write transformation instructions as simple SQL files. Each file says “take this data, reshape it like this.” dbt figures out the right order to run them, runs them, and creates the resulting tables. One command: dbt run.

L2With mechanism

Each dbt “model” is a SQL SELECT statement saved as a .sql file. When you run dbt run, dbt reads all your models, builds a dependency graph (model B references model A, so A must run first), compiles each model into the warehouse’s SQL dialect, executes them in order, and creates tables or views. Models are version-controlled in Git, testable, and documented.

L3Technical

dbt brings software engineering practices to data transformation: version control (Git), modular code (models reference other models via ref()), testing (dbt test checks uniqueness, not-null, referential integrity), documentation (auto-generated from YAML), and environments (dev/staging/prod). It compiles Jinja-templated SQL into raw SQL, handles incremental loading logic, and supports multiple warehouse backends. It is open-source (dbt Core) with a paid cloud version (dbt Cloud).

L4Expert nuance

dbt’s conceptual breakthrough isn’t technical — it’s cultural. Before dbt, data transformation was done by “ETL developers” using proprietary GUI tools (Informatica, SSIS). The logic was trapped in XML configs and drag-and-drop workflows — un-reviewable, un-testable, un-versionable. dbt said: transformations are just SQL. Put them in files. Version-control them. Review them in pull requests. Test them in CI/CD. This brought analytics engineering into the same discipline as software engineering.

What dbt Does NOT Do

dbt does not extract data from source systems. It does not load data into the warehouse. It does not schedule itself (you need an orchestrator like Airflow or dbt Cloud). It does not store data. It sits inside the warehouse and transforms what’s already there.

Think of it this way: the data is already in the kitchen (warehouse). dbt is the recipe book and the chef. Someone else delivered the groceries (extraction and loading tools).

Core

Two Meanings of “Model”

In Data Vault world, “model” means the Hub/Link/Satellite design. In dbt world, “model” means a SQL file. Same word, completely different things. Getting them mixed up will confuse you fast.

⚠ Disambiguation Zone

dbt “model”

A SQL file that produces a table or view. A recipe. hub_character.sql is a model.

One file = one model = one table.

DV “model”

The schema design — which Hubs, Links, and Satellites exist and how they connect. The blueprint.

The architecture = the model.

When someone at ScaleFree says “let me show you the model,” context tells you which one they mean. If they open a SQL file, it’s a dbt model. If they draw a diagram with Hubs and Satellites, it’s a DV model.

Key Concept
“The word ‘model’ means two different things in this context. In dbt, a model is a SQL file that produces a table. In Data Vault, the model is the schema design — the Hubs, Links, and Satellites and how they connect. One is a file, the other is an architecture.”
Core

DataVault4dbt

A Hub always works the same way. A Satellite always works the same way. If the patterns are standardized, they can be automated. That’s what datavault4dbt does.

Language Bridge — Macro

Macro — from Greek makros: “large, long.” A macro is a single short instruction that expands into many instructions. Like a keyboard shortcut that types a whole paragraph.

In dbt: A macro is a reusable template. You call it with your specific parameters (which column is the business key?), and it generates the full SQL. datavault4dbt is a library of these macros — one for Hubs, one for Satellites, one for Links, and so on.

The Automation Insight

Remember how a Hub always has the same four columns? hash_key, business_key, load_date, record_source. And the loading rule is always: insert if the business key is new, skip if it already exists, never update or delete.

Writing that SQL once is engineering. Writing it for 30 Hubs is copy-paste. datavault4dbt says: write the pattern once as a macro, then configure it for each entity.

You write 10 lines of configuration. The macro generates 200 lines of correct SQL. Your Hub is built.

What datavault4dbt Automates
DV EntityWhat the Macro DoesWhat You Provide
HubGenerates insert-only loading with hash key, deduplicationSource table, business key column
SatelliteGenerates hash-diff comparison, insert new rows on changeSource table, attribute columns, Hub reference
LinkGenerates multi-key hash, insert-only relationship trackingSource table, foreign key columns to Hubs
StagingGenerates hash keys + hash diffs from source columnsSource table, column mapping
PIT TableGenerates point-in-time snapshots across multiple SatellitesHub reference, list of Satellites
Key Concept
“The key insight about datavault4dbt is that Data Vault loading patterns are standardized enough to encode once and reuse across every entity. The staging hash calculations, the insert-only Hub and Satellite patterns, and PIT table generation for the Business Vault — these are all consistent enough to automate. The methodology stays the same; only the configuration changes per project.”
+ ScaleFree built this — why does that matter? Explore

datavault4dbt encodes years of ScaleFree consulting expertise into a reusable, open-source package. It is described as “congruent to the original Data Vault 2.0 definition by Dan Linstedt” — meaning the macros don’t just implement an interpretation of DV 2.0, they implement the specification itself.

ScaleFree built datavault4dbt. The value lies not in memorizing macro syntax, but in understanding why the automation works: the underlying methodology is standardized enough to encode once and reuse everywhere.

Bonus

TurboVault4dbt

datavault4dbt automates the SQL. TurboVault4dbt automates the automation. It generates the dbt model files themselves from a metadata spreadsheet.

The Three-Layer Stack

Think of it as three levels of abstraction, each building on the last:

Level 3 — TurboVault4dbt: You describe your data in a metadata spreadsheet (Excel, Google Sheets, or directly in Snowflake/BigQuery). TurboVault reads that description and auto-generates all the dbt model files.

Level 2 — datavault4dbt: Those generated model files call datavault4dbt macros, which expand into the correct Data Vault SQL.

Level 1 — dbt Core: dbt compiles and executes the SQL against the warehouse, creating the actual tables.

Result: Describe your data → code writes itself → Data Vault is built. Metadata-driven automation.

Key Concept
“TurboVault4dbt takes the automation a level further by generating the dbt model files from metadata. You move from ‘manually configure each entity’ to ‘describe your data in a spreadsheet and the code writes itself.’ That’s a natural next step once the loading patterns are standardized.”
Important

Snowflake

dbt is the chef. datavault4dbt is the recipe book. Snowflake is the kitchen — the place where the data actually lives and where transformations happen.

4-Layer Explanation — Snowflake
L1Plain language

Snowflake is a cloud database for analytics. You store your data there, and you run queries there. It runs on AWS, Azure, or Google Cloud — you don’t manage servers, Snowflake handles the infrastructure.

L2With mechanism

Snowflake’s key innovation is separating storage from compute. Your data sits in cheap cloud storage (S3, Azure Blob, GCS). When you need to query it, Snowflake spins up a “virtual warehouse” (compute cluster) of whatever size you need. Multiple warehouses can query the same data simultaneously without competing. When you’re done, the compute shuts down. You pay for storage always, compute only when used.

L3Technical

For Data Vault projects, Snowflake is a primary target platform. The staging area, Raw Vault, Business Vault, and Information Marts all live as schemas within Snowflake. dbt connects to Snowflake via a connector, compiles SQL to Snowflake’s SQL dialect, and executes transformations using Snowflake’s compute. datavault4dbt natively supports Snowflake as a target platform.

L4Expert nuance

Snowflake is one of several cloud data warehouses. The competitors: Google BigQuery (serverless, auto-scaling, Google’s analytics platform), Amazon Redshift (AWS’s warehouse, older architecture), Databricks (Spark-based, strong on ML/data science workloads, uses a “lakehouse” architecture). ScaleFree works across platforms — datavault4dbt supports all of them. But Snowflake is the most common target in their project portfolio.

Important

Python — Reading, Not Writing

Python doesn’t replace SQL in data pipelines. It orchestrates SQL, connects APIs, runs custom logic, and handles the parts that SQL can’t. You need to read it, not write it.

Language Bridge — Python

Python — named after Monty Python’s Flying Circus, not the snake. The language was designed to be readable — almost like English pseudocode. That’s why it’s the most common language in data engineering.

Here’s what Python looks like. Don’t memorize syntax — just practice reading what it does:

skyrim_pipeline.py
# Variables — like naming something in algebra character_name = "Khajiit" level = 42 # Lists — a collection, like an inventory sources = ["Whiterun_CRM", "Solitude_ERP", "Riften_API"] # Function — a reusable recipe def load_source(source_name): data = extract_from(source_name) load_to_staging(data) return len(data) # Loop — do this for each item in the list for source in sources: count = load_source(source) print(f"Loaded {count} records from {source}")
Python Syntax — Reading Guide
PythonWhat It MeansLike in SQL / English
x = "hello"Assign a value to a nameNo type needed — Python figures it out
def func():Define a function (reusable recipe)“def” = “define, henceforth called”
for x in list:Do something for each itemReads like English: “for each x in the list”
import dbtBring in an external libraryLike #include or using — “bring in this tool”
if x > 5:Conditional — do this only if trueLike SQL WHERE, but for code flow
indentationIndentation = scope (what’s inside what)No curly braces — whitespace is structure
+ What does Python actually do in data pipelines? Explore

Orchestration: Python scripts (using tools like Airflow or Prefect) schedule and coordinate pipeline steps. “First extract from CRM, then run dbt, then notify the team.”

API connections: Pulling data from REST APIs (e.g., Salesforce, Stripe) — HTTP requests, JSON parsing, pagination handling. SQL can’t do this.

Custom transformations: Complex logic that’s awkward in SQL — machine learning preprocessing, fuzzy matching, natural language processing.

Data quality checks: Libraries like Great Expectations validate data against rules: “this column should never be null,” “values should be between 0 and 100.”

Core

The Full Picture

Every tool you’ve just learned has a specific place in the pipeline. Here’s how they all connect.

Source
Systems
Extract
Python/Fivetran
Staging
Snowflake
Raw Vault
dbt + DV4dbt
Biz Vault
dbt
Info Mart
Star Schema
Dashboard
Power BI/Tableau

Where Each Tool Lives

Python / extraction tools — pull data from source systems (CRM, ERP, APIs). The E in ELT.

Snowflake — the warehouse where everything lives. Staging, Raw Vault, Business Vault, and Marts are all schemas inside Snowflake.

dbt + datavault4dbt — transforms data inside Snowflake. Builds the Raw Vault and Business Vault. The T in ELT.

Power BI / Tableau — visualization tools that connect to the Information Marts and display dashboards. The end of the pipeline — what business users actually see.

Data Vault — not a tool. It’s the architecture that governs how data moves through all these layers. The blueprint that all the tools follow.

+ Power BI vs Tableau — what’s the difference? Explore

Power BI (Microsoft): Most common in enterprise. Integrates with the Microsoft ecosystem (Azure, Excel, Teams). Strong on dashboards and scheduled reports. License model tied to Microsoft 365.

Tableau (Salesforce): More exploratory and visual. Favored by analysts who want to “play with” data interactively. Historically stronger on ad-hoc visualization, now competitive on dashboards too.

Both connect to Snowflake and query the Information Marts. At this stage, knowing they exist and where they sit in the pipeline is sufficient.

Exercise 1 — Tool Matching

Match the Tool to What It Does

For each description, select the correct tool.

Transforms data inside the warehouse using SQL models
Provides pre-built macros for Hub, Satellite, and Link loading
Generates dbt model files from a metadata spreadsheet
Cloud data warehouse with separated storage and compute
Connects to Information Marts and displays dashboards
Exercise 2 — ELT or ETL?

Identify the Approach

For each scenario, is it ETL or ELT?

A team uses Informatica to clean and validate data before sending it to an Oracle warehouse
Raw data from Salesforce is loaded into Snowflake, then dbt transforms it into a star schema
Source data is loaded into a staging layer as-is, then datavault4dbt generates the Hubs and Satellites
A middleware server applies business rules and deduplicates records before the warehouse ever sees them
Exercise 3 — Pipeline Order

Put the Steps in Order

What’s the correct order of data flow? Select 1–7 for each step.

Raw Vault (Hubs, Links, Satellites)
Source Systems (CRM, ERP, APIs)
Dashboard (Power BI / Tableau)
Staging Area
Information Mart (Star Schema)
Business Vault
Extraction (Python / Fivetran)

All Exercises Complete

You understand the tooling ecosystem. You can now talk about what each tool does and where it fits.

Key Concepts

On ELT + dbt
“dbt handles the T in ELT — it transforms data that’s already inside the warehouse. The key shift from ETL to ELT is that business logic moves from opaque middleware into version-controlled SQL models. That’s better engineering: reviewable, testable, documented.”
On datavault4dbt
datavault4dbt automates the loading patterns that Data Vault standardizes. Those patterns — insert-only Hub loads, hash-diff Satellite comparisons, PIT table generation — are consistent enough to encode once and reuse across every entity. The value is that the methodology stays the same; only the configuration changes per project.
On the Ecosystem
ScaleFree built a three-layer tooling stack: datavault4dbt for the loading patterns, TurboVault4dbt for metadata-driven code generation, and datavault4coalesce for a visual interface. The insight is that standardized patterns enable standardized automation — and the expertise of configuring that automation is a core part of what a Data Vault consultancy delivers.