The Tools
You’ve been learning Data Vault by hand. In real projects, almost nobody writes the SQL manually. Let’s figure out what actually builds it.
The Order Changed Everything
For decades, data teams cleaned their data before putting it in the warehouse. Then cloud warehouses got powerful enough that everyone asked: wait, why don’t we just load it first and clean it there?
Extract — from Latin extrahere: “to pull out.” Pull data out of source systems.
Transform — from Latin transformare: “to change shape.” Reshape data into the structure you need.
Load — from Old English lad: “course, journey.” Send data on its journey into the warehouse.
Why the order matters: ETL = transform then load. ELT = load then transform. Same three steps, different sequence. That sequence change is the entire revolution.
Extract → Transform → Load
Data is cleaned, reshaped, and validated before it enters the warehouse.
Like cooking at home, bringing the finished meal to a restaurant, and plating it there.
Why it existed: warehouses were expensive. You didn’t waste precious warehouse compute on messy data.
Extract → Load → Transform
Raw data is loaded into the warehouse first, then transformed inside it.
Like delivering all the raw ingredients to a professional kitchen and letting the chefs handle it.
Why it won: cloud warehouses (Snowflake, BigQuery) have massive, elastic compute. Transforming there is cheaper and faster.
ELT means: grab data from wherever it lives, dump it into the warehouse as-is, then reshape it there. It’s like moving all your stuff into a new apartment first, then organizing it, instead of sorting everything before the move.
In ETL, a separate tool (like Informatica or Talend) does the heavy transformation work on a dedicated server. In ELT, raw data lands in the warehouse (staging layer), and the warehouse itself runs the transformations using SQL. Cloud warehouses can spin up massive compute clusters on demand, making this approach both faster and cheaper at scale.
Data Vault is naturally ELT-aligned. Raw data loads into the staging area and Raw Vault with minimal transformation — just hash key generation and structural conformance. Business rules and complex transformations happen in the Business Vault, inside the warehouse. This preserves the source data in its original form (audit trail) while still producing clean, queryable output.
The ETL/ELT distinction is less binary than it sounds. Modern pipelines often do light transformations during extraction (e.g., format conversion, deduplication) before loading. What changed is where the business logic runs. In ETL, business rules were embedded in the extraction tool — opaque, hard to version-control, coupled to the tool vendor. In ELT, business logic lives in SQL inside the warehouse — transparent, version-controlled (via dbt), and tool-agnostic. That’s the real win: not just performance, but engineering discipline.
Traditional warehouses (Oracle, Teradata) had fixed compute capacity. Every query competed for the same resources. Running heavy transformations during business hours could slow down reports.
Cloud warehouses separate storage from compute. Snowflake, for example, lets you spin up a “warehouse” (compute cluster) specifically for transformations, run it for 20 minutes, then shut it down. You pay only for those 20 minutes. Meanwhile, analysts query on a completely separate cluster. No contention, elastic scaling, pay-per-use.
What dbt Actually Is
dbt is not a database. It’s not a pipeline. It’s not an ETL tool. It’s the transformation layer — the part that takes raw data sitting in a warehouse and reshapes it into something useful.
dbt = data build tool. “Build” like building a house. You have blueprints (models), raw materials (data in the warehouse), and dbt is the construction crew that reads the blueprints and builds the structure.
It doesn’t fetch the materials (that’s the Extract + Load part). It doesn’t store them (that’s the warehouse). It just builds — the T in ELT.
dbt lets you write transformation instructions as simple SQL files. Each file says “take this data, reshape it like this.” dbt figures out the right order to run them, runs them, and creates the resulting tables. One command: dbt run.
Each dbt “model” is a SQL SELECT statement saved as a .sql file. When you run dbt run, dbt reads all your models, builds a dependency graph (model B references model A, so A must run first), compiles each model into the warehouse’s SQL dialect, executes them in order, and creates tables or views. Models are version-controlled in Git, testable, and documented.
dbt brings software engineering practices to data transformation: version control (Git), modular code (models reference other models via ref()), testing (dbt test checks uniqueness, not-null, referential integrity), documentation (auto-generated from YAML), and environments (dev/staging/prod). It compiles Jinja-templated SQL into raw SQL, handles incremental loading logic, and supports multiple warehouse backends. It is open-source (dbt Core) with a paid cloud version (dbt Cloud).
dbt’s conceptual breakthrough isn’t technical — it’s cultural. Before dbt, data transformation was done by “ETL developers” using proprietary GUI tools (Informatica, SSIS). The logic was trapped in XML configs and drag-and-drop workflows — un-reviewable, un-testable, un-versionable. dbt said: transformations are just SQL. Put them in files. Version-control them. Review them in pull requests. Test them in CI/CD. This brought analytics engineering into the same discipline as software engineering.
What dbt Does NOT Do
dbt does not extract data from source systems. It does not load data into the warehouse. It does not schedule itself (you need an orchestrator like Airflow or dbt Cloud). It does not store data. It sits inside the warehouse and transforms what’s already there.
Think of it this way: the data is already in the kitchen (warehouse). dbt is the recipe book and the chef. Someone else delivered the groceries (extraction and loading tools).
Two Meanings of “Model”
In Data Vault world, “model” means the Hub/Link/Satellite design. In dbt world, “model” means a SQL file. Same word, completely different things. Getting them mixed up will confuse you fast.
dbt “model”
A SQL file that produces a table or view. A recipe. hub_character.sql is a model.
One file = one model = one table.
DV “model”
The schema design — which Hubs, Links, and Satellites exist and how they connect. The blueprint.
The architecture = the model.
When someone at ScaleFree says “let me show you the model,” context tells you which one they mean. If they open a SQL file, it’s a dbt model. If they draw a diagram with Hubs and Satellites, it’s a DV model.
“The word ‘model’ means two different things in this context. In dbt, a model is a SQL file that produces a table. In Data Vault, the model is the schema design — the Hubs, Links, and Satellites and how they connect. One is a file, the other is an architecture.”
DataVault4dbt
A Hub always works the same way. A Satellite always works the same way. If the patterns are standardized, they can be automated. That’s what datavault4dbt does.
Macro — from Greek makros: “large, long.” A macro is a single short instruction that expands into many instructions. Like a keyboard shortcut that types a whole paragraph.
In dbt: A macro is a reusable template. You call it with your specific parameters (which column is the business key?), and it generates the full SQL. datavault4dbt is a library of these macros — one for Hubs, one for Satellites, one for Links, and so on.
The Automation Insight
Remember how a Hub always has the same four columns? hash_key, business_key, load_date, record_source. And the loading rule is always: insert if the business key is new, skip if it already exists, never update or delete.
Writing that SQL once is engineering. Writing it for 30 Hubs is copy-paste. datavault4dbt says: write the pattern once as a macro, then configure it for each entity.
You write 10 lines of configuration. The macro generates 200 lines of correct SQL. Your Hub is built.
| DV Entity | What the Macro Does | What You Provide |
|---|---|---|
| Hub | Generates insert-only loading with hash key, deduplication | Source table, business key column |
| Satellite | Generates hash-diff comparison, insert new rows on change | Source table, attribute columns, Hub reference |
| Link | Generates multi-key hash, insert-only relationship tracking | Source table, foreign key columns to Hubs |
| Staging | Generates hash keys + hash diffs from source columns | Source table, column mapping |
| PIT Table | Generates point-in-time snapshots across multiple Satellites | Hub reference, list of Satellites |
“The key insight about datavault4dbt is that Data Vault loading patterns are standardized enough to encode once and reuse across every entity. The staging hash calculations, the insert-only Hub and Satellite patterns, and PIT table generation for the Business Vault — these are all consistent enough to automate. The methodology stays the same; only the configuration changes per project.”
datavault4dbt encodes years of ScaleFree consulting expertise into a reusable, open-source package. It is described as “congruent to the original Data Vault 2.0 definition by Dan Linstedt” — meaning the macros don’t just implement an interpretation of DV 2.0, they implement the specification itself.
ScaleFree built datavault4dbt. The value lies not in memorizing macro syntax, but in understanding why the automation works: the underlying methodology is standardized enough to encode once and reuse everywhere.
TurboVault4dbt
datavault4dbt automates the SQL. TurboVault4dbt automates the automation. It generates the dbt model files themselves from a metadata spreadsheet.
The Three-Layer Stack
Think of it as three levels of abstraction, each building on the last:
Level 3 — TurboVault4dbt: You describe your data in a metadata spreadsheet (Excel, Google Sheets, or directly in Snowflake/BigQuery). TurboVault reads that description and auto-generates all the dbt model files.
Level 2 — datavault4dbt: Those generated model files call datavault4dbt macros, which expand into the correct Data Vault SQL.
Level 1 — dbt Core: dbt compiles and executes the SQL against the warehouse, creating the actual tables.
Result: Describe your data → code writes itself → Data Vault is built. Metadata-driven automation.
“TurboVault4dbt takes the automation a level further by generating the dbt model files from metadata. You move from ‘manually configure each entity’ to ‘describe your data in a spreadsheet and the code writes itself.’ That’s a natural next step once the loading patterns are standardized.”
Snowflake
dbt is the chef. datavault4dbt is the recipe book. Snowflake is the kitchen — the place where the data actually lives and where transformations happen.
Snowflake is a cloud database for analytics. You store your data there, and you run queries there. It runs on AWS, Azure, or Google Cloud — you don’t manage servers, Snowflake handles the infrastructure.
Snowflake’s key innovation is separating storage from compute. Your data sits in cheap cloud storage (S3, Azure Blob, GCS). When you need to query it, Snowflake spins up a “virtual warehouse” (compute cluster) of whatever size you need. Multiple warehouses can query the same data simultaneously without competing. When you’re done, the compute shuts down. You pay for storage always, compute only when used.
For Data Vault projects, Snowflake is a primary target platform. The staging area, Raw Vault, Business Vault, and Information Marts all live as schemas within Snowflake. dbt connects to Snowflake via a connector, compiles SQL to Snowflake’s SQL dialect, and executes transformations using Snowflake’s compute. datavault4dbt natively supports Snowflake as a target platform.
Snowflake is one of several cloud data warehouses. The competitors: Google BigQuery (serverless, auto-scaling, Google’s analytics platform), Amazon Redshift (AWS’s warehouse, older architecture), Databricks (Spark-based, strong on ML/data science workloads, uses a “lakehouse” architecture). ScaleFree works across platforms — datavault4dbt supports all of them. But Snowflake is the most common target in their project portfolio.
Python — Reading, Not Writing
Python doesn’t replace SQL in data pipelines. It orchestrates SQL, connects APIs, runs custom logic, and handles the parts that SQL can’t. You need to read it, not write it.
Python — named after Monty Python’s Flying Circus, not the snake. The language was designed to be readable — almost like English pseudocode. That’s why it’s the most common language in data engineering.
Here’s what Python looks like. Don’t memorize syntax — just practice reading what it does:
| Python | What It Means | Like in SQL / English |
|---|---|---|
| x = "hello" | Assign a value to a name | No type needed — Python figures it out |
| def func(): | Define a function (reusable recipe) | “def” = “define, henceforth called” |
| for x in list: | Do something for each item | Reads like English: “for each x in the list” |
| import dbt | Bring in an external library | Like #include or using — “bring in this tool” |
| if x > 5: | Conditional — do this only if true | Like SQL WHERE, but for code flow |
| indentation | Indentation = scope (what’s inside what) | No curly braces — whitespace is structure |
Orchestration: Python scripts (using tools like Airflow or Prefect) schedule and coordinate pipeline steps. “First extract from CRM, then run dbt, then notify the team.”
API connections: Pulling data from REST APIs (e.g., Salesforce, Stripe) — HTTP requests, JSON parsing, pagination handling. SQL can’t do this.
Custom transformations: Complex logic that’s awkward in SQL — machine learning preprocessing, fuzzy matching, natural language processing.
Data quality checks: Libraries like Great Expectations validate data against rules: “this column should never be null,” “values should be between 0 and 100.”
The Full Picture
Every tool you’ve just learned has a specific place in the pipeline. Here’s how they all connect.
Systems
Python/Fivetran
Snowflake
dbt + DV4dbt
dbt
Star Schema
Power BI/Tableau
Where Each Tool Lives
Python / extraction tools — pull data from source systems (CRM, ERP, APIs). The E in ELT.
Snowflake — the warehouse where everything lives. Staging, Raw Vault, Business Vault, and Marts are all schemas inside Snowflake.
dbt + datavault4dbt — transforms data inside Snowflake. Builds the Raw Vault and Business Vault. The T in ELT.
Power BI / Tableau — visualization tools that connect to the Information Marts and display dashboards. The end of the pipeline — what business users actually see.
Data Vault — not a tool. It’s the architecture that governs how data moves through all these layers. The blueprint that all the tools follow.
Power BI (Microsoft): Most common in enterprise. Integrates with the Microsoft ecosystem (Azure, Excel, Teams). Strong on dashboards and scheduled reports. License model tied to Microsoft 365.
Tableau (Salesforce): More exploratory and visual. Favored by analysts who want to “play with” data interactively. Historically stronger on ad-hoc visualization, now competitive on dashboards too.
Both connect to Snowflake and query the Information Marts. At this stage, knowing they exist and where they sit in the pipeline is sufficient.
Match the Tool to What It Does
For each description, select the correct tool.
Identify the Approach
For each scenario, is it ETL or ELT?
Put the Steps in Order
What’s the correct order of data flow? Select 1–7 for each step.
Key Concepts
“dbt handles the T in ELT — it transforms data that’s already inside the warehouse. The key shift from ETL to ELT is that business logic moves from opaque middleware into version-controlled SQL models. That’s better engineering: reviewable, testable, documented.”
datavault4dbt automates the loading patterns that Data Vault standardizes. Those patterns — insert-only Hub loads, hash-diff Satellite comparisons, PIT table generation — are consistent enough to encode once and reuse across every entity. The value is that the methodology stays the same; only the configuration changes per project.
ScaleFree built a three-layer tooling stack: datavault4dbt for the loading patterns, TurboVault4dbt for metadata-driven code generation, and datavault4coalesce for a visual interface. The insight is that standardized patterns enable standardized automation — and the expertise of configuring that automation is a core part of what a Data Vault consultancy delivers.