Data Vault Modeling: Architecture, Examples, and Best Practices

 · 

5 min read

notion-image
In the ever-changing world of enterprise data management, organizations need a way to store, integrate, and audit data at scale without constantly breaking their models when business rules or sources evolve.
This is where Data Vault Modeling (DVM) shines. Unlike traditional Star Schemas or 3NF models, Data Vault focuses on adaptability, auditability, and scalability, making it especially powerful for modern lakehouse and cloud-native data platforms.

Core Building Blocks of Data Vault

  • Hubs represent core business concepts such as Customer ID, Product Number, or Vehicle Identification Number (VIN) and contain the business key.
  • Links define relationships between hubs, capturing physical or logical associations without holding attribute data themselves.
  • Satellites store contextual and descriptive details about hubs or links, such as timestamps, status, or metadata, allowing chang
At the heart of Data Vault lie three fundamental entities:

1.Hubs

The Business Keys

  • Represent core business concepts: e.g., Customer, Product, Vehicle.
  • Contain only the business key (immutable, unique identifier).
  • Provide semantic stability in the model.
👉 Example Hub table:
Plain Text
CREATE TABLE Hub_Customer (
    Customer_HK   CHAR(32) NOT NULL, -- Hash Key of Business Key
    Customer_ID   VARCHAR(50) NOT NULL, -- Natural Business Key
    Load_Date     TIMESTAMP NOT NULL,
    Record_Source VARCHAR(50) NOT NULL,
    PRIMARY KEY (Customer_HK)
);

2.Links

The Relationships

  • Represent associations between Hubs.
  • Contain only keys, no attributes.
  • Capture n-ary relationships (not limited to 1-to-1 or 1-to-many).
👉 Example Link table:
Plain Text
CREATE TABLE Link_OrderCustomer (
    OrderCustomer_HK CHAR(32) NOT NULL,
    Order_HK         CHAR(32) NOT NULL,
    Customer_HK      CHAR(32) NOT NULL,
    Load_Date        TIMESTAMP NOT NULL,
    Record_Source    VARCHAR(50) NOT NULL,
    PRIMARY KEY (OrderCustomer_HK)
);

3.Satellites

The Context

  • Store descriptive attributes and historical changes.
  • Can be attached to either a Hub or a Link.
  • Provide auditable, time-variant history.
👉 Example Satellite:
Plain Text
CREATE TABLE Sat_Customer (
    Customer_HK     CHAR(32) NOT NULL,
    Customer_Name   VARCHAR(100),
    Customer_Status VARCHAR(20),
    Effective_Date  TIMESTAMP NOT NULL,
    Load_Date       TIMESTAMP NOT NULL,
    Record_Source   VARCHAR(50) NOT NULL,
    PRIMARY KEY (Customer_HK, Effective_Date)
);

Data Vault Architecture

A typical modern implementation follows the Bronze → Silver → Gold layered pattern:
Plain Text
 ┌───────────┐      ┌───────────┐      ┌───────────────┐
 │ Bronze    │      │ Silver    │      │ Gold          │
 │ (Raw)     │ ---> │ Data Vault│ ---> │ Analytics     │
 │ Landing   │      │ Hubs,Links│      │ Star Schema   │
 └───────────┘      │ Satellites│      │ Reports, AI   │
                    └───────────┘      └───────────────┘
  • Bronze (Raw Ingest Layer): Unchanged ingestion from source systems.
  • Silver (Data Vault Layer): Business keys (Hubs), relationships (Links), descriptive data (Satellites).
  • Gold (Presentation Layer): Star schemas, materialized views, BI-ready marts.

Typical Architecture and Implementation

  • Often part of a multi-layer lakehouse or warehouse setup: raw data is staged (Bronze), then modeled with the data vault technique (Silver), before being refined for analytics or reporting in more familiar formats like star schemas (Gold).
  • Satellites may be added or modified as source schema evolves, ensuring flexibility without disturbing other model components.
  • Data vault supports scalability up to petabyte-level volumes and promotes ETL code generation.

⚙️ Unique Features of Data Vault

  • The data vault approach separates structural information (hubs and links) from descriptive attributes (satellites), enabling robust tracking, auditing, and ease of adaptation when business rules or data sources change.
  • Each row uniquely carries load metadata, such as source and load date, supporting full traceability and auditability.
  • Data vault models are designed to handle “all the data, all of the time,” storing facts, including both clean and “bad” data, which makes them resilient and flexible in rapidly changing environments.
  • Insert-only, parallel-load capabilities mean new data—including from new sources—can be added without expensive refactoring or risk of error during sequential batch loads
Insert-only → No destructive updates, enabling full history.
Parallel loads → New data sources can be added without disruption.
Auditability → Every row has load metadata (who, when, from where).
Scalable to petabytes → Cloud-native warehouses love this model.
Handles “all the data” → Including dirty, incomplete, or raw values.

Data Vault vs Star Schema

Feature
Data Vault
Star Schema / Dimensional Model
Structure
Hubs, Links, Satellites (normalized)
Facts & Dimensions (denormalized)
Change Handling
Agile: add satellites or hubs easily
Expensive & disruptive changes
Query Performance
Needs transformations for analytics
Optimized for BI/reporting
Historical Tracking
Full history with metadata
Usually only current “truth”
Auditability
Full traceability
Limited to logs

Example Diagram: Mini Data Vault

Imagine tracking Customers placing Orders for Products:
Plain Text
     ┌────────────┐     ┌────────────┐     ┌────────────┐
     │ HubCustomer│     │ HubOrder   │     │ HubProduct │
     └─────┬──────┘     └─────┬──────┘     └─────┬──────┘
           │                  │                  │
           └──────────┐ ┌─────┘                  │
                      ▼ ▼                        │
                 ┌───────────────┐               │
                 │ Link_OrderLine│ <─────────────┘
                 └──────┬────────┘
                        │
           ┌────────────┴─────────────┐
           │ Satellites (Attributes) │
           └─────────────────────────┘
  • Hubs: Customer, Order, Product.
  • Link: OrderLine connecting them.
  • Satellites: Store changing attributes like Order Status, Customer Address.

Code Example: Hashing Keys

To avoid surrogate key collisions and support distributed loads, Data Vault often uses hash keys:
Plain Text
import hashlib

def hash_key(value: str) -> str:
    return hashlib.md5(value.encode('utf-8')).hexdigest()

# Example:
customer_id = "CUST12345"
print(hash_key(customer_id))
# Output: 827ccb0eea8a706c4c34a16891f84e7b

When to Use Data Vault

  • You need to integrate multiple, changing sources (ERP, CRM, APIs).
  • You want full auditability and traceability (regulatory compliance, finance).
  • Your organization is scaling towards big data or lakehouse platforms.
  • You want an agile, evolvable model that avoids constant redesign.
Not ideal if:
  • You need fast BI queries directly → You’ll need a star schema layer on top.
  • Your data is small and stable → Dimensional modeling may be simpler.