Incremental Data Load approaches.

Ahmed Sayed

June 06, 2024

28 min read

Full vs Incremental Load:
Column-Based Incremental Load:
Numeric Primary Key Column:
Date/Time Column:
Log-Based Incremental Load:
Transaction Logging:
Hash-Based Incremental Load:
Hash Comparison Optimization:
Slowly Changing Dimensions (SCD):
Surrogate Key Management:
Temporal Validity Considerations:
SCD Types:
Steps to Implement Type 2 SCD:
Using a Hash Column to Track Changes
Change Data Capture (CDC):
Capture Method Selection:
Change Set Optimization:
Journal-Based CDC:
Other Relevant Concepts:
Show More

Full vs Incremental Load:

When using a full load approach, the entire dataset in the destination is completely deleted and replaced by the source dataset. This means every record from the source is loaded anew, effectively overwriting the existing data. This method ensures the destination data is a complete and exact replica of the source at the time of the load. Still, it can be time-consuming and resource-intensive, especially for large datasets.

In contrast, an incremental load selectively updates the destination by loading only the differences between the source and the target datasets. This means only new, updated, or deleted records are processed and applied to the destination. Incremental loading is more efficient regarding time and resources, as it minimizes the data transfer and processing volume. It is particularly advantageous for large datasets and real-time data integration scenarios, ensuring that only the necessary changes are reflected in the destination without complete overwriting.

Column-Based Incremental Load:

In a column-based incremental load approach, specific columns in the dataset are used to track and manage changes. Two common columns used for this purpose are the Numeric Primary Key Column and the Date/Time Column.

Numeric Primary Key Column:

This column ensures data integrity by using auto-incrementing keys. It helps identify and transfer only the modified records.

Example:

Source Table:

Plain Text

ID  Name       Age
-------------------
1   Alice      30
2   Bob        25
3   Charlie    35

Destination Table before Incremental Load:

Plain Text

ID  Name       Age
-------------------
1   Alice      30
2   Bob        25

After Incremental Load (new and updated records are loaded):

Plain Text

ID  Name       Age
-------------------
1   Alice      30
2   Bob        25
3   Charlie    35

Date/Time Column:

This column provides temporal granularity, allowing for precise tracking of changes based on creation or modification timestamps.

Example:

Source Table:

Plain Text

ID  Name       Age  Last_Modified
-----------------------------------
1   Alice      30   2024-01-01 10:00:00
2   Bob        25   2024-01-02 11:00:00
3   Charlie    35   2024-01-03 12:00:00

Destination Table before Incremental Load:

Plain Text

ID  Name       Age  Last_Modified
-----------------------------------
1   Alice      30   2024-01-01 10:00:00
2   Bob        25   2024-01-02 11:00:00

After Incremental Load (records with Last_Modified > 2024-01-02 11:00:00 are loaded):

Plain Text

ID  Name       Age  Last_Modified
-----------------------------------
1   Alice      30   2024-01-01 10:00:00
2   Bob        25   2024-01-02 11:00:00
3   Charlie    35   2024-01-03 12:00:00

In both examples, the incremental load process selectively updates the destination table based on the specified columns, ensuring only new or modified records are loaded, thus optimizing the ETL process.

Log-Based Incremental Load:

Log-based incremental load leverages transaction logging to capture granular data changes, including inserts, updates, and deletes. This approach ensures that all changes in the source system are accurately and efficiently propagated to the destination system.

Transaction Logging:

Transaction logging involves recording every data modification event in a log file, capturing details such as the type of operation (insert, update, delete), the affected data, and timestamps. This method allows for precise and sequential application of changes, maintaining data integrity and consistency.

Example:

Source System Transaction Log:

Plain Text

Timestamp           Operation   Table     ID  Name     Age
----------------------------------------------------------
2024-01-01 10:00:00 INSERT      Users     1   Alice    30
2024-01-02 11:00:00 INSERT      Users     2   Bob      25
2024-01-03 12:00:00 UPDATE      Users     1   Alice    31
2024-01-04 13:00:00 DELETE      Users     2
2024-01-05 14:00:00 INSERT      Users     3   Charlie  35

Destination Table before Incremental Load:

Plain Text

ID  Name       Age
-------------------
1   Alice      30
2   Bob        25

After Incremental Load (applying log events in sequence):

Plain Text

ID  Name       Age
-------------------
1   Alice      31
3   Charlie    35

Process:

Log Capture: Every transaction (insert, update, delete) in the source system is recorded in a transaction log with a timestamp.

Log Reading: The ETL process reads the transaction log to identify the changes that need to be applied to the destination system.

Change Application: The identified changes are applied to the destination system in the exact order they occurred, ensuring data consistency and integrity.

Advantages:

Efficiency: Only the changes are processed and transferred, reducing the volume of data moved and the time required for data synchronization.

Accuracy: Changes are applied in the exact sequence they occurred, maintaining the integrity and consistency of the data.

Minimal Impact: The source system's performance is minimally affected, as only the transaction log is read without requiring full table scans.

Log-based incremental loading is particularly useful for real-time data integration scenarios where timely and accurate reflection of changes is critical.

Hash-Based Incremental Load:

Hash-based incremental load involves using hash functions to generate unique hash values for records in a dataset. These hash values represent the state of the data, allowing for efficient comparison and identification of altered records. This approach minimizes the amount of data processed during incremental loads, optimizing performance and ensuring data integrity.

Hash Function Selection:

Selecting the optimal hash function is crucial for generating unique hashes that accurately reflect the data state. The chosen hash function must minimize collision risks, where different records produce the same hash value, to ensure precise identification of changes.

Example:

Source Table:

Plain Text

ID  Name       Age
-------------------
1   Alice      30
2   Bob        25
3   Charlie    35

Hash Function Applied:

Plain Text

ID  Hash
------------
1   h1a2b3c4
2   d5e6f7g8
3   i9j0k1l2

Hash Comparison Optimization:

Efficient algorithms for comparing hashes are implemented to swiftly identify altered records. This enhances load performance by quickly detecting which records have changed, allowing only those records to be processed and loaded.

Example:

Destination Table before Incremental Load:

Plain Text

ID  Name       Age  Hash
-------------------------
1   Alice      30   h1a2b3c4
2   Bob        25   d5e6f7g8

Source Table with Changes:

Plain Text

ID  Name       Age  Hash
-------------------------
1   Alice      31   m2n3o4p5
2   Bob        25   d5e6f7g8
3   Charlie    35   i9j0k1l2

Process:

Hash Calculation: Compute hash values for records in both the source and destination tables using a selected hash function.

Hash Comparison: Compare the hash values of records in the source table with those in the destination table.

Change Detection: Identify records with differing hash values, indicating modifications or new records, and records missing in the destination, indicating deletions.

Incremental Load: Apply the detected changes (inserts, updates, deletes) to the destination table.

Advantages:

Efficiency: Only records with changed hash values are processed, significantly reducing the data volume and load time.

Accuracy: Hash functions provide a reliable way to detect even minor changes in data, ensuring accurate updates.

Scalability: Suitable for large datasets, as hash comparisons are computationally efficient and can be parallelized.

Hash-based incremental loading is particularly beneficial in scenarios with large datasets and frequent updates, where performance and accuracy are critical. This method ensures efficient and reliable data synchronization while minimizing resource consumption.

Slowly Changing Dimensions (SCD):

Slowly Changing Dimensions (SCD) are a concept in data warehousing used to manage and track changes in dimension attributes over time. SCDs are essential for maintaining historical accuracy and enabling robust data analysis. Two key aspects of SCD are surrogate key management and temporal validity considerations.

Surrogate Key Management:

Incorporating surrogate keys is crucial for tracking and managing changes in dimension attributes. Surrogate keys are unique, non-business keys that are used to uniquely identify each record in a dimension table. They facilitate seamless tracking of historical changes and ensure data consistency.

Example:

Original Customer Dimension:

Plain Text

CustomerID  Name       City
----------------------------
1           Alice      New York
2           Bob        Los Angeles

After a change in Alice's city:

Plain Text

SurrogateKey  CustomerID  Name       City       StartDate   EndDate
----------------------------------------------------------------------
1             1           Alice      New York   2023-01-01  2024-01-01
2             1           Alice      Boston     2024-01-02  NULL
3             2           Bob        Los Angeles 2023-01-01  NULL

Temporal Validity Considerations:

Implementing temporal validity models enables precise historical data querying and analysis. These models allow for the accurate representation of when specific changes to dimension attributes occurred, which is essential for auditing, trend analysis, and understanding the state of data at any given point in time.

Example:

Customer Dimension with Temporal Validity:

Plain Text

SurrogateKey  CustomerID  Name       City       StartDate   EndDate
----------------------------------------------------------------------
1             1           Alice      New York   2023-01-01  2024-01-01
2             1           Alice      Boston     2024-01-02  NULL
3             2           Bob        Los Angeles 2023-01-01  NULL

Query to find Alice's city as of 2023-12-31:

SQL

SELECT Name, City
FROM CustomerDimension
WHERE CustomerID = 1
  AND '2023-12-31' BETWEEN StartDate AND EndDate;

Result:

Plain Text

Name    City
--------------
Alice   New York

SCD Types:

Type 1: Overwrites old data with new data, without maintaining any historical records.

Example:

Plain Text

CustomerID  Name       City
----------------------------
1           Alice      Boston
2           Bob        Los Angeles

Type 2: Maintains full historical records by adding new rows for changes.

Example:

Plain Text

SurrogateKey  CustomerID  Name       City       StartDate   EndDate
----------------------------------------------------------------------
1             1           Alice      New York   2023-01-01  2024-01-01
2             1           Alice      Boston     2024-01-02  NULL

Type 3: Tracks changes using additional columns to store previous values.

Example:

Plain Text

CustomerID  Name       CurrentCity   PreviousCity
-----------------------------------------------
1           Alice      Boston        New York
2           Bob        Los Angeles   NULL

Advantages:

Surrogate Key Management: Ensures unique identification of records, facilitates tracking of changes, and maintains data consistency.

Temporal Validity: Enables precise historical querying, auditing, and trend analysis by capturing the start and end dates of each record's validity.

Slowly Changing Dimensions provide a structured approach to managing changing data over time, ensuring that historical data is accurately captured and easily queryable for various analytical needs.

Implementing Type 2 Slowly Changing Dimensions (SCD) involves creating new rows for each change in the dimension attributes, preserving historical data. Here are the steps and techniques to implement SCD Type 2:

Steps to Implement Type 2 SCD:

Create the Dimension Table with SCD Type 2 Structure:

Include surrogate keys, natural keys, and the necessary attributes.

Add columns for StartDate, EndDate, and a current flag (IsCurrent).

Initial Load:

Populate the dimension table with the initial dataset, setting the StartDate to the current date, EndDate to a high future date (e.g., '9999-12-31'), and IsCurrent to True.

Detect Changes:

Compare the incoming dataset with the existing dimension table to identify new, updated, and unchanged records.

Handle Inserts:

Insert new records directly into the dimension table with the current date as StartDate, high future date as EndDate, and IsCurrent as True.

Handle Updates:

For records with changes, update the existing records by setting their EndDate to the current date and IsCurrent to False.

Insert the updated records as new rows with a new surrogate key, the current date as StartDate, high future date as EndDate, and IsCurrent as True.

Handle Unchanged Records:

No action is required for unchanged records.

Example Implementation:

Dimension Table Structure:

SQL

CREATE TABLE CustomerDimension (
    SurrogateKey INT PRIMARY KEY,
    CustomerID INT,
    Name VARCHAR(100),
    City VARCHAR(100),
    StartDate DATE,
    EndDate DATE,
    IsCurrent BOOLEAN
);

Initial Load:

SQL

INSERT INTO CustomerDimension (SurrogateKey, CustomerID, Name, City, StartDate, EndDate, IsCurrent)
VALUES
    (1, 101, 'Alice', 'New York', '2023-01-01', '9999-12-31', TRUE),
    (2, 102, 'Bob', 'Los Angeles', '2023-01-01', '9999-12-31', TRUE);

Detect Changes:

Assume you have an incoming dataset in a staging table CustomerStaging:

SQL

CREATE TABLE CustomerStaging (
    CustomerID INT,
    Name VARCHAR(100),
    City VARCHAR(100)
);

INSERT INTO CustomerStaging (CustomerID, Name, City)
VALUES
    (101, 'Alice', 'Boston'),
    (103, 'Charlie', 'Chicago');

Identify New and Updated Records:

SQL

-- New Records
SELECT s.CustomerID, s.Name, s.City
FROM CustomerStaging s
LEFT JOIN CustomerDimension d
ON s.CustomerID = d.CustomerID AND d.IsCurrent = TRUE
WHERE d.CustomerID IS NULL;

-- Updated Records
SELECT s.CustomerID, s.Name, s.City
FROM CustomerStaging s
JOIN CustomerDimension d
ON s.CustomerID = d.CustomerID AND d.IsCurrent = TRUE
WHERE s.Name != d.Name OR s.City != d.City;

Handle Inserts and Updates:

SQL

-- Update Existing Records (end current validity)
UPDATE CustomerDimension
SET EndDate = '2024-01-01', IsCurrent = FALSE
WHERE CustomerID IN (
    SELECT CustomerID
    FROM CustomerStaging s
    JOIN CustomerDimension d
    ON s.CustomerID = d.CustomerID AND d.IsCurrent = TRUE
    WHERE s.Name != d.Name OR s.City != d.City
);

-- Insert Updated Records
INSERT INTO CustomerDimension (SurrogateKey, CustomerID, Name, City, StartDate, EndDate, IsCurrent)
SELECT NEXTVAL('SurrogateKeySequence'), CustomerID, Name, City, '2024-01-01', '9999-12-31', TRUE
FROM CustomerStaging
WHERE CustomerID IN (
    SELECT CustomerID
    FROM CustomerStaging s
    JOIN CustomerDimension d
    ON s.CustomerID = d.CustomerID AND d.IsCurrent = TRUE
    WHERE s.Name != d.Name OR s.City != d.City
);

-- Insert New Records
INSERT INTO CustomerDimension (SurrogateKey, CustomerID, Name, City, StartDate, EndDate, IsCurrent)
SELECT NEXTVAL('SurrogateKeySequence'), CustomerID, Name, City, '2024-01-01', '9999-12-31', TRUE
FROM CustomerStaging s
LEFT JOIN CustomerDimension d
ON s.CustomerID = d.CustomerID AND d.IsCurrent = TRUE
WHERE d.CustomerID IS NULL;

Summary:

Initial Load: Populate the dimension table with the initial data.

Detect Changes: Identify new and updated records by comparing the incoming dataset with the existing data.

Handle Inserts: Insert new records directly.

Handle Updates: End the validity of existing records by updating their EndDate and IsCurrent, then insert the updated records as new rows.

Unchanged Records: No action is required.

By following these steps, you can effectively implement Type 2 SCD, ensuring that historical changes are tracked and managed accurately.

When dealing with a high number of columns in a Slowly Changing Dimension (SCD) Type 2 scenario, tracking changes efficiently can become challenging. One effective technique to handle this is by using a hash column. Here's how you can implement this approach:

Using a Hash Column to Track Changes

Add a Hash Column:

Create a new column in your dimension table to store hash values generated from the relevant columns.

Generate Hash Values:

Calculate the hash value based on the values of the columns you want to track. This can be done using a suitable hash function (e.g., SHA256).

Compare Hash Values:

Compare the hash value from the incoming dataset with the existing hash value in the dimension table to detect changes.

Implementation Steps:

Modify Dimension Table:

Add a hash column to the dimension table.

SQL

CREATE TABLE CustomerDimension (
    SurrogateKey INT PRIMARY KEY,
    CustomerID INT,
    Name VARCHAR(100),
    City VARCHAR(100),
    Age INT,
    Email VARCHAR(100),
    StartDate DATE,
    EndDate DATE,
    IsCurrent BOOLEAN,
    HashValue VARCHAR(64)  -- Add a column for the hash value
);

Initial Load with Hash Calculation:

Populate the dimension table and calculate the hash value for each record.

SQL

INSERT INTO CustomerDimension (SurrogateKey, CustomerID, Name, City, Age, Email, StartDate, EndDate, IsCurrent, HashValue)
VALUES
    (1, 101, 'Alice', 'New York', 30, '[email protected]', '2023-01-01', '9999-12-31', TRUE, SHA256(CONCAT('Alice', 'New York', 30, '[email protected]'))),
    (2, 102, 'Bob', 'Los Angeles', 25, '[email protected]', '2023-01-01', '9999-12-31', TRUE, SHA256(CONCAT('Bob', 'Los Angeles', 25, '[email protected]')));

Detect Changes Using Hash Values:

Calculate the hash value for the incoming dataset and compare it with the existing hash values in the dimension table.

SQL

-- Assume you have an incoming dataset in a staging table CustomerStaging
CREATE TABLE CustomerStaging (
    CustomerID INT,
    Name VARCHAR(100),
    City VARCHAR(100),
    Age INT,
    Email VARCHAR(100)
);

INSERT INTO CustomerStaging (CustomerID, Name, City, Age, Email)
VALUES
    (101, 'Alice', 'Boston', 31, '[email protected]'),
    (103, 'Charlie', 'Chicago', 28, '[email protected]');

-- Detect changes using hash comparison
SELECT s.CustomerID, s.Name, s.City, s.Age, s.Email, SHA256(CONCAT(s.Name, s.City, s.Age, s.Email)) AS NewHash
FROM CustomerStaging s
JOIN CustomerDimension d
ON s.CustomerID = d.CustomerID AND d.IsCurrent = TRUE
WHERE SHA256(CONCAT(s.Name, s.City, s.Age, s.Email)) != d.HashValue;

Handle Inserts and Updates:

For records with different hash values, end the validity of the existing records and insert the updated records as new rows.

SQL

-- Update Existing Records (end current validity)
UPDATE CustomerDimension
SET EndDate = '2024-01-01', IsCurrent = FALSE
WHERE CustomerID IN (
    SELECT s.CustomerID
    FROM CustomerStaging s
    JOIN CustomerDimension d
    ON s.CustomerID = d.CustomerID AND d.IsCurrent = TRUE
    WHERE SHA256(CONCAT(s.Name, s.City, s.Age, s.Email)) != d.HashValue
);

-- Insert Updated Records
INSERT INTO CustomerDimension (SurrogateKey, CustomerID, Name, City, Age, Email, StartDate, EndDate, IsCurrent, HashValue)
SELECT NEXTVAL('SurrogateKeySequence'), CustomerID, Name, City, Age, Email, '2024-01-01', '9999-12-31', TRUE, SHA256(CONCAT(Name, City, Age, Email))
FROM CustomerStaging
WHERE CustomerID IN (
    SELECT s.CustomerID
    FROM CustomerStaging s
    JOIN CustomerDimension d
    ON s.CustomerID = d.CustomerID AND d.IsCurrent = TRUE
    WHERE SHA256(CONCAT(s.Name, s.City, s.Age, s.Email)) != d.HashValue
);

-- Insert New Records
INSERT INTO CustomerDimension (SurrogateKey, CustomerID, Name, City, Age, Email, StartDate, EndDate, IsCurrent, HashValue)
SELECT NEXTVAL('SurrogateKeySequence'), CustomerID, Name, City, Age, Email, '2024-01-01', '9999-12-31', TRUE, SHA256(CONCAT(Name, City, Age, Email))
FROM CustomerStaging s
LEFT JOIN CustomerDimension d
ON s.CustomerID = d.CustomerID AND d.IsCurrent = TRUE
WHERE d.CustomerID IS NULL;

Summary:

Modify Dimension Table: Add a hash column to store hash values.

Initial Load: Populate the dimension table and calculate hash values for each record.

Detect Changes: Calculate and compare hash values between incoming and existing records.

Handle Inserts and Updates: End the validity of existing records with different hash values and insert new records with updated hash values.

Using hash columns allows for efficient tracking of changes across a large number of columns, ensuring data integrity and optimizing the SCD Type 2 process.

Change Data Capture (CDC):

Change Data Capture (CDC) is a technique used to identify and capture changes made to data in a database. This method enables real-time or near-real-time data integration, ensuring that downstream systems are updated with the latest data changes. Implementing CDC involves selecting appropriate capture methods and optimizing the change set generation process.

Capture Method Selection:

When implementing CDC, it's crucial to evaluate different capture methods based on the source system architecture and data change patterns. The two primary capture methods are trigger-based and log-based CDC.

Trigger-Based CDC:

Description: Uses database triggers to capture changes. Triggers are fired automatically when INSERT, UPDATE, or DELETE operations occur.

Advantages:

Immediate change capture.
Easy to implement in systems where database triggers are supported.

Disadvantages:

Can introduce overhead and impact database performance.
Complex to manage in systems with high transaction volumes.

Example:

SQL

CREATE TRIGGER capture_changes
AFTER INSERT OR UPDATE OR DELETE ON Customer
FOR EACH ROW
BEGIN
    IF INSERTING THEN
        INSERT INTO ChangeLog (Operation, CustomerID, Name, City, ChangeTime)
        VALUES ('INSERT', :NEW.CustomerID, :NEW.Name, :NEW.City, SYSDATE);
    ELSIF UPDATING THEN
        INSERT INTO ChangeLog (Operation, CustomerID, Name, City, ChangeTime)
        VALUES ('UPDATE', :NEW.CustomerID, :NEW.Name, :NEW.City, SYSDATE);
    ELSIF DELETING THEN
        INSERT INTO ChangeLog (Operation, CustomerID, Name, City, ChangeTime)
        VALUES ('DELETE', :OLD.CustomerID, :OLD.Name, :OLD.City, SYSDATE);
    END IF;
END;

Log-Based CDC:

Description: Reads the database transaction log to capture changes. This method is non-intrusive as it doesn't require changes to the database schema or triggers.

Advantages:

Minimal impact on database performance.
Suitable for high-transaction environments.

Disadvantages:

Requires access to transaction logs and possibly specific database privileges.
Can be complex to implement and configure.

Example: Using a tool like Debezium to capture changes from MySQL binlog.

JSON

{
    "name": "inventory-connector",
    "config": {
        "connector.class": "io.debezium.connector.mysql.MySqlConnector",
        "tasks.max": "1",
        "database.hostname": "localhost",
        "database.port": "3306",
        "database.user": "debezium",
        "database.password": "dbz",
        "database.server.id": "184054",
        "database.server.name": "dbserver1",
        "database.include.schema.changes": "true",
        "database.history.kafka.bootstrap.servers": "kafka:9092",
        "database.history.kafka.topic": "schema-changes.inventory"
    }
}

Change Set Optimization:

Optimizing the change set generation process is essential to minimize overhead and latency, enhancing CDC performance and scalability.

Batch Processing:

Description: Instead of processing each change individually, changes are grouped into batches to reduce the number of operations and improve efficiency.

Example:

SQL

-- Periodically move changes from the log table to the target system
INSERT INTO TargetTable (CustomerID, Name, City, ChangeTime)
SELECT CustomerID, Name, City, ChangeTime
FROM ChangeLog
WHERE ChangeTime > LAST_PROCESSED_TIME
ORDER BY ChangeTime
LIMIT 1000;

Change Data Compression:

Description: Compress changes by merging multiple changes for the same record within a short time window to reduce the number of operations.

Example:

SQL

-- Compress changes by keeping only the latest change for each record within a time window
WITH LatestChanges AS (
    SELECT CustomerID, Name, City, MAX(ChangeTime) AS LastChangeTime
    FROM ChangeLog
    WHERE ChangeTime > LAST_PROCESSED_TIME
    GROUP BY CustomerID, Name, City
)
INSERT INTO TargetTable (CustomerID, Name, City, ChangeTime)
SELECT CustomerID, Name, City, LastChangeTime
FROM LatestChanges;

Change Filtering:

Description: Filter out unnecessary changes to reduce the volume of data processed. For example, ignore changes to columns that are not relevant to downstream systems.

Example:

SQL

-- Only capture changes to specific columns
INSERT INTO TargetTable (CustomerID, Name, City, ChangeTime)
SELECT CustomerID, Name, City, ChangeTime
FROM ChangeLog
WHERE (Name != OLD.Name OR City != OLD.City) AND ChangeTime > LAST_PROCESSED_TIME;

Summary:

Capture Method Selection:

Trigger-Based CDC: Suitable for immediate change capture but can impact performance.

Log-Based CDC: Minimal performance impact, suitable for high-transaction environments, but may require complex setup.

Change Set Optimization:

Batch Processing: Group changes into batches to reduce operations.

Change Data Compression: Merge multiple changes for the same record within a short time window.

Change Filtering: Filter out unnecessary changes to reduce data volume.

By carefully selecting the appropriate CDC method and optimizing change set generation, you can achieve efficient, scalable, and low-latency data integration.

journal-based Change Data Capture (CDC), often referred to as journal-based CDC. This method involves capturing changes to data by utilizing a journal or log maintained by the database or application. The journal records all modifications to data, such as inserts, updates, and deletes, allowing for an efficient and reliable means of tracking changes.

Journal-Based CDC:

Concept: Journal-based CDC leverages a journal or log, which is a sequential record of changes made to the database. This method can capture changes at a granular level, including detailed information about each transaction, such as the type of operation, affected data, and timestamps. It is especially useful in systems where a robust logging mechanism is already in place.

Key Features:

Granular Change Capture:

Captures detailed information about each data change, including the type of operation (insert, update, delete), the affected rows, and the specific columns that were changed.

Historical Data Maintenance:

Maintains a complete history of all changes, allowing for precise data recovery, auditing, and historical analysis.

Non-Intrusive:

Operates independently of the main application, minimizing the impact on performance and reducing the need for intrusive changes to the existing system.

Implementation Steps:

Set Up Journal Logging:

Ensure that the database or application is configured to maintain a detailed journal or log of all data changes.

Extract Changes from the Journal:

Develop a process to periodically extract changes from the journal. This can involve reading the journal entries, filtering out irrelevant changes, and preparing the changes for further processing.

Apply Changes to the Target System:

Apply the extracted changes to the target system, ensuring that they are processed in the correct order to maintain data consistency and integrity.

Example Implementation:

Step 1: Set Up Journal Logging

Configure the database to enable journaling (this will vary depending on the database system).

Step 2: Extract Changes from the Journal

Periodically read the journal to extract changes. This can be done using a script or an ETL tool.

SQL

-- Example: Extract changes from the journal
SELECT JournalID, OperationType, CustomerID, Name, City, ChangeTime
FROM Journal
WHERE ChangeTime > LAST_PROCESSED_TIME
ORDER BY ChangeTime;

Step 3: Apply Changes to the Target System

Apply the extracted changes to the target system. This can involve inserting new records, updating existing ones, or deleting records as needed.

SQL

-- Example: Apply changes to the target system
INSERT INTO TargetTable (JournalID, CustomerID, Name, City, ChangeTime, OperationType)
SELECT JournalID, CustomerID, Name, City, ChangeTime, OperationType
FROM ExtractedChanges;

Step 4: Update the Last Processed Time

After processing the changes, update the LAST_PROCESSED_TIME to ensure that the next extraction starts from the correct point.

SQL

-- Update the last processed time
UPDATE MetadataTable
SET LAST_PROCESSED_TIME = (SELECT MAX(ChangeTime) FROM ExtractedChanges);

Advantages:

Accuracy: Provides precise and detailed change tracking, ensuring that all changes are captured and applied correctly.

Performance: Minimizes impact on the main application by operating independently of the main data processing activities.

Historical Data: Maintains a complete history of changes, useful for auditing, recovery, and historical analysis.

Summary:

Journal-based CDC is a powerful method for tracking changes in a database. By leveraging detailed journal logs, it provides granular, accurate, and efficient change tracking with minimal impact on the main application. This method is particularly useful in environments where a robust logging mechanism is already in place, allowing for efficient and reliable data integration and historical analysis.

Advanced CDC Techniques:

Hybrid CDC:

Concept: Combines multiple CDC methods (e.g., log-based and trigger-based) to provide more robust and comprehensive change capture.

Implementation: Use log-based CDC for capturing high-volume changes and trigger-based CDC for capturing complex transformations or business logic changes.

Event-Based CDC:

Concept: Uses an event-driven architecture where changes are captured and published as events to a message queue or event stream (e.g., Apache Kafka).

Implementation: Configure the database to publish changes to an event stream, and then consume those events to update the target system.

Example Tools: Debezium with Kafka, AWS DMS with Kinesis.

Other Relevant Concepts:

Delta Lake:

Concept: A storage layer that brings ACID transactions to big data workloads, enabling incremental data processing and ensuring data reliability.

Implementation: Use Delta Lake on top of your existing data lake (e.g., S3, HDFS) to manage incremental data efficiently.

Example:

Python

from delta.tables import DeltaTable

delta_table = DeltaTable.forPath(spark, "/path/to/delta-table")
delta_table.update(
    condition="source.id = target.id",
    set={"target.value": "source.value"}
)

Delta Lake and Delta Table Architecture and Concepts

Delta Lake is an open-source storage layer that brings reliability, performance, and ACID (Atomicity, Consistency, Isolation, Durability) transactions to data lakes. It builds on top of existing storage solutions (such as Amazon S3, Azure Data Lake Storage, or HDFS) and enables scalable and reliable data management and processing.

Key Concepts of Delta Lake:

ACID Transactions:

Atomicity: Ensures that a series of operations either all succeed or all fail, preventing partial updates.

Consistency: Ensures that the database remains in a valid state before and after a transaction.

Isolation: Ensures that concurrently running transactions do not interfere with each other.

Durability: Ensures that once a transaction is committed, it will remain so, even in the case of a system failure.

Schema Enforcement and Evolution:

Schema Enforcement: Ensures that data being written to a table adheres to the specified schema, preventing data corruption.

Schema Evolution: Allows for changes in the schema (like adding new columns) without impacting existing data or requiring a complete rewrite.

Unified Batch and Streaming:

Delta Lake can handle both batch and streaming data, allowing for real-time data processing alongside historical batch data.

Time Travel:

Provides the ability to query previous versions of the data, enabling easy rollbacks, auditing, and historical analysis.

Efficient Upserts and Deletes:

Delta Lake supports efficient merge operations (upserts) and deletes, making it easier to manage slowly changing dimensions and other data update requirements.

Delta Lake Architecture:

Delta Table:

A Delta Table is a table stored in the Delta Lake format. It consists of Parquet files stored on disk along with a transaction log.

Transaction Log:

The transaction log (a series of JSON files) keeps track of all the changes (commits) made to the Delta Table, enabling ACID transactions. Each transaction is recorded as a new JSON file in the _delta_log directory.

How Delta Lake Works:

Writing Data:

When data is written to a Delta Table, Delta Lake performs schema enforcement and records the operation in the transaction log.

Example using PySpark:

Python

from pyspark.sql import SparkSession

spark = SparkSession.builder \\
    .appName("DeltaLakeExample") \\
    .getOrCreate()

data = spark.createDataFrame([(1, "Alice"), (2, "Bob")], ["id", "name"])

data.write.format("delta").save("/path/to/delta-table")

Reading Data:

When data is read from a Delta Table, Delta Lake uses the transaction log to provide a consistent view of the data.

Example using PySpark:

Python

df = spark.read.format("delta").load("/path/to/delta-table")
df.show()

Updating Data:

Delta Lake supports update, delete, and merge (upsert) operations.

Example using PySpark:

Python

from delta.tables import DeltaTable

delta_table = DeltaTable.forPath(spark, "/path/to/delta-table")

# Update
delta_table.update(
    condition="id == 1",
    set={"name": "'Alice_updated'"}
)

# Delete
delta_table.delete(condition="id == 2")

# Merge (Upsert)
new_data = spark.createDataFrame([(1, "Alice_new"), (3, "Charlie")], ["id", "name"])
delta_table.alias("old_data").merge(
    new_data.alias("new_data"),
    "old_data.id = new_data.id"
).whenMatchedUpdate(set={"name": "new_data.name"}) \\
 .whenNotMatchedInsert(values={"id": "new_data.id", "name": "new_data.name"}) \\
 .execute()

Time Travel:

Query previous versions of the Delta Table by specifying a timestamp or version number.

Example using PySpark:

Python

df_version = spark.read.format("delta").option("versionAsOf", 0).load("/path/to/delta-table")
df_timestamp = spark.read.format("delta").option("timestampAsOf", "2024-01-01").load("/path/to/delta-table")

Summary:

Delta Lake enhances traditional data lakes by adding a robust transactional layer that provides ACID transactions, schema enforcement and evolution, unified batch and streaming capabilities, time travel, and efficient upserts and deletes. Its architecture leverages a transaction log to ensure data reliability and consistency, making it a powerful solution for scalable and reliable data management and processing.

Data Versioning:

Concept: Maintaining different versions of data to track historical changes and enable rollbacks.

Implementation: Use versioned data storage systems or frameworks (e.g., Apache Iceberg, Delta Lake).

Example:

SQL

SELECT * FROM my_table VERSION AS OF 5;

Change Propagation:

Concept: Ensuring that changes captured from the source are propagated to multiple downstream systems or microservices.

Implementation: Use a CDC tool to capture changes and then publish those changes to a message broker (e.g., Kafka, RabbitMQ) for consumption by multiple systems.

Data Lineage:

Concept: Tracking the origin and flow of data through various transformations and processes in the data pipeline.

Implementation: Use data lineage tools and frameworks (e.g., Apache Atlas, DataHub) to automatically capture and visualize data lineage.

Example:

YAML

lineage:
  - source: SourceTable
  - target: TargetTable
  - transformations:
    - type: SQL
      script: "SELECT * FROM SourceTable WHERE LastModified > '2024-01-01 00:00:00'"

Summary:

While traditional incremental load methods such as timestamp-based and primary key-based approaches are commonly used, there are numerous advanced techniques and concepts that can enhance data integration processes. These include hybrid CDC, event-based CDC, Delta Lake for ACID transactions, data versioning, change propagation, and data lineage. Each method and concept has its own advantages and use cases, providing a comprehensive toolkit for handling incremental data loads and ensuring efficient, reliable, and scalable data integration.

Ahmed Sayed

View Profile

Hirvesh is an indie hacker from the beautiful island of Mauritius currently working on his two SaaS products: stomod.com and assistflare.com

He’s passionate about solopreneurship and blogs at hirve.sh about his learnings on setting up a business!

Incremental Data Load approaches.

Table of Contents

Full vs Incremental Load:

Column-Based Incremental Load:

Numeric Primary Key Column:

Date/Time Column:

Log-Based Incremental Load:

Transaction Logging:

Hash-Based Incremental Load:

Hash Comparison Optimization:

Slowly Changing Dimensions (SCD):

Surrogate Key Management:

Temporal Validity Considerations:

SCD Types:

Steps to Implement Type 2 SCD:

Using a Hash Column to Track Changes

Change Data Capture (CDC):

Capture Method Selection:

Change Set Optimization:

Journal-Based CDC:

Other Relevant Concepts:

Key Concepts of Delta Lake:

Delta Lake Architecture:

How Delta Lake Works: