Apache Iceberg Architecture – What Is It?

 · 

4 min read

notion-image

What Is Apache Iceberg?

Apache Iceberg is an open table format specifically designed for handling massive analytical datasets within data lakes, adding a schema-aware, transactional cataloging system directly on top of file-based storage. This allows organizations to treat their storage as tables, not just files, enabling capabilities like atomic commits, time travel, and seamless schema evolution
Modern data platforms need a way to store and manage massive datasets—petabytes or more—across cloud object storage like S3, GCS, or Azure Data Lake. While traditional file formats such as Parquet, ORC, or Avro help with efficient storage and querying, they don’t solve the harder problem of table management: schema evolution, concurrent writes, deletes, and time travel.
This is where table formats like Apache Iceberg come in.

Table Formats vs. File Formats

  • File formats (e.g., Parquet, ORC, Avro):
    • Define how individual files store data.
    • Optimized for compression, encoding, and efficient columnar reads.
    • But they lack metadata management at the dataset (table) level.
  • Table formats (e.g., Iceberg, Delta Lake, Hudi):
    • Manage collections of files as a single logical table.
    • Provide transactional guarantees (ACID).
    • Support schema evolution, partitioning, and time travel queries.
    • Handle concurrent reads/writes in distributed systems.
In short, file formats store data efficiently, table formats manage data intelligently.
 
notion image
 

What Is Apache Iceberg?

Apache Iceberg is an open table format designed for analytic datasets in a lakehouse environment. Originally created at Netflix and now an Apache Software Foundation project, Iceberg provides:
  • High-performance metadata management for large datasets.
  • ACID transactions without relying on a central server.
  • Hidden partitioning to simplify queries.
  • Schema evolution with full backward/forward compatibility.
  • Time travel to query historical versions of data.

Architecture Overview

Iceberg’s architecture has three key layers:
  1. Manifest Files
      • Contain metadata about which data files belong to the table.
      • Track file-level stats (row counts, min/max values, null counts) for query pruning.
  1. Metadata Files
      • Store table-level metadata (schema, partitioning, snapshot history).
      • Point to manifest lists, which point to manifest files, which point to actual data files.
  1. Data Files
      • The actual data stored in Parquet, ORC, or Avro format inside object storage.
This hierarchical structure means queries can skip scanning unnecessary files, improving efficiency at scale.

How Iceberg Works

  1. Write Operations
      • New data is written to Parquet/ORC files.
      • Manifest files are updated with pointers to the new files.
      • A new snapshot is created in the metadata.
  1. Read Operations
      • Query engines (e.g., Spark, Trino, Flink, Snowflake) read table metadata.
      • Based on filters, Iceberg prunes out irrelevant data files before scanning.
      • Readers only scan necessary files, reducing I/O and costs.
  1. Schema Evolution
      • Columns can be added, removed, or renamed without breaking queries.
      • Iceberg tracks column IDs instead of names for backward compatibility.
  1. Time Travel
      • Each snapshot is immutable.
      • Users can query historical states of the table (AS OF TIMESTAMP).

Key Features of Apache Iceberg

  • ACID Transactions → Ensures data consistency in distributed environments.
  • Hidden Partitioning → Automatically manages partitions; users don’t need to specify them in queries.
  • Schema Evolution → Add, drop, rename, or reorder columns without rewriting data.
  • Snapshot Isolation & Time Travel → Roll back to older snapshots or reproduce past results.
  • Pluggable Engines → Works with Spark, Trino, Flink, Presto, Snowflake, and more.
  • Scalable Metadata → Efficiently handles petabyte-scale tables with millions of files.

How Updates Work in Iceberg

When you run an UPDATE (say with Spark SQL or Flink):
  1. Identify Affected Rows
      • The query engine reads the table metadata (manifests) to find which data files contain the rows that need updating.
  1. Write New Data Files
      • Instead of modifying files directly, Iceberg creates new data files with the updated rows.
      • Old rows are marked as deleted using delete files .
  1. Update Metadata (Snapshot)
      • Iceberg commits a new snapshot where the manifest points to the new files and removes (logically) the outdated ones.
      • Old snapshots are still available for time travel.

How Merges Work in Iceberg

A MERGE INTO operation (similar to SQL’s merge) usually involves upserts (insert + update):
  1. Join Source and Target
      • Iceberg performs a join between the source dataset (e.g., a staging table) and the target table.
  1. Row-Level Actions
      • For each matched row → update/delete depending on the conditions.
      • For unmatched rows → insert new records.
  1. File Rewrites
      • Like UPDATE, affected files are rewritten, and new snapshots are committed.
      • Depending on the execution engine, this can use row-level delete files.

Delete Files and Position Deletes

Iceberg introduces the concept of delete files to avoid rewriting huge Parquet/ORC files when only a few rows change:
  • Equality Deletes → Specify rows to delete based on column values (e.g., WHERE id=123).
  • Position Deletes → Specify row positions in a file to be deleted.
This makes MERGE and UPDATE more efficient, since only delete markers and new rows are written rather than rewriting full files.

Execution Modes

  • Copy-on-Write (COW):
    • Rewrites entire data files with updated rows.
    • Slower for frequent updates but faster reads (clean data files).
  • Merge-on-Read (MOR):
    • Writes separate delete files and new inserts.
    • Readers merge base + deletes at query time.
    • Faster writes, but slightly slower reads.

Example: MERGE in Iceberg with Spark SQL

Plain Text
MERGE INTO target t
USING source s
ON t.id = s.id
WHEN MATCHED THEN UPDATE SET t.value = s.value
WHEN NOT MATCHED THEN INSERT (id, value) VALUES (s.id, s.value)
  • Iceberg rewrites the metadata snapshot.
  • Only changed files are touched.
  • Old snapshots remain available for rollback.

Why Iceberg Matters

Apache Iceberg provides the transactional foundation for a modern lakehouse. It allows organizations to:
  • Replace rigid data warehouses with flexible cloud-native storage.
  • Enable consistent data access across multiple compute engines.
  • Support streaming + batch workloads in one unified architecture.
  • Reduce operational complexity with open standards.