Geek Logbook

Tech sea log book

Modern Table Formats: Iceberg, Delta Lake, and Hudi

Data Lakes made it possible to store raw data at scale, but they lacked the reliability and governance of data warehouses. Files could be dropped into storage (S3, HDFS, MinIO), but analysts struggled with schema changes, updates, and deletes.

To solve these issues, the community created modern table formats that brought ACID transactions, schema evolution, and time travel to Data Lakes. The three leading projects are Apache Iceberg, Delta Lake, and Apache Hudi.


Why Table Formats Matter

Without a table format, a Data Lake is just a collection of files (e.g., Parquet, ORC). Problems include:

  • No atomic operations (partial writes could corrupt data).
  • Schema changes break queries.
  • No standard way to handle deletes or updates.
  • Hard to manage snapshots or rollback.

Table formats add a metadata layer on top of files, transforming raw object storage into a Lakehouse.


Apache Iceberg

  • Origin: Created at Netflix, donated to the Apache Foundation.
  • Goal: Replace Hive tables with a scalable, open table format.
  • Key Features:
    • Full ACID transactions.
    • Hidden partitioning (no need to hardcode folder paths).
    • Schema evolution without rewriting data.
    • Time travel (query past snapshots).
    • Strong integration with Trino, Spark, Flink.

Example query in Trino:

SELECT *
FROM sales FOR VERSION AS OF 123456789;

Delta Lake

  • Origin: Created by Databricks.
  • Goal: Bring warehouse reliability to Data Lakes with tight Spark integration.
  • Key Features:
    • ACID transactions using a transaction log (_delta_log).
    • Schema enforcement and evolution.
    • Time travel using versioned checkpoints.
    • Optimized for Apache Spark.
  • Ecosystem: Open source, but Databricks provides enterprise features.

Example query in Spark SQL:

SELECT * FROM sales VERSION AS OF 42;

Apache Hudi

  • Origin: Created at Uber.
  • Goal: Optimize for streaming ingestion and incremental data processing.
  • Key Features:
    • ACID transactions.
    • Two table types: Copy-on-Write (COW) and Merge-on-Read (MOR).
    • Built-in support for upserts and deletes.
    • Incremental queries for near-real-time analytics.
    • Strong integration with Spark and Flink.

Example incremental query:

SELECT *
FROM sales
WHERE _hoodie_commit_time > '20250901';

Comparing Iceberg, Delta, and Hudi

FeatureIcebergDelta LakeHudi
OriginNetflix / ApacheDatabricksUber / Apache
Transaction modelACID, snapshotsACID, delta logACID, commit log
Best forBatch + InteractiveSpark-centric BIStreaming + upserts
Schema evolutionYes (flexible)YesYes
Time travelYesYesLimited
Engine supportTrino, Spark, FlinkSpark (best), TrinoSpark, Flink

How They Fit in the Lakehouse

  • Iceberg: Best for open, multi-engine environments (Trino, Spark, Flink).
  • Delta Lake: Strongest in Databricks/Spark ecosystems.
  • Hudi: Best fit for real-time ingestion and incremental pipelines.

All three bring warehouse-like reliability to Data Lakes, enabling the Lakehouse model.


Conclusion

Modern table formats are the foundation of the Lakehouse. By adding ACID transactions, schema evolution, and time travel, they turn raw storage (S3, MinIO, HDFS) into reliable analytical platforms.

  • Iceberg: open and engine-agnostic.
  • Delta Lake: Spark-first, Databricks-friendly.
  • Hudi: optimized for streaming and upserts.

No matter which you choose, table formats are the key to bridging the gap between lakes and warehouses.