Data Lakehouse
Comparing Open Table Formats: Iceberg, Delta, and Hudi
Evaluate the trade-offs between major transactional storage layers to provide ACID guarantees on top of cloud object storage.
In this article
The Evolution of Data Storage Paradigms
Data engineering has historically been divided between two distinct architectures: the structured data warehouse and the flexible data lake. Data warehouses like Snowflake or Redshift provide high performance and Atomicity, Consistency, Isolation, and Durability guarantees but often become cost-prohibitive as data volumes grow into the petabyte range. They also struggle with unstructured data types such as images or raw sensor logs that modern machine learning workflows require.
Data lakes built on cloud object storage like Amazon S3 or Azure Data Lake Storage offer nearly infinite scalability at a fraction of the cost. However, these lakes often turn into data swamps because they lack a formal schema enforcement mechanism and transactional integrity. Without a way to manage concurrent writes, engineers frequently encounter corrupted files or partial data reads that break downstream analytics pipelines.
The Data Lakehouse architecture attempts to solve this dichotomy by implementing a transactional metadata layer directly on top of open-source file formats like Parquet or Avro. This design allows developers to treat a collection of flat files in a cloud bucket as a governed relational table. By separating the storage of data from the management of state, organizations can achieve warehouse-level reliability while maintaining the agility of a data lake.
The primary goal of a Lakehouse is not just to store data cheaply, but to provide a consistent view of that data across diverse compute engines without the need for proprietary lock-in.
The Need for ACID on Object Storage
Object storage is inherently eventually consistent and does not support atomic multi-file updates out of the box. If a Spark job fails halfway through writing a partition, the data lake is left in an inconsistent state with orphaned files that must be manually cleaned. This lack of atomicity makes it nearly impossible to run reliable ETL processes that require high-frequency updates or deletes.
A Lakehouse metadata layer solves this by using a write-ahead log or a system of manifest files to track which data files are part of a valid table snapshot. When a write operation begins, the system writes the new data files to storage but does not make them visible to readers until the transaction is committed in the metadata. This ensures that readers always see a consistent version of the data, even while heavy write operations are occurring in the background.
Architectural Deep Dive: The Three Major Contenders
Three major open-source projects currently dominate the Lakehouse landscape: Delta Lake, Apache Iceberg, and Apache Hudi. While all three provide ACID transactions and time travel capabilities, they differ significantly in their internal implementation and target use cases. Choosing the right format depends heavily on your specific workload patterns, such as whether you prioritize fast streaming upserts or high-performance analytical queries.
Delta Lake was originally developed by Databricks and relies on a JSON-based transaction log to track changes to the table state. It is highly optimized for Spark environments and excels at handling large-scale batch processing and structured streaming. Delta focuses on simplicity by maintaining a linear history of commits, which makes features like time travel and audit logging straightforward for developers to implement.
Apache Iceberg, which originated at Netflix, takes a different approach by focusing on snapshot management through a hierarchy of manifest files. Unlike Delta, Iceberg does not rely on a centralized log but instead uses a tree structure to map files to specific table versions. This design allows for advanced features like hidden partitioning, where the engine automatically handles partition evolution without requiring users to rewrite their queries.
Apache Hudi, born at Uber, was specifically designed to handle the challenges of incremental processing and high-frequency upserts. It introduces two storage types: Copy on Write and Merge on Read. Merge on Read allows for extremely fast writes by appending changes to delta logs and merging them during read time, which is ideal for real-time streaming applications that need to reflect changes in seconds.
Delta Lake and the Transaction Log
The heart of Delta Lake is the Delta Log, an ordered record of every transaction performed on the table since its creation. When a user queries a Delta table, the engine first checks the log to determine which Parquet files are currently valid and ignores any files that have been marked as deleted or superseded. This mechanism allows Delta to provide snapshot isolation, ensuring that a long-running query is not affected by concurrent updates.
1from delta.tables import DeltaTable
2
3# Reference the existing table in S3 storage
4target_table = DeltaTable.forPath(spark, "s3://analytics-bucket/user_profiles")
5
6# New data arriving from a streaming source
7new_data_df = spark.read.parquet("s3://staging-bucket/daily_updates")
8
9# Perform an upsert (merge) operation based on user_id
10target_table.alias("target").merge(
11 new_data_df.alias("updates"),
12 "target.user_id = updates.user_id"
13).whenMatchedUpdateAll() \
14 .whenNotMatchedInsertAll() \
15 .execute()In the code above, the merge operation is atomic, meaning that if the job fails, no changes are committed to the target table. This pattern replaces the brittle overwrite-and-replace logic used in traditional data lakes. Developers can also use the history function to inspect previous versions of the table, which is invaluable for debugging data quality issues or rolling back accidental deletes.
Handling Concurrency and Conflicts
As multiple teams start interacting with the same Lakehouse, concurrency control becomes a critical bottleneck. Most Lakehouse formats use Optimistic Concurrency Control, which assumes that multiple transactions can complete without interfering with each other. If two writers attempt to modify the same data simultaneously, the system checks for conflicts and allows the first one to succeed while forcing the second to retry.
However, conflict resolution strategies differ between the formats depending on the granularity of their metadata. Delta Lake and Iceberg generally handle conflicts at the file level, meaning two updates can succeed as long as they touch different files. Hudi offers more advanced row-level concurrency control options, which are necessary for high-contention environments where multiple processes might update different rows within the same physical file.
- Optimistic Concurrency Control (OCC): Assumes conflicts are rare and validates at commit time.
- Multi-Version Concurrency Control (MVCC): Maintains multiple versions of data to allow simultaneous reads and writes.
- Write Conflicts: Occur when two processes attempt to modify the same partition or file concurrently.
- Schema Evolution: The ability to add, drop, or rename columns without breaking existing data readers.
Managing these conflicts effectively requires developers to understand their data arrival patterns. For example, if your pipeline involves late-arriving data that frequently updates historical partitions, you should choose a format that supports fine-grained conflict detection. Failing to tune these settings can lead to high retry rates and significantly increased processing costs in high-volume environments.
Schema Evolution and Safety
Schema drift is a common cause of pipeline failures in production environments. Lakehouse formats provide schema enforcement, ensuring that any data being written to a table matches the predefined structure. If a new field is added to the source data that does not exist in the target table, the write operation will fail by default, preventing the accidental ingestion of corrupt data.
Schema evolution allows developers to gracefully update the table structure when requirements change. With a simple command, you can add new columns or change data types, and the metadata layer will handle the translation for all future reads. This approach is far more robust than the manual ALTER TABLE commands used in traditional databases, as it maintains a clear audit trail of how the schema has changed over time.
Operational Challenges and Performance Tuning
While Lakehouses simplify many aspects of data management, they introduce new operational challenges that engineers must address. The most notorious of these is the small file problem, where frequent incremental updates create thousands of tiny Parquet files. This fragmentation leads to massive metadata overhead and slow query performance because the storage engine must perform thousands of individual network requests to read the data.
To combat this, all Lakehouse formats include a compaction or optimization process. This background task takes many small files and merges them into larger, more efficient files, typically around 128MB to 1GB in size. Compaction must be scheduled carefully to avoid interfering with production workloads, as it involves significant CPU and I/O resources to rewrite the data.
1-- Consolidate small files into larger chunks to improve read speed
2OPTIMIZE user_activity_logs
3WHERE event_date >= '2023-10-01';
4
5-- Organize data by frequently filtered columns for faster skipping
6OPTIMIZE user_activity_logs
7ZORDER BY (user_id, session_id);
8
9-- Remove old files that are no longer referenced by the transaction log
10VACUUM user_activity_logs RETAIN 168 HOURS;Z-Ordering is another powerful optimization technique mentioned in the code above. It rearranges the data within the files so that related information is stored together physically. This allows the query engine to skip entire files that do not contain the relevant data points, drastically reducing the amount of data scanned and improving query latency for large datasets.
Storage Maintenance and Vacuuming
Because Lakehouse formats use MVCC to support time travel, they do not immediately delete old files when data is updated or removed. Instead, the metadata simply stops pointing to those files. Over time, these unreferenced files can accumulate and lead to significant storage costs if they are not cleaned up regularly.
The vacuuming process is used to permanently delete files that are older than a specific retention period. Developers must strike a balance between maintaining enough history for debugging and keeping storage costs under control. Once a table is vacuumed, the ability to time travel back to versions older than the retention threshold is lost forever, so this operation should be handled with caution.
Conclusion: Choosing Your Storage Layer
Selecting the right transactional storage layer is a foundational decision that will impact your data platform for years. Delta Lake is often the best choice for organizations already invested in the Databricks ecosystem or those who want the most mature and performance-optimized experience with Spark. Its tightly integrated features and commercial support make it a safe bet for mission-critical enterprise workloads.
Apache Iceberg is the ideal choice for companies that prioritize engine interoperability and want to avoid vendor lock-in. Its design is cleaner and more modular than Delta's, making it easier to use with a variety of engines like Trino, Flink, and Presto. Iceberg’s handling of partition evolution is particularly valuable for teams managing rapidly changing schemas in massive analytical datasets.
Apache Hudi remains the premier choice for low-latency streaming and complex incremental processing. If your primary goal is to provide a real-time view of a transactional database with sub-minute latency, Hudi’s Merge-on-Read capabilities and built-in indexing offer a significant advantage. It is however the most complex of the three to configure and maintain properly due to its wide range of tuning parameters.
Ultimately, the Data Lakehouse represents a significant leap forward in making cloud data platforms more reliable and accessible. By providing ACID guarantees on top of open storage, it allows software engineers to apply the same rigorous engineering standards to data pipelines that they have long applied to application development. The choice between formats should be driven by your specific performance needs, existing toolchain, and the long-term flexibility required by your business.
Decision Framework for Engineers
When evaluating these technologies, start by running a benchmark with your actual production data and query patterns. Observe how each format handles your specific join types, filter conditions, and write volumes. Often, the theoretical advantages of one format are outweighed by how well another integrates with your existing monitoring and deployment infrastructure.
Remember that the Lakehouse market is moving incredibly fast, and features that are missing today may be released in a few months. Focus on the core architectural principles that align with your team's expertise. Regardless of the format you choose, the shift toward a unified storage layer will invariably reduce architectural complexity and help your team deliver data-driven insights more reliably.
