How Apache Iceberg Won the Open Table Wars

The Open Optimism of Apache Polaris


Apache Iceberg has recently emerged as the de facto open-table standard for large-scale datasets, with a thriving community and support from many of the leading data infrastructure vendors. But why did Iceberg emerge as the preferred format? And what should you know before you wade in?

Iceberg is a high-performance table format that brings the reliability and simplicity of SQL tables to large-scale data analytics. Its ecosystem has grown rapidly, with robust tooling and support from engines like Apache Spark, Trino, and Apache Flink, as well as from vendors including Snowflake, Amazon, Dremio, and Confluent. Even Databricks is betting on Iceberg, having spent more than $1B on Tabular, a startup co-founded by some of the Iceberg co-creators.

To understand why an open table format has attracted so much attention lately, consider the complex reality of today’s enterprise data environments. As much as we like to talk about the elegance of modern solutions like cloud data lakes and cloud data warehouses, these technologies don’t exist in isolation. Instead, most large enterprises comprise a patchwork of incompatible data stores and data applications from multiple vendors.

How We Got Here

At one time, online transaction processing (OLTP) databases were the dominant architecture for storing and analyzing data. These gave way to data warehouses and online analytical processing (OLAP) systems, which allowed for higher-performance analytics but were costly and hard to scale. Then the data lake emerged, providing a way to pool structured and unstructured data in a single location.

A big advantage of data lakes is to provide a single, unified pool of data in an architecture that decouples storage from compute, making it cost-efficient to scale. The widespread use of Apache Parquet, an open-source columnar storage format, reduces storage costs further with efficient data compression and encoding schemes.

That’s all well and good, but as we know, existing technologies have a habit of sticking around, which means many of these architectures exist side by side in the same enterprise. Iceberg has risen to the fore now because it provides a way to elegantly bridge these different worlds.

The fractured reality that most enterprises are living with isn’t necessarily due to bad decision-making. The past few years have seen a surge in mergers and acquisitions, which often results in different technology platforms existing in the same company. Human nature also plays a role: One team of engineers may believe passionately in Databricks while another may love Snowflake, perhaps because of a positive experience at a previous company. These quasi-religious attachments can further complicate the reality of enterprise data architectures.

Whatever the reason, these fractured environments cause data accessibility and data management problems. Data teams often want to combine data from different systems, wherever it’s stored, and incompatible systems make that impractical and costly. They can copy the data sets they need into a different format to allow access, but that’s a costly solution, and copies of data rarely stay current for long.

Why Iceberg Emerged On Top

Iceberg isn’t necessarily technologically superior to other open file formats — everything the Iceberg working group does is in plain sight and could be copied by other projects. But Iceberg is a truly open standard that has secured the support of big companies like Confluent, Amazon, Snowflake, and Databricks. It’s not the case that Iceberg is the only format that could have attracted a critical mass of users and industry support, but it’s the one that did, and it serves its purpose very well indeed.

(Tee11/Shutterstock)

If your organization is using Iceberg, you can plug in any Iceberg-compatible processing engine and do tasks that the engine would normally handle, like changing files in real-time under the hood or compacting the tables for better read performance. Iceberg gives you a clean separation of your data and data layer (consisting of the storage, management and optimization) from the processing engine that will write, query, and update the data.

The best part about Iceberg is that it enables you to manage your data separately from your query and processing engines. It slots in as part of the “headless data architecture”, where data is made available as both a table AND stream, and you can use either (or both) for analytics, operations, and everything in between. Iceberg provides a reliable, widely adopted, and performant technology for ensuring that data is easy to write, discover, and use, regardless of your use case.

There Still Work For You To Do

While Apache Iceberg has many benefits, it doesn’t provide everything out of the box. If you choose to implement the technology on your own, versus using a managed service, you will need to build some things from scratch.

  • Iceberg lacks some of the basic maintenance features that are a part of some other commercial or managed offerings. For example, it lacks an out-of-the-box implementation for data compaction, expiring world snapshots and other routine maintenance needs. The APIs exist and are part of Iceberg, but they need to be built and managed by the developer. (Note that one of Tabular’s value propositions was providing exactly this functionality – expect to see more Iceberg services offering the same in the future)
  • Iceberg doesn’t include a packaged way to handle security and governance, so the developer will need to integrate this in such a way that it can give permission to the processing engines that will want to use it.
  • There is not yet an agreed-upon standard for a metadata catalog for Iceberg. Snowflake recently made its Polaris catalog open source, while Databricks acquired Tabular providing an open source version of its own catalog. But there’s still no clear de facto standard for the Iceberg catalog yet.

In a landscape marked by a mosaic of OLTP, OLAP, and data lake configurations, Iceberg’s promise lies in its ability to bring order to chaos, allowing data to be accessed wherever it resides without the need to create brittle, one-off connections. Despite its ease of integration and wide support, the open table format isn’t yet plug and play, but it continues to mature and provides a foundation for resilient data strategies that can pivot and scale with the needs of the business.

About the author: Adam Bellemare is a Staff Technologist in the Technology Strategy Group at Confluent. He has worked on a wide range of projects, including event-driven data mesh theory and proof of concepts, event-driven microservice strategies, and event and event stream design principles. Before Confluent Adam worked in multiple e-commerce companies as a big data platform engineer, focusing on building batch solutions using Apache Spark, HDFS, and early S3, before turning his attention to event-driven architectures. Since then he has been largely focused on building micro (and regular) services with Apache Kafka, and evangelizing the benefits of publishing useful business facts as a general-purpose data access layer. Adam is the author of O’Reilly’s Building Event-Driven Microservices (2020) and Building an Event-Driven Data Mesh (2023).

Related Items:

What the Big Fuss Over Table Formats and Metadata Catalogs Is All About

Snowflake, AWS Warm Up to Apache Iceberg

It’s Go Time for Open Data Lakehouses



Source link
lol

By stp2y

Leave a Reply

Your email address will not be published. Required fields are marked *

No widgets found. Go to Widget page and add the widget in Offcanvas Sidebar Widget Area.