In announcing its Polaris Catalog last month, Snowflake Inc. Executive Vice President Christian Kleinerman said the product “extends Snowflake’s commitment to Apache Iceberg as the open standard of choice.”
Statements like that raise the hackles of Apache Hudi adherents. They maintain their preferred open table format — whose name stands for Hadoop Upserts, Deletes and Incrementals — is superior to Iceberg and the open-source Delta Lake framework developed by Databricks Inc. But Hudi appears to be sliding toward “also ran” status in the data platform race as Iceberg surges.
Developed at Uber Technologies Inc. and released to open source in 2016, Hudi has been adopted, mostly in niche applications, by numerous big-brand companies, including Walmart Inc., General Electric Co. Aviation, Walt Disney Co. and Amazon.com Inc.’s transportation service. In recent months, however, Iceberg has been ascendant, gaining endorsements from cloud analytics giants such as Snowflake Inc. and Databricks Inc. while increasingly being cited as a standard.
Lakehouse effect
The reason all this matters is because of the growing popularity of data lakehouses, analytics engines that combine the flexibility of data lakes with the performance of data warehouses. Lakehouses can accommodate a wider variety of data than warehouses, including both structured and unstructured data types. They use low-cost, flexible storage and run on commodity hardware, making them a more cost-effective alternative to data warehouses.
A table format is critical to a data lakehouse architecture. It enforces data consistency, enhances query performance through indexing, stores data in the columnar format preferred for analytical queries, and ensures reliable and consistent transactions. The faster a de facto table standard emerges, the faster the data lakehouse market will grow.
And there’s plenty of growth expected. Dremio Corp.’s 2024 State of the Data Lakehouse survey found that 70% of enterprises said more than half of their analytics will be on a data lakehouse within three years, and 42% have already moved from a cloud data warehouse to the data lakehouse for cost efficiency and ease of use reasons.
Starburst Data Inc., which sells a commercial version of the open-source Trino distributed query engine, supports Iceberg, Delta Lake and Hudi, “but when we’re asked to make a recommendation, we say go with Iceberg because we believe that’s the de facto choice,” said Chief Executive Officer Justin Borgman.
Star endorsements
Starburst cast its own vote for Iceberg in April when it announced a fully managed data lakehouse based on that platform. The biggest Iceberg endorsements by far came last month with Snowflake’s Polaris Catalog and Databricks’s blockbuster acquisition of Tabular Technologies Inc., whose founders built Iceberg while working at Netflix Inc. The purchase price of more than $1 billion for a startup with little revenue indicated just how badly Databricks wants to own the format standard.
“They did that because I think Databricks saw Iceberg’s momentum,” Borgman said. “Databricks is still very committed to Delta Lake, but the moment that the competing format says we support both, they’ve just endorsed iceberg, whether intentionally or not.”
Databricks co-founder and Chief Technologist Matei Zaharia said the acquisition should be seen less as an endorsement of Iceberg and more as a step toward consolidation. “Our hope is to make these formats converge, so hopefully, in a few years, you don’t care about the format anymore,” he said. Either way, Hudi was left out.
George Gilbert, data and artificial intelligence analyst at SiliconANGLE system firm TheCube Research, said the omission of Hudi from Snowflake’s and Databricks’ road maps isn’t good news for that community.
“It’s going to be very difficult for a query engine to support both Iceberg and Delta Lake,” he said. “You build your engine with a certain assumption about how data is stored. Getting first-class support for Hudi is going to be difficult.”
Nontrivial migration
“It’s a nontrivial task to do an Iceberg migration,” Starburst Chief Technology Officer Dain Sundstrom said in an interview on theCUBE, SiliconANGLE’s streaming media platform.
Table formats such as Delta Lake, Iceberg and Hudi can handle large amounts of data and work well with popular analytics tools such as Apache Spark, Apache Hive and Presto/Trino. All three use the Parquet columnar storage file format, which is optimized for data processing frameworks such as Apache Hadoop and Apache Spark.
Delta Lake was an early leader, but Iceberg is on track to eclipse it. The Dremio study found that 39% of respondents are currently using Delta Lake, and 23% more expect to add support within the next two years. Apache Iceberg stood at 31% adoption, but a larger 29% expect to add it over the next three years. Hudi was a distant third at 12.5% adoption.
What frustrates Hudi advocates is that they believe their table format is better than the alternatives. Hudi is functionally equivalent to its more popular siblings but is considered to be better at handling inserts and deletes, supports highly efficient changed-data processing, and stores multiple versions of data to enable users to query a specific point in time.
Hudi is especially popular in real-time scenarios, a function of its roots at Uber, which processes millions of live data streams from its fleets of drivers around the world.
Shines in real-time
“Incremental data workloads, where you have some changes you’re extracting from a Kafka data stream to incrementally process and write into a downstream table, is where Hudi shines,” said Vinoth Chandar, the co-creator of Hudi and chief executive officer of Onehouse Inc., which makes an open data lakehouse platform. “It can index records at a very large scale. It lets you manage a table without blocking writes. It’s also the only storage format that supports incremental changes, so you can tell exactly what records have changed at a point in time.”
Uber couldn’t afford to wait for the slow table recomputation processes needed to accommodate new data, so Hudi was built to constantly evolve the schema, or the blueprint for how data is stored in a database, without manual intervention or halting processing.
On paper, Hudi and Iceberg look almost identical. Starburst recently published a comparison of the two, including a table showing near feature-for-feature parity. Iceberg gets the nod for query-intensive uses like data analytics, whereas Hudi is considered superior for processing transactions, Gilbert said.
“Hudi was designed for near-real-time data ingestion, whereas the other ones aren’t quite as good at that,” he said. “They’re more optimized for query performance, for reads than writes.”
He noted that that’s a natural advantage for Iceberg because data lakes are used more for queries than transactions.
Hudi’s current third-place status “has mainly to do with its architecture that, on first look, isn’t as intuitive as the others,” said Alex Merced, senior tech evangelist at Dremio. “But the big thing Iceberg has over Hudi and delta is its ecosystem, not just of tools that read or write to the format but solutions for overall lakehouse management.”
That includes seamless integration with open-source analytical frameworks like Spark, Trino, PrestoDB, Flink and Hive, as well as support from a growing number of metadata management and governance tools.
Ecosystem advantage
Building an ecosystem was one of the original goals, said Ryan Blue, a member of the technical staff at Databricks and Iceberg’s creator when he was at Netflix. “It’s a good format technically, but it’s also one that everyone trusts and will use,” he said in an interview on theCUBE. “I think that is probably the biggest reason to use it.”
Hudi is no slouch when it comes to winning the affection of developers, Chandar wrote in a detailed explanation of Hudi’s strengths on the Onehouse blog. He noted that Hudi logged more than 25,000 Github interactions over the past 12 months, has contributors from 50 companies, and has a higher average “star” – or favorability – rating than Iceberg, according to Redpoint Ventures LLC’s OSS Index and Dashboard.
Developer approval and market share don’t always correlate, however. “I don’t think any one of these is grossly outperforming the others. I think the real issue comes down to adoption,” said Starburst’s Borgman. “Iceberg clearly seems to be like the people’s choice.”
The XTable project currently incubating in the Apache ecosystem is a possible solution to the compatibility issues. “It basically aims to make all the tables look the same from the interface API level,” Borgman said. However, integration layers often bring a performance penalty and XTable is a relatively new product that doesn’t yet support some table types or synchronized timestamps.
“Projects like Apache XTable make using multiple formats easier, but I would still say the momentum is for Iceberg to be the default format for most use cases with Hudi being used for streaming ingestion,” said Dremio’s Merced.
Gilbert was blunter. Hudi, he said, “will be a niche.”
Image: Pixabay
Your vote of support is important to us and it helps us keep the content FREE.
One click below supports our mission to provide free, deep, and relevant content.
Join our community on YouTube
Join the community that includes more than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and many more luminaries and experts.
THANK YOU
Source link
lol