Onehouse, the Apache Hudi-backer that bills itself as the most open data platform in the world, further opened up its platform today with the launch of a data catalog synchronization feature that streamlines user access to data residing in major cloud platforms. The feature complements the company’s investment in developing XTable, an open-source offering that delivers read-write interoperability among Hudi, Delta, and Apache Iceberg table formats.
The advent of open table formats like Hudi, Delta, and Iceberg revolutionized data openness by enabling multiple query engines access the same piece of data without fear of data corruption. As the key technological underpinning to data lakehouses, open table formats have enabled organizations to get the benefits of traditional data warehouses (data integrity, correctness) without giving up the benefits of modern data lakes (scalability, flexibility).
So it’s somewhat ironic that a battle has erupted over the table formats in the big data ecosystem, with some vendors and customers standardizing on Iceberg while others back Delta. Hudi, which Onehouse CEO Vinoth Chandar lead the development of while working at Uber nearly 10 years ago, has been relegated to third place in the horse race.
If you’re in the Databricks ecosystem, you’ll be using Delta. If you’re in the Snowflake ecosystem, you’ll be using Iceberg. You can forget about using query engines, data science notebooks, or even stream processing engines from certain vendors if the table formats are incompatible.
A technology designed to open data instead has turned into yet another way for vendors to lock customers in and keep competitors out. That’s why Onehouse developed XTable (formerly Onetable): to regain the openness and freedom to choose the query engine of your choice that was the original idea behind table formats.
“XTable basically solve this burning need in the industry right now where you have a writer in one of the table formats and your reader has an affinity… to another thing,” Chandar says. “Users are forced into migrations. That to us defeats the purpose of having open data formats, and this nice decoupling between the compute engines and open data.”
The technology, which Onehouse donated to the Apache Software Foundation (where it is currently incubating), delivers out-of-the-box read-write compatibility among Hudi, Iceberg, and Delta.
“We built the world’s first lakehouse before it was called a lakehouse in 2016 at Uber,” Chandar tells Datanami. “One copy of data can be accessed from Hive, Spark, Presto, and Flink for stream processing, ETL, interactive query and data science notebooks. This format war has kind of taken way that very essence of the power that these things unlock, so that was basically why at the end we decided to build XTable.”
Google and Microsoft are among the vendors backing XTable. For instance, Google may want to enable Iceberg tables written by BigQuery to be queried as either Delta or Hudi tables for Spark via Dataproc, Chandar says, while maybe Microsoft wants to enable Delta tables to be read by Hudi or Iceberg.
“We’re trying to really foster, from a first-principle way, some open standards in there,” he says. “These are really important interoperability capabilities to have for customers out there, so that they don’t feel locked into one thing. Options are always good. It fosters loyalty, healthier competition, and a more vibrant ecosystem.”
Anybody can adopt XTable, and some companies are already incorporating it into their data pipelines, Chandar says. It’s also available for customers of Onehouse, which runs a managed data lakehouse on AWS and Google Cloud. In Onehouse, customer data is stored as Parquet files in S3 and Google’s object store, along with a tiny bit of Hudi metadata that gives it that all-important transactionality.
While delivering “omnidirectional interoperability” among Hudi, Iceberg, and Delta will foster openness among users, it doesn’t do any good if the customers can’t find the data. Data catalogs are emerging as critical pieces of tech for linking users to the data they seek. The problem is that every cloud data platform has its own data catalog. And—surprise, surprise—the cloud platform catalogs have limited visibility into data that it doesn’t control.
That’s why Onehouse today launched a new data catalog synchronization feature to Databricks, Snowflake, and Google platforms, to go along with pre-existing support for the Hive Metasore, AWS’s Glue Data Catalog, and Onehouse’s Onetable Catalog.
“What this means is you can have a single copy of data in Onehouse and with a click of a button, we make tables appear inside Snowflake, Unity and BigLake catalogs,” Chandar says. “We are essentially creating pointers, if you will, from these different catalogs and maintaining those references to the actual data stored in the warehouse.”
In addition to showing users what tables are accessible and where they reside, the data catalog synch feature also extends Onehouse’s data governance capabilities into the supported catalogs. Customers can define their data access policies in Onehouse, and they will be enforced when customers try to access data residing in other platforms, Chandar says.
Since it’s all open source, Onehouse customers can pack up and leave if they no longer feel they’re getting value from Onehouse’s data services. “We maintain that principle of giving the customer choice to even not be locked into us,” Chandar says. “They can go use open source Hudi if they want to by themselves and build the same architecture.”
Chandar says he’s pleased that the industry in general is pushing towards more openness. Customers are demanding open formats to reduce lock-in, and vendors are giving them what they want via open table formats, which is a positive direction.
Related Items:
Open Table Formats Square Off in Lakehouse Data Smackdown
Onehouse Emerges from Stealth to Deliver Data Lakes in ‘Months, Not Years’
Snowflake, AWS Warm Up to Apache Iceberg
Source link
lol