The Open Optimism of Apache Polaris

Since it was first unveiled in June, interest in the Apache Polaris project has soared, as organizations look to the metadata catalog to help them get a handle on their big data and control access to their Apache Iceberg tables. As the project drives toward becoming a Top Level Project sometime in 2025, members of the Apache Software Foundation took some time to discuss the current state of the project with BigDATAwire, as well as where it may go in the future.

Apache Polaris, which made its big debut at Snowflake’s Data Cloud Summit 2024, is a technical metadata catalog that uses the Apache Iceberg REST specification to help broker access to Iceberg tables by the various compute engines that would consume the data. Snowflake donated Polaris to the Apache Software Foundation this summer, and it became an incubating project in August.

Polaris has the potential of becoming a Top Level Project (TLP) by the middle of 2025, says Jean-Baptiste (JB) Onofré, Dremio’s principal software engineer and a longtime member of the ASF, where he is a permanent member of the board and sits on a variety of project management committees (PMCs).

“I mentor a lot of Apache projects,” Onofré says. “I think the fastest that we could do is probably something around 10 months [from August 2024]. That’s probably the fastest we can do. More reasonably, I think a year is what we can target.”

There are various hurdles that a project has to clear before the ASF will give an incubating project the clearance to become a TLP, including copyright checks, licensing checks, and showing growth the project’s community, he says.

“We have a release both internally to the PPMC [Podling PMC], and then we go to the IPMC [Incubator PMC] just to double check that everything is okay,” Onofré tells BDW. “By experience, the first release is always a little bit painful. We know that. So I would say that the release is the next milestone.”

As far as executable software, however, Polaris is good to go right now, says Snowflake Principal Software Engineer Russell Spitzer, who is a PMC member for Apache Iceberg and a PPMC member for Apache Polaris.

“I want to be clear: Polaris is ready to use right now. From a technical standpoint, ready to go,” he says. “I can’t make too many forward-looking statements, but I think managed Polaris offerings are going to be available soon.”

The open lakehouse market has already coalesced around Iceberg, which became the defacto standard table format when Databricks acquired Tabular, the company behind Iceberg, the day after Snowflake announced Polaris in early June. That momentum behind Iceberg appears to be translating into momentum behind Polaris, Spitzer says.

“From my own individual one-on-one conversations with folks at other companies, they are thrilled,” Spitzer says. They are “way more excited about the project than they thought they were going to be. They just see it taking a lot of burden off of what they used to have to do.”

Apache Iceberg is one of three open table formats that emerged about five years ago, along with Databricks Delta Lake and Apache Hudi, to solve one of the key data management challenges facing members of the Hadoop community. Many customers used the Apache Hive Metastore (HMS) to keep track of changes made to data tables, but it left a lot to be desired. Developers were on their own to prevent data corruption issues, until the table formats got the situation under control.

“Almost everyone in the Iceberg community used to be on the basic Hive metastore integration, which is that old style of catalog …and all of those folks were looking for the next option,” Spitzer says. “I’ve got folks from all different companies who keep pinging us and are like, how do I get involved? Because I want to scrap what we were doing and I want to move to this. I want to be in the project that we’re all working on, so I don’t have to maintain my own version.”

The Iceberg and Polaris projects are closely linked due to the nature of the projects, and there are many PMC members who sit on both projects, including Spitzer. That begs the question: Why are two projects even needed? But as Spitzer and Onofré made clear, there is a clear separation of responsibilities between the two projects.

The most important difference is that it’s the Iceberg community’s responsibility to define the specification for the REST API that Polaris uses, and it’s the Polaris project’s job to expose that REST spec to the outside world. “It’s super important that we don’t deviate from the Iceberg REST specification,” Onofré says. “It’s clearly a requirement, a strong requirement.”

Mixing open specifications with server-side implementation of those specs is a bad recipe, according to Spitzer. By having Iceberg setting the specifications and Polaris being the server-side implementation of it, each team can move forward without making compromises, he says.

“I think a lot of people who are involved in the Iceberg project have been burned on previous open source server-side components,” he says. “When you are on that side, as well as the format side, you end up having to make compromises sometimes between what you want to focus on and what you want actually in the spec versus out of the spec.”

That separation also gives Polaris the freedom to potentially work with other databases and become a sort of super metadata catalog that stands on its own. Down the road, the Polaris team may look at helping to manage access to data stored in things like Apache Kafka or Apache Cassandra, Spitzer says.

In considering the history of catalogs, each computing engine needed its own catalog, Onofré says. But each catalog worked in slightly different ways and had different requirements. With Polaris, there is the opportunity to provide a single catalog that spans today’s distributed data environment across query engines, data stores, and languages.

“Personally, I think that it was a missing piece in the ecosystem,” he says. “We had the REST specification, which is a great improvement in Iceberg, but we didn’t have Apache Foundation project that fully implement this specification, so it was a kind of missing thing in the ecosystem.”

While the long-term potential of Polaris is bright, the short-term list of work items is getting longer. That’s a consequence of an interested user base that is looking forward to hooking Polaris into their big data environment, Spitzer says.

“People are like, we need open authentication integrations, we need this kind of back-end storage,” he says. “We’re looking to get table maintenance in as quickly as we can. Just all the stuff that folks were working on. It’s been great. It’s been way more popular than I thought it would be.”

Snowflake Embraces Open Data with Polaris Catalog

Apache Iceberg: The Hub of an Emerging Data Service Ecosystem?

Source link
lol

The Open Optimism of Apache Polaris

By stp2y

Leave a Reply Cancel reply