All Eyes on Databricks as Data + AI Summit Kicks Off

The streets of San Francisco are adorned with Databricks red for Data + AI Summit (Image courtesy Databricks)

This week, it’s Databricks’ turn to welcome thousands of users, vendors, and members of the data community to San Francisco for its annual Data + AI Summit. Coming off the earth-shattering news last week around Apache Iceberg, the anticipation is building for Databricks to make more news in big data, advanced analytics, and AI.

Over the next three days, Databricks will offer more than 500 sessions at the Data + AI Summit, which is taking place at the Moscone Center in downtown San Francisco. The event comes just a week after Databricks’ rival Snowflake hosted its own conference at the famous convention center, thereby completing the industry’s first “Snowbricks” event series (which certainly sounds better than “Dataflake”).

The big data community is still reeling from last week’s news, which saw the industry conglomerate around Apache Iceberg as the defacto standard for open table formats. First, Snowflake unveiled Polaris, a metadata catalog for Iceberg data, then Databricks announced the acquisition of Tabular, the company formed by Iceberg’s creators.

Databricks co-founder and CTO Matei Zaharia at Data + AI Summit June 28, 2023

While Databricks executives aren’t conceding that their own open table format, Delta, has lost the table format war, the fact that it’s spending between $1 billion and $2 billion on Tabular represents a significant investment in Iceberg, and indicates that they don’t want the table format to be an issue for its customers.

“It’s not going to matter [which one they choose]. We want them to work together, to make the best of both, and allow customers to choose what’s right for you,” Joel Minnick, Databricks vice president of marketing, told Datanami last week. “[We want] you to choose what data format you want to store it in, but not have that be a limiting factor on what you’re able to go do with that data.”

It’s unclear at this point what will become of Delta, which Databricks launched in October 2017 as the linchpin of its lakehouse architecture that combines the scalability and flexibility of Hadoop-style data lakes with the transactionality and accuracy of traditional analytics databases (i.e. data warehouses). Minnick indicated that Databricks will continue making investments in both Delta and Iceberg for the time being.

“What we’re looking at in the short term [is] how do we make this work together,” Minnick continued. “And the Delta Lake UniForm file format that was out there, that we announced last year, is something that we’re going to work together even more now, on how do we help these formats talk together. But it is very much about keeping the community of both of these projects alive…For now we have no plans to do anything different than keep working with the communities.”

Now that the industry has essentially decided that Iceberg is the defacto standard for table formats, the attention shifts to the metadata catalogs, which sit between the query engines and the data. Because they’re another potential pinch point that can work to create data silos, the community is concerned that the metadata catalogs could help vendors lock customers into to their platform.

That is why Snowflake committed to donating its new Polaris metadata catalog, which adheres to Iceberg’s REST-based API, to the open source community within 90 days (Ron Ortloff, the head of Snowflake’s Iceberg and data lake strategy, confirmed to Datanami that the company is leaning toward donating Polaris to the Apache Software Foundation.)

The ball is now in Databricks’ court in terms of what it will do with Unity Catalog, the metadata catalog that it developed to work with Delta and the rest of its platform, which includes batch analytics, streaming analytics, machine learning, and generative AI capabilities. Unity Catalog is currently not open source, and there is speculation that the company may change that to address concerns over lock-in.

Wednesday is shaping up to be the big day for Databricks news. CEO Ali Ghodsi will take the stage to deliver his keynote address starting at 8:30 a.m. PT. Joining him during the keynote will be fellow Databricks co-founder and Chief Architect Reynold Xin, as well as Fei Fei Li, a professor at Stanford University’s Human-Centered AI institute, and Jensen Huang, the founder and CEO of Nvidia.

The keynote will be livestreamed for free on the Web. You can sign up here.

Related Items:

It’s Go Time for Open Data Lakehouses

What the Big Fuss Over Table Formats and Metadata Catalogs Is All About

Databricks Puts Unified Data Format on the Table with Delta Lake 3.0

Source link
lol