It’s time to celebrate the incredible women leading the way in AI! Nominate your inspiring leaders for VentureBeat’s Women in AI Awards today before June 18. Learn More
Today, Databricks kicked off its annual Data and AI summit with a long-awaited move: the open sourcing of its three-year-old Unity Catalog platform that provides customers a unified solution for their data governance needs.
Unity Catalog was formerly a proprietary product of Databricks, but will now fall under an Apache 2.0 license, meaning other firms can take the underlying architecture and code and set up their own catalogs and tweak them without paying Databricks a dime. Unity Catalog is further getting an OpenAPI specification, server, and clients.
The move will provide enterprises with the flexibility to access their data and AI assets managed in the catalog, without vendor lock-in. Essentially, they would be able to use the information hosted in the catalog with their preferred tools of choice, including a large ecosystem of Delta Lake and Apache Iceberg-compatible query engines.
This comes mere days after Snowflake, Databricks’ major competitor, made a similar move with the announcement of Polaris Catalog, its own open catalog implementation for enterprises.
VB Transform 2024 Registration is Open
Join enterprise leaders in San Francisco from July 9 to 11 for our flagship AI event. Connect with peers, explore the opportunities and challenges of Generative AI, and learn how to integrate AI applications into your industry. Register Now
However, unlike Unity Catalog, which has been immediately open-sourced (Databricks CTO Matei Zaharia published the code live on stage), Snowflake’s Polaris Catalog is set to be open-sourced over the next 90 days.
Unity Catalog OSS: Much-needed for customer control
Databricks launched Unity Catalog as a proprietary, closed-source governance solution for accessing and managing data and AI assets within its platform ecosystem.
The catalog provided users with features such as centralized data access management, auditing, data discovery, lineage tracking and secure data sharing.
However, the tight integration of the closed source offering with the open Delta Lake table format and a few other formats restricted users’ ability to mix and match it with other technologies, like querying with engines compatible with Apache Iceberg or Hudi — the other two mainstream open table formats.
Databricks realized the problem and started solving it last year with Delta Lake Universal Format (UniForm).
The offering, which went into general availability a couple of weeks ago, automatically generates the metadata needed for Apache Iceberg or Hudi and unifies the table formats into a single copy that can be queried from any supporting engine.
Now, by opening Unity Catalog with open APIs and an Apache 2.0 licensed open source server, the company is building on that work, giving enterprises a universal interface that supports any of the three open data formats (via UniForm) and interoperates across various query engines, tools, and cloud platforms.
“With open-sourced Unity Catalog, existing Databricks customers can leverage a large ecosystem of Delta Lake and Apache Iceberg compatible engines and many more clients — it gives them the flexibility to access their data and AI assets managed in the Unity Catalog from the tools of their choice. Existing Unity Catalog deployments implement the same open APIs, enabling external clients to read from all tables (including managed and external tables), volumes, and functions in hosted Unity Catalog from day one, with their existing access controls in place,” Joel Minnick, VP of product marketing at Databricks, told VentureBeat.
This way, the Unity Catalog delivers interoperability with all major cloud platforms (Microsoft Azure, AWS, GCP and Salesforce), compute engines like Apache Spark, Presto, Trino, DuckDB, Daft, PuppyGraph and StarRocks as well as data and AI platforms such as dbt Labs, Confluent, Eventual, Fivetran, Granica, Immuta, Informatica, LanceDB, LangChain, Tecton and Unstructured.
In addition to different open formats and engines, the catalog supports the Iceberg REST Catalog and Hive Metastore (HMS) interface standards. Plus, it ensures unified governance across tabular and non-tabular data and AI assets, such as machine learning (ML) models and generative AI tools, letting organizations simplify management at scale.
How is it different from Snowflake’s Polaris Catalog?
With Polaris Catalog, Snowflake has also focused on an open catalog implementation for interoperability, without lock-in. However, the offering is only for data conforming to the Apache Iceberg table format. On the other hand, Unity Catalog OSS covers data in any format, including Iceberg and Delta/Hudi as well as Parquet, CSV and JSON (which it did before too).
Further, Minnick said, Databricks’ offering also supports unstructured datasets (volumes) as well as AI tools and functions, letting organizations manage images, documents and other files used in generative AI applications — which is not the case with Polaris.
“Snowflake proprietary storage format Tables cannot be accessed via Polaris, whereas with Unity Catalog OSS APIs, external clients can read from all tables, volumes and functions in Databricks Unity Catalog,” Minnick added. He also noted that Polaris needs to be connected to Snowflake’s governance solution (Horizon) to get governance, while Unity Catalog OSS comes with object-level access controls right out of the box.
Globally, over 10,000 organizations, including NASDAQ, Rivian and AT&T, use Unity Catalog within the Databricks Data Intelligence Platform. It will be interesting to see how the adoption changes with the move to go open-source.
Databricks Data and AI Summit runs from June 10 to June 13, 2024.
Source link lol