Data catalogs and metadata catalogs share some similarities, particularly in their nearly identical names. And while they have some common functions, there are also important differences between the two entities that big data practitioners should know about.
Metadata catalogs, which are sometimes called metastores or technical data catalogs, have been in the news lately. If you’re a regular Datanami reader (and we certainly hope you are!), you would have read a lot metadata catalogs at the Snowflake and Databricks conferences last month, when the two competitors committed to open sourcing their respective metadata catalogs, Polaris and Unity Catalog.
So what is a metadata catalog, and why do they matter? (We’re glad you asked!) Read on to learn more.
Metadata Catalogs
A metadata catalog is defined as the place where one stores the technical metadata describing the data you have stored as a tabular structure in a data lake or a lakehouse.
The most commonly used metadata catalog is the Hive Metastore, which was the central repository for metadata describing the contents of Apache Hive tables. Hive, of course, was the relational framework that allowed Hadoop users to query HDFS-based data using good old SQL, as opposed to MapReduce.
Hive and the Hive Metastore are still around, but they’re in the process of being replaced by a newer generation of technology. Table formats, such as Apache Iceberg, Apache Hudi, and Databricks Delta Table, bring many advantages over Hive tables, including support for transactions, which boosts the accuracy of data.
These table formats also require a technical layer–the metadata catalog–to help users know what data exists in the tables and to grant or deny access to that data. Databricks supports this function in its Unity Catalog. For Iceberg, products such as Project Nessie, which was developed by engineers at Dremio, sought to be the “transactional catalog” brokering data access to various open and commercial data engines, including Hive, Dremio, Spark, and AWS Athena (based on Presto), among others.
Snowflake developed and released (or pledged to release, anyway) Polaris to be the standard metadata catalog for the Apache Iceberg ecosystem. Like Nessie, Polaris uses Iceberg’s open REST-based API to get access to the descriptive metadata of the Parquet data that Iceberg stores. This REST API then serves as the interface between the data stored in Iceberg tables and data processing engines, such as Snowflake’s native SQL engine as well as a variety of open-source engines.
Data Catalogs
Data catalogs are typically third-party tools that companies use to organize all of the data they have stored across their organizations. They typically include some facility that allows users to search for data their organization may own, which means data catalogs often have some data discovery component.
Many data catalogs, such as Alation’s catalog, have also evolved to include access control functionality, as well as data lineage tracking and governance capabilities. In some cases, data management tool vendors that started out providing data governance and access control, such as Collibra, have evolved the other way, to also include data catalogs and data discovery capabilities.
And like metadata catalogs, regular data catalogs–or what some in the industry term “enterprise” data catalogs–are also fully involved in gobbling up metadata to help them track various data assets. One enterprise data catalog vendor, Atlan, focuses its efforts on unifying the metadata generated by different datasets and synchronizing them through a metadata “control plane,” thereby ensuring that the business metrics don’t get too out of whack.
By now, you’re probably wondering “So what the heck is the difference?! They both track metadata, and they both have “data catalog” in their name. So what’s the difference between a metadata catalog and a data catalog.
So What’s The Difference?!
To help us decode the differences between these two catalog types, Datanami recently talked to Felix Van de Maele, the CEO and co-founder of Collibra, one of the leading data catalog vendors in the big data space.
“They’re very different things,” Van de Maele said. “If you think about Polaris catalog and Unity Catalog from Databricks–and AWS and Google and Microsoft all have their catalogs–it’s really this idea that you’re able to store your data anywhere, on any clouds…And I can use any kind of data engine like a Databricks, like a Snowflake, like a Google, AWS, and so forth, to consume that data.”
But what Collibra and other enterprise data catalogs do is quite different, Van de Maele said.
“What we do is we provide much more of the business context,” he said. “We provide what we call that knowledge graph, that business context where you’re actually defining and managing your policies. Policies such as what’s the quality of my data? What business rules does my data need to comply to? What privacy policies does my data need to comply to? Who needs to approve it? How do we capture attestations? How do we do certification? How do I build a business glossary with business terms and clear definitions?
“That’s very different than a Polaris catalog on top of Iceberg that’s the physical metadata. And that’s a real differentiation,” he said.
Van de Maele supports the open data lakehouse architecture that has emerged, which gives customers the freedom to store their data in open table formats, such as Iceberg, Delta, and Hudi, and query it with any engine. His customers, many of which are Fortune 500 enterprises, store data across many data platforms and use the Collibra Data Intelligence platform to help control and govern access to that data.
Different Roles
Customers should understand that, while the names are similar, metadata catalogs and data catalogs play very different roles.
“The way I differentiate between the two is we do policy definition and management, they do policy enforcement,” Van de Maele said. “And actually I think that’s the right architecture.”
The metadata catalogs typically do not have functionality to allow users to set up business policies around data access. For instance, they won’t let you set up access controls to enable a marketing team to access all customer data except for anything that’s been marked “classified,” in which case it must be masked, Van de Mael said.
“We can have marketing data in Databricks, we have marketing data in Salesforce, we have marketing data in Google, and anywhere people are using marketing data, I need to make sure that the right data is classified and masked,” he said. “So we push that down in Databricks, in Snowflake, in Google, in Amazon and in Microsoft.”
Customers could define their own data access policies without a tool like Collibra’s, Van de Mael said. After all, it’s just SQL at the end of the day. But then they would need some other method to keep track of the millions of columns spread across various data platforms. Providing insight into what data exists and where, and then ensuring customers are accessing it according to the company’s governance rules, is the role Collibra serves.
At the same time, Collibra is dependent upon metadata catalogs for the enforcement mechanisms. Other enforcement mechanisms have been tried, such as proxies and drivers, Van de Maele said, but none of it works.
“We think the metadata catalog approach with open table format is actually the right approach,” he said. “We want to have those data platforms be able to do that natively, otherwise scalability and performance always become a problem.”
Databricks Unity Catalog appears to be the exception here. Unity Catalog, which Databricks just open sourced last month, provides the low-level control over technical metadata as well as higher-level functions, such as data governance, access control, auditing, and lineage. In that respect, Unity Catalog appears to compete with the enterprise data catalog vendors.
Related Items:
What the Big Fuss Over Table Formats and Metadata Catalogs Is All About
Databricks to Open Source Unity Catalog
What to Look for in a Data Catalog
Source link
lol