Companies across all industries want to share data with each other to enable collaboration and accelerate innovation. However, these organizations often use different data or cloud platforms, which creates friction or blocks collaboration. Databricks and the Linux Foundation developed Delta Sharing, marking a significant milestone in the democratization of data exchange with the first open source approach to data sharing across platforms, clouds, and regions. With Delta Sharing, customers are no longer limited to collaborating within their own platform and customer base but can instead go beyond and share data with all of their customers, partners, and any other collaborators.
Since announcing general availability of Delta Sharing in 2022, we have seen many enterprises adopt it to maximize their reach and collaborate with their customers and partners —regardless of cloud or platform. Databricks customers use the managed Delta Sharing service offered natively, which supports both Databricks-to-Databricks (D2D) and Databricks-to-Open or non-Databricks customers (D2O). For example, Databricks customers Atlassian and Nasdaq use Databricks D2O to deliver data to all their partners and customers on any computing platform, anywhere. Data and software platforms such as Oracle have also adopted Delta Sharing for Oracle-to-Open sharing to help enable their customers.
Databricks-to-Open (D2O) Delta Sharing revolutionizes how organizations share data, enabling seamless sharing of data managed in a Unity Catalog-enabled workspace with any user on any computing platform, anywhere. This approach enables Databricks customers to collaborate with all of their partners, customers, and suppliers – regardless of whichever data or cloud platform they use.
This blog will showcase the pivotal role of D2O in modern data sharing strategies with internal metrics and real-world applications. We will explore D2O scenarios that empower organizations to extend their data sharing capabilities, enabling interoperability with both internal business units and external partners’ systems, and effectively reaching customers anywhere.
In addition, we will highlight the most commonly used Delta Sharing open source connectors, such as Python, Apache Spark™, Excel, Tableau, PowerBI, part of the growing, open Delta Sharing ecosystem. We will also showcase how Databricks customers leverage D2O combined with the Delta Sharing REST API to build a cohesive data fabric architecture, customizing their data sharing experiences across their entire customer base.
Finally, we will review Databricks’ Marketplace‘s recent support for D2O, which now enables recipient access to Marketplace listings via the Delta Sharing open connectors. For example, we will explain how a Python connector or Spark connector can be used to consume a Delta Sharing listing in systems where there is no native connector, such as Amazon EMR, Google BigQuery, and Snowflake.
Increasingly, enterprises are implementing a D2O workflow to simplify collaboration both internally and externally across multiple platforms to unlock the potential of their data to drive innovation, ensure robust governance, and accelerate growth.
Open Ecosystem of Connectors
Consuming data shared using the Delta Sharing open sharing protocol requires an OSS connector, authenticated using a credential file that is typically obtained when a provider shares an activation token with a recipient.
The table below summarizes the OSS connectors that Delta Sharing currently supports, with links for download and major features for each. For example, the Python Connector offers robust capabilities for querying metadata, accessing snapshots, supporting Change Data Feed (CDF), and supporting Pandas. Another one is the Apache Spark Connector which provides similar capabilities to the Python connector, ensuring seamless integration into Spark users’ workflows. These connectors are part of the broader OSS Delta Sharing project, aimed at simplifying data sharing and consumption through familiar APIs and promoting open and accessible data sharing. All of these connectors also help read data from the Unity Catalog (UC) for recipients not yet on UC.
Earlier this year, a new Tableau Delta Sharing connector was announced to support seamless data sharing between Tableau and Databricks. It enables sharing internally between business units and externally with partners and customers. In addition, the new “Explore in Tableau” feature simplifies data analysis by providing one-click connectivity to data sources. These new developments highlight the next steps in Tableau and Databricks’ shared vision. They also leverage Delta Lake’s capabilities to enhance data quality and governance, significantly advancing data utilization and management.
Using Native Connectors in your platform of choice: BigQuery and Snowflake Examples
When integrating Delta Sharing with systems that lack native connectors, such as BigQuery and Snowflake, the Python delta sharing connector provides a versatile solution to bridge these gaps effectively. For BigQuery users, PySpark can be leveraged to authenticate and access shared data via the ‘delta_sharing’ library, followed by loading this data into a DataFrame and writing it directly to BigQuery. This process utilizes Google Cloud Dataproc for scalable data processing, ensuring that data handling is both efficient and secure. To learn more about how to use Delta Sharing with BigQuery, read Medium blog post from Databricks experts.
Similarly, for Snowflake integration, recipients can utilize the Python connector with the Pandas library to import data into a DataFrame. Following the data import, Snowflake’s Snowpark Python API facilitates the connection to Snowflake databases, allowing for seamless data writing from the Pandas DataFrame into Snowflake tables.
Code example:
<span class="subtle">pip install delta-sharing, snowflake-snowpark-python pandas
import delta_sharing
import pandas as pd
# Path to the Delta Sharing profile JSON file
profile_file = "path/to/your/profile.delta-sharing.json"
# Load the profile
client = delta_sharing.SharingClient(profile_file)
# Load a specific table into a DataFrame
table_url = "delta-sharing://<profile>#schema_name.table_name"
df = delta_sharing.load_as_pandas(table_url)
# Snowflake Snowpark session setup
connection_parameters = { …}
# Create a Snowflake session
session = Session.builder.configs(connection_parameters).create()
# Write the pandas DataFrame directly to a Snowflake table
session.write_pandas(df_pandas, "your_snowflake_table_name", auto_create_table=True)</span>
This method offers significant advantages because it eliminates the need for providers to replicate data in a separate system simply for sharing purposes, which would otherwise require additional computing, storage, and technical effort. By using Delta Sharing, data providers can directly share from their Databricks environment, enabling recipients to access the live data across various platforms, without the need for replication. This approach not only demonstrates the flexibility and cost-effectiveness of Delta Sharing but also enhances efficiency by consolidating data in a single system.
Enhance Your Data Services with the Delta Sharing API
The development of custom interfaces for Databricks’ Delta Sharing is revolutionizing how organizations share data with their external customers. This trend is a testament to the flexibility and open nature of Delta Sharing’s REST API, which companies are using to create tailored data sharing applications. Such applications are designed not only to enhance user experience but also to fit seamlessly into a comprehensive data fabric strategy.
Clients are leveraging these custom-built applications to control their data exchange environments, enabling them to share data hosted on Databricks with their customers who may not be using the same platform. This capability is crucial as it extends the reach of data sharing beyond the limitations of specific technologies or platforms. The use of the Delta Sharing open protocol allows for unrestricted access to shared data, democratizing data access and supporting customers across diverse technological landscapes. This approach enhances the overall data fabric by making data exchanges secure, scalable, and significantly more accessible.
By customizing user interfaces to external partners’ needs, organizations enhance collaboration and drive innovation, transforming data exchange into a strategic asset that improves business relationships and customer engagement. This approach strengthens their competitive edge in a data-driven market. The emphasis on flexibility and adaptability in these customized interfaces marks a new era of strategic data exchange.
For example, Atlassian integrates with Delta Sharing to help their customers drive insights with a flexible, open ecosystem. Atlassian Analytics’ latest feature data shares is powered by Databricks Delta Sharing’s open-source protocol. Data shares allows you to access Atlassian data in your environments and in any BI tool. Watch Atlassian’s 2024 Data + AI Summit session, “Empowering Enterprise Grade Customers with Delta Sharing – an Atlassian Analytics Story.”
“Atlassian Analytics recently launched Data Shares, leveraging Delta Sharing from Databricks, to boost flexibility and accelerate customers’ time-to-insight. Whether users choose to work within Atlassian Analytics or continue using dashboards they’re already familiar with, Delta Sharing’s open ecosystem of connectors, including Tableau, PowerBI, and Spark, enables customers to easily power their environments with data directly from the Atlassian Data Lake.”
— Ben Jackson, Senior Group Product Manager, Data & Analytics, Atlassian
Another Databricks customer, Nasdaq has been using Delta Sharing for their Data Link Platform which delivers market data, alternative data, and partner data to its users. As their data sets increased, they needed to have a scalable solution to deliver terabytes of data securely and efficiently, while reducing egress costs. Nasdaq uses Delta Sharing customized for their specific needs in a scalable way which includes built-in governance from Databricks. To learn more about how Nasdaq uses D2O sharing, hear from them in the 2024 Data + AI Summit session, “Delta Sharing unlocks the value of your data to partners and customers.”
Oracle announced Delta Sharing integration for their Oracle Autonomous Database users last year to connect with Databricks across clouds. Customers no longer have to deal with having their data locked in one platform or have to copy their data to share it with another platform. Now, with Delta Sharing, these platforms can see each other’s data without the need for copying. This helps avoid issues with outdated data, unnecessary computer usage, and extra work. Read Oracle’s blog post to learn more about this integration. You can also learn more from Oracle in the 2024 Data + AI Summit session “Delta Sharing: Open Protocol for Secure Data Sharing (OSS).”
Databricks Marketplace D2O
Databricks Marketplace is an open marketplace for all your data and AI assets, such as AI models, tabular data, file-based data, as well as industry-based Solution Accelerators.
The Databricks Marketplace D2O (Databricks-to-Open) feature extends the capabilities of Marketplace to support recipients across non-Databricks platforms, leveraging the power of Delta Sharing. This extension enables a broader range of data sharing possibilities beyond the conventional Databricks-to-Databricks (D2D) interactions, by implementing a unique credential system for recipient identification. Unlike the standard procedure that relies on mutual authentication between Databricks account metastores, D2O facilitates the sharing of data through an open protocol, allowing recipients to access shared assets without the necessity of a Databricks account. Furthermore, after the listing is installed, the feature offers the functionality for users to download and renew the credential token needed to access the shared data. This enhances the Databricks Marketplace’s utility by enabling integration with external tools such as Spark, PowerBI, Excel, and non-UC Databricks accounts, thus broadening the scope of data accessibility and collaboration.
Advancing Data Collaboration through D2O
Our exploration of D2O Delta Sharing highlights its pivotal role in facilitating data exchange across Databricks and non-Databricks platforms. By deploying connectors, D2O enhances data accessibility and ensures seamless integration with various platforms, including Spark, PowerBI, Tableau, and Excel. This strategic interoperability fosters a more inclusive data ecosystem, improving the utility and applicability of data in diverse analytical and operational scenarios.
D2O’s approach to data sharing marks a significant advancement in data democratization, empowering organizations to spread insights and foster collaboration beyond traditional boundaries. The impact of this feature is substantial, simplifying data operations, sparking innovation, and opening new avenues for growth and efficiency.
Reflecting on the capabilities and potential of D2O Delta Sharing, it is clear that this innovation is more than just technological progress; it is a commitment to open, accessible, and collaborative data exchange. With the advancements made by D2O, the future of data sharing looks promising, cementing data’s role as a crucial element in decision-making and innovation in today’s digital world.
Getting Started with Delta Sharing
To learn more about how to implement Delta Sharing within your organization, check out the latest resources including new eBooks and related blogs below, or deep dive into the Delta Sharing technical documentation.
If you are already a Delta Sharing customer, you can also reach out to the team with questions or to provide feedback at datasharing[at]databricks.com.
Source link
lol