Introduction
Databricks has joined forces with the Virtue Foundation through Databricks for Good, a grassroots initiative providing pro bono professional services to drive social impact. Through this partnership, the Virtue Foundation will advance its mission of delivering quality healthcare worldwide by optimizing a cutting-edge data infrastructure.
Current State of the Data Model
The Virtue Foundation utilizes both static and dynamic data sources to connect doctors with volunteer opportunities. To ensure data remains current, the organization’s data team implemented API-based data retrieval pipelines. While the extraction of basic information such as organization names, websites, phone numbers, and addresses is automated, specialized details like medical specialties and regions of activity require significant manual effort. This reliance on manual processes limits scalability and reduces the frequency of updates. Additionally, the dataset’s tabular format presents usability challenges for the Foundation’s primary users, such as doctors and academic researchers.
Desired State of the Data Model
In short, the Virtue Foundation aims to ensure its core datasets are consistently up-to-date, accurate, and readily accessible. To realize this vision, Databricks professional services designed and built the following components.
As depicted in the diagram above, we utilize a classic medallion architecture to structure and process our data. Our data sources include a range of API and web-based inputs, which we first ingest into a bronze landing zone via batch Spark processes. This raw data is then refined in a silver layer, where we clean and extract metadata via incremental Spark processes, typically implemented with structured streaming.
Once processed, the data is sent to two production systems. In the first, we create a robust, tabular dataset that contains essential information about hospitals, NGOs, and related entities, including their location, contact information, and medical specialties. In the second, we implement a LangChain-based ingestion pipeline that incrementally chunks and indexes raw text data into a Databricks Vector Search.
From a user perspective, these processed data sets are accessible through vfmatch.org and are integrated into a Retrieval-Augmented Generation (RAG) chatbot, hosted in the Databricks AI Playground, providing users with a powerful, interactive data exploration tool.
Interesting Design Choices
The vast majority of this project leveraged standard ETL techniques, however there were a few intermediate and advanced techniques that proved valuable in this implementation.
MongoDB Bi-Directional CDC Sync
The Virtue Foundation uses MongoDB as the serving layer for their website. Connecting Databricks to an external database like MongoDB can be complex due to compatibility limitations—certain Databricks operations may not be fully supported in MongoDB and vice versa, complicating the flow of data transformations across platforms.
To address this, we implemented a bidirectional sync that gives us full control over how data from the silver layer is merged into MongoDB. This sync maintains two identical copies of the data, so changes in one platform are reflected in the other based on the sync trigger frequency. At a high level, there are two components:
- Syncing MongoDB to Databricks: Using MongoDB change streams, we capture any updates made in MongoDB since the last sync. With structured streaming in Databricks, we apply a
merge
statement withinforEachBatch()
to keep the Databricks tables updated with these changes. - Syncing Databricks to MongoDB: Whenever updates occur on the Databricks side, structured streaming’s incremental processing capabilities allow us to push these changes back to MongoDB. This ensures that MongoDB remains in sync and accurately reflects the latest data, which is then served through the vfmatch.org website.
This bidirectional setup ensures that data flows seamlessly between Databricks and MongoDB, keeping both systems up-to-date and eliminating data silos.
Thank you Alan Reese for owning this piece!
GenAI-based Upsert
To streamline data integration, we implemented a GenAI-based approach for extracting and merging hospital information from blocks of website text. This process involves two key steps:
- Extracting Information: First, we use GenAI to extract critical hospital details from unstructured text on various websites. This is done with a simple call to Meta’s llama-3.1-70B on Databricks Foundational Model Endpoints.
- Primary Key Creation and Merging: Once the information is extracted, we generate a primary key based on a combination of city, country, and entity name. We then use embedding distance thresholds to determine whether the entity is matched in the production database.
Traditionally, this would have required fuzzy matching techniques and complex rule sets. However, by combining embedding distance with simple deterministic rules, for instance, exact match for country, we were able to create a solution that is both effective and relatively simple to build and maintain.
For the current iteration of the product, we use the following matching criteria:
- Country code exact match.
- State/Region or City fuzzy match, allowing for slight differences in spelling or formatting.
- Entity Name embedding cosine similarity, allowing for common variations in name representation e.g. “St. John’s” and “Saint Johns”. Note that we also include a tunable distance threshold to determine if a human should review the change prior to merging.
Thank you Patrick Leahey for the amazing design idea and implementing it end to end!
Additional Implementations
As mentioned, the broader infrastructure follows standard Databricks architecture and practices. Here’s a breakdown of the key components and the team members who made it all possible:
- Data Source Ingestion: We utilized Python-based API requests and batch Spark for efficient data ingestion. Huge thanks to Niranjan Sarvi for leading this effort!
- Medallion ETL: The medallion architecture is powered by structured streaming and LLM-based entity extraction, which enriches our data at every layer. Special thanks to Martina Desender for her invaluable work on this component!
- RAG Source Table Ingestion: To populate our Retrieval-Augmented Generation (RAG) source table, we used LangChain, structured streaming, and Databricks agents. Kudos to Renuka Naidu for building and optimizing this crucial element!
- Vector Store: For vectorized data storage, we implemented Databricks Vector Search and the supporting DLT infrastructure. Big thanks to Theo Randolph for designing and building the initial version of this component!
Summary
Through our collaboration with Virtue Foundation, we’re demonstrating the potential of data and AI to create lasting global impact in healthcare. From data ingestion and entity extraction to Retrieval-Augmented Generation, each phase of this project is a step toward creating an enriched, automated, and interactive data marketplace. Our combined efforts are setting the stage for a data-driven future where healthcare insights are accessible to those who need them most.
If you have ideas on similar engagements with other global non-profits, let us know at [email protected].
Source link
lol