From Data to Destinations: How Skyscanner Optimizes Traveler Experiences with Databricks Unity Catalog

This blog is authored by Michael Ewins, Director of Engineering at Skyscanner

At Skyscanner, we’re more than just a flight search engine. We are a global leader in travel in serving more than 110 million users each month to plan and book their trips with confidence and ease. Operating in over 30 languages, our platform connects travelers with a wide range of flights, hotels, and car rental options from over 1,200 travel partners across 180 countries.

We use data and AI to enhance the traveler experience as well as support internal decision-making. For our travelers, we use machine learning (ML) models to check over 80 billion prices every day, ranking and recommending hotels, flights, and car rentals, aiming to provide the best options based on journey time and costs. Databricks Data Intelligence Platform powers some of these travel insights. In this blog, we discuss our journey with Databricks and how Unity Catalog helped us streamline our data management and governance.

To learn more, attend the Data + AI Summit 2024 for our session titled Skyscanner’s Journey of Enabling Practical Data and AI Governance.

Understanding Our Data Landscape and Challenges

Data has always been central to Skyscanner’s operations. Every day, our platform handles 35 million searches, generating over 30 to 35 billion analytical events. The sheer volume of data—approximately 15 to 20 petabytes stored at any given time—poses significant challenges in data management and utilization. Our data is crucial for both consumer-facing features and internal decision-making processes, making its effective management a top priority for our engineering teams. This scale of data operations presents several challenges:

Volume and Velocity: Handling billions of events generated daily requires robust infrastructure and efficient data processing capabilities.
Scalability and Performance Issues: As Skyscanner grew, the data infrastructure struggled to keep pace with the increasing demand. Our legacy systems could not scale efficiently, leading to delays in data processing and an inability to handle large-scale data workloads effectively.
Complexity and Cost: Before transitioning to more streamlined solutions, our data management involved multiple systems, which often led to inefficiencies and increased operational costs.
Data Silos and Inconsistency: The disparate systems led to data being siloed, which hindered data accessibility and quality, affecting decision-making processes.
Compliance and Security Risks: With data spread across various systems, ensuring comprehensive security and compliance with international data protection regulations (like GDPR) was increasingly challenging. This risk was compounded by the lack of centralized control over data access and processing.

Databricks: A Game-Changer for Skyscanner

At Skyscanner, our commitment to leveraging cutting-edge technology is evident in our strategic partnership with Databricks. Databricks has been instrumental in transforming our approach to data management, enabling us to streamline operations and enhance the traveler experience.

All our data pipelines are built on top of the Databricks Data Intelligence Platform. we’ve established a robust data ingestion framework that captures data from a variety of sources, incorporating both batch and real-time streams. We utilize AWS Kinesis for streaming and Fivetran for batch data ingestion, ensuring that all incoming data is collected efficiently into our initial staging area, which we refer to as the ‘bronze layer’ of our medallion architecture. This stage is crucial as it handles the raw data collected from our diverse channels, including direct interactions from our web and mobile platforms.

Once in the bronze layer, the data undergoes a series of transformations and enrichments to prepare it for deeper analytical tasks. It then moves to the ‘silver layer,’ where it is cleaned, consolidated, and structured, ready for analytical consumption. In this phase, Databricks’ powerful Spark engine plays a crucial role, enabling fast and scalable data transformations.

Advancing the data to the ‘gold layer,’ our data is optimized for consumption by various business units where it is modeled and aggregated into metrics that directly support decision-making across the company. We leverage MLflow, to manage the complete machine learning lifecycle. This includes everything from experimentation and reproducibility to the deployment of ML models, allowing us to track experiments, package code into reproducible runs, and deploy models directly into production seamlessly. While we’re currently serving these models into production using our own model-serving architecture, we’re in the process of evaluating Databricks’ model-serving capabilities that are part of the Databricks Mosaic AI offering.

Beyond processing and machine learning, we utilize Databricks for operational reporting and analytics. Databricks SQL allows our teams to perform SQL queries directly against our data lake, create dashboards, and execute complex analytical operations at scale. Integration with BI tools like Tableau Cloud enhances our capabilities, enabling us to visualize data and extract actionable insights efficiently.

Our Migration Journey to Unity Catalog

Data governance is a critical component of Skyscanner’s architecture. It underpins our ability to manage data securely and efficiently, ensuring that we can trust our data for making business decisions and maintaining compliance with global data protection regulations, including GDPR. As a subsidiary of a company listed on NASDAQ, adhering to strict regulatory standards such as the Sarbanes-Oxley Act is paramount for ensuring transparency and accountability in our operations. Databricks Unity Catalog, being built into the platform, helped us streamline these requirements.

Before implementing Unity Catalog, we faced several significant challenges

Low Levels of Data Ownership: One of the more significant challenges we faced was the low level of ownership over datasets across the company. This often led to accountability issues, where no specific team or individual was responsible for the accuracy, privacy, and security of particular datasets.
Lack of Centralized Oversight: Managing data across disparate systems made it difficult to enforce consistent data governance policies. This lack of centralized control led to inefficiencies and increased the risk of non-compliance with data regulations such as GDPR.
Access Control Difficulties: Without a unified system, managing who had access to what data was cumbersome and often insecure. Handling IAM policies was particularly challenging, requiring substantial manual effort and being prone to errors. Ensuring the right level of access for various teams involved navigating complex IAM roles, which often led to either overly permissive access or overly restrictive practices, both of which could impede operational efficiency.
Inadequate Data Lineage and Auditing: We lacked automated tools for tracking data lineage and auditing changes, which are essential for troubleshooting and understanding the impact of data modifications. As a result, lineage graphs had to be prepared manually.

Recognizing these challenges, we developed a strategic approach to migrate to Unity Catalog. Our strategy included:

Prioritizing Business-Critical Tables: We conducted a comprehensive review of all data assets to classify them according to their importance to business operations, sensitivity, and compliance requirements. Although we had 30,000 tables in total, our active tables numbered only about 1,500, and of those, only about 350 were business-critical. That discovery was a game changer for us as this simplified our migration process.
Leveraging Automation: Initially, our teams manually migrated tables into Unity Catalog and adapted them to fit our domain model, which was a slow and time-consuming process. By leveraging Databricks’ automation tools, we significantly accelerated the migration without needing to rewrite our pipelines. To expedite the integration of all our data into Unity Catalog, we became less rigid about adhering strictly to the Medallion architecture, which requires all data to be classified into bronze, silver, and gold layers. Instead, we adopted a more flexible approach: “We’ll meet you where your data is.” This strategy allowed us to make data visible in the Unity Catalog immediately, with the intention of aligning it with the bronze, silver, and gold definitions over time.

Improving data visibility and governance with Unity Catalog

Unity Catalog has become a pivotal element in our data governance framework at Skyscanner. it now manages and governs a significant volume, approximately 15 to 20 petabytes, of our data. This data includes everything from raw data in our ‘bronze’ layer to processed data in our ‘silver’ and ‘gold’ layers, which are used extensively across various business functions for analytical and operational purposes.

The implementation of Unity Catalog has brought substantial improvements to our data management and governance capabilities, yielding several key benefits:

Enhanced Data Security and Compliance: Unity Catalog has enabled us to centralize our data governance, providing robust security features and streamlined compliance processes. This centralization reduced the complexities associated with managing permissions across disparate systems and helped ensure that only authorized personnel had access to sensitive data and is crucial for adhering to stringent data protection laws, including GDPR.
Cost Optimization: The streamlined data management process enabled by Unity Catalog has led to more efficient use of our data storage and computing resources.
Scalability and Future-Proofing: Unity Catalog has provided a scalable architecture that accommodates our growing data needs. As Skyscanner continues to expand and evolve, Unity Catalog supports this growth by enabling us to manage increasing volumes of data without compromising on performance or security.
Enhanced Data Lineage: With Unity Catalog, we’ve significantly enhanced our data lineage capabilities. This means we now have a clear and detailed view of where our data originates, how it’s processed along the way, and where it ends up. This level of transparency is crucial not just for day-to-day operations but also for our compliance efforts, particularly with GDPR. Being able to trace the entire journey of our data helps us ensure that we’re handling it correctly and staying compliant with all necessary regulations. It also simplifies the audit process, as we can readily provide detailed mappings of our data flows.
Data Observability: Building on our data in Unity Catalog, we have integrated Monte Carlo to improve data reliability across our active datasets. We have introduced a healthy data framework so that we can measure the adoption of data governance across Skyscanner.

Planning for the future: Capitalizing on new opportunities

As we look ahead, I think the value in generative AI will come from the unique, valuable data we have at Skyscanner. There’s a lot of potential, but a key step for us is making sure we have everything, including ML models, managed and governed with Unity Catalog to capitalize on any opportunities.

Currently we’re evaluating using Databricks’ Model Serving capability. We’re looking at enabling Unity Catalog in multiple regions using Delta Sharing to move data between regions. We’re also thinking about using this for external data sharing – we have some data products where we share data with third party companies.

In the future, we want our data teams to focus on problems unique to Skyscanner. Databricks does a lot of the heavy lifting when it comes to model serving and provides a good framework for thinking about the AI journey—from prompt engineering to building your own model. We have confidence in our ability to realize the opportunities we’re identifying using the Databricks ecosystem.

Learn more about Skyscanner’s journey at the Data + AI 2024 Summit by joining Michael’s session, Skyscanner’s Journey of Enabling Practical Data and AI Governance.

Source link
lol