A few years back, while working at cignal.io, I led the development of a real-time bidding platform for ad opportunities. This smart ad exchange managed a process called Real-Time Bidding (RTB). RTB is an automated system where advertisers bid in real time for the chance to display their ads to specific users visiting websites. When a partner sent an ad opportunity, our platform processed it through a series of real-time machine learning (ML) models to predict which advertising partner should receive the opportunity to bid. These models performed tasks like fraud detection, auction-winning prediction, matching advertising partners based on buying patterns, and identifying repeating opportunities. Ultimately, this system ensured that the highest bidder’s ad was displayed, optimizing efficiency and relevance for advertisers and users alike.
The scale of the platform was staggering, handling 100,000 to 150,000 ad opportunities per second. Each opportunity was represented as a large JSON object of up to 2-3 KB in size. Not every opportunity received a bid; in fact, around 40-50% were filtered out by predictive models and never sent forward. For the remaining opportunities, if a bid was placed and won the auction, a notification was generated. This activity resulted in over 1 TB of data every hour. The sheer volume of data posed significant challenges for training ML models, especially when more than 90% of the data consisted of opportunities without bids.
Initial Steps to Manage Data Volume
To address the data explosion, we implemented a selective data writing approach. Only a small percentage of the ad opportunities were written to storage, focusing primarily on those that resulted in bids. For these, we added a flag to indicate whether the opportunity was part of the reduced write set. This allowed us to maintain balanced statistical information—for example, the number of ad opportunities originating from New York—while significantly reducing the volume of stored data.
This strategy improved the preprocessing workflow for Spark, which was used to join data fragments and prepare it for ML tasks. However, as the platform scaled, the demands on Spark clusters grew, increasing processing time. Delays in updating the models with new data affected the quality of real-time predictions, and the rising resource costs reduced the platform’s return on investment (ROI).
Transitioning to Apache Parquet
To solve these issues, we transitioned to storing all our data in Apache Parquet. Parquet is an open-source, columnar storage file format optimized for large-scale data processing and analytics. Developed collaboratively by Twitter and Cloudera and inspired by Google’s Dremel paper, Parquet became a top-level Apache project in 2015. Its columnar structure and support for efficient compression and encoding schemes made it an ideal choice for our use case.
We chose Snappy as the compression algorithm for Parquet, which balanced speed and efficiency. Parquet’s columnar format allowed us to store similar data types together, significantly improving compression ratios and reducing storage requirements. Additionally, Snappy compression enabled the files to be split and processed in a distributed manner, allowing us to leverage our big Spark clusters effectively. The columnar design also enabled selective reading of relevant columns during query execution, drastically reducing I/O operations and speeding up data processing.
Benefits of Using Parquet
The switch to Parquet had a transformative impact on our platform:
-
Reduced Resource Usage: The improved storage efficiency and compression reduced the amount of hardware and computational resources required for data processing.
-
Faster Data Processing: By storing data in Parquet, we dramatically decreased the processing time for Spark jobs. This allowed us to update ML models more frequently, improving their real-time prediction accuracy.
-
Enhanced Scalability: As our data flow grew, Parquet’s efficient format allowed us to handle increased volumes without proportional increases in infrastructure costs.
-
Empowered Data Scientists: The ability to process larger volumes of data during research and testing enabled our data scientists to refine and enhance all our ML models. Parquet’s schema evolution feature also allowed for seamless updates to data structures without breaking existing workflows.
Conclusion
By adopting Apache Parquet and following its best practices, we not only overcame the challenges of scaling our ad exchange platform but also improved the overall efficiency and quality of our ML models. The shift to Parquet enhanced our ability to react to real-time changes in data, optimized resource usage, and provided our data science team with the tools to innovate further. This experience underscored the value of choosing the right data storage format for high-scale, data-intensive applications.
Source link
lol