The IT team at Arity are cruising on the homestretch of a big project to load more than a trillion miles of driving data into a new database on Amazon S3. But if it wasn’t for a decision to switch out its engine from Spark to Starburst, the project would still be stuck in neutral.
Arity is a subsidiary of Allstate that collects, aggregates, and sells driving data for all sorts of uses. For instance, auto insurers use Arity’s mobility data–composed of more than 2 trillion miles of driving data by more than 50 million drivers–to find ideal customers, retailers use it to assess customer driving patterns, and mobile app developers, such as Life360, use it to enable real-time monitoring of drivers.
On occasion, Arity is contacted by state departments of transportation who are interested in using its geolocation data to study traffic patterns on specific stretches of roadways. Because Arity’s data includes both the volume and speed of drivers, the DOTs figured they could use the data to eliminate the need to conduct on-site traffic assessments, which are both expensive and dangerous for the crews who deploy the “ropes” across the road.
As the frequency of these DOT requests increased, Arity decided it needed to automate the process. Instead of asking a data engineer to write and execute ad hoc queries to obtain the data requested, the company opted to build a system that could deliver the data to DOTs more quickly, more easily, and for less cost.
The company’s first inclination was to use the technology, Apache Spark, that they had been using for the past decade, said Reza Banikazemi, Arity’s director of system architecture.
“Traditionally, we use Spark and AWS EMR clusters,” Banikazemi said. “For this particular project, it was about six years’ worth of driving data, so over a petabyte that we wanted to run and process through. The cost was obviously a big factor, but also the amount of runtime that it would take. These were big challenges.”
Arity’s data engineers are skilled at writing highly efficient Spark routines in Scala, which is Spark’s native language. Artity’s team began the project by testing whether this approach would be feasible with the first phase of the project, which was doing the initial load of the 1PB of historical driving data that was stored as Parquet and ORC files on S3. The routines involved aggregating the road segment data, and loading them into S3 as Apache Iceberg tables (this was the company’s first Iceberg project).
“When we did our first POC earlier this year, we took a small sample of data,” Banikazemi said. “We ran the most highly optimized Spark that we could. We got 45 minutes.”
At that rate, it would be very difficult to complete the project on time. But in addition to timeliness, the expense of the EMR approach was also a concern.
“The cost just didn’t make a lot of sense,” Banikazemi told BigDATAwire. “What happens on Spark was, number one, every time you run a job, you’ve got to boot up the cluster. Now, if we’re going with [Amazon EC2] Spot instances for a huge cluster, you have to fight for the availability of the Spot instance if you want to get any kind of decent savings. If you go on demand, you’ve got to deal with extreme amount of cost.”
The stability of the EMR clusters and their tendency to fail in the middle of a job was another concern, Banikazemi said. Arity assessed the possibility of using Amazon Athena, which is AWS’s serverless Trino service, but observed that Athena “fails on large queries very frequently,” he said.
That’s when Arity decided to try another approach. The company had heard of a company called Starburst that sells a managed Trino service, called Galaxy. Banikazemi tested out the Galaxy service on the same test data that EMR took 45 minutes to process, and was surprised to see that it took only four-and-a-half minutes.
“It was almost like a no brainer when we saw those initial results, that this is the right path for us,” Banikazemi said.
Arity decided to go with Starburst for this particular job. Running in Arity’s virtual private cloud (VPC) on AWS, Starburst is executing the initial data load and “backfill” processes, and it will also be the query engine that Arity sales engineers use to obtain the road segment data for DOT clients.
What used to require a data engineer to write complex Spark Scala code can now be written by any competent data analyst with plain old SQL, Banikazemi said.
“Something that we needed engineering to do, now we have we can give it to our professional services people, to our sales engineers,” he said. “We’re giving them access to Starburst now, and they’re able to go in there and do stuff which previously they couldn’t.”
In addition to saving Arity hundreds of thousands in EMR processing costs, Starburst also met Arity’s demands for data security and privacy. Despite the need for tight privacy and security controls, Starburst was able to get the job on time, Banikazemi said.
“At the end of the day, Starburst hit all the marks,” he said. “We’re able to not only get the data done at a much lower cost, but we were able to get it done much faster, and so it was a huge win for us this year.”
Related Items:
Starburst CEO Justin Borgman Talks Trino, Iceberg, and the Future of Big Data
Starburst Debuts Icehouse, Its Managed Apache Iceberg Service
Starburst Brings Dataframes Into Trino Platform
apache spark, Arity, big data, data engineer, driving data, emr, mobility data, Scala Spark, sql, Starburst Galaxy, Trino
Source link
lol