Searching for the perfect campsite can be a hit-and-miss affair, as one seeks the perfect combination of a view, sufficient parking, and proximity to neighbors and services, among other factors. When it came to selecting a tool to manage its big data pipeline, the online reservation company Campspot didn’t have to look any further than Apache Airflow and, eventually, the hosted Airflow service from Astronomer.
If you’ve got a hankering for some camping, then Campspot is a good place to start. Founded in 2015 in Grand Rapids, Michigan, the software-as-a-service (SaaS) company lets customers make reservations at more than 2,700 private campgrounds, RV resorts, cabins, and “glamping” locations in the United States and Canada. All told, Campspot manages the reservations for more than 230,000 campsites across North America, which has helped earn the company the nickname “the Expedia of campgrounds.”
While campers might measure their overall satisfaction by the number of s’mores consumed per day, Campspot’s partners–the campground owners–need just a bit more data. For instance, every day, they need to know which of their campsites are reserved, how many total are reserved, and how that compares to previous time periods.
The responsibility of keeping the campground owners’ data appetite properly sated falls to John Marriott, manager of Campspot’s data platform team. According to Marriott, the company runs a nightly batch job that takes the latest data from the homegrown reservation management system and rolls it up into its data warehouse. This data is then bundled up into PDF of CSV reports that are either emailed to Campspot partners or made available for viewing on a Web-based dashboard. The company also offers a “signals” product to its partners that compares their existing reservations to an anonymized set of competitors in their space.
Prior to 2022, managing all of these data transformation jobs was mostly a manual affair. It was up to individual engineers to decide how data a data pipeline should be constructed to enable data to flow from the reservation system, which runs on a mix of Postgres, MySQL, and DynamoDB databases, into its data warehouse, which runs on a combination of Snowflake and Postgres.
“The company was just sort of setting programmers on these problems, and they were just writing things in any which way,” Marriott says. “So we had a lot of jobs that were either kind of bolted onto the side of our application or just lived in a variety of places and were orchestrated in different ways.”
Getting the nightly batch job done began to be a problem. While it should have taken about five minutes, it would sometimes take two or three hours to complete. With campgrounds spread across seven different time zones, Campspot was under the gun to deliver the information critical to campground owners.
“After the third or fourth time that you bump up the timeout on this batch job from one hour to two hours to three hours or something, it’s like, all right, this isn’t the right solution just to keep letting this thing run,” he tells BigDATAwire. “If it’s taking two hours, that’s just a red flag. Like, there’s got to be a better way to do this.”
When problems arose, troubleshooting issues in this decentralized, ad-hoc environment hinged on yet more decentralized, ad-hoc work.
“When something fails, first you need to figure out what’s the actual infrastructure for this job, and then go think about how to fix it,” Marriott says. “And so you’re always kind of juggling those things.”
Marriott and his team realized they needed to get a handle on these data pipeline jobs. They had heard of tools that can automate the execution of thousands of data pipelines. They perceived that Apache Airflow was the early leader in this space, and after investigating Airflow, they adopted it in 2022.
“We saw Airflow as our solution of ‘Let’s get everything under one roof,’ instead of just having things sort of mixed around,” Marriott says.
Mariott’s engineers immediately took to Airflow. While Airflow offers a few different ways to work with the product, including GUIs, Campspot’s developers are code-first types, and they gravitated to Airflow’s command line and programmatic interfaces. Similarly, they also liked how Airflow and its Python-based batch jobs easily fit into their existing DevOps workflows.
“We are used to using GitHub and having everything be code, as opposed [to going through a GUI,]” Mariott says. “I mean, those tools are great, but once you know how to write code, you kind of feel like your hands are tied a little bit [using a GUI]. Almost all of our work is done in code. So it’s a pull request, it goes through our approval process, and Airflow just fits in really naturally with the rest of the software engineering that we’re doing.”
Campspot engineers found it easy to define their data transformation jobs in Airflow using Python, Airflow’s native language; Mariott estimates that 95% of Airflow jobs are in Python. The software also allows Campspot to set up different data pipelines to process campground owners’ data depending on the timezone they’re in, further speeding up the nightly batch run.
As an AWS shop, Campspot decided to take advantage of AWS’s Amazon Managed Workflows for Apache Airflow (MWAA) offering out of the gate. While AWS’s managed Airflow environment was better than what they had in place before, Campspot found that MWAA wasn’t as easy to manage as they had initially hoped.
“Setting up the deployment pipeline was not as smooth,” Marriott recalls. “Having multiple environments was costly. If we wanted a separate dev and staging and production environments, those were just a straight multiple of the cost.”
The company looked to another hosted Airflow environment from Astronomer, the company behind the open source Airflow project. Astronomer’s Astro environment also runs on AWS, but does not double (or triple) your cost for running development and testing environments in addition to production, Marriott says. Moving Campspot to also lowered the operational burden on Campspot engineers, Mariott says.
“We’d rather pay the platform fee than pay that same amount in labor for us to be maintaining the platform,” he says. “They take care of everything, except for the part that we have to be doing. We need to write the jobs that are specific to our use cases, and we don’t have to do anything more than that.”
However, moving to Astronomer didn’t totally streamline the management of Airflow, at least not initially. Since Campspot was running Astro in its own VPC, it was still exposed to additional complexity.
When troubles arose with an Airflow job, Campspot engineers needed to investigate several systems, including the AWS batch job that was used to kick off the Airflow job, the Amazon CloudWatch job that monitors it, and the Amazon EventBridge job that scheduled it.
“When something fails, you’re going and looking in all these places and getting the logs and then those batch jobs are triggering, either hitting an endpoint in the code, something that was just sort of bolted on, maybe a Lambda or who knows what,” Marriott says. “And it’s just a lot to juggle, a lot to keep in your head.”
So about a year ago, Campspot moved its Astro deployment from its own VPC into Astronomer’s environment, further reducing the number of different environments involved and the surface area where things can go wrong.
“All of the scheduling and the running of it and the logging and investigating failures–it’s just all in one space,” Marriott says. “So that’s the advantage for us.”
As Americans and Canadians set out in 2025 to find their favorite campgrounds, they probably aren’t thinking about how their stays are triggering data transformation jobs flowing across the Internet. But for the folks at Campspot who are responsible for keeping the data flowing, the existence of Airflow and Astronomer’s Astro service means that they, too, are happy campers.
Related Items:
Astronomer’s High Hopes for New DataOps Platform
Airflow Available as a New Managed Service Called Astro
Apache Airflow to Power Google’s New Workflow Service
Source link
lol