Astronomer’s High Hopes for New DataOps Platform

(ArtemisDiana/Shutterstock)

Astronomer last month rolled out a new observability product called Astro Observe that’s aimed at giving customers the full picture of how their data is flowing using Apache Airflow, the open source data orchestration tool that is backs. As Astronomer CTO Julian LaNeve explains, the goal is for Observe to become a full-fledged DataOps platform.

Astro Observe is a cloud-based observability tool designed to give customers “an actionable view of the data supply chain,” as Astronomer says. The offering, which is in private preview, extends the company’s offerings beyond the core data orchestration capabilities offered with open source Airflow or the company’s cloud-based version of Airflow, dubbed Astro, to gain a deeper understanding of the state of a customer’s data.

During a recent interview with BigDATAwire, Astronomer’s LaNeve explained how Astro Observe will build upon Airflow to help customers stay on top of their data flows.

“As these pipelines run, you get lots of metadata from them, whether it’s how long they took, who owns them, the type of data that they’re interacting with,” LaNeve said. “And we’re taking all that metadata and turning it into an experience designed around the reliability and efficiency of your data platform.”

The new product will be particularly applicable for companies that are investing in centralized data lake and data warehouse platforms, such as Databricks, Snowflake, or Google Cloud BigQuery, he says.

(Image courtesy Datameer)

“When you go buy…some of these very expensive but very powerful tools, you want to make sure that you’re using them in the right way,” LeNeve said. “And our thinking is very much that orchestration is the right place to start to more intelligently manage those tools over time instead of just triggering processes in those tools.”

For instance, the process of turning raw data into a finished good that’s fit for consumption for analytics or machine learning/AI systems typically involves moving data through pipelines and executing transformations upon that data. As an orchestration tool, Airflow allows organizations to control and coordinate how the various ETL/ELT and transformation tools, such as Matillion and dbt, interact with the data.

Many organizations today will follow some version of the “medallion architecture,” where bronze corresponds to the raw data, silver corresponds to the first step in the data’s transformation journey, and then gold represents published tables–perhaps in Apache Iceberg or some other open table format.

Each of those steps is dependent on the previous step being completed. While those data transformation steps can be scheduled to run in a batch manner, in the real world, things don’t always complete on time or complete with 100% accuracy. That’s ultimately why something like Observe needs to exist: to detect when things go awry, and react accordingly.

“That is an orchestration process that you need to run. If the raw tables don’t update, you don’t want to run things downstream,” LaNeve said. “And when you start to add ML and AI into the picture, oftentimes you’re doing that on this data that’s in your data warehouse or data lake. And what we found more and more is there’s a very strong desire to get those ML and AI workloads as part of orchestration, because you want to run your ML jobs as soon as the data is ready. You want your AI models to have access to the latest data.”

This is essentially what Ford is doing with Airflow. According to LaNeve, the automaker is using Astronomer to move video data from its self-driving car experiments into a data lake where it can be used to develop computer vision models.

“I think it’s a great example, where part of that is traditional ETL, where the car is run, you get a ton of data, you extract, you load that into a data warehouse or data lake, and then you use some transformation,” LaNeve said. “But then on the tail end, you’re training or running inference on these computer vision models. And at Ford, that is one whole process that they run as part of Airflow. So there are no bottlenecks, there are no gaps in the process. They have full visibility across everything.”

Ford built its own observability system for Airflow; it’s not one of the private beta testers for Astro Observe. But the need for full observability across that data supply chain, as it were, is something that exists at many companies, which is why Astronomer developed Observe.

“I think all of this is indicative of this broader DataOps trend, of you want everything unified in one platform so that you have full control and visibility over all workloads,” LaNeve said. “You need access to strong orchestration, lots of compute. If you’re training ML models, you need strong observability to make sure that you understand how everything is working together. And that’s very much how we view building our products and kind of influencing the market over the next couple of years towards this full DataOps platform, where you don’t have to go buy six different tools. You can just come to one.”

Astro Observe relies on an open source project called OpenLineage to help it collect and consume metadata (logs and metrics) from different orchestration jobs, whether it’s running under Airflow or other data processing engines, such as dbt, Apache Spark, Apache Flink, or others. The software leverages uses that data to populate a series of dashboards, dependency graphs, and recommendation engines dashboards to show how the data transformation jobs are flowing. It also measures those deliverables against data freshness or timeliness SLAs, and provides predictive alerting and a recommendation engine to help optimize data flows.

The feedback from the dozen or so early adopters of Astro Observe has been positive, LaNeve said. One customer told Astronomer that it used to take them two to three weeks to figure out that their data was bad.

“Now that’s down to, they said, one to two hours to figure it out,” LaNeve said. “So especially in an age of AI and ML, data quality is essential and timeliness is essential, because you feed an AI model bad data, it’s going to give you a bad answer.”

Astro Observe, which LaNeve anticipates entering public preview early next month, will eventually form the basis for a full-fledged DataOps product. That will extend the product even further into the nuts and bolts of data engineering in the age of AI.

“Ultimately [it will] give you an experience designed around root cause analysis, like if something goes wrong, how do you immediately know what went wrong and how do you know what to go fix?” LaNeve said. “I think over time we’ll start to extend that into things like data quality monitoring, data contracts, and schema changes outside of this data product’s experience, especially because we have access to all this very rich metadata. I’d say the more we can do with it in general, the better.”

For more informatoin or to request access to the Astro Observe preview program, click here.

Apache Airflow to Power Google’s New Workflow Service

2024 State of Apache Airflow Report Shows Rapid Growth in Airflow Adoption

Source link
lol