Observability in LLMOps: Different Levels of Scale

Observability is key to successful and cost-efficient LLMOps. The demand for observability and the scale of the required infrastructure vary significantly along the LLMOps value chain.

Training foundation models is expensive, time-consuming, and happens at a scale where infrastructure failures are inevitable, making fine-grained observability a core requirement.

Developers of RAG systems and agents benefit from tracing capabilities, allowing them to understand the interplay between components and assess the responses to user requests.

The distributed structure of agentic networks adds another level of complexity, which is not yet addressed fully by LLM observability tools and practices.

Observability is invaluable in LLMOps. Whether we’re talking about pretraining or agentic networks, it’s paramount that we understand what’s going on inside our systems to control, optimize, and evolve them.

The infrastructure, effort, and scale required to achieve observability vary significantly. I recently gave a talk about thi s topic at the AI Engineer World’s Fair 2024 in San Francisco, which I’ll summarize in this article.

The value chain of LLMOps

The LLMOps value chain starts with training foundation models and subsequent task-specific finetuning. As the resource demands and costs decrease, so does the scale of the associated observability infrastructure. While achieving observability for prompt engineering is straightforward, it becomes more challenging as RAG systems and agents are introduced. The distributed nature of agentic networks adds yet another layer of complexity. While observability tooling and best practices for agents are becoming increasingly mature, it will take another year or so to reach the same level of observability for agentic networks.

When I think about LLMOps, I consider the entire value chain, from training foundation models to creating agentic networks. Each step has different observability needs and requires different scales of observability tooling and infrastructure.

Pretraining is undoubtedly the most expensive activity. We’re working with super-large GPU clusters and are looking at training runs that take weeks or months. Implementing observability at this scale is challenging but vital for training and business success.
In the post-training phase of the LLMOps value chain, cost is less of a concern. RLHF is relatively cheap, resulting in less pressure to spend on infrastructure and observability tooling. Compared to training LLMs from scratch, fine-tuning requires far fewer resources and data, making it an affordable activity with lower demands for observability.
Retrieval Augmented Generation (RAG) systems add a vector database and embeddings to the mix, which require dedicated observability tooling. When operated at scale, assessing retrieval relevance can become costly.
LLM agents and agentic networks rely on the interplay of multiple retrieval and generative components, all of which have to be instrumented and monitored to be able to trace requests.

Now that we have an overview, let’s examine the three steps of the LLMOps value chain with the biggest infrastructure scale—pretraining, RAG systems, and agents.

Scalability drivers in LLM pretraining

At neptune.ai, I work with many organizations that use our software to manage and monitor the training of foundation models. Three facts mainly drive their observability needs:

Training foundation models is incredibly expensive. Let’s say it costs us $500 million to train an LLM over three months. Losing just one day of training costs a whopping $5 million or more.
At scale, rare events aren’t rare. When you run tens of thousands of GPUs on thousands of machines for a long time, there will inevitably be hardware failures and network issues. The earlier we can identify (or, ideally, anticipate) them, the more effectively we can prevent downtime and data loss.
Training foundation models takes a long time. If we can use our resources more efficiently, we can accelerate training. Thus, we want to track how the layers of our models evolve and have granular GPU metrics, ideally on the level of a single GPU core. Understanding bottlenecks and inefficiencies helps save time and money.

neptune.ai is the experiment tracker for teams that train foundation models, designed with a strong focus on collaboration and scalability.

It lets you monitor months-long model training, track massive amounts of data, and compare thousands of metrics in the blink of an eye.

Neptune is known for its user-friendly UI and seamlessly integrates with popular ML/AI frameworks, enabling quick adoption with minimal disruption.

RAG observability challenges

Retrieval Augmented Generation (RAG) is the backbone of many LLM applications today. At first glance, the idea is simple: We embed the user’s query, retrieve related information from a vector database, and pass it to the LLM as context. However, quite a few components have to work together, and embeddings are a data type that’s hard for humans to grasp.

Overview of a typical LLM application built around a Retrieval Augmented Generation (RAG) system. Users interact with a chat interface. A controller component generates a query that is fed through an embedding model and used to retrieve relevant information from a vector database. This information is then embedded as context into the prompt template sent to the LLM, which generates the answer to the user’s request. — *Overview of a typical LLM application built around a* *Retrieval Augmented Generation (RAG)* system. Users interact with a chat interface. A controller component generates a query that is fed through an embedding model and used to retrieve relevant information from a vector database. This information is then embedded as context into the prompt template sent to the LLM, which generates the answer to the user’s request.

Tracing requests is key for RAG observability. It allows us to observe the embedding procedures and inspect how and what context is added to the query. We can utilize LLM evaluations to analyze the retrieval performance and relevance of the returned documents and the generated answers.

From a scalability and cost perspective, it would be ideal to identify low-quality results and focus our optimization efforts on them. However, in practice, since assessing a retrieval result takes significant time, we often end up storing all traces.

Towards observability of agentic networks

Observability in LLM agents requires tracking the queries of the knowledge bases, memory access, and calls to tools. The resulting amount of telemetry data is significantly higher than for RAG systems, which could just be one component of an agent.

Structure of an LLM agent. The controller and LLM sit at the agent's heart, tapping knowledge bases, long-term memory, tools, and instructions to solve a task. — *Structure of an LLM agent. The controller and LLM sit at the agent’s heart, tapping knowledge bases, long-term memory, tools, and instructions to solve a task.*

Agentic networks take this complexity a step further. By connecting multiple agents into a graph, we create a distributed system. Observing such networks requires tracking communication between agents in a way that makes the traces searchable. While we can borrow from microservice observability, I don’t think we’re quite there yet, and I’m excited to see what the next years will bring.

Was the article useful?

Thank you for your feedback!

Thanks for your vote! It’s been noted. | What topics you would like to see for your next read?

Thanks for your vote! It’s been noted. | Let us know what should be improved.

Thanks! Your suggestions have been forwarded to our editors

Explore more content topics:

Source link
lol