LLM Observability: Fundamentals, Practices, and Tools

LLM observability is the practice of gathering data about an LLM-based system in production to understand, evaluate, and optimize it.

Developers and operators gain insight by recording prompts and user feedback, tracing user requests through the components, monitoring latency and API usage, performing LLM evaluations, and assessing retrieval performance.

A range of frameworks and platforms supports the implementation of LLM observability. As new types of models are released and best practices emerge, these tools will continue to adapt and evolve.

Large Language Models (LLMs) have become the driving force behind AI-powered applications, ranging from translation services to chatbots and RAG systems.

Along with these applications, a new tech stack has emerged. Beyond LLMs, it comprises components such as vector databases and orchestration frameworks. Developers apply architectural patterns like chains and agents to create powerful applications that pose several challenges: They are non-deterministic, resource-hungry, and – since much of the application logic lies in LLMs – challenging to test and control.

LLM observability addresses these challenges by providing developers and operators insight into the application flow and performance.

At a high level, observability aims to enable an understanding of a system’s behavior without altering or directly accessing it. Observability allows developers and DevOps specialists to ask arbitrary questions about applications, even if their questions only emerge after a system is already in production.

In this tradition, LLM observability is the practice of gathering data (telemetry) while an LLM-powered system is running to analyze, assess, and enhance its performance. It augments the repertoire of software and machine-learning observability approaches with new tools and practices tailored to the unique characteristics of LLM applications.

Navigating the field of LLM observability is not easy. Best practices are just emerging, and new tools and vendors enter the market monthly. After reading this article, you’ll be able to answer the following questions:

What is LLM observability, and why do we need it?
What are the essential LLM observability practices, and how can you implement them?
Which specialized frameworks and platforms are available?

Why do we need LLM observability?

LLM applications and non-LLM systems are both complex overall. LLM systems might run on-premises or behind an API. The main distinction is that LLMs, and consequently LLM-driven systems, accept open-ended input, resulting in non-deterministic behavior.

An LLM is a relatively unpredictable piece of software. While its output can be somewhat controlled by adjusting the sampling, ML engineers and application developers can make only a few assumptions about it. Due to their algorithmic structure and stochastic nature, LLMs can generate incorrect or misleading outputs. They are known to make up information if they cannot correctly respond to a prompt, a phenomenon referred to as “hallucinations.”

Since LLMs process language (encoded as sequences of tokens), their input space is vast. Users can input arbitrary information, making it impossible to foresee and analyze all potential inputs. Therefore, the traditional software testing approach, where we verify that a specific input leads to a specific output and thereby derive guarantees about the system, does not apply to LLM applications.

Thus, it is a given that new model behaviors become apparent in production, making observability all the more critical.

Further, developers of LLM applications typically face the challenge of users expecting low-latency responses. At the same time, the models are computationally expensive, or the system requires multiple calls to third-party APIs, queries to RAG systems, or tool invocations.

Observability is key to successful and cost-efficient LLMOps. The demand for observability and the scale of the required infrastructure vary significantly along the LLMOps value chain.

Listen to Neptune’s Chief Product Officer Aurimas Griciūnas talk about the demands for observability when training foundational models, how RAG and agent developers benefit from tracing, and observability challenges in agentic networks.

Anatomy of an LLM application

Understanding a system’s makeup is paramount to accomplishing and improving its observability. So, let’s examine an LLM system’s architecture and main components in more detail.

Most LLM applications can be divided into the following components:

Large Language Models (LLMs) are complex transformer models trained on massive amounts of text data. Few organizations have the capability and need to train LLMs from scratch. Most applications use pre-trained foundational models and adapt them through fine-tuning or prompting.

LLMs are often several hundred megabytes to gigabytes in size, which makes their operation challenging and resource-intensive. There are two main approaches to LLM integration: Either the model is deployed to on-premise hardware or a cloud environment together with the remainder of the LLM application, or model hosting is outsourced to a third-party provider, and the LLM is accessed via an API.

LLMs – and, in turn, LLM applications – are non-deterministic because they generate their output through stochastic sampling processes. Further, a slight tweak to the input prompt can lead to a dramatically different outcome. Paired with the fact that many LLM applications are heavily context-driven, an LLM is a relatively unpredictable system component.

Vector databases are an integral component of many LLM applications. They act as an information source for the LLM, providing information beyond what is encoded in the model or included in the user’s request.
Converting documents to abstract embedding vectors and subsequently retrieving them through similarity search is an opaque process. Assessing retrieval performance requires using metrics that often do not fully reflect human perception.
Chains and agents have emerged as the dominant architectural patterns for LLM applications.
A chain prescribes a series of specific steps for processing the input and generating a response. For example, a chain could insert a user-provided text into a prompt template with instructions to extract specific information, pass the prompt to an LLM, and parse the model’s output into a well-defined JSON object.

In an agentic system, there is no fixed order of steps, but an LLM repeatedly selects between several possible actions to perform the desired task. For example, an agent for answering programming questions might have the option to query an internet search engine, prompt a specialized LLM to adapt or generate a code example, or execute a piece of code and observe the outcome. The agent can select the most suitable action based on the user’s request and intermediate outputs.
User interface: End users typically interact with an LLM application through a well-known UI like a chat, a mobile or desktop app, or a plugin. In many cases, this means that the LLM application exposes an API that handles requests for the entire user base.
While the UI is a part of an LLM application that does not differ from a traditional software application, it is nevertheless important to include it in observability considerations. After all, the goal is to acquire end-to-end traces and improve the user experience.

Overview of a typical LLM application built around a Retrieval Augmented Generation (RAG) system. Users interact with a chat interface. A controller component generates a query that is fed through an embedding model and used to retrieve relevant information from a vector database. This information is then embedded as context into the prompt template sent to the LLM, which generates the answer to the user’s request. | Source

Goals of LLM observability

LLM observability practices help with the following:

Root cause analysis. When an LLM application returns an unexpected response or fails with an error, we’re often left in the dark. Do we have an implementation error in our software? Did our knowledge base not return a sufficient amount of relevant data? Does the LLM struggle to parse our prompt? Observability aims to collect data about everything that’s going on inside our application in a way that enables us to trace individual requests across components.
Identifying performance bottlenecks. Users of LLM-based applications such as chat-based assistants or code-completion tools expect fast response times. Due to the large number of resource-hungry components, meeting latency requirements is challenging. While we can monitor individual components, tracking metrics such as request rates, latency, and resource utilization, this alone does not tell us where to place our focus. Observability enables us to see where requests take a long time and allows us to dig into outliers.
Assessing LLM outputs. As the developer of an LLM application, it’s easy to be fooled by satisfactory responses to sample requests. Even with thorough testing, it’s inevitable that a user’s input will result in an unsatisfactory answer, and experience shows that we usually need several rounds of refinement to maintain high quality consistently. Observability measures help notice when LLM applications fail to appropriately respond to requests, for example, through automated evaluations or the ability to correlate user feedback and behavior with LLM outputs.

Detecting patterns in inadequate responses. By providing means of identifying wrong and substandard responses, implementing LLM observability enables us to identify commonalities and patterns. These insights allow us to optimize prompts, processing steps, and retrieval mechanisms more systematically.
Developing guardrails. While we can resolve many issues in LLM applications through a combination of software engineering, prompt optimization, and fine-tuning, there are still plenty of scenarios where this is not enough. Observability helps identify where guardrails are needed and assess their efficacy and impact on the system.

What is LLM observability?

Large Language Model (LLM) observability comprises methods for monitoring, tracing, and analyzing an LLM system’s behavior and responses. Like traditional observability of IT systems, it’s not defined as a fixed set of capabilities but best described as a practice, approach, or mindset.

At a high level, implementing LLM observability involves:

Instrumenting the system to gather relevant data in production, which collectively is referred to as “telemetry.”

Identifying and analyzing successful and problematic requests made to the LLM application. Over time, this builds an understanding of the system’s baseline performance and weaknesses.
Taking action to improve the system or adding additional observability instruments to remove blind spots.

Before we discuss the different pillars of LLM observability in detail, let’s clarify how LLM observability relates to LLM monitoring and ML observability.

LLM monitoring vs. LLM observability

The difference between LLM monitoring and LLM observability is analogous to the one between traditional monitoring and observability.

Monitoring focuses on the “what.” By collecting and aggregating metrics data, we track and assess our application’s performance. Examples of typical LLM application metrics are the number of requests to an LLM, the time it takes an API to respond, or the model server’s GPU utilization. Through dashboards and alerts, we keep an eye on key performance indicators and verify that our systems fulfill service-level agreements (SLAs).

Observability goes a step further, asking “why.” It aims to enable developers and operators to find the root cause of issues and understand the interactions between the system’s components. While it often draws from the same logs and metrics used for monitoring, it’s an investigative approach that requires the data about a system to be collected in a way that allows it to be queried and linked. Ideally, we can trace the path of every individual request through the system and find correlations between specific groups of requests.

LLM observability vs ML observability

Machine learning (ML) observability is an established practice, so it is natural to ask why LLM applications require a new approach.

ML models are predictive. They aim to map input to output in a deterministic fashion. A typical ML model behaves like a mathematical function, taking a specific data point and computing an output. Accordingly, ML observability mainly revolves around analyzing data drift to understand degrading model performance as input data distributions change over time.

In contrast, LLM applications rely heavily on context and are non-deterministic. They integrate information beyond the user’s input, often from hard-to-predict sources, and LLMs generate their output through a stochastic sampling process.

Another difference between LLM and ML applications is that for many of the latter, ground truth data becomes available eventually and can be compared using metrics. This is typically not the case for an LLM application, where we have to work with heuristics or indirect user feedback.

Further, ML observability includes interpretability. While it is possible to apply methods like feature attributions to LLMs, they provide little actionable insight for developers and data scientists—in contrast to ML models, where a similar approach might surface that an ML model over- or undervalues a particular feature or points towards the need for changes in the model’s capacity. Thus, LLM interpretability remains, first and foremost, a tool researchers use to uncover the rich inner structures of language models.

Pillars of LLM observability

Traditional observability rests on four pillars: metrics, events, logs, and traces. Together, these data types are known as “MELT” and serve as different lenses into a system.

In LLM applications, MELT data remains the backbone that is extended with a new set of pillars that build on or augment them:

Prompts and user feedback
Tracing
Latency and usage monitoring
LLM evaluations
Retrieval analysis

Prompts and user feedback

The prompts fed to an LLM are a core part of any LLM application. They are either provided by the users directly or generated by populating a prompt template with user input.

A first step towards observability is structured logging of prompts and resulting LLM outputs, annotated with metadata such as the prompt template’s version, the invoked API endpoint, or any encountered errors.

Logging prompts allows us to identify scenarios where prompts do not yield the desired output and to optimize prompt templates. When working with structured outputs, tracking the raw LLM response and the parsed version enables us to refine our schemas and assess whether additional fine-tuning is warranted.

We can track user feedback to assess whether an LLM application’s output meets expectations. Even a simple “thumbs up” and “thumbs down” feedback is often sufficient to point out instances where our application fell short.

Tracing

Tracing requests through a system’s different components is an integral part of observability.

In an LLM application, a trace represents a single user interaction from the initial user input to the final application response.

A trace consists of spans representing specific workflow steps or operations. One example could be assembling a prompt or a call to a model API. Each span can encompass many child spans, giving a holistic view of the application.

A full trace makes it obvious at first glance how components are connected and where a system spends time responding to a request. In chains and agents, where the steps taken are different for each request and cannot be known beforehand, traces are an invaluable aid in understanding the application’s behavior.

Trace of a user request to an LLM chain. The root span encompasses the entire request. The chain consists of a retrieval and a generation step, each of which is divided into several sub-steps. The trace shows how the user request triggers the chain, which invokes the retrieval component where the user’s request is embedded before it is used in the subsequent retrieval of related information from a vector database. The following generation step is subdivided into a call to an LLM API, after which the output is parsed into a structured format. The length of the spans indicates the duration of each step.

Latency and usage monitoring

Due to their size and complexity, LLMs can take a long time to generate a response. Thus, managing and reducing latency is a key concern for LLM application developers.

When hosting LLMs, monitoring resource utilization and tracking response times is essential. In addition, keeping track of prompt length and the number and rate of produced tokens helps optimize resources and identify bottlenecks.

Recording response latency is indispensable for applications that call third-party APIs. As many vendors’ pricing models are based on the number of input and output tokens, monitoring these metrics is crucial for cost management. When API calls fail, error codes and messages help distinguish between application errors, exceeded rate limits, and outages.

LLM evaluations

It’s typically not possible to directly measure the success or failure of an LLM application. While we can compare the output of a software component or ML model to an expected value, an LLM application usually has many different ways of “responding correctly” to a user’s request.

LLM evaluation is the practice of assessing LLM outputs. Four different types of evaluations are employed:

Validating the output’s structure by attempting to parse it into the pre-defined schema.
Comparing the LLM output with a reference utilizing heuristics such as BLEU or ROUGE.
Using another LLM to assess the output. This second LLM can be a stronger and more capable model solving the same task or a model specialized in detecting, e.g., hate speech or sentiment.
Asking humans to evaluate an LLM response is a highly valuable but expensive option.

All categories of LLM evaluations require data to work with. Collecting prompts and outputs is a prerequisite for ensuring that the evaluation examples match users’ input. Any evaluation data set must be representative, truly capturing the application’s performance.

However, even for humans, it can be difficult to infer a user’s intent just by looking at a short textual input. Thus, collecting user feedback and analyzing interactions (e.g., if a user repeatedly asks the same question in varied forms) is often necessary to obtain the complete picture.

Retrieval analysis

LLMs can only replicate information they encountered in their training data or the prompt. Retrieval-augmented generation (RAG) systems use the user’s input to retrieve information from a knowledge base that they then include in the prompt fed to the LLM.

Observing the retrieval component and underlying databases is paramount. On a basic level, this means including the RAG sub-system in traces and tracking latency and cost.

Beyond that, evaluations for retrievers focus on the relevancy of the returned information. As in the case of LLM evaluations, we can employ heuristics, LLMs, or humans to assess retrieval results. When an LLM application utilizes contextual compression or re-ranking, these steps must also be included.

In this section, we’ll explore a range of LLMOps tools and how they contribute to LLM observability.

As with any emerging field, the market is volatile, and products are announced, refocused, or discontinued regularly. If you think we’re missing a tool, let us know.

Arize Phoenix
Arize AI
Langfuse
Langsmith
Helicone
Confident AI and DeepEval
Galileo
Aporia
WhyLabs and LangKit

Arize Phoenix

Phoenix is an open-source LLM observability platform developed by Arize AI. Phoenix provides features for diagnosing, troubleshooting, and monitoring the entire LLM application lifecycle, from experimentation to production deployment.

Visualization of a trace of an RAG application. The left-hand panel shows the spans and their nesting. In the central panel on the right, the user can inspect detailed information about the selected span, such as input and output messages, as well as the utilized prompt template and invocation parameters.| Source

Arize Phoenix key features:

LangSmith

LangSmith, developed by LangChain, is a SaaS LLM observability platform that lets AI engineers test, evaluate, and monitor chains and agents. It seamlessly integrates with the LangChain framework, popular among LLM application developers for its wide range of integrations.

LangSmith key features

Langfuse

Langfuse is an open-source LLM observability platform that provides LLM engineers with tools for developing prompts, tracing, monitoring, and testing.

Langfuse key features

Helicone

Helicone is an open-source LLM observability platform that can be self-hosted or used through a SaaS subscription.

Helicone key features

Confident AI and DeepEval

DeepEval is an open-source LLM evaluation framework that allows developers to define tests for LLM applications similar to the pytest framework. Users can submit test results and metrics to the Confident AI SaaS platform.

Overview of test cases in Confident AI. Users can filter test runs based on status and inspect details like the LLM’s input and output. | Source

Confident AI key features

User feedback: Confident AI includes different means of collecting, managing, and analyzing human feedback. Developers can submit feedback from their application flow using DeepEval’s send_feedback method.
Tracing and retrieval analysis: Confident AI provides framework-agnostic tracing for LLM applications through DeepEval. When submitting traces, users can choose from a wide variety of pre-defined trace types and add corresponding attributes. For example, when tracing the retrieval from a vector database, these attributes include the query, the average chunk size, and the similarity metric.
LLM evaluations: DeepEval structures LLM evaluations as test cases. Similar to unit testing frameworks, each test case defines an input and expected output, which is compared to the LLM’s actual output when the test suite is executed. Confident AI collects test results and allows developers to query, analyze, and comment on them.

Galileo

Galileo is an LLM evaluation and observability platform centered around the GenAI Studio. It is exclusively available through customer-specific enterprise contracts.

Galileo key features

Aporia

The Aporia ML observability platform includes a range of capabilities to provide observability and guardrails for LLM applications.

Aporia key features

WhyLabs and LangKit

LangKit is an open-source LLM metrics and monitoring toolkit designed by WhyLabs, who provide an associated observability platform as a SaaS product.

Metrics generated with LangKit can be reported into the WhyLabs platform, where they can be filtered and analyzed. | Source

LangKit key features

Comparison table

This overview was compiled in August 2024. Let us know if we’re missing something.

The present and future of LLM observability

LLM observability enables developers and operators to understand and improve their LLM applications. The tools and platforms available on the market enable teams to adopt practices like prompt management, tracing, and LLM evaluations with limited effort.

With each new development in LLMs and generative AI, new challenges emerge. It is likely that LLM observability will require new pillars to ensure the same level of insight for multi-modal models or LLMs deployed on the edge.

Was the article useful?

Thank you for your feedback!

Thanks for your vote! It’s been noted. | What topics you would like to see for your next read?

Thanks for your vote! It’s been noted. | Let us know what should be improved.

Thanks! Your suggestions have been forwarded to our editors

Explore more content topics:

Source link
lol