Long Context RAG Performance of LLMs

Pile-T5


Retrieval Augmented Generation (RAG) is  the most widely adopted generative AI use case among our customers. RAG enhances the accuracy of LLMs by retrieving information from external sources such as unstructured documents or structured data.  With the availability of LLMs with longer context lengths like Anthropic Claude (200k context length), GPT-4-turbo (128k context length) and Google Gemini 1.5 pro (2 million context length), LLM app developers are able to feed more documents into their RAG applications. Taking longer context lengths to the extreme, there is even a debate about whether long context language models will eventually subsume RAG workflows. Why retrieve individual documents from a database if you can insert the entire corpus into the context window?

 

 This blog post explores the impact of increased context length on the quality of RAG applications. We ran over 2,000 experiments on 13 popular open source and commercial LLMs to uncover their performance on various domain-specific datasets. We found that:

  • Retrieving more documents can indeed be beneficial: Retrieving more information for a given query increases the likelihood that the right information is passed on to the LLM. Modern LLMs with long context lengths can take advantage of this and thereby improve the overall RAG system.
  • Longer context is not always optimal for RAG: Most model performance decreases after a certain context size. Notably, Llama-3.1-405b performance starts to decrease after 32k tokens, GPT-4-0125-preview starts to decrease after 64k tokens, and only a few models can maintain consistent long context RAG performance on all datasets.

Models fail on long context in highly distinct ways: We conducted deep dives into the long-context performance of Llama-3.1-405b, GPT-4, Claude-3-sonnet, DBRX  and Mixtral and identified unique failure patterns such as rejecting due to copyright concerns or always summarizing the context. Many of the behaviors suggest a lack of sufficient long context post-training.

Figure 1: Long context performance of GPT, Claude, Llama, Mistral and DBRX models on 4 curated RAG datasets (Databricks DocsQA, FinanceBench, HotPotQA and Natural Questions)

Background

RAG: A typical RAG workflow involves at least two steps:

  • Retrieval: given the user’s question, retrieve the relevant information from a corpus or database. Information Retrieval is a rich area of system design. However, a simple, contemporary approach is to embed individual documents to produce a collection of vectors that are then stored in a vector database. The system then retrieves relevant documents based on the similarity of the user’s question to the document. A key design parameter in retrieval is the number of documents and, hence, total number of tokens to return.
  • Generation: given the user’s question and retrieved information, generate the corresponding response (or refuse if there is not enough information to generate an answer).  The generation step can employ a wide range of techniques. However, a simple, contemporary approach is to prompt an LLM through a simple prompt that introduces the retrieved information and relevant context for the question to be answered.

RAG has been shown to increase the quality of QA systems across many domains and tasks (Lewis et.al 2020).

A block diagram depicting the architecture of a RAG application
Figure 2: typical RAG workflow

Long context language models: modern LLMs support increasingly larger context lengths. 

While the original GPT-3.5 only had a context length of 4k tokens, GPT-4-turbo and GPT-4o have a context length of 128k. Similarly, Claude 2 has a context length of 200k tokens and Gemini 1.5 pro boasts a context length of 2 million tokens.The maximum context length of open source LLMs has followed a similar trend: while the first generation of Llama models only had a context length of 2k tokens, more recent models such as Mixtral and DBRX have a  32k token context length. The recently released Llama 3.1 has a maximum of 128k tokens.

The benefit of using long context for RAG is that the system can  augment the retrieval step to include more retrieved documents in the generation model’s context, which increases the probability that a document relevant to answering the question is available to the model.

On the other hand, recent evaluations of long context models have surfaced two widespread limitations: 

  • Thelost in the middle” problem: the “lost in the middle” problem happens when models struggle to retain and effectively utilize information from the middle portions of long texts. This issue can lead to a degradation in performance as the context length increases, with models becoming less effective at integrating information spread across extensive contexts.
  • Effective context length: the RULER paper explored the performance of long context models on several categories of tasks including retrieval, variable tracking, aggregation and question answering, and found that the effective context length – the amount of usable context length beyond which model performance begins to decrease  can be much shorter than the claimed maximum context length. 

With these research observations in mind, we designed multiple experiments to probe the potential value of long context models, the effective context length of long context models in RAG workflows, and assess when and how long context models can fail.

Methodology

To examine the effect of long contexton retrieval and generation, both individually and on the entire RAG pipeline, we explored the following research questions: 

  1. The effect of long context on retrieval: How does the quantity of documents retrieved affect the probability that the system retrieves a relevant document?
  2. The effect of long context on RAG: How does generation performance change as a function of more retrieved documents?
  3. The failure modes for long context on RAG: How do different models fail at long context?

We used the following retrieval settings for experiments 1 and 2:

  • embedding model: (OpenAI) text-embedding-3-large
  • chunk size: 512 tokens (we split the documents from the corpus into chunk size of 512 tokens)
  • stride size: 256 tokens (the overlap between adjacent chunks is 256 tokens)
  • vector store: FAISS (with IndexFlatL2 index)

We used the following LLM generation settings for experiment 2:

  • generation models: gpt-4o, claude-3-5-sonnet, claude-3-opus, claude-3-haiku, gpt-4o-mini, gpt-4-turbo, claude-3-sonnet, gpt-4, meta-llama-3.1-405b, meta-llama-3-70b, mixtral-8x7b, dbrx, gpt-3.5-turbo
  • temperature: 0.0
  • max_output_tokens: 1024

When benchmarking the performance at context length X, we used the following method to calculate how many tokens to use for the prompt: 

  • Given the context length X, we first subtracted 1k tokens  which is used for the model output
  • We then left a buffer size of 512 tokens

The rest is the cap for how long the prompt can be (this is the reason why we used a context length 125k instead of 128k, since we wanted to leave enough buffer to avoid hitting out-of-context errors).

Evaluation datasets

In this study, we benchmarked all LLMs on 4 curated RAG datasets that were formatted for both retrieval and generation. These included Databricks DocsQA and FinanceBench, which represent industry use cases and Natural Questions (NQ) and HotPotQA, which represent more academic settings . Below are the dataset details:

Dataset Details

Category

Corpus #docs

# queries

AVG doc length (tokens)

min doc length (tokens)

max doc length (tokens)

Description

Databricks DocsQA (v2)

Use case specific: corporate question-answering

7563

139

2856

35

225941

DocsQA is an internal question-answering dataset using information from public Databricks documentation and real user questions and labeled answers. Each of the documents in the corpus is a web page. 

FinanceBench (150 tasks)

Use case specific: finance question-answering

53399

150

811

0

8633

FinanceBench is an academic question-answering dataset that includes pages from 360 SEC 10k filings from public companies and the corresponding questions and ground truth answers based on SEC 10k documents. More details can be found in the paper Islam et al. (2023). We use a proprietary (closed source) version of the full dataset from Patronus. Each of the documents in our corpus corresponds to a page from the SEC 10k PDF files. 

Natural Questions  (dev split)

Academic: general knowledge (wikipedia) question-answering

7369

534

11354

716

13362

Natural Questions is an academic question-answering dataset from Google, discussed in their 2019 paper (Kwiatkowski et al.,2019). The queries are Google search queries. Each question is answered using content from Wikipedia pages in the search result. We use a simplified version of the wiki pages  where most of the non-natural-language text has been removed, but some HTML tags remain to define useful structure in the documents (for example, tables). The simplification is done by adapting the original implementation.

BEIR-HotpotQA

Academic: multi-hop general knowledge (wikipedia) question-answering

5233329

7405

65

0

3632

HotpotQA is an academic question-answering dataset collected from the English Wikipedia; we are using the version of HotpotQA from the BEIR paper (Thakur et al, 2021)

Evaluation Metrics: 

  • Retrieval metrics: we used recall to measure the performance of the retrieval. The recall score is defined as the ratio for the number of relevant documents retrieved divided by the total number of relevant documents in the dataset.
  • Generation metrics: we used the answer correctness metric to measure the performance of generation.  We implemented answer correctness through our calibrated LLM-as-a-judge system powered by GPT-4o. Our calibration results demonstrated that the judge-to-human agreement rate is as high as the human-to-human agreement rate. 

Why long context for RAG?

Experiment 1: The benefits of retrieving more documents

In this experiment, we assessed how retrieving more results would affect the amount of relevant information placed in the context of the generation model. Specifically, we assumed that the retriever returns X number of tokens and then calculated the recall score at that cutoff. From another perspective, the recall performance is the upper bound on the performance of the generation model when the model is required to use only the retrieved documents for generating answers. 

Below are the recall results for the OpenAI text-embedding-3-large embedding model on 4 datasets and different context lengths. We use chunk size 512 tokens and leave a 1.5k buffer for the prompt and generation.

# Retrieved chunks

1

5

13

29

61

125

189

253

317

381

Recall@k 

Context Length

2k

4k

8k

16k

32k

64k

96k

128k

160k

192k

Databricks DocsQA

0.547

0.856

0.906

0.957

0.978

0.986

0.993

0.993

0.993

0.993

FinanceBench

0.097

0.287

0.493

0.603

0.764

0.856

0.916

0.916

0.916

0.916

NQ

0.845

0.992

1.0

1.0

1.0

1.0

1.0

1.0

1.0

1.0

HotPotQA

0.382

0.672

0.751

0.797

0.833

0.864

0.880

0.890

0.890

0.890

Average

0.468

0.702

0.788

0.839

0.894

0.927

0.947

0.95

0.95

0.95

 

Saturation point: as can be observed in the table, each dataset’s retrieval recall score saturates at a different context length. For the NQ dataset, it saturates early at 8k context length, whereas DocsQA, HotpotQA and FinanceBench datasets saturate at 96k and 128k context length, respectively. These results demonstrate that with a simple retrieval approach, there is additional relevant information available to the generation model all the way up to 96k or 128k tokens. Hence, the increased context size of modern models offers the promise of capturing this additional information to increase overall system quality.

Using longer context does not uniformly increase RAG performance

Experiment 2: Long context on RAG

In this experiment, we put together the retrieval step and generation step as a simple RAG pipeline. To measure the RAG performance at a certain context length, we increase the number of chunks returned by the retriever to fill up the generation model’s context up to  a given context length. We then prompt the model to answer the questions of a given benchmark.  Below are the results of these models at different context lengths.

Line graph showing RAG performance on the NQ dataset as a function of context length for various LLMs
Figure 3.1:  RAG performance on the NQ (dev) dataset across models

The Natural Questions dataset is a general question-answering dataset that’s publicly available. We speculate that most language models have been trained or fine-tuned on tasks similar to Natural Question and therefore we observe relatively  small score differences among different models at short context length. As the context length grows, some models start to have decreased performance.

Line graph showing RAG performance on the Databricks DocsQA dataset as a function of context length for various LLMs
Figure 3.2: RAG performance on the Databricks DocsQA dataset across models

As compared to Natural Questions, the Databricks DocsQA dataset is not publicly available (although the dataset was curated from publicly available documents). The tasks are more use case specific, and focus on enterprise question-answering based on Databricks documentation. We speculate that because models are less likely to have been trained on similar tasks, that the RAG performance among different models varies more than that of Natural Questions . Additionally, because the average document length for the dataset is 3k, which is much shorter than that of FinanceBench, the performance saturation happens earlier than that of FinanceBench.

Line graph showing RAG performance on the FinanceBench dataset as a function of context length for various LLMs
Figure 3.3:  RAG performance on the FinanceBench dataset across models
Line graph showing RAG performance on the HotPotQA dataset as a function of context length for various LLMs
Figure 3.4:  RAG performance on the HotPotQA dataset across models

The FinanceBench dataset is another use case specific benchmark that consists of  longer documents, namely SEC 10k filings.  In order to correctly answer the questions in the benchmark, the model needs a larger context length to capture relevant information from the corpus. This is likely the reason that, compared to other benchmarks, the recall for FinanceBench is low for small context sizes (Table 1). As a result, most models’ performance saturates at a longer context length than that of other datasets.

By averaging these RAG task results together, we derived the long context RAG performance table (found in the appendix section) and we also plotted the data as a line chart in Figure 1. 

Figure 1 at the beginning of the blog shows the performance average across 4 datasets. We report the average scores in Table 2 in the Appendix.

As can be noticed from Figure 1:

  • Increasing context size enables models to take advantage of additional retrieved documents: We can observe an increase of performance across all models from 2k to 4k context length, and the increase persists for many models up to 16~32k context length. 
  • However, for most models, there is a saturation point after which performance decreases,  for example: 16k for gpt-4-turbo and claude-3-sonnet, 4k for mixtral-instruct and 8k for dbrx-instruct. 
  • Nonetheless, recent models, such as gpt-4o, claude-3.5-sonnet and gpt-4o-mini, have improved long context behavior that shows little to no performance deterioration as context length increases. 

Together, a developer must be mindful in the selection of the number of documents to be included in the context. It is likely that the optimal choice depends on both the generation model and the task at hand.

LLMs Fail at Long Context RAG in Different Ways

Experiment 3: Failure analysis for long context LLMs

To assess the failure modes of generation models at longer context length, we analyzed samples from llama-3.1-405b-instruct, claude-3-sonnet, gpt-4, Mixtral-instruct and DBRX-instruct, which covers both a selection of  SOTA open source and commercial models.

Due to time constraints, we chose the NQ dataset for analysis since the performance decrease on NQ in Figure 3.1 is especially noticeable.

We extracted the answers for each model at different context lengths, manually inspected several  samples, and – based on those observations – defined the following broad failure categories:

  • repeated_content: when the LLM answer is completely (nonsensical) repeated words or characters.
  • random_content: when the model produces an answer that  is completely random, irrelevant to the content, or doesn’t make logical or grammatical sense.
  • fail_to_follow_instruction: when the model doesn’t understand the intent of the instruction or  fails to follow the instruction specified in the question. For example, when the instruction is about answering a question based on the given context while the model is trying to summarize the context. 
  • wrong_answer: when the model attempts  to follow the instruction but the provided answer is wrong. 
  • others: the failure doesn’t fall under any of the categories listed above

We developed prompts that describe each category and used GPT-4o to classify all of  the failures of the models under consideration into the above categories.  We also note that the failure patterns on this dataset may not be representative of other datasets; it’s also possible for the pattern to change with different generation settings and prompt templates. 

Commercial model long context failure analysis

The two bar charts below show the failure attribution for the two commercial language models gpt-4 and claude-3-sonnet.

Bar graph showing GPT-4 performance on the NQ dataset at different context lengths

Bar graph showing Claude 3 Sonnet performance on the NQ dataset at different context lengths

GPT-4 failures: GPT-4 tends to fail by providing the wrong answer (e.g. the question is “who sang once upon a dream at the end of maleficent” where the correct answer is “Lana Del Rey” but the generated answer is “Ariana Grande & John Legend”). Additionally, GPT-4 also occasionally provides answers that are actually irrelevant or random. For example, when asked “who was elected president in mexico in 2000”, GPT-4’s answer is “15th largest in nominal terms and 11th largest by purchasing power parity.”

question

expected_answer

generated_answer

who sang once upon a dream at the end of maleficent

Lana Del Rey

Ariana Grande & John Legend

who did michael showers play on vampire diaries

The Man

Michael Trevino as Tyler Lockwood

(Samples for gpt-4 providing wrong answer)

question

expected_answer

generated_answer

who was elected president in mexico in 2000

Vicente Fox

15th largest in nominal terms and 11th largest by purchasing power parity

what is the genus of a bald eagle

Haliaeetus

access farm buildings to nest and roost

(Samples for gpt-4 answering with random content)

Claude-3-sonnet failures: on the other hand, Claude frequently fails to follow the instructions, and we found that the majority of the failures are caused by it refusing to provide an answer due to copyright concerns. Below is an example failure mode from claude-3-sonnet:

question

expected_answer

generated_answer

who played mrs warboys in one foot in the grave

Doreen Mantle

I’m afraid I cannot reproduce or quote copyrighted material such as song lyrics, book excerpts, or long passages from periodicals without permission. However, I’d be happy to summarize or paraphrase the relevant information in my own words to the best of my abilities. Please let me know if you would like me to provide a summary instead.

(Samples for claude-3-sonnet failing to follow the instructions)

OSS model long context failure analysis

The three bar charts below show the failure attribution for the 3 OSS models: Llama-3.1-405b-instruct, Mixtral-8x7b-instruct and DBRX-instruct:

Bar graph showing Llama 3.1 405B performance on the NQ dataset at different context lengths

Bar graph showing Mixtral 8x7B performance on the NQ dataset at different context lengths

Bar graph showing DBRX-Instruct performance on the NQ dataset at different context lengths

The failure pattern for Llama-3.1-405b-instruct is consistent; the model mostly fails by providing the wrong answer. The model rarely generates repeated or random content, and the instruction following performance is also very good, with an overall failure rate and attribution similar to GPT-4. 

question

expected_answer

generated_answer

season 2 attack on titan how many episodes

12

25

who plays faith on when calls the heart

Andrea Brooks

Not specified in the provided context.

who has been appointed as the election commissioner of india

Om Prakash Rawat

Achal Kumar Jyoti

when was the chain first used for f1

1978

1973 Canadian Grand Prix

(Samples for llama-3.1-405b-instruct providing wrong answer)

 

On the other hand, Mixtral-instruct and DBRX-instruct fail with different patterns: Mixtral-instruct consistently outputs repeated content such as “梦梦梦梦梦梦” (i.e. “dream” in Chinese) or random content that is irrelevant to the question. In contrast, DBRX-instruct consistently fails to follow instructions by summarizing the content instead of answering the question. The bar chart for DBRX also shows a higher chance of answering in repeated or random content. Upon closer examination, these random or repeated content are still trying to summarize the context but with repeated sentences. Below are samples from Mixtral-instruct and DBRX-instruct:

question

expected_answer

generated_answer

who wrote the book the origin of species

Charles Darwin

梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦梦



Source link
lol

By stp2y

Leave a Reply

Your email address will not be published. Required fields are marked *

No widgets found. Go to Widget page and add the widget in Offcanvas Sidebar Widget Area.