LLM Hallucinations 101

Building MLOps Capabilities at GitLab As a One-Person ML Platform Team


Hallucinations are an inherent feature of LLMs that becomes a bug in LLM-based applications.

Causes of hallucinations include insufficient training data, misalignment, attention limitations, and tokenizer issues.

Hallucinations can be detected by verifying the accuracy and reliability of the model’s responses.

Effective mitigation strategies involve enhancing data quality, alignment, information retrieval methods, and prompt engineering.

In 2022, when GPT-3.5 was introduced with ChatGPT, many, like me, started experimenting with various use cases. A friend asked me if it could read an article, summarize it, and answer some questions, like a research assistant. At that time, ChatGPT had no tools to explore websites, but I was unaware of this. So, I gave it the article’s link. It responded with an abstract of the article. Since the article was a medical research paper, and I had no medical background, I was amazed by the result and eagerly shared my enthusiasm with my friend. However, when he reviewed the abstract, he noticed it had almost nothing to do with the article.

Then, I realized what had happened. As you might guess, ChatGPT had taken the URL, which included the article’s title, and “made up” an abstract. This “making up” event is what we call a hallucination, a term popularized by Andrej Karpathy in 2015 in the context of RNNs and extensively used nowadays for large language models (LLMs).

What are LLM hallucinations?

LLMs like GPT4o, Llama 3.1, Claude 3.5, or Gemini Pro 1.5 have made a huge jump in quality compared to the first of its class, GPT 3.5. However, they are all based on the same decoder-only transformer architecture, with the sole goal of predicting the next token based on a sequence of given or already predicted tokens. This is called causal language modeling. Relying on this goal task and looping (pre-training) over a gigantic dataset of text (15T tokens for Llama 3.1) trying to predict each one of its tokens is how an LLM acquires its ability to understand natural language.

There is a whole field of study on how LLMs select the following token for a sequence. In the following, we’ll exclusively talk about LLMs with greedy decoding, which means choosing the most probable token for the next token prediction. Given that, talking about hallucinations is hard because, in some sense, all an LLM does is hallucinate tokens.

LLM hallucinations become a problem in LLM-based applications

Most of the time, if you use an LLM, you probably won’t use a base LLM but an LLM-based assistant whose goal is to help with your requests and reliably answer your questions. Ultimately, the student has been trained (post-training) to follow your instructions. Here’s when hallucinations become an undesirable bug.

In short, hallucinations occur when a user instruction (prompt) leads the LLM to predict tokens that are not aligned with the expected answer or ground truth. These hallucinations mainly happen either because the correct token was not available or because the LLM failed to retrieve it.

Before we dive into this further, I’d like to stress that when thinking about LLM hallucinations, it’s important to keep in mind the difference between a base LLM and an LLM-based assistant. When we talk about LLM hallucinations as a problematic phenomenon, it’s in the context of an LLM-based assistant or system.

Where in the transformer architecture are hallucinations generated?

The statement “all an LLM does is hallucinate tokens” conceals a lot of meaning. To uncover this, let’s walk through the transformer architecture to understand how tokens are generated during inference and where hallucinations may be happening.

Decoder-only transformer architecture. The input tokens are embedded, combined with a positional encoding, and fed through a stack of transformer blocks. Each of these blocks consists of multiple attention heads and a feed-forward layer. The output probabilities are obtained by computing the softmax over the output layer.
Decoder-only transformer architecture. The input tokens are embedded, combined with a positional encoding, and fed through a stack of transformer blocks. Each of these blocks consists of multiple attention heads and a feed-forward layer. The output probabilities are obtained by computing the softmax over the output layer. | Source

Hallucinations can occur throughout the process to predict the next token in a sequence of tokens:

  1. Initially, the sequence is split into words or subwords (collectively referred to as tokens), which are transformed into numerical values. This is the first potential source of hallucinations, as what is happening is a literal translation between words and numbers.
  2. The encoded tokens pass through an embedding layer that has learned how to represent these tokens in a vector space where tokens with similar meanings are placed close and vice versa. If this representation is not good enough, the embedding vectors of two tokens could be close even though the tokens are not similar. This could lead to hallucinations downstream.
  3. An additional embedding is used to represent the position of tokens in the original sequence. If this representation does not work properly, the transformer may not be able to understand the sentence. (Have you ever tried to read a randomly sorted sentence?)
  4. Within the transformer block, tokens first undergo self-attention multiple times (multi-head). Self-attention is the mechanism where tokens interact with each other (auto-regressive) and with the knowledge acquired during pre-training. The interactions between the Query, Key, and Value matrices determine which information is emphasized or prioritized and will carry more weight in the final prediction. This is where most “factual hallucinations” (the LLM is making up convincingly sounding fake information) are generated.
  5. Still, within the transformer block, the Feed Forward layer has the role of processing self-attention output, learning complex patterns over it, and improving the output. While it’s unlikely that this process introduces new hallucinations, hallucinations seeded upstream are amplified.
  6. Last, during the decoder stage, the softmax calculates the next token probability distribution. Here, the hallucinations are materialized.

What causes LLMs to hallucinate?

While there are many origins of hallucinations within an LLM’s architecture, we can simplify and categorize the root causes into four main origins of hallucinations:

What causes LLMs to hallucinate

Lack of or scarce data during training

As a rule of thumb, an LLM cannot give you any info that was not clearly shown during training. Trying to do so is one of the fastest ways to get a hallucination.

How an LLM actually learns factual knowledge is not yet fully understood, and a lot of research is ongoing. But we do know that for an LLM to learn some knowledge, it is not enough to show it some information once. In fact, it benefits from being exposed to a piece of knowledge from diverse sources and perspectives, avoiding duplicated data, and maximizing the LLM opportunities to link it with other close knowledge (like a field of study). This is why scarce knowledge, commonly known as long-tail knowledge, usually shows high hallucination rates.

There’s also certain knowledge that an LLM could not have possibly seen during training:

  • Future data. It is not possible to tell if some future event will happen or not. In this context, any data related to the future is speculative. For any LLM, “future” equals anything happening after the last date covered in the training dataset. This is what we call the “knowledge cut-off date.”
  • Private data. Assuming that LLMs are trained with publicly available or licensed data, there’s no chance that an LLM knows, for example, about your company balance sheet, your friend’s group chat, or your parents’ home address unless you provide the info in the prompt.

Lack of alignment

Another rule of thumb is that an LLM is just trying to follow your instructions and answer with the most probable response it has. But what happens if an LLM doesn’t know how to follow instructions properly? That is due to a lack of alignment.

Alignment is the process of teaching an LLM how to follow instructions and respond helpfully, safely, and reliably to match our human expectations. This process happens during the post-training stage, which includes different fine-tuning methods.

Imagine using an LLM-based meal assistant. You ask for a nutritious and tasty breakfast suitable for someone with celiac disease. The assistant recommends salmon, avocado, and toast. Why? The model likely knows that toast contains gluten, but when asked for a breakfast suggestion, it failed to ensure that all items met the dietary requirements.

Instead, it defaulted to the most probable and common pairing with salmon and avocado, which happened to be a toast. This is an example of a hallucination caused by misalignment. The assistant’s response didn’t meet the requirements for a celiac-friendly menu, not because the LLM didn’t understand what celiac disease is but because it failed to accurately follow the instructions provided.

Although the example may seem simplistic, and modern LLMs have largely addressed these issues, similar mistakes can still be observed with smaller or older language models.

Poor attention performance

Attention is the process of modeling the interaction between input tokens via the dot product of Query and Key matrices, generating an attention matrix, which is then multiplied with a Value matrix to get the attention output. This operation represents a mathematical way of expressing a lookup of knowledge related to the input tokens, weighing it, and then responding to the request based on it.

Poor attention performance means not properly attending to all relevant parts of the prompt and thus not having available the information needed to respond appropriately. Attention performance is an inherent property of LLMs fundamentally determined by architecture and hyperparameter choice. Nevertheless, it seems like a combination of fine-tuning and some tweaks on the positional embedding brings huge improvements in attention performance.

Typical poor attention-based hallucinations are those when, after a relatively long conversation, the model is unable to remember a certain date you mentioned or your name or even forgets the instructions given at the very beginning. We can assess this using the “needle in a haystack” evaluation, which assesses whether an LLM can accurately retrieve the fact across varying context lengths.

Tokenizer

The tokenizer is a core part of the LLMs due to its singular functionality. It’s the single component in the transformer architecture, which is at the same time the root cause of hallucinations and where hallucinations are generated.

The tokenizer is the component where input text is chunked into little pieces of characters represented by a numeric ID, the tokens. Tokenizers learn the correspondences between word chunks and tokens separately from the LLM training. Hence, it is the only component that is not necessarily trained with the same dataset as the transformer.

This can lead to words being interpreted with a totally different meaning. In extreme cases, certain tokens can completely break an LLM. One of the first widely discussed examples was the SolidGoldMagikarp token, which GPT-3 internally understood as the verb “distribute,” resulting in weird conversation completions.

Is it possible to detect hallucinations?

When it comes to detecting hallucinations, what you actually want to do is evaluate if the LLM responds reliably and truthfully. We can classify evaluations based on whether the ground truth (reference) is available or not.

Reference-based evaluations

Comparing ground truth against LLM-generated answers is based on the same principles of classic machine learning model evaluation. However, unlike other models, language predictions cannot be compared word by word. Instead, semantic and fact-based metrics must be used. Here are some of the main ones:

  • BLEU (Bilingual Evaluation Understudy) works by comparing the n-grams (contiguous sequences of words) in the generated text to those in one or more reference texts, calculating the precision of these matches, and applying a brevity penalty to discourage overly short outputs.
  • BERTScore evaluates the semantic similarity between the generated text and reference texts by converting them into dense vector embeddings using a pre-trained model like BERT and then calculating the similarity between these embeddings with a metric like cosine similarity, allowing it to account for meaning and paraphrasing rather than just exact word matches.
  • Answer Correctness. Proposed by the evaluation framework RAGAS, consists of two steps:
    • Factual correctness is the factual overlap between the generated answer and the ground truth answer. This is done by leveraging the F1 score and redefining their concepts:
  • TP (True Positive): Facts or statements present in both the ground truth and the generated answer.
  • FP (False Positive): Facts or statements present in the generated answer but not in the ground truth.
  • FN (False Negative): Facts or statements present in the ground truth but not in the generated answer.
  • Semantic similarity. This step is, indeed, the same as in BERTScore.
  • Hallucination classifiers. Models like Vectara’s HHEM-2.1-Open are encoder-decoder trained to detect hallucinations given a ground truth and an LLM response.

Reference-free evaluations

When there is no ground truth, evaluation methods may be separated based on whether the LLM response is generated from a given context (i.e., RAG-like frameworks) or not:

  • Context-based evaluations. Again, RAGAS covered this and proposed a series of metrics for evaluating how well an LLM attends to the provided context. Here are the two most representative:
    • Faithfulness. Ask an external LLM (or human) to break the answer into individual statements. Then, it checks if statements can be inferred from the context and calculate a precision metric for each one.
    • Context utilization. For each chunk in the top-k of retrieved context, check if it is relevant or not relevant to arrive at the answer for the given question. Then, it calculates a weighted precision that attends to the rank of the relevant chunk.
  • Context-free evaluations. Here, the only valid approach is supervision by an external agent that can be:
    • LLM supervisor. Having an LLM assessing the output requires this second LLM to be a stronger and more capable model solving the same task or a model specialized in detecting, e.g., hate speech or sentiment.
    • LLM self-supervisor. The same LLM can evaluate its own output if enabled with self-critique or self-reflection agentic patterns.
    • Human supervision or feedback. They can be either “teachers” responsible for LLM supervision during any training stage or just users reporting hallucinations as feedback.

How to reduce hallucinations in LLMs?

Hallucinations have been one of the main obstacles to the adoption of LLM assistants in enterprises. LLMs are persuasive to the point of fooling PhDs in their own field. The potential harm to non-expert users is high when talking, for example, about health. So, preventing them is one of the main focuses for different stakeholders:

  • AI labs, owners of the top models, to foster adoption. 
  • Start-ups have a strong market incentive to solve it and productize the solution.
  • Academia due to high paper impact and research funding.

Hence, an overwhelming amount of new hallucination-prevention methods are constantly being released. (If you’re curious, try searching the recent posts on X talking about “hallucination mitigation” or the latest papers on Google Scholar talking about “LLM hallucination.” By the way, this is a good way to stay updated.)

Broadly speaking, we can reduce hallucinations in LLMs by filtering responses, prompt engineering, achieving better alignment, and improving the training data. To navigate the space, we can use a simple taxonomy to organize current and upcoming methods. Hallucinations can be prevented at different steps of the process an LLM uses to generate an output, and we can use this as the foundation for our categorization.

After the response

Correcting a hallucination after the LLM output has been generated is still beneficial, as it prevents the user from seeing the incorrect information. This approach can effectively transform correction into prevention by ensuring that the erroneous response never reaches the user. The process can be broken down into the following steps:

  • Detect the hallucination in the generated response. For example, leveraging observability tool capabilities.
  • Prevent the incorrect information before it reaches the user. This is just adding an extra step in the response processing pipeline.
  • Replace the hallucination with accurate information. Making the LLM aware of the hallucination and elaborating a new answer accordingly. For that, any scaffolding strategy may be used.

This method is part of multi-step reasoning strategies, which are increasingly important in handling complex problems. These strategies, often referred to as “agents,” are gaining popularity. One well-known agent pattern is reflection. By identifying hallucinations early, you can address and correct them before they impact the user.

During the response (in context)

Since the LLM will directly respond to the user’s request, we can inject information before starting the generation to condition the model’s response. Here are the most relevant strategies to condition response:

  • Prompt engineering techniques: Single-step prompt engineering strategies condition the model to how the response is generated, steering the LLM to think in a specific way that turns into better responses less prone to hallucinations. For instance, the Chain of Thoughts (CoT) technique works by
    • Add to the original prompt some examples of questions and explicit reasoning processes driving the correct answer.
    • During generation, the LLM will emulate the reasoning process and thus avoid mistakes in response.

A good example of the “Chain of Thoughts” approach is the Anthropic’s Claude using <antthinking> to give itself space to reflect and the addition of “Let’s think step by step” at the end of any prompt. 

  • Grounding or Retrieval Augmented Generation (RAG) consists of getting external information related to the question topic and adding it all together with the user’s prompt. Then, the LLM will respond based on the provided info instead of its own knowledge. The success of this strategy relies on retrieving the proper and relevant information. There are two main approaches for information source retrieval:
    • Internet search engines. Information is constantly getting updated, and there’s lots of news every day. In the same way we search for information on Google, an LLM could do the same and then answer based on it.
    • Private data. The idea is to build a search engine over a private set of data (e.g. company internal documentation or a database) and search for relevant data from it to ground the response. There are lots of frameworks like langchain that implement RAG abstractions for private data.
Overview of an RAG application. The prompt is used to retrieve relevant documents from a document store, which are added to the input sent to the LLM. This provides knowledge to the LLM it has not learned during training.
Overview of an RAG application. The prompt is used to retrieve relevant documents from a document store, which are added to the input sent to the LLM. This provides knowledge to the LLM it has not learned during training.

As an alternative to retrieving information, if an LLM context window is long enough, any document or data source could be directly added to the prompt, leveraging in-context learning. This would be a brute-force approach, and while costly, it could be effective when reasoning over an entire knowledge base instead of just some retrieved parts.

Post-training or alignment

It is hypothesized that an LLM instructed not only to respond and follow instructions but also to take time to reason and reflect on a problem could largely mitigate the hallucination issue—either by providing the correct answer or by stating that it does not know how to answer.

Furthermore, you can teach a model to use external tools during the reasoning process,  like getting information from a search engine. There are a lot of different fine-tuning techniques being tested to achieve this. Some LLMs already working with this reasoning strategy are Matt Shumer’s Reflection-LLama-3.1-70b and OpenAI’s O1 family models.

Pre-training

Increasing the pre-training dataset or introducing new knowledge directly leads to broader knowledge coverage and fewer hallucinations, especially regarding facts and recent events. Additionally, better data processing and curation enhance LLM learning. Unfortunately, pre-training requires vast computational resources, mainly GPUs, which are only accessible to large companies and frontier AI labs. Despite that, if the problem is big enough, pre-training may still be a viable solution, as the OpenAI and Harvey case showed.

Is it possible to achieve hallucination-free LLM applications?

Hallucination-free LLM applications are the Holy Grail or the One Piece of the LLM world. Over time, with a growing availability of resources, invested money, and brains researching the topic, it is hard not to be optimistic.

Ilya Sutskever, one of the researchers behind GPT, is quite sure that hallucinations can be solved with better alignment alone. LLM-based applications are becoming more sophisticated and complex. The combination of the previously commented hallucination prevention strategies is conquering milestones one after another. Despite that, whether the goal is achievable or not is just a hypothesis.

Some, like Yann LeCunn, Chief AI Scientist at Meta, have stated that hallucination problems are specific to auto-regressive models, and we should move away from architectures that can reason and plan. Others, like Gary Marcus, argue strongly that transformer-based LLMs are completely unable to eliminate hallucinations. Instead, he bets on neurosymbolic AI. But the good news is that even those not optimistic about mitigating hallucinations in today’s LLMs are optimistic about the broader goal.

On average, experts’ opinions point either to moderate optimism or uncertainty. After all, my intuition is that there is enough evidence to believe that hallucination-free LLM applications are possible. But remember, when it comes to state-of-the-art research, intuitions must always be built on top of solid knowledge and previous research.

Where does this leave us?

Hallucinations are a blessing and a curse at the same time. Along the article, you’ve gained a structured understanding of why, how, and where LLMs hallucinate. Equipped with this base knowledge, you’re ready to face hallucination problems with the different tools and techniques that we’ve explored.

Was the article useful?

Thank you for your feedback!

Thanks for your vote! It’s been noted. | What topics you would like to see for your next read?

Thanks for your vote! It’s been noted. | Let us know what should be improved.

Thanks! Your suggestions have been forwarded to our editors

Explore more content topics:



Source link
lol

By stp2y

Leave a Reply

Your email address will not be published. Required fields are marked *

No widgets found. Go to Widget page and add the widget in Offcanvas Sidebar Widget Area.