Building LLM Applications With Vector Databases

Vector databases play a key role in Retrieval-Augmented Generation (RAG) systems. They enable efficient context retrieval or dynamic few-shot prompting to improve the factual accuracy of LLM-generated responses.

When implementing a RAG system, start with a simple Naive RAG and iteratively improve the system:

Refine the contextual information available to the LLM using multi-modal models to extract information from documents, optimize the chunk size, and pre-process chunks to filter out irrelevant information.

Look into techniques like parent-document retrieval and hybrid search to improve retrieval accuracy.

Use re-ranking or contextual compression techniques to ensure only the most relevant information is provided to the LLM, improving response accuracy and reducing cost.

As a Machine Learning Engineer working with many companies, I repeatedly encounter the same interaction. They tell me how happy they are with ChatGPT and how much general knowledge it has. So, “all” they want me to do is teach ChatGPT the company’s data, services, and procedures. And then this new chatbot will revolutionize the world. “Just train it on our data”—easy, right?

Then, it’s my turn to explain why we can’t “just train it.” LLMs can’t simply read thousands of documents and remember them forever. We would need to perform foundational training, which, let’s face it, the vast majority of companies can’t afford. While fine-tuning is within reach for many, it mostly steers how models respond rather than resulting in knowledge acquisition. Often, the best option is retrieving the relevant knowledge dynamically at runtime on a per-query basis.

The flexibility provided by being able to retrieve context at runtime is the primary motivation behind using vector databases in LLM applications, or, as this is more commonly referred to, Retrieval Augmented Generation (RAG) systems: We find clever ways to dynamically retrieve and provide the LLM with the most relevant information it needs to perform a particular task. This retrieval process remains hidden from the end user. From their point of view, they’re talking to an all-knowing AI that can answer any question.

I often have to explain the ideas and concepts around RAG to business stakeholders. Further, talking to data scientists and ML engineers, I noticed quite a bit of confusion around RAG systems and terminology. After reading this article, you’ll know different ways to use vector databases to enhance the task performance of LLM-based systems. Starting from a naive RAG system, we’ll discuss why and how to upgrade different parts to improve performance and reduce hallucinations, all while avoiding cost increases.

How does Retrieval Augmented Generation work?

Integrating retrieval of relevant contextual information into LLM systems has become a common design pattern to mitigate the LLMs’ lack of domain-specific knowledge.

The main components of a Retrieval-Augmented Generation (RAG) system are:

Embedding Model: A machine-learning model that receives chunks of text as inputs and produces a vector (usually between 256 and 1024 dimensions). This so-called embedding represents the meaning of the chunk of text in an abstract space. The similarity/proximity of the embedding vectors is interpreted as semantic similarity (similarity in meaning).
Vector Database: A database purpose-built for handling storage and retrieval of vectors. These databases typically have highly efficient ways to compare vectors according to predetermined similarity measures.
Large Language Model (LLM): A machine-learning model that takes in a textual prompt and outputs an answer. In a RAG system, this prompt is usually a combination of retrieved contextual information, instructions to the model, and the user’s query.

Architecture of a simple RAG system: First, the user query is passed through an embedding model. Then, a similarity search against a vector database containing document embeddings surfaces the documents most relevant to the query. These documents and the user query comprise the prompt for the LLM. — **Architecture of a simple RAG system:** First, the user query is passed through an embedding model. Then, a similarity search against a vector database containing document embeddings surfaces the documents most relevant to the query. These documents and the user query comprise the prompt for the LLM | Source: Author

Methods for building LLM applications with vector databases

Vector databases for context retrieval

The simplest way to leverage vector databases in LLM systems is to use them to efficiently search for context that can help your LLM provide accurate answers.

At first, building a RAG system seems straightforward: We use a vector database to run a semantic search, find the most relevant documents in the database, and add them to the original prompt. This is what you see in most PoCs or demos for LLM systems: a simple Langchain notebook where everything just works.

But let me tell you, this falls apart completely on the first end-uses contact.

You will quickly encounter a number of problematic edge cases. For instance, consider the case that your database only contains three relevant documents, but you’re retrieving the top five. Even with a perfect embedding system, you’re now feeding two irrelevant documents to your LLM. In turn, it will output irrelevant or even wrong information.

Later on, we’ll learn how to mitigate these issues to build production-grade RAG applications. But for now, let’s understand how adding documents to the original user query enables the LLM to solve tasks on which it was not trained.

Vector databases for dynamic few-shot prompting

The benefits and effectiveness of “few-shot prompting” have been widely studied. By providing several examples along with our original prompt, we can steer an LLM to provide the desired output. However, it can be challenging to select the proper examples.

It’s pretty popular to pick an example for each “type” of answer we might want to get. For example, say we’re trying to classify texts as “positive” or “negative” in sentiment. Here, we should add an equal number of positive and negative examples to our prompt to avoid class imbalance.

To find these examples on behalf of our users, we need to create a tool that can pick the right examples. We can accomplish this by using a vector database that contains any examples we might want to add to our prompts and find the most relevant samples through semantic search. This approach is quite helpful and fully supported by Langchain and Llamaindex.

The way we build this vector database of examples can also get quite interesting. We can add a set of selected samples and then iteratively add more manually validated examples. Going even further, we can save the LLM’s previous mistakes and manually correct the outputs to ensure we have “hard examples” to provide the LLM with. Have a look into Active Prompting to learn more about this.

Dynamic few-shot prompting: The prompt is constructed by combining the original user query and examples selected through retrieval from a vector database. — **Dynamic few-shot prompting:** The prompt is constructed by combining the original user query and examples selected through retrieval from
a vector database | Source: Author

How to build LLM applications with vector databases: step-by-step guide

Building applications with Large Language Models (LLMs) using vector databases allows for dynamic and context-rich responses. But, implementing a Retrieval-Augmented Generation (RAG) system that lives up to this promise is not easy.

This section guides you through developing a RAG system, starting with a basic setup and moving towards advanced optimizations, iteratively adding more features and complexity as needed.

Step 1: Naive RAG

Start with a so-called Naive RAG with no bells and whistles.

Take your documents, extract any text you can from them, chunk them into fixed-size chunks, run them through an embedding model, and store them in a vector database. Then, use this vector database to find the most similar documents to add to the prompt.

Chunking and saving documents in a Naive RAG: The process starts with your raw data (e.g., PDF documents). Then, all text is extracted and split into fixed-size chunks (usually 500 to 1000 characters). Subsequently, each chunk is run through an embedding model that produces vectors. Finally, the (vector, chunk) pairs are stored in the vector database. — **Chunking and saving documents in a Naive RAG:** The process starts with your raw data (e.g., PDF documents). Then, all text is extracted and split into fixed-size chunks (usually 500 to 1000 characters). Subsequently, each chunk is run through an embedding model that produces vectors. Finally, the (vector, chunk) pairs are stored in the vector database | Source: Author

You can follow the quickstart guides for any LLM orchestration library that supports RAG to do this. Langchain, Llamaindex, and Haystack are all great starting points.

Don’t worry too much about vector database selection. All you need is something capable of building a vector index. FAISS, Chroma, and Qdrant have excellent support for quickly putting together local versions. Most RAG frameworks abstract the vector database, so they should be easily hot-swappable unless you use a database-specific feature.

Once the Naive RAG is in place, all subsequent steps should be informed by a thorough evaluation of its successes and failures. A good starting point for performing RAG evaluation is the RAGAS framework, which supports multiple ways of validating your results, helping you identify where your RAG system needs improvement.

Step 2: Building a better vector database

The documents you use are arguably the most critical part of a RAG system. Here are some potential paths for improvement:

Increase the information available to the LLM: Internal knowledge bases often consist of a lot of unstructured data that’s hard for LLMs to process. Thus, carefully analyze the documents and extract as much textual information as possible. If your documents contain many images, diagrams, or tables essential to understanding their content, consider adding a preprocessing step with a multi-modal model to convert them into text that your LLM can interpret.
Optimize the chunk size: A universally best chunk size doesn’t exist. To find the appropriate chunk size for your system, embed your documents using different chunk sizes and evaluate which chunk size yields the best retrieval results. To learn more about chunk size sensitivity, I recommend this guide by LlamaIndex, which details how to perform RAG performance evaluation for different chunk sizes.
Consider how you turn chunks into embeddings: We’re not forced to stick to the (chunk embedding, chunk) pairs of the Naive RAG approach. Instead, we can modify the embeddings we use as the index for retrieval. For example, we can summarize our chunks using an LLM before running it through the embedding model. These summaries will be much shorter and contain less meaningless filler text, which might “confuse” or “distract” our embedding model.

Document preprocessing pipeline: processing PDFs by extracting text, chunking it, embedding it, and saving it into a vector database. — **Document preprocessing pipeline:** processing PDFs by extracting text, chunking it, embedding it, and saving it into a vector database |
Source: Author

When dealing with hierarchical documents, such as books or research papers, it’s essential to capture context for accurate information retrieval. Parent Document Retrieval involves indexing smaller chunks (e.g., paragraphs) in a vector database and, when retrieving a chunk, also fetching its parent document or surrounding sections. Alternatively, a windowed approach retrieves a chunk along with its neighboring chunks. Both methods ensure the retrieved information is understood within its broader context, improving relevance and comprehension.

Step 3: Going beyond semantic search

Vector databases effectively return the vectors associated with semantically similar documents. However, this is not necessarily what we want in all cases. Let’s say we are implementing a chatbot to answer questions about the Windows operating system, and a user asks, “Is Windows 8 any good?”

If we simply run a semantic search on our database of software reviews, we’ll most likely retrieve many reviews that cover a different version of Windows. This is because semantic similarity tends to fail when keyword matching. You can’t fix this unless you train your own embedding model for this specific use case, which considers “Windows 8” and “Windows 10” distinct entities. In most circumstances, this is too costly.

Pitfalls of semantic search: In this example, we computed the cosine similarity between embeddings generated by OpenAI’s text-embedding-ada-002 embedding model. If we were to retrieve the top two matches, we would be giving our LLM a review of a different version of Windows, resulting in wrong or irrelevant outputs. — **Pitfalls of semantic search:** In this example, we computed the cosine similarity between embeddings generated by OpenAI’s text-embedding-ada-002 embedding model. If we were to retrieve the top two matches, we would be giving our LLM a review of a different version of Windows, resulting in wrong or irrelevant outputs | Source: Author

The best way to mitigate these issues is to adopt a hybrid search approach. Vector databases might be much more capable in 80% of cases. However, for the other 20%, we can use more traditional word-matching-based systems that produce sparse vectors, like BM-25 or TF-IDF.

Since we don’t know ahead of time which kind of search will perform better, in hybrid search, we don’t exclusively choose between semantic search and word-matching search. Instead, we combine results from both approaches to leverage their respective strengths. We determine the top matches by merging the results from each search tool or using a scoring system that incorporates the similarity scores from both systems. This approach allows us to benefit from the nuanced understanding of context provided by semantic search while capturing the precise keyword matches identified by traditional word-matching algorithms.

Vector databases are specifically designed for semantic search. However, most modern vector databases, like Qdrant and Pinecone, already support hybrid search approaches, making it extremely simple to implement these upgrades without significantly changing your previous systems or hosting two separate databases.

Hybrid Search: A sparse and a dense vector space are combined to create a hybrid search index. — **Hybrid Search:** A sparse and a dense vector space are combined to create a hybrid search index | Source

Step 4: Contextual compression and re-rankers

So far, we’ve talked about improving our usage of vector databases and search systems. However, especially when using hybrid search approaches, the amount of context can confuse your LLM. Further, if the relevant documents are very deep into the prompt, they will likely be simply ignored.

An intermediate step of rearranging or compressing the retrieved context can mitigate this. After a preliminary similarity search yielding many documents, we rerank these documents according to some similarity metric. Once again, we can decide to take the top n documents or create thresholds for what’s acceptable to send to the large language model.

Rerank models: After retrieving an initial list of search results, they are reranked according to their relevance to the original query by another model. — **Rerank models:** After retrieving an initial list of search results, they are reranked according to their relevance to the original query by another model | Source

Another way to implement context pre-processing is to use a (usually smaller) LLM to decide which context is relevant for a particular purpose. This discards irrelevant examples that would only confuse the main model and drive up your costs.

I strongly recommend LangChain for implementing these features. They have an excellent implementation of Contextual Compression and support Cohere’s re-ranker, allowing you to integrate them into your applications easily.

Step 5: Fine-tuning Large Language Models for RAG

Fine-tuning and RAG tend to be presented as opposing concepts. However, practitioners have recently started combining both approaches.

The idea behind Retrieval-Augmented Fine-Tuning (RAFT) is that you start by building a RAG system, and as a final step of optimization, you train the LLM being used to handle this new retrieval system. This way, the model becomes less sensitive to mistakes in the retrieval process and more effective overall.

If you want to learn more about RAFT, I recommend this post by Cedric Vidal and Suraj Subramanian, which summarizes the original paper and discusses the practical implementation.

Into the future

Building Large Language Model (LLM) applications with vector databases is a game-changer for creating dynamic, context-rich interactions without costly retraining or fine-tuning.

We’ve covered the essentials of iterating on efficient LLM applications, from Naive RAG to more complex topics like hybrid search strategies and contextual compression.

I’m sure many new techniques will emerge in the upcoming years. I’m particularly excited about future developments in multi-modal RAG workflows and improvements in agentic RAG, which I think will fundamentally change how we interact with LLMs and computers in general.

Was the article useful?

Thank you for your feedback!

Thanks for your vote! It’s been noted. | What topics you would like to see for your next read?

Thanks for your vote! It’s been noted. | Let us know what should be improved.

Thanks! Your suggestions have been forwarded to our editors

Explore more content topics:

Source link
lol