LLM For Structured Data

LLM For Structured Data


Large Language Models (LLMs) can be used to extract insightful information from structured data, help users perform queries, and generate new datasets.

Retrieval-Augmented Generation is great for filtering data and extracting observations.

LLMs can be used to generate code that executes complex queries against structured datasets.

LLMs can also generate synthetic data with user-defined types and statistical properties.

It is estimated that 80% to 90% of the data worldwide is unstructured. However, when we look for data in a specific domain or organization, we often end up finding structured data. The most likely reason is that structured data is still the de facto standard for quantitative information.

Consequently, in the age of Large Language Models (LLM), structured data still is and will continue to be relevant—even Microsoft is working on adding Large Language Models (LLMs) to Excel!

LLMs are mostly used with unstructured data, particularly text, but with the proper tools, they can also help tackle tasks with structured data. Given some context or examples of the structured data in the prompt, together with a sentence stating what information we want to be retrieved, LLMs can get insights and patterns from the data, generate code to extract statistics and other metrics, or even generate new data with the same characteristics.

In this article, I’ll describe and demonstrate with examples three structured data use cases for LLMs, namely:

Prerequisites for hands-on examples

All examples in this article use OpenAI GPT-3.5 Turbo, the OpenAI library, and LangChain on Python 3.11. We’ll use the well-known Titanic dataset, which you can download as a CSV file and load as a Pandas DataFrame.

I’ve prepared Jupyter notebooks for the three examples in the article. Note that you’ll need an API key to use the GPT-3.5 Turbo model, which you can create on the OpenAI platform after registering an account. This usage has costs, but the bill will be less than 5 cents (as per pricing in the fall of 2024) to run the examples from this article. 

Here are the complete installation instructions for all dependencies:

Use case 1: Filtering data with Retrieval-Augmented Generation

Structured datasets usually present large quantities of data organized in many rows and columns. When retrieving information from a dataset, one of the most common tasks is to find and filter rows that match specific criteria.

With SQL, we need to write a query with WHERE clauses that map our criteria. For example, to find the names of all teenagers on the Titanic:

While this query is relatively straightforward, it can get complicated quickly as we add more conditions. Let’s see how LLMs can simplify this.

What is Retrieval-Augmented Generation?

Retrieval-Augmented Generation (RAG) is a common approach for building LLM applications. The idea is to include information from external documents in the prompt so that the LLM can use the additional context to provide better answers.

Including external data in the prompt solves two common issues with LLMs:

  • The knowledge cutoff problem arises because any LLM is trained with data curated up to a certain point in time. It will not be able to answer questions about events that happened after that date.
  • Hallucinations occur because an LLM generates its output based on probabilistic reasoning, which can make it produce factually incorrect content.

RAG can overcome these limitations because the additional information included in the prompt extends the knowledge base of the model and reduces the probability of it generating outcomes that are not aligned with the provided context.

Retrieval-Augmented Generation. The documents that provide external context are processed by an embedding model that converts the data into a numerical representation. These so-called embeddings are saved in a vector database. When the user queries the LLM, the same embedding model processes the query. Using the embedded query, a similarity search is conducted on the vector database, yielding the most similar documents. These documents,  which represent information related to the user query, are added to the user’s query in their original, human-readable format, providing the necessary context for the LLM to return a more accurate answer.
Retrieval-Augmented Generation. The documents that provide external context are processed by an embedding model that converts the data into a numerical representation. These so-called embeddings are saved in a vector database. When the user queries the LLM, the same embedding model processes the query. Using the embedded query, a similarity search is conducted on the vector database, yielding the most similar documents. These documents,  which represent information related to the user query, are added to the user’s query in their original, human-readable format, providing the necessary context for the LLM to return a more accurate answer. | Source: Author

Applying Retrieval-Augmented Generation to structured data

To understand how we can leverage RAG for structured data, let’s consider a simple tabular dataset with a few observations and variables.

The first step is to convert the data into a numeric representation so that the LLM can use it. This can be achieved by using an embedding model, whose goal is to transform each dataset observation into an abstract multi-dimensional numeric representation. All variables of the dataset are mapped to a new set of numeric features that capture characteristics and relationships of the data in a way that similar observations have similar embedding representations.

There are many pre-trained embedding models ready to be used. Choosing the one most suitable for our use case is not always easy: we can focus on specific characteristics of the model (e.g., size of the embedding representation), and we can check leaderboards to see what models achieve state-of-the-art results in different tasks (e.g., the MTEB Leaderboard).

To validate that the model we have chosen is suitable, we can assess the retrieval quality by checking the relevance of the retrieved observations considering the provided query. For example, we can select a query, retrieve the k most similar observations based on the embeddings, and calculate how many retrieved items are relevant (i.e., precision).

We store the embeddings representing our observations in a vector database. Afterward, when the user sends a query, we can fetch the observations that are most similar from the database and use them as context in the prompt. This works great for queries whose goal is to find and filter specific records. However, RAG is not suitable to get statistics and summary information for the entire structured dataset.

Hands-on example: Finding data points with RAG

We start by importing all the required LangChain modules: the OpenAI wrappers, the document loader for CSV files, the Chroma vector database, and the RAG pipeline wrapper.

We then instantiate the client for the embedding model and load the Titanic dataset from the CSV file. Afterward, we use the model to generate the embedding representations of the data and save them into a Chroma vector database.

To create the RAG pipeline, we start by configuring a retriever for the vector database, where we specify the number of records to be returned (in this case, five). We then load the OpenAI model wrapper and create the RAG chain. chain_type=“stuff” tells the pipeline to add the entire documents to the prompt.

Finally, we can send a query to the RAG pipeline. In this example, we ask for three Titanic passengers with at least two siblings or spouses aboard.

A possible output would be:

1) Goodwin, Master. Sidney Leonard (Age 1, Class 3)

2) Andersson, Miss. Sigrid Elisabeth (Age 11, Class 3)

3) Rice, Master. Eric (Age 7, Class 3)

Use case 2: Code generation for operations with the entire dataset

Although RAG allows us to find and filter specific data, it fails when the goal is to extract global metrics or statistics that require access to the entire dataset. One way to address this limitation is by using the LLM to generate code that will extract the desired information.

This is usually achieved with prompt templates that include a predefined set of instructions, which will be completed with a small dataset sample (e.g., five rows), letting the LLM know the data structure. These instructions “configure” the LLM for its code generation purpose, and the user query will be translated into the respective code.

You can see below an example of a prompt template, where {first_5_rows} would be replaced by the first five observations of the dataset, and {input} would be replaced by the user query. 

Creating Pandas queries with LangChain

The LangChain library offers several predefined functions for different Python data types. For example, it has the create_pandas_dataframe_agent function that implements the described behavior for Pandas DataFrames. You can find the underlying prompt templates on GitHub.

The example below uses this function to extract a complex statistic based on the entire dataset. Bear in mind that this approach has a security risk: the code returned by the LLM is executed automatically to obtain the expected result. Therefore, if, for some reason, malicious code is returned, this may put your machine at risk. To mitigate this issue, follow the LangChain security guidelines and run your code with the minimum required permissions for the application.

Hands-on example: Analyzing a Pandas DataFrame with LLM-generated code

We start by importing Pandas and the required LangChain modules: the OpenAI wrapper and the create_pandas_dataframe_agent function.

We then load the Titanic dataset as a Pandas DataFrame from the CSV file, and initialize the OpenAI model. Afterward, we use create_pandas_dataframe_agent to create the agent that will operate the LLM so that it generates the required code.

The agent_type=“tool-calling” tells the agent that we are working with OpenAI tools, and allow_dangerous_code=True allows for the automatic execution of the returned code.

Finally, we can send the query to the agent. In this example, we want to get the mean age per ticket class of the people aboard the Titanic.

We can see from the logs that the LLM generates the correct code to extract this information from a Pandas DataFrame. The output is:

The mean age per ticket class is:

– Ticket Class 1: 38.23 years

– Ticket Class 2: 29.88 years

– Ticket Class 3: 25.14 years

Use case 3: Synthetic structured data generation

When working with structured datasets, it is common to need more data with the same characteristics: we may need to augment training data for a machine learning model or generate anonymized data to protect sensitive information.

A common solution to address this need is to generate synthetic data with the same characteristics as the original dataset. LLMs can be applied to generate high-quality synthetic structured data without requiring any pre-training, which is a major advantage compared to previous methods like generative adversarial networks (GANs).

Using LLMs to generate synthetic data

One way to allow LLMs to excel at a task or in a domain for which they were not trained is to provide the necessary context within the prompt. Let’s say I want to ask an LLM to write a summary about me. Without any context, it will not be able to do it because it hasn’t learned any information about me in training. However, if I include a few sentences with details about myself in the prompt, the model will use that information to write a better summary.

Leveraging this idea, an easy solution to perform synthetic data generation with LLMs is to provide information about the data characteristics as part of the prompt. We can, for example, provide descriptive statistics as context for the model to use. We could include the entire original dataset in the prompt, assuming that the context window is big enough to fit all tokens and that we accept the performance and monetary costs. In practice, sending all data becomes unfeasible due to these limitations. Therefore, including summarized information is a more common approach. 

In the following example, we include a Pandas-generated table with the descriptive statistics of the numeric columns in the prompt and ask for ten new synthetic data points.

Hands-on example: Generating synthetic data points with GPT-3.5 Turbo

We start by importing Pandas and the OpenAI library:

We then load the dataset CSV file and save the descriptive statistics table into a string variable.

Finally, we instantiate the OpenAI client, send a system prompt that includes the descriptive statistics table, and ask for ten synthetic examples of people aboard the Titanic.

A possible output would be:

Sure! Here are 10 synthetic data observations generated based on the provided descriptive statistics:

10 synthetic data observations generated based on the provided descriptive statistics

Conclusion

As we saw through the use cases presented in the article, LLMs can do great things with structured data: they can find and filter specific data of interest, replacing the need for SQL queries; they can generate code capable of extracting meaningful statistics from the entire dataset; and they can be used to generate synthetic data with the same characteristics of the original data.

All the presented use cases exemplify common tasks that data scientists and analysts perform daily. Analyzing and understanding the data they are working with is probably the most time-consuming task these professionals face. As one of these practitioners, I find it much easier to select the data that I need through a simple English question than to write a complex SQL query. The same can be said about developing code to extract statistics and insights from a dataset or to generate new similar data. We’ve seen in this article that LLMs can perform these tasks quite well, achieving the desired results with minimal configuration and few libraries.

The future looks bright for the use of LLMs for structured data, with a lot of interesting challenges yet to be addressed. For example, LLMs sometimes generate inaccurate results, which may be hard to detect, particularly when considering the natural stochasticity of the model. In the presented use cases, this can be an issue: if I ask for specific criteria to be met, I need to be sure that the LLM complies. Strategies like RAG mitigate this behavior, but there is still work to be done.

Was the article useful?

Thank you for your feedback!

Thanks for your vote! It’s been noted. | What topics you would like to see for your next read?

Thanks for your vote! It’s been noted. | Let us know what should be improved.

Thanks! Your suggestions have been forwarded to our editors

Explore more content topics:



Source link
lol

By stp2y

Leave a Reply

Your email address will not be published. Required fields are marked *

No widgets found. Go to Widget page and add the widget in Offcanvas Sidebar Widget Area.