Conventional data analysis workflows can be cumbersome and require keeping up with the exponential volume and complexity of modern data sources, which may result in significant hurdles, such as being time-consuming, error-prone, and requiring specialized expertise. Automating these processes can significantly reduce the time and resources needed for data analysis.
Retrieval-augmented generation (RAG)-enabled large language models (LLMs) offer a trivial solution to the challenges of conventional data analysis workflows. By combining the capabilities of natural language processing (NLP) with the ability to retrieve and integrate relevant information from external sources into its response, RAG-enabled LLMs can access and process vast amounts of structured, semi-structured, and unstructured data. Organizations can automate various data analysis tasks by leveraging RAG-enabled LLMs, including summarising data trends, identifying correlations and patterns, and detecting anomalies. With RAG-enabled LLMs, the time and effort required for data analysis are significantly reduced while accuracy and efficiency increase.
Here, you’ll explore the possibilities of using RAG-enabled LLMs to automate data analysis and how it transforms the way we work with data.
Prerequisites
- Understanding of Python, Data Analysis, and How Large Language Models Work.
- An OpenAI account and API key for accessing their LLMs
- You can find the data required to follow along on the GitHub repository.
- Development Environment: Create and activate a virtual environment with Python in your workspace directory. Then, proceed to install the required packages.
python -m venv .venv && source .venv/bin/activate
pip3 install pypdf python-dotenv pandas beautifulsoup4 llama-index==0.9.48
What are RAG-Enabled LLMs?
Large language models (LLMs) have become robust tools for processing natural language. Due to continuous increases in their context window, they can fluently comprehend, produce, and modify human language.
The architecture of LLMs is a crucial factor in their versatility. It allows data from external sources to be seamlessly integrated, making LLMs a base or kernel for building on top of.
Connection to external data enables the development of RAG-enabled LLMs, a prevalent paradigm for fitting external data into LLMs.
Retrieval-Augmented Generation (RAG)
One primary reason for RAG is that LLMs do not have access to all or most of the data you might care about, such as private data or data generated after training the LLM.
RAG combines the strengths of both generative and retrieval-based models. Its architecture consists of multiple components, including Query transformations, Routing, Query construction, Indexing, Retrieval, and Generation. The focus here will be on the last three:
- Indexing: This involves creating an index or a searchable database of a collection of related data or documents after chunking and embedding them. The indexing mechanism enhances the model’s retrieval of accurate information.
- Retrieval: Unlike traditional LLMs, RAG incorporates a retrieval mechanism responsible for retrieving relevant information from external knowledge sources, such as text databases or pre-indexed corpora. This mechanism serves as additional context for the generative component.
- Generation: This component functions similarly to conventional LLMs, generating text based on input prompts and internal knowledge.
The unique architecture sets RAG-LLMs apart, making them a powerful tool in data analysis.
Overcoming Limitations of Conventional LLMs
A common issue with conventional LLMs is their limited access to data, limited knowledge representation, and hallucinations, which result in unrelated or irrelevant responses. By integrating external knowledge, RAG-enabled LLMs overcome these difficulties and give the model access to more recent information beyond its pre-training data, allowing for more coherent, relevant, and factually accurate outputs.
Role of Vector Databases in RAG Systems
Vector databases play a crucial role in RAG systems by facilitating efficient retrieval of relevant information. These databases organize textual information into high-dimensional vector representations, enabling fast and accurate similarity searches. Vector representations capture semantic relationships between documents, allowing RAG models to retrieve contextually relevant information for a given query prompt.
Applications of RAG-Enabled LLMs in Data Analysis
You’ll see how to integrate external data with OpenAI’s LLMs to implement RAG-enabled data analysis in practice. You’ll work with structured (JSON file containing countries’ GDP for 2023), semi-structured (HTML containing info about a country’s economy), and unstructured (PDF containing info about a country’s economy) data.
In your project directory, create the following files and directories:
.
├── data
│ ├── Economy_of_China.html
│ ├── Economy_of_Nigeria.pdf
│ ├── gdp_2023.json
└── .env
└── main.py
1 directory, 5 files
The main.py file is the project’s entry point, the data folder holds project-related data and the .env file holds the API Key (OPENAI_API_KEY=<key>
).
Working With Structured Data
Structured data comes in a predefined manner and adheres to a specific schema, making it easy to analyze. Here, you’ll work with JSON data loaded into a pandas dataframe
Add the following code to the main.py file.
## import the required libraries (1)
import os
import pandas as pd
from llama_index.query_engine import PandasQueryEngine
from llama_index.tools import QueryEngineTool, ToolMetadata
from llama_index.agent import ReActAgent
from llama_index.llms import OpenAI
from dotenv import load_dotenv
## load .env file (2)
load_dotenv()
## construct file path (3)
gdp_file_path = os.path.join("data", "gdp_2023.json")
## load data (4)
gdp_df = df = pd.read_json(
gdp_file_path,
orient="records",
dtype={"gdp": int},
convert_dates=["year"],
)
## create query engine (5)
gdp_query_engine = PandasQueryEngine(df=gdp_df, verbose=True)
## list of query engine tools (6)
tools = [
QueryEngineTool(
query_engine=gdp_query_engine,
metadata=ToolMetadata(
name="countries_gdp_ata",
description="This provides data regarding the GDP of nations.",
),
),
]
The code sets up a query engine for the GDP data loaded into a pandas dataframe and defines a tool for querying the data.
Here, we use the PandasQueryEngine
as a wrapper interface for the gdp_df
dataset to make a query engine (gdp_query_engine
). This way, we have an interface that allows asking questions about GDP data using natural language. Using the right query engine, you can create a generic interface for any data you want to interact with.
Using a Python list (tools
), we define a list of QueryEngineTool
instances, which represent tools for querying the knowledge graph of the external data. QueryEngineTool
wraps the gdp_query_engine
with metadata. You can have more than one query engine tool. The agent accesses and picks the right tool for a task.
Add the following code to the main.py file below the existing code.
## load llm and create agent (7)
llm = OpenAI(model="gpt-4")
agent = ReActAgent.from_tools(tools, llm=llm, verbose=True)
## take input prompt (8)
while (prompt := input("Enter a prompt (q to quit): ")) != "q":
result = agent.query(prompt)
print(result)
The code creates an agent with a GPT-4 language model and enters a loop that prompts the user for input.
You can use any Large Language Model of your choice by importing the appropriate model class and passing the model’s name.
Using the ReActAgent
class, you create an agent to access and interact with the list of query engine tools. The from_tools
method initializes the agent with a list of tools and the LLM instance.
Execute the main.py file. As a prompt, input “Which region has the highest GDP?”.
Note how the agent uses the tool we provided in the code to attempt to answer the question asked.
Also, note how the agent takes the necessary steps to answer the question correctly.
Using a Prompt Template
You can use a prompt template to provide additional information on processing prompts and queries, provide context, specify the task, guide the response, reduce ambiguity, and improve accuracy. If one is not provided, the default prompt template is used.
Create a prompts.py file in your project directory with the following code.
from llama_index import PromptTemplate
instruction_str = """\
1. Convert the query to executable Python code using Pandas.
2. The final line of code should be a Python expression that can be called with the `eval()` function.
3. The code should represent a solution to the query.
4. PRINT ONLY THE EXPRESSION.
5. Do not quote the expression."""
new_prompt = PromptTemplate(
"""\
You are working with a pandas dataframe in Python.
The name of the dataframe is `gdp_df`.
This is the result of `print(gdp_df.head())`:
{gdp_df_str}
Follow these instructions:
{instruction_str}
Query: {query_str}
Expression: """
)
You can embed instructions, prompts, and queries within a prompt template, optimizing the input to the agent and response from the agent.
Update the initialization of gdp_query_engine
in the main.py file to include an instruction_str
:
from prompts import new_prompt, instruction_str
...
...
## create query engine (5)
gdp_query_engine = PandasQueryEngine(
df=gdp_df, verbose=True, instruction_str=instruction_str
)
gdp_query_engine.update_prompts({"pandas_prompt": new_prompt})
This way, you’ll have provided additional instructions and updated the prompts template.
Execute the main.py file. As a prompt, input “Which country has the highest GDP and which region does it belong?”.
With this, you can customize your prompt templates to fit your data analysis needs and requirements.
Working With Semi-Structured and Unstructured Data
Semi-structured data is information that doesn’t necessarily reside in a database but still has some organizational properties that make it easier to analyze. Unstructured data presents a significant challenge in data analysis due to its lack of a predefined data model.
Here, you’ll see how to build an agent to analyze semi-structured data (HTML containing information about a country’s economy) and unstructured data (PDF containing information about a country’s economy) using natural language processing (NLP) techniques, embedding, vectorization, and indexing.
Create a readers.py in your project directory with the following code.
import os
from llama_index import StorageContext, VectorStoreIndex, load_index_from_storage
from llama_index.readers import PDFReader, HTMLTagReader
load_dotenv()
def create_and_persist_index(data, index_name):
print("generating index...", index_name)
index = VectorStoreIndex.from_documents(data, show_progress=True)
index.storage_context.persist(persist_dir=index_name)
return index
def load_existing_index(index_name):
storage_context = StorageContext.from_defaults(persist_dir=index_name)
return load_index_from_storage(storage_context)
def get_index(data, index_name):
if not os.path.exists(index_name):
return create_and_persist_index(data, index_name)
else:
return load_existing_index(index_name)
html_path = os.path.join("data", "Economy_of_China.html")
china_html = HTMLTagReader(tag="main").load_data(file=html_path)
china_index = get_index(china_html, "china")
china_engine = china_index.as_query_engine()
pdf_path = os.path.join("data", "Economy_of_Nigeria.pdf")
nigeria_pdf = PDFReader().load_data(file=pdf_path)
nigeria_index = get_index(nigeria_pdf, "nigeria")
nigeria_engine = nigeria_index.as_query_engine()
The code creates and manages a vector-based store and search index for the HTML and PDF documents. A vector-based store index is a type of data structure and retrieval system that uses vector representations of data to perform searches. This method relies on embedding data into high-dimensional vectors of similar dimensionality that capture the semantic essence of the content, allowing for more subtle and refined searching.
create_and_persist_index
creates a new index from the provided documents usingVectorStoreIndex.from_documents
, persists the created index to the specified directory (persist_dir=index_name
) and returns the index.load_existing_index
makes a storage context object (storage_context
) from the specified directory (persist_dir=index_name
) usingStorageContext.from_defaults
, loads the index from the storage context usingload_index_from_storage
and returns it.get_index
checks if an index already exists by looking for a directory with the nameindex_name
. It loads and returns an index if it exists or creates and persists one if it doesn’t.- As a final step, the documents are being processed. The file paths to the documents are defined. The documents are then read using the appropriate readers (
HTMLTagReader
andPDFReader
). Llama Hub provides different kinds of readers to suit your needs.get_index
is called to either create a new index or load an existing one. Lastly, the index is converted to a query engine.
The list of query engine tools in main.py needs to be updated to contain the china_engine
and nigerian_engine
query engines.
Add the following code to the main.py file.
from readers import china_engine, nigeria_engine
...
...
tools = [
...
...
QueryEngineTool(
query_engine=nigeria_engine,
metadata=ToolMetadata(
name="nigeria_economy_data",
description="this gives information about the economy of the Federal Republic of Nigeria",
),
),
QueryEngineTool(
query_engine=china_engine,
metadata=ToolMetadata(
name="china_economy_data",
description="this gives information about the economy of the People's Republic of China",
),
),
]
Execute the main.py file. It takes a few minutes to generate the embedding, knowledge graph, and index for the documents and store them in the specified folders. This would not be required in subsequent runs except if deleted.
As a prompt, input “What is the difference between the economy of China and Nigeria?”.
As seen, the agent makes use of the tools provided to guide its thoughts and answer the question.
Enhancing the Capabilities of the Agent
LlamaIndex provides a FunctionTool
class that allows you to wrap a Python function and turn it into a reusable tool or engine.
Let’s implement utility functions that download images from the web and take notes for the user.
Create a utils.py in your project directory with the following code.
import os
import requests
from llama_index.tools import FunctionTool
def download_image(url: str):
file_name = url.split("/")[-1]
response = requests.get(url)
response.raise_for_status()
file_path = os.path.join("data", file_name)
with open(file_path, "wb") as file:
file.write(response.content)
return file_path
def save_note_to_file(note: str):
note_file_path = os.path.join("data", "notes.txt")
if not os.path.isfile(note_file_path):
with open(note_file_path, "w"):
pass
with open(note_file_path, "a") as file:
file.write(note + "\n")
return note
downlod_engine = FunctionTool.from_defaults(
fn=download_image,
name="download_image",
description="This tool allows the user to download images at a specified URL",
)
note_engine = FunctionTool.from_defaults(
fn=save_note_to_file,
name="make_note",
description="This tool allows the user to save a text note to a file.",
)
The list of query engine tools in main.py needs to be updated to contain the downlod_engine
and note_engine
tools.
from utils import downlod_engine, note_engine
...
...
tools = [
downlod_engine,
note_engine,
...
...
...
]
Execute the main.py with the prompts shown in the images below.
A notes.txt file containing the agent’s notes is created in the data folder.
Similar to taking notes, the agent can download an image at a given URL.
You should see an image file, WxHixX4.jpg, created in the data folder.
This way, you can create utility functions (tools) that can perform various tasks, such as data processing, calculations, or interactions with external systems.
Conclusion
Here, we’ve covered the concept of RAG-enabled LLMs for data analysis. RAG-enabled LLMs mark a new frontier in leveraging the power of language models for data-driven tasks. By blending LLMs’ natural language processing skills with retrieval methods, RAG models provide a new way to automate data analysis. This combination enables companies to manage datasets of different structures more effectively, uncover valuable insights with less human involvement, and make data analysis available to users of varying skill levels.