For added ideas on tips on how to improve the performance of your RAG pipeline to make it production-ready, proceed reading here:
This section discusses the required packages and API keys to follow along in this text.
Required Packages
This text will guide you thru implementing a naive and a sophisticated RAG pipeline using LlamaIndex in Python.
pip install llama-index
In this text, we will probably be using LlamaIndex v0.10
. When you are upgrading from an older LlamaIndex version, it’s essential run the next commands to put in and run LlamaIndex properly:
pip uninstall llama-index
pip install llama-index --upgrade --no-cache-dir --force-reinstall
LlamaIndex offers an choice to store vector embeddings locally in JSON files for persistent storage, which is great for quickly prototyping an idea. Nonetheless, we’ll use a vector database for persistent storage since advanced RAG techniques aim for production-ready applications.
Since we’ll need metadata storage and hybrid search capabilities along with storing the vector embeddings, we’ll use the open source vector database Weaviate (v3.26.2
), which supports these features.
pip install weaviate-client llama-index-vector-stores-weaviate
API Keys
We will probably be using Weaviate embedded, which you should utilize without spending a dime without registering for an API key. Nonetheless, this tutorial uses an embedding model and LLM from OpenAI, for which you will have an OpenAI API key. To acquire one, you would like an OpenAI account after which “Create recent secret key” under API keys.
Next, create an area .env
file in your root directory and define your API keys in it:
OPENAI_API_KEY=""
Afterwards, you’ll be able to load your API keys with the next code:
# !pip install python-dotenv
import os
from dotenv import load_dotenv,find_dotenvload_dotenv(find_dotenv())
This section discusses tips on how to implement a naive RAG pipeline using LlamaIndex. Yow will discover the complete naive RAG pipeline on this Jupyter Notebook. For the implementation using LangChain, you’ll be able to proceed in this text (naive RAG pipeline using LangChain).
Step 1: Define the embedding model and LLM
First, you’ll be able to define an embedding model and LLM in a worldwide settings object. Doing this implies you don’t need to specify the models explicitly within the code again.
- Embedding model: used to generate vector embeddings for the document chunks and the query.
- LLM: used to generate a solution based on the user query and the relevant context.
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core.settings import SettingsSettings.llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
Settings.embed_model = OpenAIEmbedding()
Step 2: Load data
Next, you’ll create an area directory named data
in your root directory and download some example data from the LlamaIndex GitHub repository (MIT license).
!mkdir -p 'data'
!wget '' -O 'data/paul_graham_essay.txt'
Afterward, you’ll be able to load the info for further processing:
from llama_index.core import SimpleDirectoryReader# Load data
documents = SimpleDirectoryReader(
input_files=["./data/paul_graham_essay.txt"]
).load_data()
Step 3: Chunk documents into nodes
As the complete document is just too large to suit into the context window of the LLM, you will have to partition it into smaller text chunks, that are called Nodes
in LlamaIndex. You may parse the loaded documents into nodes using the SimpleNodeParser
with an outlined chunk size of 1024.
from llama_index.core.node_parser import SimpleNodeParsernode_parser = SimpleNodeParser.from_defaults(chunk_size=1024)
# Extract nodes from documents
nodes = node_parser.get_nodes_from_documents(documents)
Step 4: Construct index
Next, you’ll construct the index that stores all of the external knowledge in Weaviate, an open source vector database.
First, you will have to hook up with a Weaviate instance. On this case, we’re using Weaviate Embedded, which lets you experiment in Notebooks without spending a dime without an API key. For a production-ready solution, deploying Weaviate yourself, e.g., via Docker or utilizing a managed service, is advisable.
import weaviate# Connect with your Weaviate instance
client = weaviate.Client(
embedded_options=weaviate.embedded.EmbeddedOptions(),
)
Next, you’ll construct a VectorStoreIndex
from the Weaviate client to store your data in and interact with.
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.weaviate import WeaviateVectorStoreindex_name = "MyExternalContext"
# Construct vector store
vector_store = WeaviateVectorStore(
weaviate_client = client,
index_name = index_name
)
# Arrange the storage for the embeddings
storage_context = StorageContext.from_defaults(vector_store=vector_store)
# Setup the index
# construct VectorStoreIndex that takes care of chunking documents
# and encoding chunks to embeddings for future retrieval
index = VectorStoreIndex(
nodes,
storage_context = storage_context,
)
Step 5: Setup query engine
Lastly, you’ll arrange the index because the query engine.
# The QueryEngine class is provided with the generator
# and facilitates the retrieval and generation steps
query_engine = index.as_query_engine()
Step 6: Run a naive RAG query in your data
Now, you’ll be able to run a naive RAG query in your data, as shown below:
# Run your naive RAG query
response = query_engine.query(
"What happened at Interleaf?"
)
On this section, we’ll cover some easy adjustments you’ll be able to make to show the above naive RAG pipeline into a sophisticated one. This walkthrough will cover the next choice of advanced RAG techniques:
As we’ll only cover the modifications here, yow will discover the complete end-to-end advanced RAG pipeline on this Jupyter Notebook.
For the sentence window retrieval technique, it’s essential make two adjustments: First, you could adjust the way you store and post-process your data. As an alternative of the SimpleNodeParser
, we’ll use the SentenceWindowNodeParser
.
from llama_index.core.node_parser import SentenceWindowNodeParser# create the sentence window node parser w/ default settings
node_parser = SentenceWindowNodeParser.from_defaults(
window_size=3,
window_metadata_key="window",
original_text_metadata_key="original_text",
)
The SentenceWindowNodeParser
does two things:
- It separates the document into single sentences, which will probably be embedded.
- For every sentence, it creates a context window. When you specify a
window_size = 3
, the resulting window will probably be three sentences long, starting on the previous sentence of the embedded sentence and spanning the sentence after. The window will probably be stored as metadata.
During retrieval, the sentence that almost all closely matches the query is returned. After retrieval, it’s essential replace the sentence with the complete window from the metadata by defining a MetadataReplacementPostProcessor
and using it within the list of node_postprocessors
.
from llama_index.core.postprocessor import MetadataReplacementPostProcessor# The goal key defaults to `window` to match the node_parser's default
postproc = MetadataReplacementPostProcessor(
target_metadata_key="window"
)
...
query_engine = index.as_query_engine(
node_postprocessors = [postproc],
)
Implementing a hybrid search in LlamaIndex is as easy as two parameter changes to the query_engine
if the underlying vector database supports hybrid search queries. The alpha
parameter specifies the weighting between vector search and keyword-based search, where alpha=0
means keyword-based search and alpha=1
means pure vector search.
query_engine = index.as_query_engine(
...,
vector_store_query_mode="hybrid",
alpha=0.5,
...
)
Adding a reranker to your advanced RAG pipeline only takes three easy steps:
- First, define a reranker model. Here, we’re using the
BAAI/bge-reranker-base
from Hugging Face. - Within the query engine, add the reranker model to the list of
node_postprocessors
. - Increase the
similarity_top_k
within the query engine to retrieve more context passages, which might be reduced totop_n
after reranking.
# !pip install torch sentence-transformers
from llama_index.core.postprocessor import SentenceTransformerRerank# Define reranker model
rerank = SentenceTransformerRerank(
top_n = 2,
model = "BAAI/bge-reranker-base"
)
...
# Add reranker to question engine
query_engine = index.as_query_engine(
similarity_top_k = 6,
...,
node_postprocessors = [rerank],
...,
)