Home Artificial Intelligence Advanced Retrieval-Augmented Generation: From Theory to LlamaIndex Implementation What’s Advanced RAG Pre-retrieval optimization Retrieval optimization Post-retrieval optimization Prerequisites Implementing Naive RAG with LlamaIndex Implementing Advanced RAG with LlamaIndex Indexing optimization example: Sentence window retrieval Retrieval optimization example: Hybrid search Post-retrieval optimization example: Re-ranking Summary Enjoyed This Story? Disclaimer References

Advanced Retrieval-Augmented Generation: From Theory to LlamaIndex Implementation What’s Advanced RAG Pre-retrieval optimization Retrieval optimization Post-retrieval optimization Prerequisites Implementing Naive RAG with LlamaIndex Implementing Advanced RAG with LlamaIndex Indexing optimization example: Sentence window retrieval Retrieval optimization example: Hybrid search Post-retrieval optimization example: Re-ranking Summary Enjoyed This Story? Disclaimer References

0
Advanced Retrieval-Augmented Generation: From Theory to LlamaIndex Implementation
What’s Advanced RAG
Pre-retrieval optimization
Retrieval optimization
Post-retrieval optimization
Prerequisites
Implementing Naive RAG with LlamaIndex
Implementing Advanced RAG with LlamaIndex
Indexing optimization example: Sentence window retrieval
Retrieval optimization example: Hybrid search
Post-retrieval optimization example: Re-ranking
Summary
Enjoyed This Story?
Disclaimer
References

For added ideas on tips on how to improve the performance of your RAG pipeline to make it production-ready, proceed reading here:

This section discusses the required packages and API keys to follow along in this text.

Required Packages

This text will guide you thru implementing a naive and a sophisticated RAG pipeline using LlamaIndex in Python.

pip install llama-index

In this text, we will probably be using LlamaIndex v0.10. When you are upgrading from an older LlamaIndex version, it’s essential run the next commands to put in and run LlamaIndex properly:

pip uninstall llama-index
pip install llama-index --upgrade --no-cache-dir --force-reinstall

LlamaIndex offers an choice to store vector embeddings locally in JSON files for persistent storage, which is great for quickly prototyping an idea. Nonetheless, we’ll use a vector database for persistent storage since advanced RAG techniques aim for production-ready applications.

Since we’ll need metadata storage and hybrid search capabilities along with storing the vector embeddings, we’ll use the open source vector database Weaviate (v3.26.2), which supports these features.

pip install weaviate-client llama-index-vector-stores-weaviate

API Keys

We will probably be using Weaviate embedded, which you should utilize without spending a dime without registering for an API key. Nonetheless, this tutorial uses an embedding model and LLM from OpenAI, for which you will have an OpenAI API key. To acquire one, you would like an OpenAI account after which “Create recent secret key” under API keys.

Next, create an area .env file in your root directory and define your API keys in it:

OPENAI_API_KEY=""

Afterwards, you’ll be able to load your API keys with the next code:

# !pip install python-dotenv
import os
from dotenv import load_dotenv,find_dotenv

load_dotenv(find_dotenv())

This section discusses tips on how to implement a naive RAG pipeline using LlamaIndex. Yow will discover the complete naive RAG pipeline on this Jupyter Notebook. For the implementation using LangChain, you’ll be able to proceed in this text (naive RAG pipeline using LangChain).

Step 1: Define the embedding model and LLM

First, you’ll be able to define an embedding model and LLM in a worldwide settings object. Doing this implies you don’t need to specify the models explicitly within the code again.

  • Embedding model: used to generate vector embeddings for the document chunks and the query.
  • LLM: used to generate a solution based on the user query and the relevant context.
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from llama_index.core.settings import Settings

Settings.llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
Settings.embed_model = OpenAIEmbedding()

Step 2: Load data

Next, you’ll create an area directory named data in your root directory and download some example data from the LlamaIndex GitHub repository (MIT license).

!mkdir -p 'data'
!wget '' -O 'data/paul_graham_essay.txt'

Afterward, you’ll be able to load the info for further processing:

from llama_index.core import SimpleDirectoryReader

# Load data
documents = SimpleDirectoryReader(
input_files=["./data/paul_graham_essay.txt"]
).load_data()

Step 3: Chunk documents into nodes

As the complete document is just too large to suit into the context window of the LLM, you will have to partition it into smaller text chunks, that are called Nodes in LlamaIndex. You may parse the loaded documents into nodes using the SimpleNodeParser with an outlined chunk size of 1024.

from llama_index.core.node_parser import SimpleNodeParser

node_parser = SimpleNodeParser.from_defaults(chunk_size=1024)

# Extract nodes from documents
nodes = node_parser.get_nodes_from_documents(documents)

Step 4: Construct index

Next, you’ll construct the index that stores all of the external knowledge in Weaviate, an open source vector database.

First, you will have to hook up with a Weaviate instance. On this case, we’re using Weaviate Embedded, which lets you experiment in Notebooks without spending a dime without an API key. For a production-ready solution, deploying Weaviate yourself, e.g., via Docker or utilizing a managed service, is advisable.

import weaviate

# Connect with your Weaviate instance
client = weaviate.Client(
embedded_options=weaviate.embedded.EmbeddedOptions(),
)

Next, you’ll construct a VectorStoreIndex from the Weaviate client to store your data in and interact with.

from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.weaviate import WeaviateVectorStore

index_name = "MyExternalContext"

# Construct vector store
vector_store = WeaviateVectorStore(
weaviate_client = client,
index_name = index_name
)

# Arrange the storage for the embeddings
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Setup the index
# construct VectorStoreIndex that takes care of chunking documents
# and encoding chunks to embeddings for future retrieval
index = VectorStoreIndex(
nodes,
storage_context = storage_context,
)

Step 5: Setup query engine

Lastly, you’ll arrange the index because the query engine.

# The QueryEngine class is provided with the generator
# and facilitates the retrieval and generation steps
query_engine = index.as_query_engine()

Step 6: Run a naive RAG query in your data

Now, you’ll be able to run a naive RAG query in your data, as shown below:

# Run your naive RAG query
response = query_engine.query(
"What happened at Interleaf?"
)

On this section, we’ll cover some easy adjustments you’ll be able to make to show the above naive RAG pipeline into a sophisticated one. This walkthrough will cover the next choice of advanced RAG techniques:

As we’ll only cover the modifications here, yow will discover the complete end-to-end advanced RAG pipeline on this Jupyter Notebook.

For the sentence window retrieval technique, it’s essential make two adjustments: First, you could adjust the way you store and post-process your data. As an alternative of the SimpleNodeParser, we’ll use the SentenceWindowNodeParser.

from llama_index.core.node_parser import SentenceWindowNodeParser

# create the sentence window node parser w/ default settings
node_parser = SentenceWindowNodeParser.from_defaults(
window_size=3,
window_metadata_key="window",
original_text_metadata_key="original_text",
)

The SentenceWindowNodeParser does two things:

  1. It separates the document into single sentences, which will probably be embedded.
  2. For every sentence, it creates a context window. When you specify a window_size = 3, the resulting window will probably be three sentences long, starting on the previous sentence of the embedded sentence and spanning the sentence after. The window will probably be stored as metadata.

During retrieval, the sentence that almost all closely matches the query is returned. After retrieval, it’s essential replace the sentence with the complete window from the metadata by defining a MetadataReplacementPostProcessor and using it within the list of node_postprocessors.

from llama_index.core.postprocessor import MetadataReplacementPostProcessor

# The goal key defaults to `window` to match the node_parser's default
postproc = MetadataReplacementPostProcessor(
target_metadata_key="window"
)

...

query_engine = index.as_query_engine(
node_postprocessors = [postproc],
)

Implementing a hybrid search in LlamaIndex is as easy as two parameter changes to the query_engine if the underlying vector database supports hybrid search queries. The alpha parameter specifies the weighting between vector search and keyword-based search, where alpha=0 means keyword-based search and alpha=1 means pure vector search.

query_engine = index.as_query_engine(
...,
vector_store_query_mode="hybrid",
alpha=0.5,
...
)

Adding a reranker to your advanced RAG pipeline only takes three easy steps:

  1. First, define a reranker model. Here, we’re using the BAAI/bge-reranker-basefrom Hugging Face.
  2. Within the query engine, add the reranker model to the list of node_postprocessors.
  3. Increase the similarity_top_k within the query engine to retrieve more context passages, which might be reduced to top_n after reranking.
# !pip install torch sentence-transformers
from llama_index.core.postprocessor import SentenceTransformerRerank

# Define reranker model
rerank = SentenceTransformerRerank(
top_n = 2,
model = "BAAI/bge-reranker-base"
)

...

# Add reranker to question engine
query_engine = index.as_query_engine(
similarity_top_k = 6,
...,
node_postprocessors = [rerank],
...,
)

LEAVE A REPLY

Please enter your comment!
Please enter your name here