RAGAs (Retrieval-Augmented Generation Assessment) is a framework (GitHub, Docs) that gives you with the needed ingredients to make it easier to evaluate your RAG pipeline on a component level.
Evaluation Data
What’s interesting about RAGAs is that it started off as a framework for “reference-free” evaluation [1]. Meaning, as a substitute of getting to depend on human-annotated ground truth labels within the evaluation dataset, RAGAs leverages LLMs under the hood to conduct the evaluations.
To guage the RAG pipeline, RAGAs expects the next information:
query
: The user query that’s the input of the RAG pipeline. The input.answer
: The generated answer from the RAG pipeline. The output.contexts
: The contexts retrieved from the external knowledge source used to reply thequery
.ground_truths
: The bottom truth answer to thequery
. That is the one human-annotated information. This information is simply required for the metriccontext_recall
(see Evaluation Metrics).
Leveraging LLMs for reference-free evaluation is an energetic research topic. While using as little human-annotated data as possible makes it a less expensive and faster evaluation method, there continues to be some discussion about its shortcomings, corresponding to bias [3]. Nevertheless, some papers have already shown promising results [4]. For detailed information, see the “Related Work” section of the RAGAs [1] paper.
Note that the framework has expanded to offer metrics and paradigms that require ground truth labels (e.g., context_recall
and answer_correctness
, see Evaluation Metrics).
Moreover, the framework provides you with tooling for automatic test data generation.
Evaluation Metrics
RAGAs give you just a few metrics to judge a RAG pipeline component-wise in addition to end-to-end.
On a component level, RAGAs provides you with metrics to judge the retrieval component (context_relevancy
and context_recall
) and the generative component (faithfulness
and answer_relevancy
) individually [2]:
- Context precision measures the signal-to-noise ratio of the retrieved context. This metric is computed using the
query
and thecontexts
. - Context recall measures if all of the relevant information required to reply the query was retrieved. This metric is computed based on the
ground_truth
(that is the one metric within the framework that relies on human-annotated ground truth labels) and thecontexts
. - Faithfulness measures the factual accuracy of the generated answer. The variety of correct statements from the given contexts is split by the entire variety of statements within the generated answer. This metric uses the
query
,contexts
and theanswer
. - Answer relevancy measures how relevant the generated answer is to the query. This metric is computed using the
query
and theanswer
. For instance, the reply “France is in western Europe.” to the query “Where is France and what’s it’s capital?” would achieve a low answer relevancy since it only answers half of the query.
All metrics are scaled to the range [0, 1], with higher values indicating a greater performance.
RAGAs also provides you with metrics to judge the RAG pipeline end-to-end, corresponding to answer semantic similarity and answer correctness. This text focuses on the component-level metrics.
This section uses RAGAs to judge a minimal vanilla RAG pipeline to indicate you easy methods to use RAGAs and to present you an intuition about its evaluation metrics.
Prerequisites
Be certain you will have installed the required Python packages:
langchain
,openai
, andweaviate-client
for the RAG pipelineragas
for evaluating the RAG pipeline
#!pip install langchain openai weaviate-client ragas
Moreover, define your relevant environment variables in a .env file in your root directory. To acquire an OpenAI API Key, you would like an OpenAI account after which “Create recent secret key” under API keys.
OPENAI_API_KEY=""
Organising the RAG application
Before you possibly can evaluate your RAG application, you must set it up. We’ll use a vanilla RAG pipeline. We’ll keep this section short since we are going to use the identical setup described intimately in the next article.
First, you could prepare the information by loading and chunking the documents.
import requests
from langchain.document_loaders import TextLoader
from langchain.text_splitter import CharacterTextSplitterurl = "https://raw.githubusercontent.com/langchain-ai/langchain/master/docs/docs/modules/state_of_the_union.txt"
res = requests.get(url)
with open("state_of_the_union.txt", "w") as f:
f.write(res.text)
# Load the information
loader = TextLoader('./state_of_the_union.txt')
documents = loader.load()
# Chunk the information
text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = text_splitter.split_documents(documents)
Next, generate the vector embeddings for every chunk with the OpenAI embedding model and store them within the vector database.
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Weaviate
import weaviate
from weaviate.embedded import EmbeddedOptions
from dotenv import load_dotenv,find_dotenv# Load OpenAI API key from .env file
load_dotenv(find_dotenv())
# Setup vector database
client = weaviate.Client(
embedded_options = EmbeddedOptions()
)
# Populate vector database
vectorstore = Weaviate.from_documents(
client = client,
documents = chunks,
embedding = OpenAIEmbeddings(),
by_text = False
)
# Define vectorstore as retriever to enable semantic search
retriever = vectorstore.as_retriever()
Finally, arrange a prompt template and the OpenAI LLM and mix them with the retriever component to a RAG pipeline.
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate
from langchain.schema.runnable import RunnablePassthrough
from langchain.schema.output_parser import StrOutputParser# Define LLM
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)
# Define prompt template
template = """You might be an assistant for question-answering tasks.
Use the next pieces of retrieved context to reply the query.
For those who do not know the reply, just say that you simply do not know.
Use two sentences maximum and keep the reply concise.
Query: {query}
Context: {context}
Answer:
"""
prompt = ChatPromptTemplate.from_template(template)
# Setup RAG pipeline
rag_chain = (
{"context": retriever, "query": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
Preparing the Evaluation Data
As RAGAs goals to be a reference-free evaluation framework, the required preparations of the evaluation dataset are minimal. You have to to arrange query
and ground_truths
pairs from which you’ll be able to prepare the remaining information through inference as follows:
from datasets import Datasetquestions = ["What did the president say about Justice Breyer?",
"What did the president say about Intel's CEO?",
"What did the president say about gun violence?",
]
ground_truths = [["The president said that Justice Breyer has dedicated his life to serve the country and thanked him for his service."],
["The president said that Pat Gelsinger is ready to increase Intel's investment to $100 billion."],
["The president asked Congress to pass proven measures to reduce gun violence."]]
answers = []
contexts = []
# Inference
for query in questions:
answers.append(rag_chain.invoke(query))
contexts.append([docs.page_content for docs in retriever.get_relevant_documents(query)])
# To dict
data = {
"query": questions,
"answer": answers,
"contexts": contexts,
"ground_truths": ground_truths
}
# Convert dict to dataset
dataset = Dataset.from_dict(data)
For those who are usually not considering the context_recall
metric, you don’t need to offer the ground_truths
information. On this case, all you must prepare are the query
s.
Evaluating the RAG application
First, import all of the metrics you ought to use from ragas.metrics
. Then, you need to use the evaluate()
function and easily pass within the relevant metrics and the prepared dataset.
from ragas import evaluate
from ragas.metrics import (
faithfulness,
answer_relevancy,
context_recall,
context_precision,
)result = evaluate(
dataset = dataset,
metrics=[
context_precision,
context_recall,
faithfulness,
answer_relevancy,
],
)
df = result.to_pandas()
Below, you possibly can see the resulting RAGAs scores for the examples: