Home Artificial Intelligence LLM+RAG-Based Query Answering RAG Overview Constructing the RAG Input, or Chunking Text Example RAG Query Embedding Vectors Vector Databases Chunked Data and Embeddings Seach for Similar Query Embeddings vs Chunk Embeddings Re-ranking Constructing the Context Generating the Answer Visual Embedding Check More Advanced Context Selection Extending the RAG Query Hallucinations Evaluating Results Conclusions

LLM+RAG-Based Query Answering RAG Overview Constructing the RAG Input, or Chunking Text Example RAG Query Embedding Vectors Vector Databases Chunked Data and Embeddings Seach for Similar Query Embeddings vs Chunk Embeddings Re-ranking Constructing the Context Generating the Answer Visual Embedding Check More Advanced Context Selection Extending the RAG Query Hallucinations Evaluating Results Conclusions

LLM+RAG-Based Query Answering
RAG Overview
Constructing the RAG Input, or Chunking Text
Example RAG Query
Embedding Vectors
Vector Databases
Chunked Data and Embeddings
Seach for Similar Query Embeddings vs Chunk Embeddings
Constructing the Context
Generating the Answer
Visual Embedding Check
More Advanced Context Selection
Extending the RAG Query
Evaluating Results

Easy methods to do poorly on Kaggle, and study RAG+LLM from it

Towards Data Science

23 min read

Dec 25, 2023

Image generated with ChatGPT+/DALL-E3, asking for an illustrative image for an article about RAG.

Retrieval Augmented Generation (RAG) appears to be quite popular lately. Along the wave of Large Language Models (LLM’s), it’s certainly one of the favored techniques to get LLM’s to perform higher on specific tasks reminiscent of query answering on in-house documents. A while ago, I played on a Kaggle competition that allowed me to try it out and learn a bit higher than random experiments by myself. Listed below are a number of learnings from that and the next experiments while writing this text.

All images, unless otherwise noted, are by the writer. Generated with the assistance of ChatGPT+/DALL-E3 (where noted), or taken from my personal Jupyter notebooks.

RAG has two predominant parts, retrieval and generation. In the primary part, retrieval is used to fetch (chunks of) documents related to the query of interest. Generation uses those fetched chunks as added input, called context, to the reply generation model within the second part. This added context is meant to present the generator more up-to-date, hopefully higher, information to base its generated answer on than simply its base training data.

LLM’s have a maximum context or sequence window length they will handle, and the generated input context for RAG must be short enough to suit into this sequence window. We wish to suit as much relevant information into this context as possible, so getting the very best “chunks” of text from the potential input documents is very important. These chunks should optimally be probably the most relevant ones for generating the right answer to the query posed to the RAG system.

As a primary step, the input text is usually chunked into smaller pieces. A basic pre-processing step in RAG is converting these chunks into embeddings using a particular embedding model. A typical sequence window for an embedding model is 512 tokens, which also makes a practical goal for chunk size. Once the documents are chunked and encoded into embeddings, a similarity search using the embeddings could be performed to construct the context for generating the reply.

I actually have found Langchain to offer useful tools for input loading and chunking. For instance, chunking a document with Langchain (on this case, using tokenizer for Flan-T5-Large model) is so simple as:

from transformers import AutoTokenizer 
from langchain.text_splitter import RecursiveCharacterTextSplitter

#That is the Flan-T5-Large model I used for the Kaggle competition
llm = "/mystuff/llm/flan-t5-large/flan-t5-large"
tokenizer = AutoTokenizer.from_pretrained(llm, local_files_only=True)
text_splitter = RecursiveCharacterTextSplitter
           .from_huggingface_tokenizer(tokenizer, chunk_size=12,
separators=["nn", "n", ". "])
section_text="Hello. This is a few text to separate. With a number of "
"uncharacteristic words to chunk, expecting 2 chunks."
texts = text_splitter.split_text(section_text)

This produces the next two chunks:

['Hello. This is some text to split',
'. With a few uncharacteristic words to chunk, expecting 2 chunks.']

Within the above code, chunk_size 12 tells LangChain to aim for a maximum of 12 tokens per chunk. Depending on the text structure, this will not all the time be 100% exact. Nonetheless, in my experience it really works generally well. Something to have in mind is the difference between tokens vs words. Here is an example of tokenizing the above section_text:

section_text="Hello. This is a few text to separate. With a number of " 
"uncharacteristic words to chunk, expecting 2 chunks."
encoded_text = tokenizer(section_text)
tokens = tokenizer.convert_ids_to_tokens(encoded_text['input_ids'])

Resulting output tokens:

['▁Hello', '.', '▁This', '▁is', '▁some', '▁text', '▁to', '▁split', '.', 
'▁With', '▁', 'a', '▁few', '▁un', 'character', 'istic', '▁words',
'▁to', '▁chunk', ',', '▁expecting', '▁2', '▁chunk', 's', '.', '']

Most words within the section_text form a token on their very own, as they’re common words in texts. Nonetheless, for special types of words, or domain words this generally is a bit more complicated. For instance, here the word “uncharacteristic” becomes three tokens [“ un”, “ character”, “ istic”]. It is because the model tokenizer knows those 3 partial sub-words but not the complete word (“ uncharacteristic “). Each model comes with its own tokenizer to match these rules in input and model training.

In chunking, the RecursiveCharacterTextSplitter from Langchain utilized in above code counts these tokens, and appears for given separators to separate the text into chunks as requested. Trials with different chunk sizes could also be useful. In my Kaggle experiment I began with the utmost size for the embedding model, which was 512 tokens. Then proceeded to try chunk sizes of 256, 128, and 64 tokens.

The Kaggle competition I discussed was about multiple-choice query answering based on Wikipedia data. The duty was to pick the right answer option from the multiple options for every query. The apparent approach was to make use of RAG to seek out required information from a Wikipedia dump, and use it to generate the right. Here is the primary query from competition data, and its answer options as an instance:

Example query and answer options A-E.

The multiple-choice questions were an interesting topic to check out RAG. But probably the most common RAG use case is, I imagine, answering questions based on source documents. Sort of like a chatbot, but typically query answering over domain specific or (company) internal documents. I exploit this basic query answering use case to show RAG in this text.

For example RAG query for this text, I needed something the LLM wouldn’t know the reply to directly based on its training data alone. I used Wikipedia data, and because it is probably going used as part of coaching data for LLM’s, I needed a matter related to something after the model was trained. The model I used for this text was Zephyr 7B beta, trained in early 2023. Finally, I settled on asking in regards to the Google Bard AI chatbot. It has had many developments over the past 12 months, after the Zephyr training date. I even have a good knowledge of Bard to guage the LLM’s answers. Thus I used “what’s google bard? “ for example query for this text.

The primary phase of retrieval in RAG is predicated on the embedding vectors, that are really just points in a multidimensional space. They give the impression of being something like this (only the primary 10 values here):

array([-0.45518905, -0.6450379, 0.3097812, -0.4861114 , -0.08480848,
-0.1664767 , 0.1875889, 0.3513346, -0.04495572, 0.12551129],

These embedding vectors could be used to match the words/sentences, and their relations, against one another. These vectors could be built using embedding models. A pleasant set of those models with various stats per model could be found on the MTEB leaderboard. Using certainly one of those models is so simple as this:

from sentence_transformers import SentenceTransformer, util

embedding_model_path = "/mystuff/llm/bge-small-en"
embedding_model = SentenceTransformer(embedding_model_path, device='cuda')

The model page on HuggingFace typically shows the instance code. The above loads the model “ bge-small-en “ from local disk. To create the embeddings using this model is just:

query = "what's google bard?" 
q_embeddings = embedding_model.encode(query)

On this case, the embedding model is used to encode the given query into an embedding vector. The vector is similar as the instance above:

(, 384)

array([-0.45518905, -0.6450379, 0.3097812, -0.4861114 , -0.08480848,
-0.1664767 , 0.1875889, 0.3513346, -0.04495572, 0.12551129],

The form (, 384) tells me q_embeddings is a single vector (versus embedding an inventory of multiple texts directly) of length 384 floats. The slice above shows the primary 10 values out of those 384. Some models use longer vectors for more accurate relations, others, like this one, shorter (here 384). Again, MTEB leaderboard has good examples. The small ones require less space and computation, larger ones give some improvements in representing the relations between chunks, and sometimes sequence length.

For my RAG similarity search, I first needed embeddings for the query. That is the q_embeddings above. This needed to be compared against embedding vectors of all of the searched articles (or their chunks). On this case all of the chunked Wikipedia articles. To construct embedding for all of those:

article_embeddings = embedding_model.encode(article_chunks)

Here article_chunks is an inventory of all chunks for all articles from the English Wikipedia dump. This manner they could be batch-encoded.

Implementing similarity search over a big set of documents / document chunks shouldn’t be too complicated at a basic level. A standard way is to calculate cosine similarity between the query and document vectors, and type accordingly. Nonetheless, at large scale, this sometimes gets a bit complicated to administer. Vector databases are tools that make this management and search easier / more efficient at scale.

For instance, Weaviate is a vector database that was utilized in StackOverflow’s AI-based search. In its latest versions, it may possibly even be utilized in an embedded mode, which must have made it usable even in a Kaggle notebook. It is usually utilized in some Deeplearning.AI LLM short courses, so at the least seems somewhat popular. After all, there are lots of others and it is nice to make comparisons, this field also evolves fast.

In my trials, I used FAISS from Facebook/Meta research because the vector database. FAISS is more of a library than a client-server database, and was thus easy to make use of in a Kaggle notebook. And it worked quite nicely.

Once the chunking and embedding of all of the articles was all done, I built a Pandas DataFrame with all of the relevant information. Here is an example with the primary 5 chunks of the Wikipedia dump I used, for a document titled Anarchism:

First 5 chunks from the primary article within the Wikipedia dump I used.

Each row on this table (a Pandas DataFrame) accommodates data for a single chunk after the chunking process. It has 5 columns:

  • chunk_id: allows me to map chunk embeddings to the chunk text later.
  • doc_id: allows mapping the chunks back to their document.
  • doc_title: for trialing approaches reminiscent of adding the doc title to every chunk.
  • chunk_title: article subsection title for the chunk, same purpose as doc_title
  • chunk: the actual chunk text

Listed below are the embeddings for the primary five Anarchism chunks, same order because the DataFrame above:

[[ 0.042624 -0.131264 -0.266858 ... -0.329627 0.178211 0.248001]
[-0.120318 -0.110153 -0.059611 ... -0.297150 -0.043165 0.558150]
[ 0.116761 -0.066759 -0.498548 ... -0.330301 0.019448 0.326484]
[-0.517585 0.183634 0.186501 ... 0.134235 -0.033262 0.498731]
[-0.245819 -0.189427 0.159848 ... -0.077107 -0.111901 0.483461]]

Each row is partially only shown here, but illustrates the concept.

Earlier I encoded the query vector for query “ what’s google bard? “‘, followed by encoding all of the article chunks. With these two sets of embeddings, the primary a part of RAG search is easy: finding the documents “semantically” closest to the query. In practice just calculating a measure reminiscent of cosine similarity between the query embedding vector and all of the chunk vectors, and sorting by the similarity rating.

Listed below are the highest 10 “semantically” closest chunks to the q_embeddings:

Top 10 chunks sorted by their cosine similarity with the query.

Each row on this table (DataFrame) represents a piece. The sim_score here is the calculated cosine similarity rating, and the rows are sorted from highest cosine similarity to lowest. The table shows the highest 10 highest sim_score rows.

A pure embeddings based similarity search could be very fast and low-cost when it comes to computation. Nonetheless, it shouldn’t be quite as accurate as another approaches. Re-ranking is a term used to explain the technique of using one other model to more accurately sort this initial list of top documents, with a more computationally expensive model. This model is often too expensive to run against all documents and chunks, but running it on the set of top chunks after the initial similarity search is way more feasible. Re-ranking helps to get a greater list of ultimate chunks to construct the input context for the generation a part of RAG.

The identical MTEB leaderboard that hosts metrics for the embedding models also has re-ranking scores for a lot of models. On this case I used the bge-reranker-base model for re-ranking:

import torch 
from transformers import AutoModelForSequenceClassification, AutoTokenizer

rerank_model_path = "/mystuff/llm/bge-reranker-base"
rerank_tokenizer = AutoTokenizer.from_pretrained(rerank_model_path)
rerank_model = AutoModelForSequenceClassification

def calculate_rerank_scores(pairs):
with torch.no_grad(): inputs = rerank_tokenizer(pairs, padding=True,
truncation=True, return_tensors='pt',
scores = rerank_model(**inputs, return_dict=True)
.logits.view(-1, ).float()
return scores

query = questions[idx]
pairs = [(question, chunk) for chunk in doc_chunks_all[idx]]
rerank_scores = calculate_rerank_scores(pairs)
df["rerank_score"] = rerank_scores

After adding rerank_score to the chunk DataFrame, and sorting with it:

Top 10 chunks sorted by their re-rank rating with the query.

Comparing the 2 tables above (first sorted by sim_score vs now by rerank_score), there are some clear differences. Sorting by the plain similarity rating ( sim_score) from embeddings, the Tenor page is the fifth most similar chunk. Since Tenor appears to be a GIF search engine hosted by Google, I assume it makes some sense to see its embeddings near the query “ what’s google bard? “. But it surely has nothing really to do with Bard itself, except that Tenor is a Google product in an identical domain.

Nonetheless, after sorting by the rerank_score, the outcomes make way more sense. Tenor is gone from the highest 10, and only the last two chunks from the highest 10 list look like unrelated. These are in regards to the names “Bard” and “Bård”. Possibly because the very best source of data on Google Bard appears to be the page on Google Bard, which within the above tables is document with id 6026776. After that I assume RAG runs out of excellent article matches and goes a bit off-road (Bård). Which can also be seen within the negative re-rank scores for those two last rows/chunks of the table.

Typically there would likely be many relevant documents and chunks across those documents, not only the 1 document and eight chunks as above. But on this case this limitation helps illustrate the difference in basic embeddings-based similarity search and re-ranking, and the way re-ranking can positively affect the top result.

What can we do once we now have collected the highest chunks for RAG input? We’d like to construct the context for the generator model from these chunks. At its simplest, that is only a concatenation of the chosen top chunks into an extended text sequence. The utmost length of this sequence in constrained by the used model. As I used the Zephyr 7B model, I used 4096 tokens as the utmost length. The Zephyr page gives this as a versatile sequence limit (with sliding attention window). Longer context seems higher, but it surely appears this shouldn’t be all the time the case. Higher try it.

Here is the bottom code I used to generate the reply with this context:

from transformers import AutoTokenizer, AutoModelForCausalLM 
import torch

llm_answer_path = "/mystuff/llm/zephyr-7b-beta"
torch_device = "cuda:0"
tokenizer = AutoTokenizer.from_pretrained(llm_answer_path,
llm_answer = AutoModelForCausalLM.from_pretrained(llm_answer_path,
device_map=torch_device, local_files_only=True,
# assuming here that "context" accommodates the pre-built context
query = "answer the next query, "
"based in your knowledge and the provided context. "n
"Keep the reply concise.nnquestion:" + query +

input_ids = tokenizer.encode(query+"nnANSWER:", return_tensors='pt',
greedy_output = llm_answer.generate(input_ids, max_new_tokens=1024,
answer = tokenizer.decode(greedy_output[0], skip_special_tokens=True)

As noted, on this case the context was only a concatenation of the highest ranked chunks.

For comparison, first lets try what the model answers with none added context, i.e. based on its training data alone:

query = "what's google bard?" 
input_ids = tokenizer.encode(query+"nnANSWER:", return_tensors='pt',
greedy_output = llm_answer.generate(input_ids, max_new_tokens=1024,
answer = tokenizer.decode(greedy_output[0], skip_special_tokens=True)

This provides (certainly one of many runs, slight variations but generally similar):

Google Bard is an experimental, AI-based language model developed by
Google's sister company, DeepMind. Its primary use is to generate
human-like text responses to prompts, which may help in tasks reminiscent of
content creation, idea generation, and text summarization. Bard is
trained on an enormous amount of textual data and may provide highly
relevant and contextually accurate responses, making it a great tool
in various applications where text generation is required. Nonetheless, as
an experimental feature, Bard's accuracy and effectiveness should still be
improving, and it shouldn't be yet publicly available to be used.

Generally accurate, but missing much of the newest developments. As compared, lets try with providing the generated context to the query:

query = "answer the next query, "
"based in your knowledge and the provided context. "
"Keep the reply concise.nn"
"query:" + query + "nncontext:"+context
input_ids = tokenizer.encode(query+"nnANSWER:", return_tensors='pt',
greedy_output = llm_answer.generate(input_ids, max_new_tokens=1024,
answer = tokenizer.decode(greedy_output[0], skip_special_tokens=True)

The next is an example answer with the highest sim_score sorted chunks as context (includes the Tenor and Bård page chunks):

Bard is a conversational generative artificial intelligence (AI)
chatbot developed by Google based initially on the LaMDA family of
large language models (LLMs) and later PaLM. It was launched in a
limited capability in March 2023 to mixed reviews, and expanded to other
countries in May, following the launch of Microsoft's Bing Chat. Google
researcher Jacob Devlin resigned from the corporate after alleging Bard
had surreptitiously leveraged data from ChatGPT. Tenor is a web based GIF
search engine and database owned by Google, available on Android, iOS,
and macOS. Its predominant product is the GIF Keyboard, which is integrated
into quite a few apps and messaging services. Bård is a Norwegian
masculine given name of Old Norse origin. It is typically used as a
surname. It might discuss with several people. Tenor and Bård are usually not related.

This shouldn’t be a superb answer because it starts talking about completely non-related topics here, Tenor and Bård. Partly because on this case the Tenor chunk is included within the context, and chunk order also generally less optimal because it shouldn’t be re-ranked.

As compared, with rerank_score sorted context chunks (higher chunk ordering and Tenor gone):

Google Bard is a conversational generative artificial intelligence (AI)
chatbot, based initially on the LaMDA family of enormous language models
(LLMs) and in a while PaLM, developed by Google to compete with OpenAI's
ChatGPT. It was rolled out in a limited capability in March 2023 and
expanded to more countries in May, prompting a mixed reception from
critics, who raised safety and ethical concerns about its accuracy and
usefulness. Google has promised that Bard can be tightly integrated
with other Google AI services, resulting in claims that a
latest AI-powered version of the Google Assistant, dubbed "Assistant with
Bard", is being prepared for launch. Google has also stressed that Bard
continues to be in its early stages and being constantly refined, with plans
to upgrade it with latest personalization and productivity features, while
stressing that it stays distinct from Google Search.

Now the unrelated topics are gone and the reply on the whole is best and more to the purpose.

This highlights that it shouldn’t be only vital to seek out proper context to present to the model, but in addition to trim out the unrelated context. No less than on this case, the Zephyr model was not capable of directly discover which a part of the context was relevant, but relatively seems to have summarized the all of it. Cannot really fault the model, as I gave it that context and asked to make use of it.

the re-rank scores for the chunks, a general filtering approach based on metrics reminiscent of negative re-rank scores would have solved this issue also within the above case, because the “bad” chunks on this case have a negative re-rank rating.

Something to notice is that Google released a brand new and far improved Gemini family of models for Bard, across the time I used to be writing this text. It shouldn’t be mentioned within the generated answers here for the reason that Wikipedia dumps are generated with a slight delay. In order one may think, it’s important to attempt to have up-to-date information within the context, and to maintain it relevant and focused.

Embeddings are an awesome tool, but sometimes it’s a bit difficult to actually grasp how they’re working, and what is going on with the similarity search. A basic approach is to plot the embeddings against one another to get some insight into their relations.

Constructing such a visualization is sort of easy with PCA and visualization libraries. It involves mapping the embedding vectors to 2 or 3 dimensions, and plotting the outcomes. Here I map from those 384 dimensions to 2, and plot the result:

import seaborn as sns 
import numpy as np

fp_embeddings = embedding_model.encode(first_chunks)
q_embeddings_reshaped = q_embeddings.reshape(1, -1)
combined_embeddings = np.concatenate((fp_embeddings, q_embeddings_reshaped))

df_embedded_pca = pd.DataFrame(X_pca, columns=["x", "y"])
# text is brief version of chunk text (plot title)
df_embedded_pca["text"] = titles
# row_type = article or query per each embedding
df_embedded_pca["row_type"] = row_types

X = combined_embeddings pca = PCA(n_components=2).fit(X)
X_pca = pca.transform(X)

sns.scatterplot(x="x", y="y", hue="row_type",
palette={"article": "blue", "query": "red"},
data=df_embedded_pca, #legend="full",
alpha=0.8, s=100 )
for i in range(df_embedded_pca.shape[0]):
(df_embedded_pca["x"].iloc[i], df_embedded_pca["y"].iloc[i]),
fontsize=20 )
# Change the font size for x and y axis ticks plt.xticks(fontsize=16)
# Change the font size for x and y axis labels
plt.xlabel('X', fontsize=16)
plt.ylabel('Y', fontsize=16)

For the highest 10 articles within the “ what’s google bard? “ query, this provides the next visualization:

PCA-based 2D plot of query embeddings vs article 1st chunk embeddings.

On this plot, the red dot is the embedding for the query “ what’s google bard?”. The blue dots are the closest Wikipedia article matches, in response to sim_score.

The Bard article is clearly the closest one to the query, while the remainder are a bit further off. The Tenor article appears to be about second closest, while the Bård one is a bit further away, possibly resulting from the loss of data in mapping from 384 dimensions to 2. On account of this, the visualization shouldn’t be perfectly accurate but helpful for quick human overview.

The next figure illustrates an actual error finding from my Kaggle code using an identical PCA plot. Searching for a little bit of insights, I attempted an easy query in regards to the first article within the Wikipedia dump (“ Anarchism”). With the query “ what’s the definition of anarchism? “ . The next is what the PCA visualization looked like for the closest articles, the marked outliers are perhaps probably the most interesting part:

My fail shown in PCA-based 2D plot of Kaggle embeddings for chosen top documents.

The red dot in the underside left corner is again the query. The cluster of blue dots next to it are all related articles about anarchism. After which there are the 2 outlier dots on the highest right. I removed the titles from the plot to maintain it readable. The 2 outlier articles looked as if it would don’t have anything to do with the query when looking.

Why is that this? As I indexed the articles with various chunk sizes of 512, 256, 128, and 64, I had some issues in processing all of the articles for 256 chunk size, and restarted the chunking in the center. This resulted in some differences in indices of a few of those embeddings vs the chunk texts I had stored. After noticing these strange looking results, I re-calculated the embeddings with the 256 token chunk size, and compared the outcomes vs size 512, noted this difference. Too bad the competition was done at the moment 🙂

Within the above I discussed chunking the documents and using similarity search + re-ranking as a way to seek out relevant chunks and construct a context for the query answering. I discovered sometimes additionally it is useful to contemplate how the initial documents to chunk are chosen vs just the chunks themselves.

As example methods, the advanced RAG course on DeepLearning.AI , presents two approaches: sentence windowing, and hierarchical chunk merging. In summary this looks at nearby-chunks and if multiple are ranked high by their scores, takes them as a single large chunk. The “hierarchy” coming from considering larger and bigger chunk combos for joint relevance. Aiming for more cohesive context vs random ordered small chunks, giving the generator LLM higher input to work with.

As an easy example of this, here is the re-ranked set of top chunks for my above Bard example:

Top 10 chunks for my Bard example, sorted by rerank_score.

The leftmost column here is the index of the chunk. In my generation, I just took the highest chunks on this sorted order as within the table. If we desired to make the context a bit more coherent, we could sort the ultimate chosen chunks by their order inside a document. If there’s a small piece missing between highly ranked chunks, adding the missing one (e.g., here chunk id 7) could assist in missing gaps, just like the hierarchical merging. This might be something to try as a final step for final gains.

In my Kaggle experiments, I performed initial document selection based on the primary chunk only. Partially resulting from Kaggle’s resource limits, but it surely appeared to have another benefits as well. Typically, an article’s starting acts as a summary (introduction or abstract). Initial chunk selection from such ranked articles may help select chunks with more relevant overall context.

That is visible in my Bard example above, where each the rerank_score and sim_score are highest for the primary chunk of the very best article. To try to enhance this, I also tried using a bigger chunk size for this initial document selection, to incorporate more of the introduction for higher relevance. Then chunked the highest chosen documents with smaller chunk sizes for experimenting on how good the context is with each size.

While I couldn’t run the initial search on all chunks of all documents on Kaggle resulting from resource limitations, I attempted it outside of Kaggle. In these trials, I noticed that sometimes single chunks of unrelated articles get ranked high, while in point of fact misleading for the reply generation. For instance, actor biography in a related movie. Initial document relevance selection may help avoid this. Unfortunately, I didn’t have time to review this further with different configurations, and good re-ranking may already help.

Finally, repeating the identical information in multiple chunks within the context shouldn’t be very useful. Top rating of the chunks doesn’t guarantee that they best complement one another, or best chunk diversity. For instance, LangChain has a special chunk selector for Maximum Marginal Relevance. It does this by penalizing latest chunks by how close they’re to the already added chunks.

I used a quite simple query / query for my RAG example here (“ what’s google bard?”), and easy is nice as an instance the essential RAG concept. It is a pretty short query input considering that the embedding model I used had a 512 token maximum sequence length. If I encode this query into tokens using the tokenizer for the embedding model ( bge-small-en), I get the next tokens:

['[CLS]', 'what', 'is', 'google', 'bard', '?', '[SEP]']

Which amounts to a complete of seven tokens. With a maximum sequence length of 512, this leaves loads of room if I need to make use of an extended query sentence. Sometimes this could be useful, especially if the data we would like to retrieve shouldn’t be such an easy query, or if the domain is more complex. For a really small query, the semantic search may not work best, as noted also within the Stack Overflows AI Journey posting.

For instance, the Kaggle competition had a set of questions, each with 5 answer options to choose from. I initially tried RAG with just the query because the input for the embedding model. The search results weren’t too great, so I attempted again with the query + all the reply options because the query. This produced significantly better results.

For example, the primary query within the training dataset of the competition:

Which of the next statements accurately describes the impact of 
Modified Newtonian Dynamics (MOND) on the observed "missing baryonic mass"
discrepancy in galaxy clusters?

That is 32 tokens for the bge-small-en model. So about 480 still left to suit into the utmost 512 token sequence length.

Here is the primary query together with the 5 answer options given for it:

Example query and answer options A-E. Concatenating all these texts formed the query.

Concatenating the query and the given options into one RAG query gives this a length 235 tokens, with still greater than 50% of embedding model sequence length left. In my case, this approach produced significantly better results. Each from manual inspection, and for the competition rating. Thus, experimenting with alternative ways to make the RAG query itself more expressive is value a try.

Finally, there’s the subject of hallucinations, where the model produces text that is inaccurate or fabricated. The Tenor example from my sim_score sorting is one type of an example, even when the generator did base it on the actual given context. So higher keep the context good I assume :).

To handle hallucinations, the chatbots from the massive AI corporations ( Google Bard, ChatGPT, Bing Chat) all provide means to link parts of their generated answers to verifiable sources. Bard has a particular “G” button that performs a Google search and highlights parts of the generated answer that match the search results. Too bad we don’t all the time have a world-class search-engine for our data to assist.

Bing Chat has an identical approach, highlighting parts of the reply and adding a reference to the source web sites. ChatGPT has a rather different approach; I needed to explicitly ask it to confirm its answer and update with latest developments, telling it to make use of its browser tool. After this, it did a web search and linked to specific web sites as sources. The source quality looked as if it would vary quite a bit as in any web search. After all, for internal documents this sort of web search shouldn’t be possible. Nonetheless, linking to the source should all the time be possible even internally.

I also asked Bard, ChatGPT+, and Bing for ideas on detecting hallucinations. The outcomes included an LLM hallucination rating index, including RAG hallucination. When tuning LLM’s, it may also help to set the temperature parameter to zero for the LLM to generate deterministic, most probable output tokens.

Finally, as it is a quite common problem, there appear to be various approaches being built to handle this challenge a bit higher. For instance, specific LLM’s to assist detect halluciations appear to be a promising area. I didn’t have time to try them, but definitely relevant in larger projects.

Besides implementing a working RAG solution, additionally it is nice to give you the chance to inform something about how well it really works. Within the Kaggle competition this was quite easy. I just ran the answer to try to reply the given questions within the training dataset, comparing to the right answers given within the training data. Or submitted the model for scoring on the Kaggle competition test set. The higher the reply rating, the higher one could call the RAG solution, even when there was more to the rating.

In lots of cases, an acceptable evaluation dataset for domain specific RAG might not be available. For this scenario, one might want to start out with some generic NLP evaluation datasets, reminiscent of this list. Tools reminiscent of LangChain also include support for auto-generating questions and answers, and evaluating them. On this case, an LLM is used to create example questions and answers for a given set of documents, and one other LLM is used to guage whether the RAG can provide the right answer to those questions. This is probably higher explained on this tutorial on RAG evaluation with LangChain.

While the generic solutions are likely good to start out with, in an actual project I might try to gather an actual dataset of questions and answers from the domain experts and the intended users of the RAG solution. Because the LLM is usually expected to generate a natural language response, this will vary so much while still being correct. Because of this, evaluating if the reply was correct or not shouldn’t be as straightforward as a daily expression or similar pattern matching. Here, I find the concept of using one other LLM to guage whether the given response matches a reference response a really great tool. These models can cope with the text variation significantly better.

RAG is a really nice tool, and is sort of a preferred topic lately with the high interest in LLM’s on the whole. While RAG and embeddings have been around for while, the newest powerful LLM’s and their fast evolution have perhaps made them more interesting for a lot of advanced use cases. I expect the sphere to maintain evolving at pace, and it is typically a bit difficult to maintain up to this point on every thing. For this, summaries reminiscent of reviews on RAG developments can provide points to at the least keep the predominant developments in sight.

The RAG approach on the whole is sort of easy: discover a set of chunks of text just like the given query, concatenate them right into a context, and ask the LLM for a solution. Nonetheless, as I attempted to indicate here, there could be various issues to contemplate in how one can make this work well and efficiently for various needs. From good context retrieval, to rating and choosing the very best results, and eventually having the ability to link the outcomes back to actual source documents. And evaluating the resulting query contexts and answers. And as Stack Overflow people noted, sometimes the more traditional lexical or hybrid search could be very useful as well, even when semantic search is cool.

That’s all for today. RAG on…

ChatGPT+/DALL-E3 vision of what it means to RAG on..


Please enter your comment!
Please enter your name here