When preparing data for embedding and retrieval in a RAG system, splitting the text into appropriately sized chunks is crucial. This process is guided by two predominant aspects, Model Constraints and Retrieval Effectiveness.
Model Constraints
Embedding models have a maximum token length for input; anything beyond this limit gets truncated. Concentrate on your chosen model’s limitations and be sure that each data chunk doesn’t exceed this max token length.
Multilingual models, specifically, often have shorter sequence limits in comparison with their English counterparts. For example, the widely used Paraphrase multilingual MiniLM-L12 v2 model has a maximum context window of just 128 tokens.
Also, consider the text length the model was trained on — some models might technically accept longer inputs but were trained on shorter chunks, which could affect performance on longer texts. One such is example, is the Multi QA base from SBERT as seen below,
Retrieval effectiveness
While chunking data to the model’s maximum length seems logical, it may not at all times result in the most effective retrieval outcomes. Larger chunks offer more context for the LLM but can obscure key details, making it harder to retrieve precise matches. Conversely, smaller chunks can enhance match accuracy but might lack the context needed for complete answers. Hybrid approaches use smaller chunks for search but include surrounding context at query time for balance.
While there isn’t a definitive answer regarding chunk size, the considerations for chunk size remain consistent whether you’re working on multilingual or English projects. I might recommend reading further on the subject from resources corresponding to Evaluating the Ideal Chunk Size for RAG System using Llamaindex or Constructing RAG-based LLM Applications for Production.
Text splitting: Methods for splitting text
Text might be split using various methods, mainly falling into two categories: rule-based (specializing in character evaluation) and machine learning-based models. ML approaches, from easy NLTK & Spacy tokenizers to advanced transformer models, often rely upon language-specific training, primarily in English. Although easy models like NLTK & Spacy support multiple languages, they mainly address sentence splitting, not semantic sectioning.
Since ML based sentence splitters currently work poorly for many non-English languages, and are compute intensive, I like to recommend starting with an easy rule-based splitter. In the event you’ve preserved relevant syntactic structure from the unique data, and formatted the info appropriately, the result will likely be of fine quality.
A standard and effective method is a recursive character text splitter, like those utilized in LangChain or LlamaIndex, which shortens sections by finding the closest split character in a prioritized sequence (e.g., nn, n, ., ?, !).
Taking the formatted text from the previous section, an example of using LangChains recursive character splitter would appear like:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from transformers import AutoTokenizertokenizer = AutoTokenizer.from_pretrained("intfloat/e5-base-v2")
def token_length_function(text_input):
return len(tokenizer.encode(text_input, add_special_tokens=False))
text_splitter = RecursiveCharacterTextSplitter(
# Set a very small chunk size, just to point out.
chunk_size = 128,
chunk_overlap = 0,
length_function = token_length_function,
separators = ["nn", "n", ". ", "? ", "! "]
)
split_texts = text_splitter(formatted_document['Boosting RAG: Picking the Best Embedding & Reranker models'])
Here it’s vital to notice that one should define the tokenizer because the embedding model intended to make use of, since different models ‘count’ the words otherwise. The function will now, in a prioritized order, split any text longer than 128 tokens first by the nn we introduced at end of sections, and if that just isn’t possible, then by end of paragraphs delimited by n and so forth. The primary 3 chunks will likely be:
Token of text: 111 UPDATE: The pooling method for the Jina AI embeddings has been adjusted to make use of mean pooling, and the outcomes have been updated accordingly. Notably, the JinaAI-v2-base-en with bge-reranker-largenow exhibits a Hit Rate of 0.938202 and an MRR (Mean Reciprocal Rank) of 0.868539 and withCohereRerank exhibits a Hit Rate of 0.932584, and an MRR of 0.873689.
-----------
Token of text: 112
When constructing a Retrieval Augmented Generation (RAG) pipeline, one key component is the Retriever. We have now quite a lot of embedding models to pick from, including OpenAI, CohereAI, and open-source sentence transformers. Moreover, there are several rerankers available from CohereAI and sentence transformers.
But with all these options, how can we determine the most effective mix for top-notch retrieval performance? How can we know which embedding model suits our data best? Or which reranker boosts our results essentially the most?
-----------
Token of text: 54
On this blog post, we’ll use the Retrieval Evaluation module from LlamaIndex to swiftly determine the most effective combination of embedding and reranker models. Let's dive in!
Let’s first start with understanding the metrics available in Retrieval Evaluation
Now that we now have successfully split the text in a semantically meaningful way, we are able to move onto the ultimate a part of embedding these chunks for storage.