Home Artificial Intelligence A Guide on 12 Tuning Strategies for Production-Ready RAG Applications Ingestion Stage Inferencing Stage (Retrieval & Generation) Summary Enjoyed This Story? References

A Guide on 12 Tuning Strategies for Production-Ready RAG Applications Ingestion Stage Inferencing Stage (Retrieval & Generation) Summary Enjoyed This Story? References

A Guide on 12 Tuning Strategies for Production-Ready RAG Applications
Ingestion Stage
Inferencing Stage (Retrieval & Generation)
Enjoyed This Story?

Easy methods to improve the performance of your Retrieval-Augmented Generation (RAG) pipeline with these “hyperparameters” and tuning strategies

Towards Data Science
Tuning Strategies for Retrieval-Augmented Generation Applications

Data Science is an experimental science. It starts with the “No Free Lunch Theorem,” which states that there isn’t any one-size-fits-all algorithm that works best for each problem. And it leads to data scientists using experiment tracking systems to assist them tune the hyperparameters of their Machine Learning (ML) projects to attain one of the best performance.

This text looks at a Retrieval-Augmented Generation (RAG) pipeline through the eyes of an information scientist. It discusses potential “hyperparameters” you may experiment with to enhance your RAG pipeline’s performance. Much like experimentation in Deep Learning, where, e.g., data augmentation techniques aren’t a hyperparameter but a knob you may tune and experiment with, this text will even cover different strategies you may apply, which aren’t per se hyperparameters.

This text covers the next “hyperparameters” sorted by their relevant stage. Within the ingestion stage of a RAG pipeline, you may achieve performance improvements by:

And within the inferencing stage (retrieval and generation), you may tune:

Note that this text covers text-use cases of RAG. For multimodal RAG applications, different considerations may apply.

The ingestion stage is a preparation step for constructing a RAG pipeline, much like the information cleansing and preprocessing steps in an ML pipeline. Often, the ingestion stage consists of the next steps:

  1. Collect data
  2. Chunk data
  3. Generate vector embeddings of chunks
  4. Store vector embeddings and chunks in a vector database
Documents are first chunked, then the chunks are embedded, and the embeddings are stored in the vector database
Ingestion stage of a RAG pipeline

This section discusses impactful techniques and hyperparameters which you can apply and tune to enhance the relevance of the retrieved contexts within the inferencing stage.

Data cleansing

Like all Data Science pipeline, the standard of your data heavily impacts the end result in your RAG pipeline [8, 9]. Before moving on to any of the next steps, be sure that your data meets the next criteria:

  • Clean: Apply a minimum of some basic data cleansing techniques commonly utilized in Natural Language Processing, akin to ensuring all special characters are encoded appropriately.
  • Correct: Make sure that your information is consistent and factually accurate to avoid conflicting information confusing your LLM.


Chunking your documents is a necessary preparation step to your external knowledge source in a RAG pipeline that may impact the performance [1, 8, 9]. It’s a way to generate logically coherent snippets of knowledge, often by breaking up long documents into smaller sections (but it may possibly also mix smaller snippets into coherent paragraphs).

One consideration you want to make is the alternative of the chunking technique. For instance, in LangChain, different text splitters split up documents by different logics, akin to by characters, tokens, etc. This depends upon the sort of data you might have. For instance, you have to to make use of different chunking techniques in case your input data is code vs. whether it is a Markdown file.

The perfect length of your chunk (chunk_size) depends upon your use case: In case your use case is query answering, you might need shorter specific chunks, but in case your use case is summarization, you might need longer chunks. Moreover, if a piece is just too short, it won’t contain enough context. However, if a piece is just too long, it would contain an excessive amount of irrelevant information.

Moreover, you have to to take into consideration a “rolling window” between chunks (overlap) to introduce some additional context.

Embedding models

Embedding models are on the core of your retrieval. The quality of your embeddings heavily impacts your retrieval results [1, 4]. Often, the upper the dimensionality of the generated embeddings, the upper the precision of your embeddings.

For an idea of what alternative embedding models can be found, you may have a look at the Massive Text Embedding Benchmark (MTEB) Leaderboard, which covers 164 text embedding models (on the time of this writing).

While you should use general-purpose embedding models out-of-the-box, it might make sense to fine-tune your embedding model to your specific use case in some cases to avoid out-of-domain issues afterward [9]. In keeping with experiments conducted by LlamaIndex, fine-tuning your embedding model can result in a 5–10% performance increase in retrieval evaluation metrics [2].

Note that you simply cannot fine-tune all embedding models (e.g., OpenAI’s text-ebmedding-ada-002 can’t be fine-tuned for the time being).


Whenever you store vector embeddings in a vector database, some vector databases allow you to store them along with metadata (or data that isn’t vectorized). Annotating vector embeddings with metadata may be helpful for added post-processing of the search results, akin to metadata filtering [1, 3, 8, 9]. For instance, you may add metadata, akin to the date, chapter, or subchapter reference.


If the metadata isn’t sufficient enough to supply additional information to separate several types of context logically, you might wish to experiment with multiple indexes [1, 9]. For instance, you should use different indexes for several types of documents. Note that you’re going to must incorporate some index routing at retrieval time [1, 9]. Should you are curious about a deeper dive into metadata and separate collections, it is advisable to learn more concerning the concept of native multi-tenancy.

Indexing algorithms

To enable lightning-fast similarity search at scale, vector databases and vector indexing libraries use an Approximate Nearest Neighbor (ANN) search as an alternative of a k-nearest neighbor (kNN) search. Because the name suggests, ANN algorithms approximate the closest neighbors and thus may be less precise than a kNN algorithm.

There are different ANN algorithms you may experiment with, akin to Facebook Faiss (clustering), Spotify Annoy (trees), Google ScaNN (vector compression), and HNSWLIB (proximity graphs). Also, a lot of these ANN algorithms have some parameters you may tune, akin to ef, efConstruction, and maxConnections for HNSW [1].

Moreover, you may enable vector compression for these indexing algorithms. Analogous to ANN algorithms, you’ll lose some precision with vector compression. Nonetheless, depending on the alternative of the vector compression algorithm and its tuning, you may optimize this as well.

Nonetheless, in practice, these parameters are already tuned by research teams of vector databases and vector indexing libraries during benchmarking experiments and never by developers of RAG systems. Nonetheless, if you would like to experiment with these parameters to squeeze out the last bits of performance, I like to recommend this text as a place to begin:

The primary components of the RAG pipeline are the retrieval and the generative components. This section mainly discusses strategies to enhance the retrieval (Query transformations, retrieval parameters, advanced retrieval strategies, and re-ranking models) as that is the more impactful component of the 2. However it also briefly touches on some strategies to enhance the generation (LLM and prompt engineering).

Standard RAG schema
Inference stage of a RAG pipeline

Query transformations

For the reason that search query to retrieve additional context in a RAG pipeline can be embedded into the vector space, its phrasing may impact the search results. Thus, in case your search query doesn’t lead to satisfactory search results, you may experiment with various query transformation techniques [5, 8, 9], akin to:

  • Rephrasing: Use an LLM to rephrase the query and take a look at again.
  • Hypothetical Document Embeddings (HyDE): Use an LLM to generate a hypothetical response to the search query and use each for retrieval.
  • Sub-queries: Break down longer queries into multiple shorter queries.

Retrieval parameters

The retrieval is an integral part of the RAG pipeline. The primary consideration is whether or not semantic search shall be sufficient to your use case or if you would like to experiment with hybrid search.

Within the latter case, you want to experiment with weighting the aggregation of sparse and dense retrieval methods in hybrid search [1, 4, 9]. Thus, tuning the parameter alpha, which controls the weighting between semantic (alpha = 1) and keyword-based search (alpha = 0), will grow to be essential.

Also, the variety of search results to retrieve will play a necessary role. The variety of retrieved contexts will impact the length of the used context window (see Prompt Engineering). Also, in case you are using a re-ranking model, you want to consider what number of contexts to input to the model (see Re-ranking models).

Note, while the used similarity measure for semantic search is a parameter you may change, you must not experiment with it but as an alternative set it in accordance with the used embedding model (e.g., text-embedding-ada-002 supports cosine similarity or multi-qa-MiniLM-l6-cos-v1 supports cosine similarity, dot product, and Euclidean distance).

Advanced retrieval strategies

This section could technically be its own article. For this overview, we are going to keep this as concise as possible. For an in-depth explanation of the next techniques, I like to recommend this DeepLearning.AI course:

The underlying idea of this section is that the chunks for retrieval shouldn’t necessarily be the identical chunks used for the generation. Ideally, you’d embed smaller chunks for retrieval (see Chunking) but retrieve larger contexts. [7]

  • Sentence-window retrieval: Do not only retrieve the relevant sentence, however the window of appropriate sentences before and after the retrieved one.
  • Auto-merging retrieval: The documents are organized in a tree-like structure. At query time, separate but related, smaller chunks may be consolidated right into a larger context.

Re-ranking models

While semantic search retrieves context based on its semantic similarity to the search query, “most similar” doesn’t necessarily mean “most relevant”. Re-ranking models, akin to Cohere’s Rerank model, might help eliminate irrelevant search results by computing a rating for the relevance of the query for every retrieved context [1, 9].

“most similar” doesn’t necessarily mean “most relevant”

Should you are using a re-ranker model, you might must re-tune the variety of search results for the input of the re-ranker and the way lots of the reranked results you would like to feed into the LLM.

As with the embedding models, you might wish to experiment with fine-tuning the re-ranker to your specific use case.


The LLM is the core component for generating the response. Similarly to the embedding models, there’s a big selection of LLMs you may pick from depending in your requirements, akin to open vs. proprietary models, inferencing costs, context length, etc. [1]

As with the embedding models or re-ranking models, you might wish to experiment with fine-tuning the LLM to your specific use case to include specific wording or tone of voice.

Prompt engineering

The way you phrase or engineer your prompt will significantly impact the LLM’s completion [1, 8, 9].

Please base your answer only on the search results and nothing else!
Very necessary! Your answer MUST be grounded within the search results provided. 
Please explain why your answer is grounded within the search results!

Moreover, using few-shot examples in your prompt can improve the standard of the completions.

As mentioned in Retrieval parameters, the variety of contexts fed into the prompt is a parameter you must experiment with [1]. While the performance of your RAG pipeline can improve with increasing relevant context, you may also run right into a “Lost within the Middle” [6] effect where relevant context isn’t recognized as such by the LLM whether it is placed in the course of many contexts.

As increasingly more developers gain experience with prototyping RAG pipelines, it becomes more necessary to debate strategies to bring RAG pipelines to production-ready performances. This text discussed different “hyperparameters” and other knobs you may tune in a RAG pipeline in accordance with the relevant stages:

This text covered the next strategies within the ingestion stage:

  • Data cleansing: Ensure data is clean and proper.
  • Chunking: Selection of chunking technique, chunk size (chunk_size) and chunk overlap (overlap).
  • Embedding models: Selection of the embedding model, incl. dimensionality, and whether to fine-tune it.
  • Metadata: Whether to make use of metadata and alternative of metadata.
  • Multi-indexing: Resolve whether to make use of multiple indexes for various data collections.
  • Indexing algorithms: Selection and tuning of ANN and vector compression algorithms may be tuned but are often not tuned by practitioners.

And the next strategies within the inferencing stage (retrieval and generation):

  • Query transformations: Experiment with rephrasing, HyDE, or sub-queries.
  • Retrieval parameters: Selection of search technique (alpha if you might have hybrid search enabled) and the variety of retrieved search results.
  • Advanced retrieval strategies: Whether to make use of advanced retrieval strategies, akin to sentence-window or auto-merging retrieval.
  • Re-ranking models: Whether to make use of a re-ranking model, alternative of re-ranking model, variety of search results to input into the re-ranking model, and whether to fine-tune the re-ranking model.
  • LLMs: Selection of LLM and whether to fine-tune it.
  • Prompt engineering: Experiment with different phrasing and few-shot examples.


Please enter your comment!
Please enter your name here