
Application-oriented methods from current research

This text explores methods to boost the truthfulness of Retrieval Augmented Generation (RAG) application outputs, specializing in mitigating issues like hallucinations and reliance on pre-trained knowledge. I discover the causes of untruthful results, evaluate methods for assessing truthfulness, and propose solutions to enhance accuracy. The study emphasizes the importance of groundedness and completeness in RAG outputs, recommending fine-tuning Large Language Models (LLMs) and employing element-aware summarization to make sure factual accuracy. Moreover, it discusses using scalable evaluation metrics, reminiscent of the Learnable Evaluation Metric for Text Simplification (LENS), and Chain of Thought-based (CoT) evaluations, for real-time output verification. The article highlights the necessity to balance the advantages of increased truthfulness against potential costs and performance impacts, suggesting a selective approach to method implementation based on application needs.
A widely used Large Language Model (LLM) architecture which might provide insight into application outputs and reduce hallucinations is Retrieval Augmented Generation (RAG). RAG is a technique to expand LLM memory by combining parametric memory (i.e. LLM pre-trained) with non-parametric (i.e. document retrieved) memories. To do that, probably the most relevant documents are retrieved from a vector database and, along with the user query and a customized prompt, passed to an LLM, which generates a response (see Figure 1). For further details, see Lewis et al. (2021).
An actual-world application can, for example, connect an LLM to a database of medical guideline documents. Medical practitioners can replace manual look-up by asking natural language questions using RAG as a “search engine”. The appliance would answer the user’s query and reference the source guideline. If the reply is predicated on parametric memory, e.g. answering on guidelines contained within the pre-training but not the connected database, or if the LLM hallucinates, this might have drastic implications.
Firstly, if the medical practitioners confirm with the referenced guidelines, they might lose trust in the applying answers, resulting in less usage. Secondly, and more worryingly, if not every answer is verified, a solution will be falsely assumed to be based on the queried medical guidelines, directly affecting the patient’s treatment. This highlights the relevance of the truthfulness of output in RAG applications.
In this text assessing RAG, truth is defined as being firmly grounded in factual knowledge of the retrieved document. To research this issue, one General Research Query (GRQ) and three Specific Research Questions (SRQ) are derived.
GRQ: How can the truthfulness of RAG outputs be improved?
SRQ 1: What causes untruthful results to be generated by RAG applications?
SRQ 2: How can truthfulness be evaluated?
SRQ 3: What methods will be used to extend truthfulness?
To reply the GRQ, the SRQs are analysed sequentially on the idea of literature research. The aim is to discover methods that will be implemented to be used cases reminiscent of the above example from the medical field. Ultimately two categories of solution methods will likely be beneficial for further evaluation and customisation.
As previously defined, a truthful answer needs to be firmly grounded in factual knowledge of the retrieved document. One metric for that is factual consistency, measuring if the summary incorporates untruthful or misleading facts that will not be supported by the source text (Liu et al., 2023). It’s used as a critical evaluation metric in multiple benchmarks (Kim et al., 2023; Fabbri et al., 2021; Deutsch & Roth, 2022; Wang et al., 2023; Wu et al., 2023). In the world of RAG, that is sometimes called groundedness (Levonian et al., 2023). Furthermore, to take the usefulness of a truthful answer into consideration, its completeness can be of relevance. The next paragraphs give insight into the rationale behind untruthful RAG results. This refers back to the Generation Step in Figure 1, which summarises the retrieved documents with respect to the user query.
Firstly, the groundedness of an RAG application is impacted if the LLM answer is predicated on parametric memory fairly than the factual knowledge of the retrieved document. This could, for example, occur if the reply comes from pre-trained knowledge or is attributable to hallucinations. Hallucinations still remain a fundamental problem of LLMs (Bang et al., 2023; Ji et al., 2023; Zhang & Gao, 2023), from which even powerful LLMs suffer (Liu et al., 2023). As per definition, low groundedness leads to untruthful RAG results.
Secondly, completeness describes if an LLM´s answer lacks factual knowledge from the documents. This will be because of the low summarisation capability of an LLM or missing domain knowledge to interpret the factual knowledge (T. Zhang et al., 2023). The output could still be highly grounded. Nevertheless, a solution might be incomplete with respect to the documents. Resulting in incorrect user perception of the content of the database. As well as, if factual knowledge from the document is missing, the LLM will be encouraged to make up for this by answering with its own parametric memory, raising the abovementioned issue.
Having established the important thing causes of untruthful outputs, it’s obligatory to first measure and quantify these errors before an answer will be pursued. Due to this fact, the next section will cover the methods of measurement for the aforementioned sources of untruthful RAG outputs.
Having elaborated on groundedness and completeness and their origins, this section intends to guide through their measurement methods. I’ll begin with the widely known general-purpose methods and proceed by highlighting recent trends. TruLens´s Feedback Functions plot serves here as a worthwhile reference for scalability and meaningfulness (see Figure2).
When talking about natural language generation evaluations, traditional evaluation metrics like ROUGE (Lin, 2004) and BLEU (Papineni et al., 2002) are widely used but are inclined to show a discrepancy from human assessments (Liu et al., 2023). Moreover, Medium Language Models (MLMs) have demonstrated superior results to traditional evaluation metrics, but will be replaced by LLMs in lots of areas (X. Zhang & Gao, 2023). Lastly, one other well-known evaluation method is the human evaluation of generated text, which has apparent drawbacks of scale and value (Fabbri et al., 2021). Resulting from the downsides of those methods (see Figure 2), these will not be relevant for further consideration on this paper.
Concerning recent trends, evaluation metrics have developed with the rise in the recognition of LLMs. One such development are LLM evaluations, allowing one other LLM through Chain of Thought (CoT) reasoning to guage the generated text (Liu et al., 2023). Through bespoke prompting strategies, areas of focus like groundedness and completeness will be emphasised and numerically scored (Kim et al., 2023). For this method, it has been shown that a bigger model size is useful for summarisation evaluation (Liu et al., 2023). Furthermore, this evaluation can be based on references or collected ground truth, comparing generated text and reference text (Wu et al., 2023). For open-ended tasks with no single correct answer, LLM-based evaluation outperforms reference-based metrics when it comes to correlation with human quality judgements. Furthermore, ground-truth collection will be costly. Due to this fact, reference or ground-truth based metrics are outside the scope of this assessment (Liu et al., 2023; Feedback Functions — TruLens, o. J.).
Concluding with a noteworthy recent development, the Learnable Evaluation Metric for Text Simplification (LENS), stated to be “the primary supervised automatic metric for text simplification evaluation” by Maddela et al. (2023), has demonstrated promising outcomes in recent benchmarks. It’s recognized for its effectiveness in identifying hallucinations (Kew et al., 2023). By way of scalability and meaningfulness this is predicted to be barely more scalable, because of lower cost, and barely less meaningful than LLM evaluations, placing LENS near LLM Evals in the appropriate top corner of Figure 2. Nevertheless, further assessment could be required to confirm these claims. This may conclude the evaluations methods in scope and the following section is specializing in methods of their application.
Having established in section 1, the relevance of truthfulness in RAG applications, with SRQ1 the causes of untruthful output and with SRQ2 its evaluation, this section will deal with SRQ3. Hence, detailing specific beneficial methods improving groundedness and completeness to extend truthful responses. These methods will be categorised into two groups, improvements within the generation of output and validation of output.
To be able to improve the generation step of the RAG application, this text will highlight two methods. These are visualised in Figure 3, with the simplified RAG architecture referenced on the left. The primary methods is fine-tuning the generation LLM. Instruction tuning over model size is critical to the LLM’s zero-shot summarisation capability. Thus, state-of-the-art LLMs can perform on par with summaries written by freelance writers (T. Zhang et al., 2023). The second method focuses on element-aware summarisation. With CoT prompting, like presented in SumCoT, LLMs can generate summaries step-by-step, emphasising the factual entities of the source text (Wang et al., 2023). Specifically, in a further step, factual elements are extracted from the relevant documents and made available to the LLM along with the context for the summarisation, see Figure 3. Each methods have shown promising results for improving the groundedness and completeness of LLM-generated summaries.
In validation of the RAG outputs, LLM-generated summaries are evaluated for groundedness and completeness. This will be done by CoT prompting an LLM to aggregate a groundedness and completeness rating. In Figure 4 an example CoT prompt is depicted, which will be forwarded to an LLM of larger model size for completion. Moreover, this step will be replaced or advanced through the use of supervised metrics like LENS. Finally, the generated evaluation is compared against a threshold. In case of not grounded or incomplete outputs, those will be modified, raised to the user or potentially rejected.
Before adapting these methods to RAG applications, it needs to be considered that analysis at run-time and fine-tuning the generation model will result in additional costs. Moreover, the evaluation step will affect the applications’ answering speed. Lastly, no answer because of output rejections and raised truthfulness concerns might confuse application users. Consequently, it’s critical to guage these methods with respect to the sphere of application, the functionality of the applying and the user´s expectations. Resulting in a customized approach increasing outputs truthfulness of RAG applications.
Unless otherwise noted, all images are by the creator.
Bang, Y., Cahyawijaya, S., Lee, N., Dai, W., Su, D., Wilie, B., Lovenia, H., Ji, Z., Yu, T., Chung, W., Do, Q. V., Xu, Y., & Fung, P. (2023). A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity (arXiv:2302.04023). arXiv. https://doi.org/10.48550/arXiv.2302.04023
Deutsch, D., & Roth, D. (2022). Benchmarking Answer Verification Methods for Query Answering-Based Summarization Evaluation Metrics (arXiv:2204.10206). arXiv. https://doi.org/10.48550/arXiv.2204.10206
Fabbri, A. R., Kryściński, W., McCann, B., Xiong, C., Socher, R., & Radev, D. (2021). SummEval: Re-evaluating Summarization Evaluation (arXiv:2007.12626). arXiv. https://doi.org/10.48550/arXiv.2007.12626
Feedback Functions — TruLens. (o. J.). Abgerufen 11. Februar 2024, von https://www.trulens.org/trulens_eval/core_concepts_feedback_functions/#feedback-functions
Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., Ishii, E., Bang, Y., Dai, W., Madotto, A., & Fung, P. (2023). Survey of Hallucination in Natural Language Generation. ACM Computing Surveys, 55(12), 1–38. https://doi.org/10.1145/3571730
Kew, T., Chi, A., Vásquez-Rodríguez, L., Agrawal, S., Aumiller, D., Alva-Manchego, F., & Shardlow, M. (2023). BLESS: Benchmarking Large Language Models on Sentence Simplification (arXiv:2310.15773). arXiv. https://doi.org/10.48550/arXiv.2310.15773
Kim, J., Park, S., Jeong, K., Lee, S., Han, S. H., Lee, J., & Kang, P. (2023). Which is healthier? Exploring Prompting Strategy For LLM-based Metrics (arXiv:2311.03754). arXiv. https://doi.org/10.48550/arXiv.2311.03754
Levonian, Z., Li, C., Zhu, W., Gade, A., Henkel, O., Postle, M.-E., & Xing, W. (2023). Retrieval-augmented Generation to Improve Math Query-Answering: Trade-offs Between Groundedness and Human Preference (arXiv:2310.03184). arXiv. https://doi.org/10.48550/arXiv.2310.03184
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W., Rocktäschel, T., Riedel, S., & Kiela, D. (2021). Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (arXiv:2005.11401). arXiv. https://doi.org/10.48550/arXiv.2005.11401
Lin, C.-Y. (2004). ROUGE: A Package for Automatic Evaluation of Summaries. Text Summarization Branches Out, 74–81. https://aclanthology.org/W04-1013
Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., & Zhu, C. (2023). G-Eval: NLG Evaluation using GPT-4 with Higher Human Alignment (arXiv:2303.16634). arXiv. https://doi.org/10.48550/arXiv.2303.16634
Maddela, M., Dou, Y., Heineman, D., & Xu, W. (2023). LENS: A Learnable Evaluation Metric for Text Simplification (arXiv:2212.09739). arXiv. https://doi.org/10.48550/arXiv.2212.09739
Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: A Method for Automatic Evaluation of Machine Translation. In P. Isabelle, E. Charniak, & D. Lin (Hrsg.), Proceedings of the fortieth Annual Meeting of the Association for Computational Linguistics (S. 311–318). Association for Computational Linguistics. https://doi.org/10.3115/1073083.1073135
Wang, Y., Zhang, Z., & Wang, R. (2023). Element-aware Summarization with Large Language Models: Expert-aligned Evaluation and Chain-of-Thought Method (arXiv:2305.13412). arXiv. https://doi.org/10.48550/arXiv.2305.13412
Wu, N., Gong, M., Shou, L., Liang, S., & Jiang, D. (2023). Large Language Models are Diverse Role-Players for Summarization Evaluation (arXiv:2303.15078). arXiv. https://doi.org/10.48550/arXiv.2303.15078
Zhang, T., Ladhak, F., Durmus, E., Liang, P., McKeown, K., & Hashimoto, T. B. (2023). Benchmarking Large Language Models for News Summarization (arXiv:2301.13848). arXiv. https://doi.org/10.48550/arXiv.2301.13848
Zhang, X., & Gao, W. (2023). Towards LLM-based Fact Verification on News Claims with a Hierarchical Step-by-Step Prompting Method (arXiv:2310.00305). arXiv. https://doi.org/10.48550/arXiv.2310.00305