Retrieval-augmented language models often retrieve only short chunks from a corpus, limiting overall document context. This decreases their ability to adapt to changes on the planet state and incorporate long-tail knowledge. Existing retrieval-augmented approaches also need fixing. The one we tackle is that the majority existing methods retrieve only a couple of short, contiguous text chunks, which limits their ability to represent and leverage large-scale discourse structure. This is especially relevant for thematic questions that require integrating knowledge from multiple text parts, equivalent to understanding a whole book.
Recent developments in Large Language Models (LLMs) display their effectiveness as standalone knowledge stores, encoding facts inside their parameters. Tremendous-tuning downstream tasks further enhances their performance. Nonetheless, challenges arise in updating LLMs with evolving world knowledge. Another approach involves indexing text in an information retrieval system and presenting retrieved information to LLMs for current domain-specific knowledge. Existing retrieval-augmented methods are limited to retrieving only short, contiguous text chunks, hindering the representation of large-scale discourse structure, which is crucial for thematic questions and a comprehensive understanding of texts like within the NarrativeQA dataset.
The researchers from Stanford University propose RAPTOR, an revolutionary indexing and retrieval system designed to handle limitations in existing methods. RAPTOR utilizes a tree structure to capture a text’s high-level and low-level details. It clusters text chunks, generates summaries for clusters, and constructs a tree from the underside up. This structure enables loading different levels of text chunks into LLMs context, facilitating efficient and effective answering of questions at various levels. The important thing contribution is using text summarization for retrieval augmentation, enhancing context representation across different scales, as demonstrated in experiments on long document collections.
RAPTOR addresses reading semantic depth and connection issues by constructing a recursive tree structure that captures each broad thematic comprehension and granular details. The method involves segmenting the retrieval corpus into chunks, embedding them using SBERT, and clustering them with a soft clustering algorithm based on Gaussian Mixture Models (GMMs) and Uniform Manifold Approximation and Projection (UMAP). The resulting tree structure allows for efficient querying through tree traversal or a collapsed tree approach, enabling retrieval of relevant information at different levels of specificity.
RAPTOR outperforms baseline methods across three question-answering datasets: NarrativeQA, QASPER, and QuALITY. Control comparisons using UnifiedQA 3B because the reader show consistent superiority of RAPTOR over BM25 and DPR. Paired with GPT-4, RAPTOR achieves state-of-the-art results on QASPER and QuALITY datasets, showcasing its effectiveness in handling thematic and multi-hop queries. The contribution of the tree structure is validated, demonstrating the importance of upper-level nodes in capturing a broader understanding and enhancing retrieval capabilities.
In conclusion, Stanford University researchers introduce RAPTOR, an revolutionary tree-based retrieval system that enhances the knowledge of enormous language models with contextual information across different abstraction levels. RAPTOR constructs a hierarchical tree structure through recursive clustering and summarization, facilitating the effective synthesis of data from diverse sections of retrieval corpora. Controlled experiments showcase RAPTOR’s superiority over traditional methods, establishing recent benchmarks in various question-answering tasks. Overall, RAPTOR proves to be a promising approach for advancing the capabilities of language models through enhanced contextual retrieval.
Take a look at the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and Google News. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.
If you happen to like our work, you’ll love our newsletter..
Don’t Forget to hitch our Telegram Channel
Asjad is an intern consultant at Marktechpost. He’s persuing B.Tech in mechanical engineering on the Indian Institute of Technology, Kharagpur. Asjad is a Machine learning and deep learning enthusiast who’s at all times researching the applications of machine learning in healthcare.