Large language models (LLMs) have greatly improved the state-of-the-art in various understanding and generation tasks, revolutionizing natural language processing. Most LLMs gain from self-supervised training over huge corpora by gathering information from a fixed-sized local context and displaying emerging skills, including zero-shot prompting, in-context learning, and Chain-of-Thought (CoT) reasoning. The input length restriction of present LLMs precludes them from generalizing to real-world applications, corresponding to prolonged horizontal planning, where the capability to handle long-form material beyond a fix-sized session is crucial.
The only solution to the length limit problem is solely scaling up the input context length. For improved long-range interdependence, GPT-3, for instance, raises the input length from 1k of GPT-2 to 2k tokens. The in-context dense attention is nevertheless severely confined by the quadratic computing complexity of Transformer self-attention, and this system often requires computationally extensive training from the start. One other recent area of research, which still mostly requires training from the beginning, focuses on creating in-context sparse attention to avoid the quadratic cost of self-attention.
While Memorising Transformer (MemTRM) is a well known study, it approximates in-context scant attention through dense attention to each in-context tokens and memorized tokens retrieved from a non-differentiable memory for Transformers. MemTRM delivers significant perplexity advantages when modeling large books or papers by scaling up the resultant language model to handle as much as 65k tokens. MemTRM’s linked memory approach, which uses a single model for encoding and fusing memory for language modeling, presents the memory staleness difficulty during training. In other words, cached earlier representations in memory could have distributional changes from those from essentially the most recent model when the model parameters are modified, reducing the usage of memory augmentation.
On this paper authors from UCSB and Microsoft Research propose the LONGMEM framework, which enables language models to cache long-form prior context or knowledge into the non-differentiable memory bank and benefit from them via a decoupled memory module to handle the memory staleness problem. They create a revolutionary residual side network (SideNet) to attain decoupled memory. A frozen backbone LLM is used to extract the paired attention keys and values from the previous context into the memory bank. The resulting attention query of the present input is utilized within the SideNet’s memory-augmented layer to access cached (keys and values) for earlier contexts. The associated memory augmentations are then fused into learning hidden states via a joint attention process.
Higher knowledge transfer from the pretrained backbone LLM is made possible by newly built cross-network residual connections between the SideNet and the frozen backbone LLM. The pre-trained LLM could also be modified to utilize long-contextual memory by repeatedly training the residual SideNet to extract and fuse memory-augmented long-context. There are two primary benefits to their decoupled memory system. First, the decoupled frozen backbone LLM and SideNet of their proposed architecture isolate memory retrieval and fusion from encoding prior inputs into memory.
This efficiently addresses the issue of memory staleness because the backbone LLM only serves because the long-context knowledge encoder. In contrast, the residual SideNet serves because the memory retriever and reader. Second, it’s computationally inefficient and suffers from catastrophic forgetting to alter the LLM with memory augmentations directly. Along with with the ability to access the knowledge that was previously learned, LONGMEM may also prevent devastating forgetting because the backbone LLM is frozen throughout the effective memory-augmented adaption stage. Depending on the following activities, LONGMEM can input different sorts of long-form text and data into the memory bank.
They give attention to two illustrative instances: memory-augmented in-context learning with 1000’s of task-relevant demonstration examples and language modeling with full-length book contexts. They assess how well the proposed LONGMEM performs on several long-text language modeling tasks and memory-augmented in-context learning for language understanding. Based on experimental findings, their model repeatedly surpasses the strong baselines regarding its capability for long-text modeling and in-context learning. Their approach significantly increases the power of LLM to represent long-context language by -1.38 ~ -1.62 perplexity over various length splits of the Gutenberg-2022 corpus.
Surprisingly, their model greatly outperforms the present strong x-former baselines to realize the state-of-the-art performance of 40.5% identification accuracy on ChapterBreak, a difficult long-context modeling benchmark. Lastly, in comparison with MemTRM and baselines without memory enhancement, LONGMEM displays strong in-context learning advantages on common NLU tasks.
Check Out The Paper and Github link. Don’t forget to affix our 24k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the newest AI research news, cool AI projects, and more. If you’ve any questions regarding the above article or if we missed anything, be at liberty to email us at Asif@marktechpost.com
🚀 Check Out 100’s AI Tools in AI Tools Club
Aneesh Tickoo is a consulting intern at MarktechPost. He’s currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed toward harnessing the ability of machine learning. His research interest is image processing and is captivated with constructing solutions around it. He loves to attach with people and collaborate on interesting projects.