Home Community This AI Paper Demonstrates How Decoder-Only Transformers Mimic Infinite Multi-State Recurrent Neural Networks RNNs and Introduces TOVA for Enhanced Efficiency

This AI Paper Demonstrates How Decoder-Only Transformers Mimic Infinite Multi-State Recurrent Neural Networks RNNs and Introduces TOVA for Enhanced Efficiency

0
This AI Paper Demonstrates How Decoder-Only Transformers Mimic Infinite Multi-State Recurrent Neural Networks RNNs and Introduces TOVA for Enhanced Efficiency

Transformers have taken over from recurrent neural networks (RNNs) as the popular architecture for natural language processing (NLP). Transformers stand out conceptually because they directly access each token in a sequence, unlike RNNs that depend on maintaining a recurring state of past inputs. Decoders have emerged as a distinguished variant throughout the realm of transformers. These decoders commonly produce output in an auto-regressive manner, meaning the generation of every token is influenced by the important thing and value computations of preceding tokens.

Researchers from The Hebrew University of Jerusalem and FAIR, AI at Meta, have demonstrated that the auto-regressive nature of transformers aligns with the elemental principle of RNNs, which involves preserving a state from one step to the subsequent. They formally redefine decoder-only transformers as multi-state RNNs (MSRNN), presenting a generalized version of traditional RNNs. This redefinition highlights that because the variety of previous tokens increases during decoding, transformers grow to be MSRNNs with infinite states. The researchers further show that transformers will be compressed into finite MSRNNs by limiting the variety of tokens processed at each step. They introduce TOVA, a compression policy for MSRNNs, which selects tokens to retain based solely on their attention scores. The evaluation of TOVA is conducted on 4 long-range tasks.

https://arxiv.org/abs/2401.06104

The study compares transformers and RNNs, demonstrating that decoder-only transformers will be conceptualized as infinite multi-state RNNs, and pretrained transformers will be converted into finite multi-state RNNs by fixing the dimensions of their hidden state. It reports perplexity on the PG-19 test set for language modeling. It uses test sets from the ZeroSCROLLS benchmark for evaluating long-range understanding, including long-range summarization and long-range question-answering tasks. The study mentions using the QASPER dataset for long text query answering and evaluating generated stories using GPT-4 as an evaluator.

https://arxiv.org/abs/2401.06104

The study demonstrates that decoder-only transformers will be conceptualized as infinite multi-state RNNs, and pretrained transformers will be converted into finite multi-state RNNs by fixing the dimensions of their hidden state. The study also mentions modifying the eye mask to include different MSRNN policies, comparable to the First In First Out (FIFO) strategy, to effectively parallel the language modeling task. The researchers use the GPT-4 model to judge the generated texts and compare the output of the TOVA policy with the topline model.

https://arxiv.org/abs/2401.06104

The study demonstrates that transformer decoder LLMs behave as finite MSRNNs although they’re trained as infinite MSRNNs. The proposed TOVA policy performs consistently higher than other policies in long-range tasks with smaller cache sizes across all multi-state sizes and models. The experiments show that using TOVA with 1 / 4 and even one-eighth of the complete context yields results inside one point of the topline model in language modeling tasks. The study also reports a big reduction in LLM cache size, as much as 88%, resulting in reduced memory consumption during inference. The researchers acknowledge the computational constraints and approximate the infinite MSRNN with a sequence length of 4,096 tokens for extrapolation experiments.

To summarize, the researchers have redefined decoder transformers as multi-state RNNs with an infinite multi-state size. When the variety of token representations that transformers can handle at each step is proscribed, it is identical as compressing it from infinite to finite MSRNNs. The TOVA policy, which is a straightforward compression method that selects which tokens to maintain using their attention scores, has been found to outperform existing compression policies and performs comparably to the infinite MSRNN model with a reduced multi-state size. Although not trained, transformers often function as finite MSRNNs in practice. These findings provide insights into the inter-working of transformers and their connections to RNNs. Also, they’ve practical value in reducing the LLM cache size by as much as 88%.


Try the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

For those who like our work, you’ll love our newsletter..

Don’t Forget to affix our Telegram Channel


Sana Hassan, a consulting intern at Marktechpost and dual-degree student at IIT Madras, is keen about applying technology and AI to handle real-world challenges. With a keen interest in solving practical problems, he brings a fresh perspective to the intersection of AI and real-life solutions.


[Free AI Event] 🐝 ‘Real-Time AI with Kafka and Streaming Data Analytics’ (Jan 15 2024, 10 am PST)

LEAVE A REPLY

Please enter your comment!
Please enter your name here