Transformer design that has recently turn out to be popular has taken over as the usual method for Natural Language Processing (NLP) activities, particularly Machine Translation (MT). This architecture has displayed impressive scaling qualities, which implies that adding more model parameters leads to higher performance on a wide range of NLP tasks. Numerous studies and investigations have validated this commentary. Though transformers excel by way of scalability, there’s a parallel movement to make these models more practical and deployable in the true world. This entails taking good care of issues with latency, memory use, and disc space.
Researchers have been actively investigating methods to deal with these issues, including component trimming, parameter sharing, and dimensionality reduction. The widely utilized Transformer architecture comprises various essential parts, of which two of an important ones are the Feed Forward Network (FFN) and Attention.
- Attention – The Attention mechanism allows the model to capture relationships and dependencies between words in a sentence, regardless of their positions. It functions as a kind of mechanism to assist the model in determining which portions of the input text are most pertinent to every word it’s currently analyzing. Understanding the context and connections between words in a phrase relies on this.
- Feed Forward Network (FFN): The FFN is chargeable for non-linearly transforming each input token independently. It adds complexity and expressiveness to the model’s comprehension of every word by performing specific mathematical operations on the representation of every word.
In recent research, a team of researchers has focused on investigating the role of the FFN throughout the Transformer architecture. They’ve discovered that the FFN exhibits a high level of redundancy while being a big component of the model and consuming a major variety of parameters. They’ve found that they may reduce the model’s parameter count without significantly compromising accuracy. They’ve achieved this by removing the FFN from the decoder layers and as a substitute using a single shared FFN across the encoder layers.
- Decoder Layers: Each encoder and decoder in a normal Transformer model has its own FFN. The researchers eliminated the FFN from the decoder layers.
- Encoder Layers: They used a single FFN that was shared by all the encoder layers reasonably than having individual FFNs for every encoder layer.
The researchers have shared the advantages which have accompanied this approach, that are as follows.
- Parameter Reduction: They drastically decreased the quantity of parameters within the model by deleting and sharing the FFN components.
- The model’s accuracy only decreased by a modest amount despite removing a large variety of its parameters. This shows that the encoder’s quite a few FFNs and the decoder’s FFN have a point of functional redundancy.
- Scaling Back: They expanded the hidden dimension of the shared FFN to revive the architecture to its previous size while maintaining and even enhancing the performance of the model. In comparison with the previous large-scale Transformer model, this resulted in considerable improvements in accuracy and model processing speed, i.e., latency.
In conclusion, this research shows that the Feed Forward Network within the Transformer design, especially within the decoder levels, could also be streamlined and shared without significantly affecting model performance. This not only lessens the model’s computational load but additionally improves its effectiveness and applicability for diverse NLP applications.
Take a look at the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to hitch our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the newest AI research news, cool AI projects, and more.
For those who like our work, you’ll love our newsletter..
Tanya Malhotra is a final yr undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and important considering, together with an ardent interest in acquiring recent skills, leading groups, and managing work in an organized manner.