Home Community Google AI Introduces MaMMUT: A Easy Architecture for Joint Learning for MultiModal Tasks

Google AI Introduces MaMMUT: A Easy Architecture for Joint Learning for MultiModal Tasks

Google AI Introduces MaMMUT: A Easy Architecture for Joint Learning for MultiModal Tasks

The concept on which vision-language fundamental models are constructed is that a single pre-training will be used to adapt to a wide range of downstream activities. There are two widely used but distinct training scenarios: 

  • Contrastive learning within the kind of CLIP. It trains the model to predict if image-text pairs appropriately match, effectively constructing visual and text representations for the corresponding image and text inputs. It enables image-text and text-image retrieval tasks like choosing the image that best matches a selected description.
  • Next-token prediction: It learns to generate text by predicting essentially the most probable next token in a sequence. It supports text-generative tasks like Image Captioning and Visual Query Answering (VQA) while contrastive learning.

While each methods have shown promising results, pre-trained models not transferable to other tasks are inclined to perform poorly on text-generation tasks and vice versa. It’s also common for complex or inefficient approaches for use while adapting to latest tasks.

To coach jointly for these competing goals and to offer the groundwork for varied vision-language tasks either directly or by easy adaptation, a recent Google study presents MaMMUT, an easy architecture for joint learning for multimodal tasks. MaMMUT is a condensed multimodal model with only 2B parameters, and it might be trained to attain contrastive, text-generating, and localization-aware goals. Its easy design—only one image encoder and one text decoder—makes it easy to recycle the 2 independently. 

🚀 JOIN the fastest ML Subreddit Community

The proposed model comprises a single visual encoder and a single text-decoder linked via cross-attention and trains concurrently on contrastive and text-generative kinds of losses. Previous work either doesn’t address image-text retrieval tasks or simply applies some losses to pick out facets of the model. Jointly training contrastive losses and text-generative captioning-like losses is mandatory to enable multimodal tasks and fully use the decoder-only model.

There may be a substantial performance gain with a smaller model size (nearly half the parameters) for decoder-only models in language learning. Considered one of the most important obstacles to using them in multimodal situations is reconciling contrastive learning (which relies on unconditional sequence-level representation) and captioning (which optimizes the likelihood of a token based on the tokens that got here before it). The researchers offer a two-pass technique to learn these incompatible text representations throughout the decoder jointly.

Their initial run at learning the caption generation challenge uses cross-attention and causal masking in order that the text features can listen to the image features and make sequential token predictions. They turn off cross-attention and causal masking to learn the contrastive task on the second pass. While the image features will remain hidden from the text features, the text features will give you the chance to attend in each directions on all text tokens concurrently. Each tasks, which were previously difficult to reconcile, may now be handled by the identical decoder because of the two-pass technique. Despite the fact that this model architecture is sort of easy, it may well function a basis for various multimodal tasks.

Because the architecture is trained for several separate tasks, it might be easily integrated into many applications, including image-text and text-image retrieval, visual quality assessment, and captioning. The researchers use sparse video tubes to directly access spatiotemporal information from video for lightweight adaptation. Training to detect bounding boxes via an object-detection head can be required to transfer the model to Open-Vocabulary Detection.

Despite its compact design, MaMMUT provides superior or competitive leads to various areas, including image-text and text-image retrieval, video query answering (VideoQA), video captioning, open-vocabulary identification, and VQA. The team highlights that their model achieves higher results than much larger models like Flamingo, which is tailored to image+video pre-training and already pre-trained on image-text and video-text data.

Try the Paper and Google blog. Don’t forget to affix our 21k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the newest AI research news, cool AI projects, and more. If you might have any questions regarding the above article or if we missed anything, be happy to email us at Asif@marktechpost.com

🚀 Check Out 100’s AI Tools in AI Tools Club


” data-medium-file=”https://www.marktechpost.com/wp-content/uploads/2020/10/Tanushree-Picture-225×300.jpeg” data-large-file=”https://www.marktechpost.com/wp-content/uploads/2020/10/Tanushree-Picture-768×1024.jpeg”>

Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Bhubaneswar. She is a Data Science enthusiast and has a keen interest within the scope of application of artificial intelligence in various fields. She is enthusiastic about exploring the brand new advancements in technologies and their real-life application.


Please enter your comment!
Please enter your name here