Home Community Meet ONE-PEACE: A General Representation Model Towards Unlimited Modalities Across Different Modalities

Meet ONE-PEACE: A General Representation Model Towards Unlimited Modalities Across Different Modalities

Meet ONE-PEACE: A General Representation Model Towards Unlimited Modalities Across Different Modalities

➡️ Annotate every kind of unstructured data rapidly and accurately with customizable annotation tasks with Kili Technology:

Representation models have gotten much attention in computer vision, voice, natural language processing, etc. Representation models exhibit high generalization in various downstream tasks after learning from vast data. Moreover, there may be a growing demand for representation models on account of the spectacular rise of large-scale language models (LLMs). Representation models have recently demonstrated their fundamental importance in enabling LLMs to understand, experience, and interact with other modalities (like vision). Previous research has mostly focused on developing uni-modal representation models with unique topologies and pretraining tasks on account of the assorted properties of assorted modalities. 

Recent efforts in vision-language and audio-language learning have shown promising results because of the event of unified architectures and effective pretraining activities. Nonetheless, research on creating universal models that could be used for language, audio, and visual modalities still must be made available. Despite producing outstanding results, unimodal representation models need assistance using multi-modal data, comparable to image-text and audio-text pairings, efficiently, making applying them to multi-modal tasks difficult. Use a single masked prediction task with the Multiway Transformer to research text and film modalities for pretraining. 

The scalability to other modalities, comparable to audio, is constrained because the masked prediction job necessitates the pretrained CLIP model to discretize picture input. It offers a broad pretraining approach that could be used for language, audio, and visual modalities without external models (like CLIP). Still, it must expand the approach to multi-modal data. On this study, they investigate a scalable method to develop a general representation model that may accommodate any variety of modalities. They promote the next requirements for a broad representation model: 1. The model design should be adaptable enough to handle multi-modal interaction and multiple modalities. 2. Pretraining exercises should promote alignment across modalities and data extraction inside each modality. 3. Pretraining exercises ought to be broad and uncomplicated so that they could also be used with various modalities. 

🚀 JOIN the fastest ML Subreddit Community

As a consequence of these incentives, researchers from DAMO Academy and Huazhong University of Science and Technology suggest ONE-PEACE, a model with 4B parameters that may easily align and integrate representations across visual, audio, and language modalities. The architecture of ONE-PEACE comprises a modality fusion encoder and lots of modality adapters. Each modality includes an adaptor to remodel the raw inputs into feature sequences. The modality fusion encoder uses the Transformer architecture-based feature sequences. A typical self-attention layer and several other modality Feed Forward Networks (FFNs) are present in each Transformer block. Throughout the modality FFNs aid in information extraction inside modalities. The self-attention layer uses the eye mechanism to enable interaction between the multi-modal features. 

This architecture’s obvious division of labor makes adding recent modalities easy and merely calls for adding adapters and FFNs. They supply two modality-independent pretraining assignments for ONE-PEACE. The primary is cross-modal contrastive learning, which mixes vision-language contrastive education and audio-language contrastive learning to successfully align the semantic spaces of the three modalities of vision, audio, and language. The second method is intra-modal denoising contrastive learning, which could be regarded as combining masked prediction and contrastive knowledge. Contrastive loss is performed between the fine-grained masked features and visual features, like image patches, language tokens, or audio waveform features. 

ONE-PEACE could be expanded to infinite modalities because of the scaling-friendly model design and pretraining activities. Together, these activities improve the model’s performance during fine-tuning while preserving cross-modal retrieval capability. In addition they eliminate the requirement for modality-specific plans because they’re ubiquitous for all modalities. They perform in-depth studies on various tasks in various modalities, comparable to vision, audio, vision-language, and audio-language activities. ONE PEACE achieves industry-leading results without using vision or language-pre-trained models for initialization in uni-modal and multi-modal tasks. The code is publicly available on GitHub.

Try the Paper and Github. Don’t forget to hitch our 21k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the newest AI research news, cool AI projects, and more. If you may have any questions regarding the above article or if we missed anything, be at liberty to email us at Asif@marktechpost.com

🚀 Check Out 100’s AI Tools in AI Tools Club

Aneesh Tickoo is a consulting intern at MarktechPost. He’s currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed toward harnessing the ability of machine learning. His research interest is image processing and is obsessed with constructing solutions around it. He loves to attach with people and collaborate on interesting projects.

➡️ Ultimate Guide to Data Labeling in Machine Learning


Please enter your comment!
Please enter your name here