Home Community Friendship Ended with Single Modality – Now Multi-Modality is My Best Friend: CoDi is an AI Model that may Achieve Any-to-Any Generation via Composable Diffusion

Friendship Ended with Single Modality – Now Multi-Modality is My Best Friend: CoDi is an AI Model that may Achieve Any-to-Any Generation via Composable Diffusion

0
Friendship Ended with Single Modality – Now Multi-Modality is My Best Friend: CoDi is an AI Model that may Achieve Any-to-Any Generation via Composable Diffusion

Generative AI is a term we hear almost day-after-day now. I even don’t remember what number of papers I’ve read and summarized about generative AI here. They’re impressive, what they do seems unreal and magical, and so they could be utilized in many applications. We will generate images, videos, audio, and more by just using text prompts.

The numerous progress made in generative AI models in recent times has enabled use cases that were deemed unimaginable not so way back. It began with text-to-image models, and once it was seen that they produced incredibly nice results. After that, the demand for AI models able to handling multiple modalities has increased.

Recently, there’s a surging demand for models that may take any combination of inputs (e.g., text + audio) and generate various combos of modal outputs (e.g., video + audio) has increased. Several models have been proposed to tackle this, but these models have limitations regarding real-world applications involving multiple modalities that coexist and interact. 

🚀 JOIN the fastest ML Subreddit Community

While it’s possible to chain together modality-specific generative models in a multi-step process, the generation power of every step stays inherently limited, leading to a cumbersome and slow approach. Moreover, independently generated unimodal streams may lack consistency and alignment when combined, making post-processing synchronization difficult.

Training a model to handle any mixture of input modalities and flexibly generate any combination of outputs presents significant computational and data requirements. The variety of possible input-output combos scales exponentially, while aligned training data for a lot of groups of modalities is scarce or non-existent. 

Allow us to meet with CoDi, which is proposed to tackle this challenge. CoDi is a novel neural architecture that permits the simultaneous processing and generation of arbitrary combos of modalities. 

CoDi proposes aligning multiple modalities in each the input conditioning and generation diffusion steps. Moreover, it introduces a “Bridging Alignment” strategy for contrastive learning, enabling it to efficiently model the exponential variety of input-output combos with a linear number of coaching objectives.

The important thing innovation of CoDi lies in its ability to handle any-to-any generation by leveraging a mix of latent diffusion models (LDMs), multimodal conditioning mechanisms, and cross-attention modules. By training separate LDMs for every modality and projecting input modalities right into a shared feature space, CoDi can generate any modality or combination of modalities without direct training for such settings. 

The event of CoDi requires comprehensive model design and training on diverse data resources. First, the training starts with a latent diffusion model (LDM) for every modality, comparable to text, image, video, and audio. These models could be trained independently in parallel, ensuring exceptional single-modality generation quality using modality-specific training data. For conditional cross-modality generation, where images are generated using audio+language prompts, the input modalities are projected right into a shared feature space, and the output LDM attends to the mixture of input features. This multimodal conditioning mechanism prepares the diffusion model to handle any modality or combination of modalities without direct training for such settings.

Within the second stage of coaching, CoDi handles many-to-many generation strategies involving the simultaneous generation of arbitrary combos of output modalities. That is achieved by adding a cross-attention module to every diffuser and an environment encoder to project the latent variable of various LDMs right into a shared latent space. This seamless generation capability allows CoDi to generate any group of modalities without training on all possible generation combos, reducing the number of coaching objectives from exponential to linear.


Check Out The Paper, Code, and Project. Don’t forget to hitch our 24k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the newest AI research news, cool AI projects, and more. If you may have any questions regarding the above article or if we missed anything, be happy to email us at Asif@marktechpost.com


🚀 Check Out 100’s AI Tools in AI Tools Club


Ekrem Çetinkaya received his B.Sc. in 2018, and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. He wrote his M.Sc. thesis about image denoising using deep convolutional networks. He received his Ph.D. degree in 2023 from the University of Klagenfurt, Austria, together with his dissertation titled “Video Coding Enhancements for HTTP Adaptive Streaming Using Machine Learning.” His research interests include deep learning, computer vision, video encoding, and multimedia networking.


➡️ Meet Notion: Your Wiki, Docs, & Projects Together

LEAVE A REPLY

Please enter your comment!
Please enter your name here