Home Community Moving Images with No Effort: Text2Video-Zero is an AI Model That Converts Text-to-Image Models to Zero-Shot Video Generators

Moving Images with No Effort: Text2Video-Zero is an AI Model That Converts Text-to-Image Models to Zero-Shot Video Generators

0
Moving Images with No Effort: Text2Video-Zero is an AI Model That Converts Text-to-Image Models to Zero-Shot Video Generators

We’ve witnessed the rise of generative AI models within the last couple of months. They went from generating low-resolution face-like images to generating high-resolution photo-realistic images quite quickly. It’s now possible to acquire unique, photo-realistic images by describing what we wish to see. Furthermore, perhaps more impressive is the incontrovertible fact that we will even use diffusion models to generate videos for us. 

The important thing contributor to generative AI is the diffusion models. They take a text prompt and generate an output that matches that description. They do that by progressively transforming a set of random numbers into a picture or video while adding more details to make it appear to be the outline. These models learn from datasets with hundreds of thousands of samples, so that they can generate recent visuals that look much like those they’ve seen before. Though, the dataset may be the important thing problem sometimes.

It is sort of at all times not feasible to coach a diffusion model for video generation from scratch. They require extremely large datasets and likewise equipment to feed their needs. Constructing such datasets is just possible for a few institutes all over the world, as accessing and collecting those data is out of reach for most individuals as a result of the price. We’ve to go along with existing models and check out to make them work for our use case. 

🚀 JOIN the fastest ML Subreddit Community

Even when one way or the other you manage to organize a text-video dataset with hundreds of thousands, if not billions, of pairs, you continue to need to search out a method to obtain the hardware power required to feed those large-scale models. Due to this fact, the high cost of video diffusion models makes it difficult for a lot of users to customize these technologies for their very own needs.

What if there was a method to bypass this requirement? Could we have now a method to reduce the price of coaching video diffusion models? Time to fulfill Text2Video-Zero 

Text2Video-Zero is a zero-shot text-to-video generative model, which implies it doesn’t require any training to be customized. It uses pre-trained text-to-image models and converts them right into a temporally consistent video generation model. In the long run, the video displays a sequence of images in a fast manner to stimulate the movement. The concept of using them consecutively to generate the video is a simple solution. 

Though, we cannot just use a picture generation model tons of of times and mix the outputs at the tip. It will not work because there isn’t a method to make sure the models draw the identical objects on a regular basis. We want a method to ensure temporal consistency within the model.

To implement temporal consistency, Text2Video-Zero uses two lightweight modifications.  

First, it enriches the latent vectors of generated frames with motion information to maintain the worldwide scene and the background time consistent. This is completed by adding motion information to the latent vectors as an alternative of just randomly sampling them. Nonetheless, these latent vectors do not need sufficient restrictions to depict specific colours, shapes, or identities, leading to temporal inconsistencies, particularly for the foreground object. Due to this fact, a second modification is required to tackle this issue.

The second modification is in regards to the attention mechanism. To leverage the facility of cross-frame attention and at the identical time exploit a pre-trained diffusion model without retraining, each self-attention layer is replaced with cross-frame attention, and the eye for every frame is concentrated on the primary frame. This helps Text2Video-Zero to preserve the context, appearance, and identity of the foreground object throughout the whole sequence. 

Experiments show that these modifications result in high-quality and time-consistent video generation, regardless that it doesn’t require training on large-scale video data. Moreover, it is just not limited to text-to-video synthesis but can also be applicable to conditional and specialized video generation, in addition to video editing by textual instruction.


Try the Paper and Github. Don’t forget to affix our 19k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the newest AI research news, cool AI projects, and more. If you’ve gotten any questions regarding the above article or if we missed anything, be at liberty to email us at Asif@marktechpost.com

🚀 Check Out 100’s AI Tools in AI Tools Club


Ekrem Çetinkaya received his B.Sc. in 2018 and M.Sc. in 2019 from Ozyegin University, Istanbul, Türkiye. He wrote his M.Sc. thesis about image denoising using deep convolutional networks. He’s currently pursuing a Ph.D. degree on the University of Klagenfurt, Austria, and dealing as a researcher on the ATHENA project. His research interests include deep learning, computer vision, and multimedia networking.


🚀 JOIN the fastest ML Subreddit Community

LEAVE A REPLY

Please enter your comment!
Please enter your name here