Home Community Baidu AI Researchers Introduce VideoGen: A Recent Text-to-Video Generation Approach That Can Generate High-Definition Video With High Frame Fidelity

Baidu AI Researchers Introduce VideoGen: A Recent Text-to-Video Generation Approach That Can Generate High-Definition Video With High Frame Fidelity

0
Baidu AI Researchers Introduce VideoGen: A Recent Text-to-Video Generation Approach That Can Generate High-Definition Video With High Frame Fidelity

Text-to-image (T2I) generation systems like DALL-E2, Imagen, Cogview, Latent Diffusion, and others have come a great distance lately. Alternatively, text-to-video (T2V) generation stays a difficult issue as a consequence of the necessity for high-quality visual content and temporally smooth, realistic motion corresponding to the text. As well as, large-scale databases of text-video combos are very hard to come back across. 

A recent research by Baidu Inc. introduces VideoGen, a way for making a high-quality, seamless movie from textual descriptions. To assist direct the creation of T2V, the researchers first built a high-quality image using a T2I model. Then, they use a cascaded latent video diffusion module that generates a series of high-resolution smooth latent representations based on the reference image and the text description. When mandatory, additionally they employ a flow-based approach to upsample the latent representation sequence in time. Ultimately, the team trained a video decoder to convert the sequence of latent representations into an actual video.

Making a reference image with the assistance of a T2I model has two distinct benefits. 

  1. The resulting video’s visual quality has improved. The proposed method takes advantage of the T2I model to attract from the much larger dataset of image-text pairs, which is more diverse and information-rich than the dataset of video-text pairs. In comparison with Imagen Video, which uses image-text pairings for joint training, this method is more efficient throughout the training phase. 
  2. A cascaded latent video diffusion model might be guided by a reference image, allowing it to learn video dynamics reasonably than visual content. The team believes that is an additional benefit above methods that only use the T2I model parameters.

The team also mentions that textual description is unnecessary for his or her video decoder to supply a movie from the latent representation sequence. By doing so, they train the video decoder on an even bigger data pool, including video-text pairs and unlabeled (unpaired) movies. Because of this, this method improves the smoothness and realism of the created video’s motion due to the high-quality video data we use.

As findings suggest, VideoGen represents a big improvement over previous methods of text-to-video generation when it comes to each qualitative and quantitative evaluation.


Take a look at the Paper and Project. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to affix our 30k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the most recent AI research news, cool AI projects, and more.

Should you like our work, you’ll love our newsletter..


Dhanshree

” data-medium-file=”https://www.marktechpost.com/wp-content/uploads/2022/11/20221028_101632-Dhanshree-Shenwai-169×300.jpg” data-large-file=”https://www.marktechpost.com/wp-content/uploads/2022/11/20221028_101632-Dhanshree-Shenwai-576×1024.jpg”>

Dhanshree Shenwai is a Computer Science Engineer and has a very good experience in FinTech corporations covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is passionate about exploring latest technologies and advancements in today’s evolving world making everyone’s life easy.


🚀 The top of project management by humans (Sponsored)

LEAVE A REPLY

Please enter your comment!
Please enter your name here