Deep generative modeling has emerged as a robust approach for generating high-quality images in recent times. Specifically, technical improvements in utilizing techniques like diffusion and autoregressive models have enabled the generation of stunning and photo-realistic images conditioned on a text input prompt. Although these models offer remarkable performance, they suffer from a major limitation: their slow sampling speed. A big neural network must be evaluated 50-1000 times to generate a single image, as each step within the generative process relies on reusing the identical function. This inefficiency is an important factor to contemplate in real-world scenarios and may present a hurdle for the widespread application of those models.
One popular technique on this field is deep variational autoencoders (VAEs), which mix deep neural networks with probabilistic modeling to learn latent data representations. These representations can then be used to generate latest images which are much like the unique data but have unique variations. The utilization of deep VAEs for image generation has enabled remarkable progress in the sector of image generation.
Nonetheless, hierarchical VAEs have yet to supply high-quality images on large, diverse datasets, which is especially unexpected given their hierarchical generation process, which appears well-suited for image generation. In contrast, autoregressive models have shown greater success, although their inductive bias involves generating images in a straightforward raster-scan order. Due to this fact, the authors of the paper discussed in this text have examined the aspects contributing to autoregressive models’ success and transposed them to VAEs.
For example, the important thing to the success of autoregressive models lies in training on a sequence of compressed image tokens slightly than on direct pixel values. By doing so, they’ll think about learning the relationships between image semantics while disregarding imperceptible image details. Hence, similarly to pixel-space autoregressive models, existing pixel-space hierarchical VAEs may primarily give attention to learning fine-grained features, limiting their ability to capture the underlying composition of image concepts.
Based on the abovementioned considerations, the work exploits deep VAEs by leveraging the latent space of a deterministic autoencoder (DAE).
This approach comprises two stages: training a DAE to reconstruct images from low-dimensional latents after which training a VAE to construct a generative model from these latents.
The model gains two critical advantages by training the VAE on low-dimensional latents as an alternative of pixel space: a more robust and lighter training process. Indeed, the compressed latent code is far smaller than its RGB representation, yet it preserves almost the entire image’s perceptual information. A smaller code length is advantageous because it emphasizes global features, which comprise only a couple of bits. Moreover, the VAE can concentrate entirely on the image structure because imperceptible details are discarded. Second, the reduced dimensionality of the latent variable reduces computational costs and enables training larger models with the identical resources.
Moreover, large-scale diffusion and autoregressive models utilize classifier-free guidance to reinforce image fidelity. The aim of this system is to balance diversity and sample quality since poor likelihood-based models are likely to generate samples that don’t align with the information distribution. The guidance mechanism aids in steering samples toward regions that more closely match a desired label by comparing conditional and unconditional likelihood functions. For that reason, the authors extend the classifier-free guidance concept to deep VAEs.
The comparison of the outcomes between the proposed method and state-of-the-art approaches is depicted below.
This was the summary of a novel lightweight deep VAEs architecture for image generation.
In the event you have an interest or wish to learn more about this framework, you’ll find a link to the paper and the project page.
Take a look at the Paper. Don’t forget to hitch our 19k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the newest AI research news, cool AI projects, and more. If you’ve any questions regarding the above article or if we missed anything, be at liberty to email us at Asif@marktechpost.com
🚀 Check Out 100’s AI Tools in AI Tools Club
Daniele Lorenzi received his M.Sc. in ICT for Web and Multimedia Engineering in 2021 from the University of Padua, Italy. He’s a Ph.D. candidate on the Institute of Information Technology (ITEC) on the Alpen-Adria-Universität (AAU) Klagenfurt. He’s currently working within the Christian Doppler Laboratory ATHENA and his research interests include adaptive video streaming, immersive media, machine learning, and QoS/QoE evaluation.