
Over the past few years, diffusion models have achieved massive success and recognition for image and video generation tasks. Video diffusion models, specifically, have been gaining significant attention because of their ability to supply videos with high coherence in addition to fidelity. These models generate high-quality videos by employing an iterative denoising process of their architecture that regularly transforms high-dimensional Gaussian noise into real data.
Stable Diffusion is one of the vital representative models for image generative tasks, counting on a Variational AutoEncoder (VAE) to map between the actual image and the down-sampled latent features. This permits the model to cut back generative costs, while the cross-attention mechanism in its architecture facilitates text-conditioned image generation. More recently, the Stable Diffusion framework has built the inspiration for several plug-and-play adapters to realize more progressive and effective image or video generation. Nevertheless, the iterative generative process employed by a majority of video diffusion models makes the image generation process time-consuming and relatively costly, limiting its applications.
In this text, we’ll discuss AnimateLCM, a customized diffusion model with adapters aimed toward generating high-fidelity videos with minimal steps and computational costs. The AnimateLCM framework is inspired by the Consistency Model, which accelerates sampling with minimal steps by distilling pre-trained image diffusion models. Moreover, the successful extension of the Consistency Model, the Latent Consistency Model (LCM), facilitates conditional image generation. As a substitute of conducting consistency learning directly on the raw video dataset, the AnimateLCM framework proposes using a decoupled consistency learning strategy. This strategy decouples the distillation of motion generation priors and image generation priors, allowing the model to reinforce the visual quality of the generated content and improve training efficiency concurrently. Moreover, the AnimateLCM model proposes training adapters from scratch or adapting existing adapters to its distilled video consistency model. This facilitates the mixture of plug-and-play adapters within the family of stable diffusion models to realize different functions without harming the sample speed.
This text goals to cover the AnimateLCM framework in depth. We explore the mechanism, the methodology, and the architecture of the framework, together with its comparison with state-of-the-art image and video generation frameworks. So, let’s start.
Diffusion models have been the go to framework for image generation and video generation tasks owing to their efficiency and capabilities on generative tasks. A majority of diffusion models depend on an iterative denoising process for image generation that transforms a high dimensional Gaussian noise into real data regularly. Although the strategy delivers somewhat satisfactory results, the iterative process and the variety of iterating samples slows the generation process and in addition adds to the computational requirements of diffusion models which might be much slower than other generative frameworks like GAN or Generative Adversarial Networks. Prior to now few years, Consistency Models or CMs have been proposed as an alternative choice to iterative diffusion models to hurry up the generation process while keeping the computational requirements constant.
The highlight of consistency models is that they learn consistency mappings that maintain self-consistency of trajectories introduced by the pre-trained diffusion models. The training means of Consistency Models allows it to generate high-quality images with minimal steps, and in addition eliminates the necessity for computation-intensive iterations. Moreover, the Latent Consistency Model or LCM built on top of the stable diffusion framework might be integrated into the net user interface with the present adapters to realize a bunch of additional functionalities like real time image to image translation. As compared, although the present video diffusion models deliver acceptable results, progress continues to be to be made within the video sample acceleration field, and is of great significance owing to the high video generation computational costs.
That leads us to AnimateLCM, a high fidelity video generation framework that needs a minimal variety of steps for the video generation tasks. Following the Latent Consistency Model, AnimateLCM framework treats the reverse diffusion process as solving CFG or Classifier Free Guidance augmented probability flow, and trains the model to predict the answer of such probability flows directly within the latent space. Nevertheless, as a substitute of conducting consistency learning on raw video data directly that requires high training and computational resources, and sometimes results in poor quality, the AnimateLCM framework proposes a decoupled consistent learning strategy that decouples the consistency distillation of motion generation and image generation priors.
The AnimateLCM framework first conducts the consistency distillation to adapt the image base diffusion model into the image consistency model, after which conducts 3D inflation to each the image consistency and image diffusion models to accommodate 3D features. Eventually, the AnimateLCM framework obtains the video consistency model by conducting consistency distillation on video data. Moreover, to alleviate potential feature corruption consequently of the diffusion process, the AnimateLCM framework also proposes to make use of an initialization strategy. For the reason that AnimateLCM framework is built on top of the Stable Diffusion framework, it may replace the spatial weights of its trained video consistency model with the publicly available personalized image diffusion weights to realize progressive generation results.
Moreover, to coach specific adapters from scratch or to suit publicly available adapters higher, the AnimateLCM framework proposes an efficient acceleration strategy for the adapters that don’t require training the precise teacher models.
The contributions of the AnimateLCM framework might be thoroughly summarized as: The proposed AnimateLCM framework goals to realize prime quality, fast, and high fidelity video generation, and to realize this, the AnimateLCM framework proposes a decoupled distillation strategy the decouples the motion and image generation priors leading to higher generation quality, and enhanced training efficiency.
InstantID : Methodology and Architecture
At its core, the InstantID framework draws heavy inspiration from diffusion models and sampling speed strategies. Diffusion models, also referred to as score-based generative models have demonstrated remarkable image generative capabilities. Under the guidance of rating direction, the iterative sampling strategy implemented by diffusion models denoise the noise-corrupted data regularly. The efficiency of diffusion models is one in all the most important the reason why they’re employed by a majority of video diffusion models by training on added temporal layers. Then again, sampling speed and sampling acceleration strategies help tackle the slow generation speeds in diffusion models. Distillation based acceleration method tunes the unique diffusion weights with a refined architecture or scheduler to reinforce the generation speed.
Moving along, the InstantID framework is built on top of the stable diffusion model that enables InstantID to use relevant notions. The model treats the discrete forward diffusion process as continuous-time Variance Preserving SDE. Moreover, the stable diffusion model is an extension of DDPM or Denoising Diffusion Probabilistic Model, by which the training data point is perturbed regularly by the discrete Markov chain with a perturbation kennel allowing the distribution of noisy data at different time step to follow the distribution.
To realize high-fidelity video generation with a minimal variety of steps, the AnimateLCM framework tames the stable diffusion-based video models to follow the self-consistency property. The general training structure of the AnimateLCM framework consists of a decoupled consistency learning strategy for teacher free adaptation and effective consistency learning.
Transition from Diffusion Models to Consistency Models
The AnimateLCM framework introduces its own adaptation of the Stable Diffusion Model or DM to the Consistency Model or CM following the design of the Latent Consistency Model or LCM. It’s value noting that although the stable diffusion models typically predict the noise added to the samples, they’re essential sigma-diffusion models. It’s in contrast with consistency models that aim to predict the answer to the PF-ODE trajectory directly. Moreover, in stable diffusion models with certain parameters, it is important for the model to employ a classifier-free guidance technique to generate prime quality images. The AnimateLCM framework nonetheless, employs a classifier-free guidance augmented ODE solver to sample the adjoining pairs in the identical trajectories, leading to higher efficiency and enhanced quality. Moreover, existing models have indicated that the generation quality and training efficiency is influenced heavily by the variety of discrete points within the trajectory. Smaller variety of discrete points accelerates the training process whereas a better variety of discrete points leads to less bias during training.
Decoupled Consistency Learning
For the means of consistency distillation, developers have observed that the info used for training heavily influences the standard of the ultimate generation of the consistency models. Nevertheless, the most important issue with publicly available datasets currently is that always consist of watermark data, or its of low quality, and might contain overly transient or ambiguous captions. Moreover, training the model directly on large-resolution videos is computationally expensive, and time consuming, making it a non-feasible option for a majority of researchers.
Given the supply of filtered prime quality datasets, the AnimateLCM framework proposes to decouple the distillation of the motion priors and image generation priors. To be more specific, the AnimateLCM framework first distills the stable diffusion models into image consistency models with filtered high-quality image text datasets with higher resolution. The framework then trains the sunshine LoRA weights on the layers of the stable diffusion model, thus freezing the weights of the stable diffusion model. Once the model tunes the LoRA weights, it really works as a flexible acceleration module, and it has demonstrated its compatibility with other personalized models within the stable diffusion communities. For inference, the AnimateLCM framework merges the weights of the LoRA with the unique weights without corrupting the inference speed. After the AnimateLCM framework gains the consistency model at the extent of image generation, it freezes the weights of the stable diffusion model and LoRA weights on it. Moreover, the model inflates the 2D convolution kernels to the pseudo-3D kernels to coach the consistency models for video generation. The model also adds temporal layers with zero initialization and a block level residual connection. The general setup helps in assuring that the output of the model is not going to be influenced when it’s trained for the primary time. The AnimateLCM framework under the guidance of open sourced video diffusion models trains the temporal layers prolonged from the stable diffusion models.
It is important to acknowledge that while spatial LoRA weights are designed to expedite the sampling process without taking temporal modeling into consideration, and temporal modules are developed through standard diffusion techniques, their direct integration tends to deprave the representation on the onset of coaching. This presents significant challenges in effectively and efficiently merging them with minimal conflict. Through empirical research, the AnimateLCM framework has identified a successful initialization approach that not only utilizes the consistency priors from spatial LoRA weights but in addition mitigates the adversarial effects of their direct combination.
On the onset of consistency training, pre-trained spatial LoRA weights are integrated exclusively into the web consistency model, sparing the goal consistency model from insertion. This strategy ensures that the goal model, serving as the tutorial guide for the web model, doesn’t generate faulty predictions that might detrimentally affect the web model’s learning process. Throughout the training period, the LoRA weights are progressively incorporated into the goal consistency model via an exponential moving average (EMA) process, achieving the optimal weight balance after several iterations.
Teacher Free Adaptation
Stable Diffusion models and plug and play adapters often go hand in hand. Nevertheless, it has been observed that though the plug and play adapters work to some extent, they have a tendency to lose control in details even when a majority of those adapters are trained with image diffusion models. To counter this issue, the AnimateLCM framework opts for teacher free adaptation, a straightforward yet effective strategy that either accommodates the present adapters for higher compatibility or trains the adapters from the bottom up or. The approach allows the AnimateLCM framework to realize the controllable video generation and image-to-video generation with a minimal variety of steps without requiring teacher models.
AnimateLCM: Experiments and Results
The AnimateLCM framework employs a Stable Diffusion v1-5 as the bottom model, and implements the DDIM ODE solver for training purposes. The framework also applies the Stable Diffusion v1-5 with open sourced motion weights because the teacher video diffusion model with the experiments being conducted on the WebVid2M dataset with none additional or augmented data. Moreover, the framework employs the TikTok dataset with BLIP-captioned transient textual prompts for controllable video generation.
Qualitative Results
The next figure demonstrates results of the four-step generation method implemented by the AnimateLCM framework in text-to-video generation, image-to-video generation, and controllable video generation.
As it may be observed, the outcomes delivered by each of them are satisfactory with the generated results demonstrating the flexibility of the AnimateLCM framework to follow the consistency property even with various inference steps, maintaining similar motion and magnificence.
Quantitative Results
The next figure illustrates the quantitative results and comparison of the AnimateLCM framework with cutting-edge DDIM and DPM++ methods.
As it may be observed, the AnimateLCM framework outperforms the present methods by a major margin especially within the low step regime starting from 1 to 4 steps. Moreover, the AnimateLCM metrics displayed on this comparison are evaluated without using the CFG or classifier free guidance that enables the framework to save lots of nearly 50% of the inference time and inference peak memory cost. Moreover, to further validate its performance, the spatial weights inside the AnimateLCM framework are replaced with a publicly available personalized realistic model that strikes a very good balance between fidelity and variety, that helps in boosting the performance further.
Final Thoughts
In this text, we’ve got talked about AnimateLCM, a customized diffusion model with adapters that goals to generate high-fidelity videos with minimal steps and computational costs. The AnimateLCM framework is inspired by the Consistency Model that accelerates the sampling with minimal steps by distilling pre-trained image diffusion models, and the successful extension of the Consistency Model, the Latent Consistency Model or LCM that facilitates conditional image generation. As a substitute of conducting consistency learning on the raw video dataset directly, the AnimateLCM framework proposes to make use of a decoupled consistency learning strategy that decouples the distillation of motion generation priors and image generation priors, allowing the model to reinforce the visual quality of the generated content, and improve the training efficiency concurrently.