The generation of hyper-realistic human images from user-defined conditions, reminiscent of text and pose, is meaningful for various applications, including image animation and virtual try-ons. Quite a few efforts have been made to explore the duty of controllable human image generation. Early methods either relied on variational auto-encoders (VAEs) in a reconstruction manner or improved realism through generative adversarial networks (GANs). Despite the creation of high-quality images by some methods, challenges like unstable training and limited model capability confined them to small datasets with low diversity.
The recent emergence of diffusion models (DMs) has introduced a brand new paradigm for realistic synthesis, becoming the predominant architecture in Generative AI. Nonetheless, exemplar text-to-image (T2I) models like Stable Diffusion and DALL·E 2 still struggle to create human images with coherent anatomy, reminiscent of arms, legs, and natural poses. The first challenge lies within the non-rigid deformations of the human form, requiring structural information that’s difficult to depict solely through text prompts.
Recent works, reminiscent of ControlNet and T2I-Adapter, have attempted to enable structural control for image generation by introducing a learnable branch to modulate pre-trained DMs, like Stable Diffusion, in a plug-and-play manner. Nonetheless, these approaches suffer from feature discrepancies between the important and auxiliary branches, leading to inconsistency between control signals (e.g., pose maps) and generated images. HumanSD proposes directly inputting the body skeleton into the diffusion U-Net through channel-wise concatenation to deal with this. Nonetheless, this method is confined to generating artistic-style images with limited diversity. Moreover, human content is synthesized only with pose control, neglecting other crucial structural information like depth maps and surface-normal maps.
The work reported in this text proposes a unified framework, HyperHuman, to generate in-the-wild human images with high realism and diverse layouts. Its overview is illustrated within the figure below.
The important thing insight is recognizing the inherently structural nature of human images across multiple granularities, from coarse-level body skeletons to fine-grained spatial geometry. Capturing such correlations between explicit appearance and latent structure in a single model is important for generating coherent and natural human images. The paper establishes a large-scale human-centric dataset called HumanVerse, containing 340 million in-the-wild human images with comprehensive annotations. Based on this dataset, two modules are designed for hyper-realistic controllable human image generation: the Latent Structural Diffusion Model and the Structure-Guided Refiner. The previous augments the pre-trained diffusion backbone to concurrently denoise RGB, depth, and normal elements, ensuring spatial alignment amongst denoised textures and structures.
On account of such meticulous design, the modeling of image appearance, spatial relationships, and geometry occurs collaboratively inside a unified network. Each branch complements the others, incorporating each structural awareness and textural richness. An enhanced noise schedule eliminates low-frequency information leakage, ensuring uniform depth and surface-normal values in local regions. Employing the identical timestep for every branch enhances learning and facilitates feature fusion. With spatially-aligned structure maps, the Structure-Guided Refiner composes predicted conditions for detailed, high-resolution image generation. Moreover, a strong conditioning scheme is designed to alleviate the impact of error accumulation within the two-stage generation pipeline.
A comparison with state-of-the-art techniques is reported below.
The primary 4×4 grid of every row comprises the input skeleton, jointly denoised normal, depth, and coarse RGB (512×512) as computed by HyperHuman.
This was the summary of HyperHuman, a novel AI framework for generating in-the-wild human images with high realism and diverse layouts. If you happen to have an interest and need to learn more about it, please be happy to consult with the links cited below.
Try the Paper and Project. All credit for this research goes to the researchers of this project. Also, don’t forget to hitch our 33k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the newest AI research news, cool AI projects, and more.
If you happen to like our work, you’ll love our newsletter..
Daniele Lorenzi received his M.Sc. in ICT for Web and Multimedia Engineering in 2021 from the University of Padua, Italy. He’s a Ph.D. candidate on the Institute of Information Technology (ITEC) on the Alpen-Adria-Universität (AAU) Klagenfurt. He’s currently working within the Christian Doppler Laboratory ATHENA and his research interests include adaptive video streaming, immersive media, machine learning, and QoS/QoE evaluation.