Home Community Meta and UC Berkeley Researchers Present Audio2Photoreal: An Artificial Intelligence Framework for Generating Full-Bodied Photorealistic Avatars that Gesture In line with the Conversational Dynamics

Meta and UC Berkeley Researchers Present Audio2Photoreal: An Artificial Intelligence Framework for Generating Full-Bodied Photorealistic Avatars that Gesture In line with the Conversational Dynamics

0
Meta and UC Berkeley Researchers Present Audio2Photoreal: An Artificial Intelligence Framework for Generating Full-Bodied Photorealistic Avatars that Gesture In line with the Conversational Dynamics

Avatar technology has turn out to be ubiquitous in platforms like Snapchat, Instagram, and video games, enhancing user engagement by replicating human actions and emotions. Nonetheless, the search for a more immersive experience led researchers from Meta and BAIR to introduce “,” a groundbreaking method for synthesizing photorealistic avatars able to natural conversations.

Imagine engaging in a telepresent conversation with a friend represented by a photorealistic 3D model, dynamically expressing emotions aligned with their speech. The challenge lies in overcoming the restrictions of non-textured meshes, which fail to capture subtle nuances like eye gaze or smirking, leading to a robotic and uncanny interaction (see Figure 1, ). The research goals to bridge this gap, presenting a way for generating photorealistic avatars based on the speech audio of a dyadic conversation.

 

The approach involves synthesizing diverse high-frequency gestures and expressive facial movements synchronized with speech. Leveraging each an autoregressive VQ-based method and a diffusion model for body and hands, the researchers achieve a balance between frame rate and motion details. The result’s a system that renders photorealistic avatars able to conveying intricate facial, body, and hand motions in real time.

To support this research, the team introduces a novel multi-view conversational dataset, providing a photorealistic reconstruction of non-scripted, long-form conversations. Unlike previous datasets focused on upper body or facial motion, this dataset captures the dynamics of interpersonal conversations, offering a more comprehensive understanding of conversational gestures.

 

The system employs a two-model ( approach for face and body motion synthesis, each addressing the unique dynamics of those components. The face motion model (), a diffusion model conditioned on input audio and lip vertices, focuses on generating speech-consistent facial details. In contrast, the body motion model uses an autoregressive audio-conditioned transformer to predict coarse guide poses ( at 1fps, later refined by the diffusion model ( for diverse yet plausible body motions.

 

The evaluation demonstrates the model’s effectiveness ( in generating realistic and diverse conversational motions, outperforming various baselines. Photorealism proves crucial in capturing subtle nuances, as highlighted in perceptual evaluations. The quantitative results showcase the tactic’s ability to balance realism and variety, surpassing prior works by way of motion quality.

While the model excels in generating compelling and plausible gestures, it operates on short-range audio, limiting its capability for long-range language understanding. Moreover, the moral considerations of consent are addressed by rendering only consenting participants within the dataset.

 

In conclusion, “Audio2Photoreal” represents a big leap in synthesizing conversational avatars, offering a more immersive and realistic experience. The research not only introduces a novel dataset and methodology but in addition opens avenues for exploring ethical considerations in photorealistic motion synthesis.


Take a look at the Paper and ProjectAll credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.

In the event you like our work, you’ll love our newsletter..


Vineet Kumar is a consulting intern at MarktechPost. He’s currently pursuing his BS from the Indian Institute of Technology(IIT), Kanpur. He’s a Machine Learning enthusiast. He’s keen about research and the newest advancements in Deep Learning, Computer Vision, and related fields.


[Partnership and Promotion on Marktechpost] 🐝 Now you possibly can partner with Marktechpost to advertise your Research Paper, Github Repo and even add your pro commentary in any trending research article on marktechpost.com. Elevate your and your organization’s AI research visibility within the tech community…Learn more

LEAVE A REPLY

Please enter your comment!
Please enter your name here