Home Community UC Berkeley And MIT Researchers Propose A Policy Gradient Algorithm Called Denoising Diffusion Policy Optimization (DDPO) That Can Optimize A Diffusion Model For Downstream Tasks Using Only A Black-Box Reward Function

UC Berkeley And MIT Researchers Propose A Policy Gradient Algorithm Called Denoising Diffusion Policy Optimization (DDPO) That Can Optimize A Diffusion Model For Downstream Tasks Using Only A Black-Box Reward Function

0
UC Berkeley And MIT Researchers Propose A Policy Gradient Algorithm Called Denoising Diffusion Policy Optimization (DDPO) That Can Optimize A Diffusion Model For Downstream Tasks Using Only A Black-Box Reward Function

Researchers have made notable strides in training diffusion models using reinforcement learning (RL) to boost prompt-image alignment and optimize various objectives. Introducing denoising diffusion policy optimization (DDPO), which treats denoising diffusion as a multi-step decision-making problem, enables fine-tuning Stable Diffusion on difficult downstream objectives.

By directly training diffusion models on RL-based objectives, the researchers display significant improvements in prompt-image alignment and optimizing objectives which might be difficult to precise through traditional prompting methods. DDPO presents a category of policy gradient algorithms designed for this purpose. To enhance prompt-image alignment, the research team incorporates feedback from a big vision-language model generally known as LLaVA. By leveraging RL training, they achieved remarkable progress in aligning prompts with generated images. Notably, the models shift towards a more cartoon-like style, potentially influenced by the prevalence of such representations within the pretraining data.

The outcomes obtained using DDPO for various reward functions are promising. Evaluations on objectives resembling compressibility, incompressibility, and aesthetic quality show notable enhancements in comparison with the bottom model. The researchers also highlight the generalization capabilities of the RL-trained models, which extend to unseen animals, on a regular basis objects, and novel mixtures of activities and objects. While RL training brings substantial advantages, the researchers note the potential challenge of over-optimization. Superb-tuning learned reward functions can result in models exploiting the rewards non-usefully, often destroying meaningful image content.

[Sponsored] 🔥 Construct your personal brand with Taplio  🚀 The first all-in-one AI-powered tool to grow on LinkedIn. Create higher LinkedIn content 10x faster, schedule, analyze your stats & engage. Try it at no cost!

Moreover, the researchers observe a susceptibility of the LLaVA model to typographic attacks. RL-trained models can loosely generate text resembling the proper variety of animals, fooling LLaVA in prompt-based alignment scenarios.

In summary, introducing DDPO and using RL training for diffusion models represent significant progress in improving prompt-image alignment and optimizing diverse objectives. The outcomes showcase advancements in compressibility, incompressibility, and aesthetic quality. Nonetheless, challenges resembling reward over-optimization and vulnerabilities in prompt-based alignment methods warrant further investigation. These findings open up recent opportunities for research and development in diffusion models, particularly in image generation and completion tasks.


Take a look at the Paper, Project, and GitHub Link. Don’t forget to affix our 26k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the most recent AI research news, cool AI projects, and more. If you have got any questions regarding the above article or if we missed anything, be at liberty to email us at Asif@marktechpost.com

🚀 Check Out 100’s AI Tools in AI Tools Club


Niharika

” data-medium-file=”https://www.marktechpost.com/wp-content/uploads/2023/01/1674480782181-Niharika-Singh-264×300.jpg” data-large-file=”https://www.marktechpost.com/wp-content/uploads/2023/01/1674480782181-Niharika-Singh-902×1024.jpg”>

Niharika is a Technical consulting intern at Marktechpost. She is a 3rd 12 months undergraduate, currently pursuing her B.Tech from Indian Institute of Technology(IIT), Kharagpur. She is a highly enthusiastic individual with a keen interest in Machine learning, Data science and AI and an avid reader of the most recent developments in these fields.


🔥 StoryBird.ai just dropped some amazing features. Generate an illustrated story from a prompt. Test it out here. (Sponsored)

LEAVE A REPLY

Please enter your comment!
Please enter your name here