Designing a reward function by hand is time-consuming and may end up in unintended consequences. This can be a major roadblock in developing reinforcement learning (RL)-based generic decision-making agents.
Previous video-based learning methods have rewarded agents whose current observations are most like those of experts. They can not capture meaningful activities throughout time since rewards are conditional solely on the present commentary. And generalization is hindered by the adversarial training techniques that result in mode collapse.
U.C. Berkeley researchers have developed a novel method for extracting incentives from video prediction models called Video Prediction incentives for reinforcement learning (VIPER). VIPER can learn reward functions from raw movies and generalize to untrained domains.
First, VIPER uses expert-generated movies to coach a prediction model. The video prediction model is then used to coach an agent in reinforcement learning to optimize the log-likelihood of agent trajectories. The distribution of the agent’s trajectories have to be minimized to match the distribution of the video model. Using the video model’s likelihoods as a reward signal directly, the agent could also be trained to follow a trajectory distribution just like the video model’s. Unlike rewards on the observational level, those provided by video models quantify the temporal consistency of behavior. It also allows quicker training timeframes and greater interactions with the environment because evaluating likelihoods is far faster than doing video model rollouts.Â
Across 15 DMC tasks, 6 RLBench tasks, and seven Atari tasks, the team conducts a radical study and demonstrates that VIPER can achieve expert-level control without using task rewards. In accordance with the findings, VIPER-trained RL agents beat adversarial imitation learning across the board. Since VIPER is integrated into the setting, it doesn’t care which RL agent is used. Video models are already generalizable to arm/task mixtures not encountered during training, even within the small dataset regime.
The researchers think using big, pre-trained conditional video models will make more flexible reward functions possible. With the assistance of recent breakthroughs in generative modeling, they consider their work provides the community with a foundation for scalable reward specification from unlabeled movies.
Take a look at the Paper and Project. Don’t forget to hitch our 22k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the most recent AI research news, cool AI projects, and more. If you’ve any questions regarding the above article or if we missed anything, be at liberty to email us at Asif@marktechpost.com
🚀 Check Out 100’s AI Tools in AI Tools Club
Tanushree
” data-medium-file=”https://www.marktechpost.com/wp-content/uploads/2020/10/Tanushree-Picture-225×300.jpeg” data-large-file=”https://www.marktechpost.com/wp-content/uploads/2020/10/Tanushree-Picture-768×1024.jpeg”>
Tanushree Shenwai is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Bhubaneswar. She is a Data Science enthusiast and has a keen interest within the scope of application of artificial intelligence in various fields. She is keen about exploring the brand new advancements in technologies and their real-life application.