
To show an AI agent a brand new task, like the best way to open a kitchen cabinet, researchers often use reinforcement learning — a trial-and-error process where the agent is rewarded for taking actions that get it closer to the goal.
In lots of instances, a human expert must fastidiously design a reward function, which is an incentive mechanism that provides the agent motivation to explore. The human expert must iteratively update that reward function because the agent explores and tries different actions. This will be time-consuming, inefficient, and difficult to scale up, especially when the duty is complex and involves many steps.
Researchers from MIT, Harvard University, and the University of Washington have developed a brand new reinforcement learning approach that doesn’t depend on an expertly designed reward function. As a substitute, it leverages crowdsourced feedback, gathered from many nonexpert users, to guide the agent because it learns to succeed in its goal.
While another methods also try to utilize nonexpert feedback, this latest approach enables the AI agent to learn more quickly, despite the undeniable fact that data crowdsourced from users are sometimes stuffed with errors. These noisy data might cause other methods to fail.
As well as, this latest approach allows feedback to be gathered asynchronously, so nonexpert users all over the world can contribute to teaching the agent.
“Probably the most time-consuming and difficult parts in designing a robotic agent today is engineering the reward function. Today reward functions are designed by expert researchers — a paradigm that is just not scalable if we would like to show our robots many alternative tasks. Our work proposes a approach to scale robot learning by crowdsourcing the design of reward function and by making it possible for nonexperts to supply useful feedback,” says Pulkit Agrawal, an assistant professor within the MIT Department of Electrical Engineering and Computer Science (EECS) who leads the Improbable AI Lab within the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL).
In the long run, this method could help a robot learn to perform specific tasks in a user’s home quickly, without the owner needing to point out the robot physical examples of every task. The robot could explore by itself, with crowdsourced nonexpert feedback guiding its exploration.
“In our method, the reward function guides the agent to what it should explore, as an alternative of telling it exactly what it should do to finish the duty. So, even when the human supervision is somewhat inaccurate and noisy, the agent remains to be capable of explore, which helps it learn significantly better,” explains lead creator Marcel Torne ’23, a research assistant within the Improbable AI Lab.
Torne is joined on the paper by his MIT advisor, Agrawal; senior creator Abhishek Gupta, assistant professor on the University of Washington; in addition to others on the University of Washington and MIT. The research shall be presented on the Conference on Neural Information Processing Systems next month.
Noisy feedback
One approach to gather user feedback for reinforcement learning is to point out a user two photos of states achieved by the agent, after which ask that user which state is closer to a goal. As an illustration, perhaps a robot’s goal is to open a kitchen cabinet. One image might show that the robot opened the cupboard, while the second might show that it opened the microwave. A user would pick the photo of the “higher” state.
Some previous approaches try to make use of this crowdsourced, binary feedback to optimize a reward function that the agent would use to learn the duty. Nonetheless, because nonexperts are prone to make mistakes, the reward function can change into very noisy, so the agent might get stuck and never reach its goal.
“Principally, the agent would take the reward function too seriously. It might attempt to match the reward function perfectly. So, as an alternative of directly optimizing over the reward function, we just use it to inform the robot which areas it needs to be exploring,” Torne says.
He and his collaborators decoupled the method into two separate parts, each directed by its own algorithm. They call their latest reinforcement learning method HuGE (Human Guided Exploration).
On one side, a goal selector algorithm is constantly updated with crowdsourced human feedback. The feedback is just not used as a reward function, but relatively to guide the agent’s exploration. In a way, the nonexpert users drop breadcrumbs that incrementally lead the agent toward its goal.
On the opposite side, the agent explores by itself, in a self-supervised manner guided by the goal selector. It collects images or videos of actions that it tries, that are then sent to humans and used to update the goal selector.
This narrows down the realm for the agent to explore, leading it to more promising areas which are closer to its goal. But when there isn’t a feedback, or if feedback takes some time to reach, the agent will continue to learn by itself, albeit in a slower manner. This allows feedback to be gathered infrequently and asynchronously.
“The exploration loop can keep going autonomously, since it is just going to explore and learn latest things. After which while you get some higher signal, it’ll explore in additional concrete ways. You possibly can just keep them turning at their very own pace,” adds Torne.
And since the feedback is just gently guiding the agent’s behavior, it’s going to eventually learn to finish the duty even when users provide incorrect answers.
Faster learning
The researchers tested this method on quite a few simulated and real-world tasks. In simulation, they used HuGE to effectively learn tasks with long sequences of actions, comparable to stacking blocks in a selected order or navigating a big maze.
In real-world tests, they utilized HuGE to coach robotic arms to attract the letter “U” and pick and place objects. For these tests, they crowdsourced data from 109 nonexpert users in 13 different countries spanning three continents.
In real-world and simulated experiments, HuGE helped agents learn to attain the goal faster than other methods.
The researchers also found that data crowdsourced from nonexperts yielded higher performance than synthetic data, which were produced and labeled by the researchers. For nonexpert users, labeling 30 images or videos took fewer than two minutes.
“This makes it very promising when it comes to with the ability to scale up this method,” Torne adds.
In a related paper, which the researchers presented on the recent Conference on Robot Learning, they enhanced HuGE so an AI agent can learn to perform the duty, after which autonomously reset the environment to proceed learning. As an illustration, if the agent learns to open a cupboard, the strategy also guides the agent to shut the cupboard.
“Now we will have it learn completely autonomously without having human resets,” he says.
The researchers also emphasize that, on this and other learning approaches, it’s critical to be certain that AI agents are aligned with human values.
In the long run, they wish to proceed refining HuGE so the agent can learn from other types of communication, comparable to natural language and physical interactions with the robot. Also they are curious about applying this method to show multiple agents directly.
This research is funded, partly, by the MIT-IBM Watson AI Lab.