Someone learning to play tennis might hire a teacher to assist them learn faster. Because this teacher is (hopefully) a terrific tennis player, there are occasions when trying to precisely mimic the teacher won’t help the coed learn. Perhaps the teacher leaps high into the air to deftly return a volley. The scholar, unable to repeat that, might as an alternative try just a few other moves on her own until she has mastered the talents she must return volleys.
Computer scientists also can use “teacher” systems to coach one other machine to finish a task. But identical to with human learning, the coed machine faces a dilemma of knowing when to follow the teacher and when to explore by itself. To this end, researchers from MIT and Technion, the Israel Institute of Technology, have developed an algorithm that robotically and independently determines when the coed should mimic the teacher (often known as imitation learning) and when it should as an alternative learn through trial and error (often known as reinforcement learning).
Their dynamic approach allows the coed to diverge from copying the teacher when the teacher is either too good or not adequate, but then return to following the teacher at a later point within the training process if doing so would achieve higher results and faster learning.
When the researchers tested this approach in simulations, they found that their combination of trial-and-error learning and imitation learning enabled students to learn tasks more effectively than methods that used just one style of learning.
This method could help researchers improve the training process for machines that will probably be deployed in uncertain real-world situations, like a robot being trained to navigate inside a constructing it has never seen before.
“This mixture of learning by trial-and-error and following a teacher may be very powerful. It gives our algorithm the power to unravel very difficult tasks that can’t be solved by utilizing either technique individually,” says Idan Shenfeld an electrical engineering and computer science (EECS) graduate student and lead creator of a paper on this method.
Shenfeld wrote the paper with coauthors Zhang-Wei Hong, an EECS graduate student; Aviv Tamar; assistant professor of electrical engineering and computer science at Technion; and senior creator Pulkit Agrawal, director of Improbable AI Lab and an assistant professor within the Computer Science and Artificial Intelligence Laboratory. The research will probably be presented on the International Conference on Machine Learning.
Striking a balance
Many existing methods that seek to strike a balance between imitation learning and reinforcement learning accomplish that through brute force trial-and-error. Researchers pick a weighted combination of the 2 learning methods, run your complete training procedure, after which repeat the method until they find the optimal balance. That is inefficient and infrequently so computationally expensive it isn’t even feasible.
“We wish algorithms which might be principled, involve tuning of as few knobs as possible, and achieve high performance — these principles have driven our research,” says Agrawal.
To attain this, the team approached the issue otherwise than prior work. Their solution involves training two students: one with a weighted combination of reinforcement learning and imitation learning, and a second that may only use reinforcement learning to learn the identical task.
The important idea is to robotically and dynamically adjust the weighting of the reinforcement and imitation learning objectives of the primary student. Here is where the second student comes into play. The researchers’ algorithm continually compares the 2 students. If the one using the teacher is doing higher, the algorithm puts more weight on imitation learning to coach the coed, but when the one using only trial and error is beginning to get well results, it can focus more on learning from reinforcement learning.
By dynamically determining which method achieves higher results, the algorithm is adaptive and may pick the very best technique throughout the training process. Due to this innovation, it’s in a position to more effectively teach students than other methods that aren’t adaptive, Shenfeld says.
“One among the important challenges in developing this algorithm was that it took us a while to appreciate that we must always not train the 2 students independently. It became clear that we wanted to attach the agents to make them share information, after which find the fitting strategy to technically ground this intuition,” Shenfeld says.
Solving tough problems
To check their approach, the researchers arrange many simulated teacher-student training experiments, comparable to navigating through a maze of lava to succeed in the opposite corner of a grid. On this case, the teacher has a map of your complete grid while the coed can only see a patch in front of it. Their algorithm achieved an almost perfect success rate across all testing environments, and was much faster than other methods.
To present their algorithm a fair harder test, they arrange a simulation involving a robotic hand with touch sensors but no vision, that must reorient a pen to the right pose. The teacher had access to the actual orientation of the pen, while the coed could only use touch sensors to find out the pen’s orientation.
Their method outperformed others that used either only imitation learning or only reinforcement learning.
Reorienting objects is one amongst many manipulation tasks that a future home robot would wish to perform, a vision that the Improbable AI lab is working toward, Agrawal adds.
Teacher-student learning has successfully been applied to coach robots to perform complex object manipulation and locomotion in simulation after which transfer the learned skills into the real-world. In these methods, the teacher has privileged information accessible from the simulation that the coed won’t have when it’s deployed in the actual world. For instance, the teacher will know the detailed map of a constructing that the coed robot is being trained to navigate using only images captured by its camera.
“Current methods for student-teacher learning in robotics don’t account for the shortcoming of the coed to mimic the teacher and thus are performance-limited. The brand new method paves a path for constructing superior robots,” says Agrawal.
Aside from higher robots, the researchers consider their algorithm has the potential to enhance performance in diverse applications where imitation or reinforcement learning is getting used. For instance, large language models comparable to GPT-4 are superb at accomplishing a wide selection of tasks, so perhaps one could use the big model as a teacher to coach a smaller, student model to be even “higher” at one particular task. One other exciting direction is to analyze the similarities and differences between machines and humans learning from their respective teachers. Such evaluation might help improve the educational experience, the researchers say.
“What’s interesting about [this method] in comparison with related methods is how robust it seems to numerous parameter selections, and the variability of domains it shows promising leads to,” says Abhishek Gupta, an assistant professor on the University of Washington, who was not involved with this work. “While the present set of results are largely in simulation, I’m very excited in regards to the future possibilities of applying this work to problems involving memory and reasoning with different modalities comparable to tactile sensing.”
“This work presents an interesting approach to reuse prior computational work in reinforcement learning. Particularly, their proposed method can leverage suboptimal teacher policies as a guide while avoiding careful hyperparameter schedules required by prior methods for balancing the objectives of mimicking the teacher versus optimizing the duty reward,” adds Rishabh Agarwal, a senior research scientist at Google Brain, who was also not involved on this research. “Hopefully, this work would make reincarnating reinforcement learning with learned policies less cumbersome.”
This research was supported, partially, by the MIT-IBM Watson AI Lab, Hyundai Motor Company, the DARPA Machine Common Sense Program, and the Office of Naval Research.