
Whatever the industry they’re employed in, artificial intelligence (AI) and machine learning (ML) technologies have all the time attempted to enhance the standard of life for people. One among the most important applications of AI in recent times is to design and create agents that may accomplish decision-making tasks across various domains. As an illustration, large language models like GPT-3 and PaLM and vision models like CLIP and Flamingo have proven to be exceptionally good at zero-shot learning of their respective fields. Nevertheless, there may be one prime drawback related to training such agents. It’s because such agents exhibit the inherent property of environmental diversity during training. In easy terms, training for various tasks or environments necessitates the use of assorted state spaces, which might occasionally impede learning, knowledge transfer, and the generalization ability of models across domains. Furthermore, for reinforcement learning (RL) based tasks, creating reward functions for specific tasks across environments becomes difficult.
Working on this problem statement, a team from Google Research investigated whether such tools might be used to construct more all-purpose agents. For his or her research, the team specifically focused on text-guided image synthesis, wherein the specified goal in the shape of text is fed to a planner, which creates a sequence of frames that represent the intended plan of action, after which control actions are extracted from the generated video. The Google team, thus, proposed a Universal Policy (UniPi) that addresses challenges in environmental diversity and reward specification of their recent paper titled “Learning Universal Policies via Text-Guided Video Generation.” The UniPi policy uses text as a universal interface for task descriptions and video as a universal interface for communicating motion and statement behavior in various situations. Specifically, the team designed a video generator as a planner that accepts the present image frame and a text prompt stating the present goal as input to generate a trajectory in the shape of a picture sequence or video. The generated video is then fed into an inverse dynamics model that extracts underlying actions executed. This approach stands out because it allows the universal nature of language and video to be leveraged in generalizing to novel goals and tasks across diverse environments.
Over the past few years, significant progress has been achieved within the text-guided image synthesis domain, which has yielded models with an exceptional capability of generating sophisticated images. This further motivated the team to decide on this as their decision-making task. The UniPi approach proposed by Google researchers mainly consists of 4 components: trajectory consistency through tiling, hierarchical planning, flexible behavior modulation, and task-specific motion adaptation, that are described intimately as follows:
1. Trajectory consistency through tiling:
Existing text-to-video methods often produce videos with a substantially changing underlying environment state. Nevertheless, ensuring the environment is constant throughout all timestamps is crucial to construct an accurate trajectory planner. Thus, to implement environment consistency in conditional video synthesis, the researchers moreover provide the observed image while denoising each frame within the synthesized video. With the intention to retain the underlying environment state across time, UniPi directly concatenates each noisy intermediate frame with the conditioned observed image across sampling steps.
2. Hierarchical Planning:
It’s difficult to generate all of the crucial actions when planning in complex and complicated environments that require loads of time and measures. Planning methods overcome this issue by leveraging a natural hierarchy by creating rough plans in a smaller space and refining them into more detailed plans. Similarly, within the video generation process, UniPi first creates videos at a rough level demonstrating the specified agent behavior after which improves them to make them more realistic by filling within the missing frames and making them smoother. This is completed through the use of a hierarchy of steps, with each step improving the video quality until the specified level of detail is reached.
3. Flexible behavioral modulation:
While planning a sequence of actions for a smaller goal, one can easily include external constraints to change the generated plan. This might be done by incorporating a probabilistic prior that reflects the specified limitations based on the properties of the plan. The prior might be described using a learned classifier or a Dirac delta distribution on a specific image to guide the plan toward specific states. This approach can also be compatible with UniPi. The researchers employed the video diffusion algorithm to coach the text-conditioned video generation model. This algorithm consists of encoded pre-trained language features from the Text-To-Text Transfer Transformer (T5).
4. Task-specific motion adaptation:
A small inverse dynamics model is trained to translate video frames into low-level control actions using a set of synthesized videos. This model is separate from the planner and might be trained on a separate smaller dataset generated by a simulator. The inverse dynamics model takes input frames and text descriptions of the present goals, synthesizes the image frames, and generates a sequence of actions to predict future steps. An agent then executes these low-level control actions using closed-loop control.
To summarize, the researchers from Google have made a powerful contribution by showcasing the worth of using text-based video generation to represent policies able to enabling combinatorial generalization, multi-task learning, and real-world transfer. The researchers evaluated their approach on quite a lot of novel language-based tasks, and it was concluded that UniPi generalizes well to each seen and unknown combos of language prompts, in comparison with other baselines akin to Transformer BC, Trajectory Transformer, and Diffuser. These encouraging findings highlight the potential of utilizing generative models and the vast data available as useful resources for creating versatile decision-making systems.
Try the Paper and Google Blog. Don’t forget to hitch our 19k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the most recent AI research news, cool AI projects, and more. If you’ve any questions regarding the above article or if we missed anything, be happy to email us at Asif@marktechpost.com
🚀 Check Out 100’s AI Tools in AI Tools Club
Khushboo Gupta is a consulting intern at MarktechPost. She is currently pursuing her B.Tech from the Indian Institute of Technology(IIT), Goa. She is obsessed with the fields of Machine Learning, Natural Language Processing and Web Development. She enjoys learning more in regards to the technical field by participating in several challenges.