Home News EUREKA: Human-Level Reward Design via Coding Large Language Models EUREKA : An Introduction

EUREKA: Human-Level Reward Design via Coding Large Language Models EUREKA : An Introduction

0
EUREKA: Human-Level Reward Design via Coding Large Language Models
EUREKA : An Introduction

With the advancements Large Language Models have made in recent times, it’s unsurprising why these LLM frameworks excel as semantic planners for sequential high-level decision-making tasks. Nevertheless, developers still find it difficult to utilize the complete potential of LLM frameworks for learning complex low-level manipulation tasks. Despite their efficiency, today’s Large Language Models require considerable domain and subject expertise to learn even easy skills or construct textual prompts, creating a major gap between their performance and human-level dexterity.

To bridge this gap, developers from Nvidia, CalTech, UPenn, and others have introduced EUREKA, an LLM-powered human-level design algorithm. EUREKA goals to harness various capabilities of LLM frameworks, including code-writing, in-context improvement, and zero-shot content generation, to perform unprecedented optimization of reward codes. These reward codes, combined with reinforcement learning, enable the frameworks to learn complex skills or perform manipulation tasks.

In this text, we are going to examine the EUREKA framework from a development perspective, exploring its framework, workings, and the outcomes it achieves in generating reward functions. These functions, as claimed by the developers, outperform those generated by humans. We may also delve into how the EUREKA framework paves the best way for a brand new approach to RLHF (Reinforcement Learning using Human Feedback) by enabling gradient-free in-context learning. Let’s start.

Today, cutting-edge LLM frameworks like GPT-3, and GPT-4 deliver outstanding results when serving as semantic planners for sequential high-level decision making tasks, but developers are still in search of ways to boost their performance relating to learning low-level manipulation tasks like pen spinning dexterity. Moreover, developers have observed that reinforcement learning could be used to attain sustainable ends in dexterous conditions, and other domains provided the reward functions are constructed rigorously by human designers, and these reward functions are able to providing the training signals for favorable behaviors. Compared to real-world reinforcement learning tasks that accept sparse rewards makes it difficult for the model to learn the patterns, shaping these rewards provides the obligatory incremental learning signals. Moreover, rewards functions, despite their importance, are extremely difficult to design, and sub-optimal designs of those functions often result in unintended behaviors. 

To tackle these challenges and maximize the efficiency of those reward tokens, the EUREKA or Evolution-driven Universal REward Kit for Agent goals to make the next contributions. 

  1. Achieving human-level performance for designing Reward Functions. 
  2. Effectively solve manipulation tasks without using manual reward engineering. 
  3. Generate more human-aligned and more performant reward functions by introducing a brand new gradient-free in-context learning approach as an alternative of traditional RLHF or Reinforcement Learning from Human Feedback method. 

There are three key algorithmic design selections that the developers have opted for to boost EUREKA’s generality: evolutionary search, environment as context, and reward reflection. First, the EUREKA framework takes the environment source code as context to generate executable reward functions in a zero-shot setting. Following this, the framework performs an evolutionary search to enhance the standard of its rewards substantially, proposes batches of reward candidates with every iteration or epoch, and refines those that it finds to be probably the most promising. Within the third and the ultimate stage, the framework uses the reward reflection approach to make the in-context improvement of rewards simpler, a process that ultimately helps the framework enable targeted and automatic reward editing by utilizing a textual summary of the standard of those rewards on the idea of policy training statistics. The next figure gives you a temporary overview of how the EUREKA framework works, and within the upcoming section, we might be talking in regards to the architecture and dealing in greater detail. 

EUREKA : Model Architecture, and Problem Setting

The first aim of reward shaping is to return a shaped or curated reward function for a ground-truth reward function, which could pose difficulties when being directly optimized like sparse rewards. Moreover, designers can only use queries to access these ground-truth reward functions which is the explanation why the EUREKA framework opts for reward generation, a program synthesis setting based on RDP or the Reward Design Problem. 

The Reward Design Problem or RDP is a tuple that incorporates a world model with a state space, space for reward functions, a transition function, and an motion space. A learning algorithm then optimizes rewards by generating a policy that ends in a MDP or Markov Design Process, that produces the scalar evolution of any policy, and may only be accessed using policy queries. The first goal of the RDP is to output a reward function in a way such that the policy is able to achieving the utmost fitness rating. In EUREKA’s problem setting, the developers have specified every component within the Reward Design Problem using code. Moreover, for a given string that specifies the main points of the duty, the first objective of the reward generation problem is to generate a reward function code to maximise the fitness rating. 

Moving along, at its core, there are three fundamental algorithmic components within the EUREKA framework. Evolutionary search(proposing and rewarding refining candidates iteratively), environment as context(generating executable rewards in zero-shot setting), and reward reflection(to enable fine-grained improvement of rewards). The pseudo code for the algorithm is illustrated in the next image. 

Environment as Context

Currently, LLM frameworks need environment specifications as inputs for designing rewards whereas the EUREKA framework proposes to feed the raw environment code directly as context, without the reward code allowing the LLM frameworks to take the world model as context. The approach followed by EUREKA has two major advantages. First, LLM frameworks for coding purposes are trained on native code sets which are written in existing programming languages like C, C++, Python, Java, and more, which is the elemental reason why they’re higher at producing code outputs once they are directly allowed to compose code within the syntax and magnificence that they’ve originally trained on. Second, using the environment source code often reveals the environments involved semantically, and the variables which are fit or ideal to be used in an try and output a reward function in accordance with the desired task. On the idea of those insights, the EUREKA framework instructs the LLM to return a more executable Python code directly with the assistance of only formatting suggestions, and generic reward designs. 

Evolutionary Search

The inclusion of evolutionary search within the EUREKA framework goals to present a natural solution to the sub-optimality challenges, and errors occurred during execution as mentioned before. With each iteration or epoch, the framework various independent outputs from the Large Language Model, and provided the generations are all i.i.d, it exponentially reduces the probability of reward functions throughout the iterations being buggy given the variety of samples are increasing with every epoch. 

In the following step, the EUREKA framework uses the executable rewards functions from previous iteration the perform an in-context reward mutation, after which proposes a brand new and improved reward function on the idea of textual feedback. The EUREKA framework when combined with the in-context improvement, and instruction-following capabilities of Large Language Models is capable of specify the mutation operator as a text prompt, and suggests a way to make use of the textual summary of policy training to switch existing reward codes. 

Reward Reflection

To ground in-context reward mutations, it is important to evaluate the standard of the generated rewards, and more importantly, put them into words, and the EUREKA framework tackles it by utilizing the easy strategy of providing the numerical scores as reward evaluation. When the duty fitness function serves as a holistic metric for ground-truth, it lacks credit project, and is unable to supply any priceless information as to why the reward function works, or why it doesn’t work. So, in an try and provide a more targeted and complex reward diagnosis, the framework proposes to make use of automated feedbacks to summarize the policy training dynamics in texts. Moreover, within the reward program, the reward functions within the EUREKA framework are asked to show their components individually allowing the framework to trace the scalar values of each unique reward component at policy checkpoints during all the training phase.

Although the reward function procedure followed by the EUREKA framework is straightforward to construct, it is important due to the algorithmic-dependent nature of optimizing rewards. It implies that the effectiveness of a reward function is directly influenced by the selection of a Reinforcement Learning algorithm, and with a change in hyperparameters, the reward may perform otherwise even with the identical optimizer. Thus, the EUREKA framework is capable of edit the records more effectively & selectively while synthesizing reward functions which are in enhanced synergy with the Reinforcement Learning algorithm. 

Training and Baseline

There are two major training components of the EUREKA framework: Policy Learning and Reward Evaluation Metrics.

Policy Learning

The ultimate reward functions for each individual task is optimized with the assistance of the identical reinforcement learning algorithm using the identical set of hyperparameters which are fine-tuned to make the human-engineered rewards function well. 

Reward Evaluation Metrics

Because the task metric varies by way of scale & semantic meaning with every task, the EUREKA framework reports the human normalized rating, a metric that gives a holistic measure for the framework to check the way it performs against the expert human-generated rewards in accordance with the ground-truth metrics. 

Moving along, there are three primary baselines: L2R, Human, and Sparse. 

L2R

L2R is a dual-stage Large Language Model prompting solution that helps in generating templated rewards. First, a LLM framework fills in a natural language template for environment and task laid out in natural language, after which a second LLM framework converts this “motion description” right into a code that writes a reward function by calling a set of manually written reward API primitives. 

Human

The Human baseline are the unique reward functions written by reinforcement learning researchers, thus representing the outcomes of human reward engineering at an unprecedented level. 

Sparse

The Sparse baseline resembles the fitness functions, and so they are used to judge the standard of the rewards the framework generates. 

Results and Outcomes

To research the performance of the EUREKA framework, we are going to evaluate it on different parameters including its performance against human rewards, improvement in results over time, generating novel rewards, enabling targeted improvement, and working with human feedback. 

EUREKA Outperforms Human Rewards

The next figure illustrates the mixture results over different benchmarks, and as it will probably be clearly observed, the EUREKA framework either outperforms or performs on par to human-level rewards on each Dexterity and Issac tasks. Compared, the L2R baseline delivers similar performance on low-dimensional tasks, but relating to high-dimensional tasks, the gap within the performance is sort of substantial. 

Consistently Improving Over Time

One in every of the key highlights of the EUREKA framework is its ability to consistently improve and enhance its performance over time with each iteration, and the outcomes are demonstrated within the figure below. 

As it will probably be clearly seen, the framework consistently generates higher rewards with each iteration, and it also improves & eventually surpasses the performance of human rewards, due to its use of in-context evolutionary reward search approach. 

Generating Novel Rewards

The novelty of the rewards of the EUREKA framework could be assessed by calculating the correlation between human and EUREKA rewards on the whole thing of Issac tasks. These correlations are then plotted on a scatter-plot or map against the human normalized scores, with each point on the plot representing a person EUREKA reward for each individual task. As it will probably be clearly seen, the EUREKA framework predominantly generates weak correlated reward functions outperforming the human reward functions. 

Enabling Targeted Improvement

To guage the importance of adding reward reflection in reward feedback, developers evaluated an ablation, a EUREKA framework with no reward reflection that reduces the feedback prompts to consist only of snapshot values. When running Issac tasks, developers observed that without reward reflection, the EUREKA framework witnessed a drop of about 29% in the common normalized rating. 

Working with Human Feedbacks

To readily incorporate a big selection of inputs to generate human-aligned and more performant reward functions, the EUREKA framework along with automated reward designs also introduces a brand new gradient-free in-context learning approach to Reinforcement Learning from Human Feedback, and there have been two significant observations. 

  1. EUREKA can profit and improve from human-reward functions. 
  2. Using human feedback for reward reflections induces aligned behavior. 

The above figure demonstrates how the EUREKA framework demonstrates a considerable boost in performance, and efficiency using human reward initialization whatever the quality of the human rewards suggesting the standard of the bottom rewards doesn’t have a major impact on the in-context reward improvement abilities of the framework. 

The above figure illustrates how the EUREKA framework cannot only induce more human-aligned policies, but additionally modify rewards by incorporating human feedback. 

Final Thoughts

In this text, now we have talked about EUREKA, a LLM-powered human-level design algorithm, that attempts to harness various capabilities of LLM frameworks including code-writing, in-context improvement capabilities, and zero-shot content generation to perform unprecedented optimization of reward codes. The reward code together with reinforcement learning can then be utilized by these frameworks to learn complex skills, or perform manipulation tasks. Without human intervention or task-specific prompt engineering, the framework delivers human-level reward generation capabilities on a big selection of tasks, and its major strength lies in learning complex tasks with a curriculum learning approach. 

Overall, the substantial performance and flexibility of the EUREKA framework indicates the potential of mixing evolutionary algorithms with large language models might lead to a scalable and general approach to design rewards, and this insight is perhaps applicable to other open-ended search problems. 

LEAVE A REPLY

Please enter your comment!
Please enter your name here