Reinforcement Learning, Part 2: Policy Evaluation and Improvement Introduction About this text Solving Bellman equation Policy evaluation Policy improvement Policy iteration Value iteration Generalized policy iteration Conclusion Resources

Artificial Intelligence

Reinforcement Learning, Part 2: Policy Evaluation and Improvement Introduction About this text Solving Bellman equation Policy evaluation Policy improvement Policy iteration Value iteration Generalized policy iteration Conclusion Resources

admin

April 23, 2024

Reinforcement Learning, Part 2: Policy Evaluation and Improvement
Introduction
About this text
Solving Bellman equation
Policy evaluation
Policy improvement
Policy iteration
Value iteration
Generalized policy iteration
Conclusion
Resources

From data to decisions: maximizing rewards with policy improvement methods for optimal strategies

Reinforcement learning is a site in machine learning that introduces the concept of an agent who must learn optimal strategies in complex environments. The agent learns from its actions that end in rewards given the environment’s state. Reinforcement learning is a difficult topic and differs significantly from other areas of machine learning. That’s the reason it should only be used when a given problem can’t be solved otherwise.

The incredible flexibility of reinforcement learning is that the identical algorithms will be used to make the agent adapt to completely different, unknown, and complicated conditions.

Note. To totally understand the ideas included in this text, it is very really helpful to be acquainted with the most important concepts of reinforcement learning introduced in the primary a part of this text series.

In Part 1, we now have introduced the most important concepts of reinforcement learning: the framework, policies and value functions. The Bellman equation that recursively establishes the connection of value functions is the backbone of contemporary algorithms. We are going to understand its power in this text by learning how it will possibly be used to search out optimal policies.

This text relies on Chapter 4 of the book “Reinforcement Learning” written by Richard S. Sutton and Andrew G. Barto. I highly appreciate the efforts of the authors who contributed to the publication of this book.

Allow us to imagine that we perfectly know the environment’s dynamics that comprises |S| states. Motion transition probablities are given by a policy π. Provided that, we will solve the Bellman equation for the V-function for this environment that can, in truth, represent a system of linear equations with |S| variables (in case of the Q-function there might be |S| x |A| equations).

The answer to that system of equations corresponds to v-values for each state (or q-values for each pair (state, pair)).

Example

Allow us to have a take a look at a straightforward example of an environment with 5 states where T is a terminal state. Numbers in blue represent transition probabilities while number in red represent rewards received by the agent. We will even assume that the identical motion chosen by the agent within the state A (represented by the horizontal arrow with probability p = 0.6) results in either C or D with different probabilities (p = 0.8 and p = 0.2).

Transition diagram for the instance. Numbers in blue denote transition probabilities between states and numbers in red define respective rewards.

Because the environment comprises |S| = 5 states, to search out all v-values, we could have to resolve a system of equations consisting of 5 Bellman equations:

System of Bellman equations for the V-function.

Since T is a terminal state, its v-value is at all times 0, so technically we only have to resolve 4 equations.

Solving the analogous system for the Q-function could be harder because we would want to resolve an equation for each pair (state, motion).

Solving a linear system of equations in an easy manner, because it was shown in the instance above, is a possible technique to get real v-values. Nevertheless, given the cubic algorithm complexity O(n³), where n = |S|, it will not be optimal, especially when the variety of states |S| is large. As an alternative, we will apply an iterative policy evaluation algorithm:

Randomly initialise v-values for all environment states (aside from terminal states whose v-values should be equal to 0).
Iteratively update all non-terminal states through the use of the Bellman equation.
Repeat step 2 until the difference between previous and current v-values is just too small (≤ θ).

Policy evaluation pseudocode. Source: Reinforcement Learning. An Introduction. Second Edition | Richard S. Sutton and Andrew G. Barto

If the variety of states |S| if finite, then it is feasible to prove mathematically that iterative estimations obtained by the policy evaluation algorithm under a given policy π ultimately converge to real v-values!

A single update of the v-value of a state s ∈ S is named an expected update. The logic behind this name is that the update procedure considers rewards of all possible successive states of s, not only a single one.

An entire iteration of updates for all states is named a sweep.

Note. The analogous iterative algorithm will be applied to the calculation of Q-functions as well.

To understand how amazing this algorithm is, allow us to highlight it once more:

Policy evaluation allows iteratively finding the V-function under a given policy π.

Update variations

The update equation within the policy evaluation algorithm will be implemented in two ways:

By utilizing two arrays: latest values are computed sequentially from unchanged old values stored in two separate arrays.
By utilizing one array: computed values are overwritten immediately. In consequence, later updates in the course of the same iteration use the overwritten latest values.

In practice, overwriting v-values is a preferable technique to perform updates because the brand new information is used as soon because it becomes available for other updates, as compared to the 2 array method. As a consequence, v-values are likely to converge faster.

The algorithm doesn’t impose rules on the order of variables that needs to be updated during every iteration, nevertheless the order can have a big influence on the convergence rate.

Example

Description

To further understand how the policy evaluation algorithm works in practice, allow us to have a take a look at the instance 4.1 from the Sutton’s and Barto’s book. We’re given an environment in the shape of the 4 x 4 grid where at every step the agent equiprobably (p = 0.25) makes a single step in considered one of the 4 directions (up, right, down, left).

The agent starts at a random maze cell and may go in considered one of 4 directions receiving the reward R = -1 at every step. A4 and D1 are terminal states. Image adapted by the writer. Source: Reinforcement Learning. An Introduction. Second Edition | Richard S. Sutton and Andrew G. Barto

If an agent is positioned at the sting of the maze and chooses to enter the direction of a wall across the maze, then its position stays the identical. For instance, if the agent is positioned at D3 and chooses to go to the proper, then it’ll stay at D3 at the following state.

Every move to any cell ends in R = -1 reward aside from two terminal states positioned at A4 and D1 whose rewards are R = 0. The final word goal is to calculate V-function for the given equiprobable policy.

Algorithm

Allow us to initialize all V-values to 0. Then we are going to run several iterations of the policy evaluation algorithm:

The V-function on different policy evaluation iterations. Image adapted by the writer. Source: Reinforcement Learning. An Introduction. Second Edition | Richard S. Sutton and Andrew G. Barto

In some unspecified time in the future, there might be no changes between v-values on consecutive iterations. That implies that the algorithm has converged to the actual V-values. For the maze, the V-function under the equiprobable policy is shown at the proper of the last diagram row.

Interpretation

Allow us to say an agent acting based on the random policy starts from the cell C2 whose expected reward is -18. By the V-function definition, -18 is the entire cumulative reward the agent receives by the top of the episode. Since every move within the maze adds -1 to the reward, we will interpret the v-value of 18 as the expected variety of steps the agent could have to make until it gets to the terminal state.

At first sight, it’d sound surprising but V- and Q- functions will be used to search out optimal policies. To know this, allow us to consult with the maze example where we now have calculated the V-function for a starting random policy.

As an illustration, allow us to take the cell B3. Given our random policy, the agent can go in 4 directions with equal probabilities from that state. The possible expected rewards it will possibly receive are -14, -20, -20 and -14. Allow us to suppose that we had an option to change the policy for that state. To maximise the expected reward, wouldn’t or not it’s logical to at all times go next to either A3 or B4 from B3, i.e. within the cell with the utmost expected reward within the neighbourhood (-14 in our case)?

Optimal actions from the cell B3 result in either A3 or B4 where the expected reward reaches its maximum.

This concept is sensible because being positioned at A3 or B4 gives the agent a possibility to complete the maze in only one step. In consequence, we will include that transition rule for B3 to derive a brand new policy. Nevertheless, is it at all times optimal to make such transitions to maximise the expected reward?

Indeed, transitioning greedily to the state with an motion whose combination of expected reward is maximal amongst other possible next states, results in a greater policy.

To proceed our example, allow us to perform the identical procedure for all maze states:

Converged V-function and its corresponding greedy policy from the instance. Image adapted by the writer. Source: Reinforcement Learning. An Introduction. Second Edition | Richard S. Sutton and Andrew G. Barto

As a consequence, we now have derived a brand new policy that is healthier than the old one. By the best way, our findings will be generalized for other problems as well by the policy improvement theorem which plays a vital role in reinforcement learning.

Policy improvement theorem

Formulation

The formulation from the Sutton’s and Barto’s book concisely describes the theory:

Let π and π’ be any pair of deterministic policies such that, for all s ∈ S,

Source: Reinforcement Learning. An Introduction. Second Edition | Richard S. Sutton and Andrew G. Barto

Then the policy π’ should be nearly as good as, or higher than, π. That’s, it must obtain greater or equal expected return from all states s ∈ S:

Source: Reinforcement Learning. An Introduction. Second Edition | Richard S. Sutton and Andrew G. Barto

Logic

To know the theory’s formulation, allow us to assume that we now have access to the V- and Q-functions of a given environment evaluated under a policy π. For that environment, we are going to create one other policy π’. This policy might be absolutely the identical as π with the one difference that for each state it’ll select actions that end in either the identical or greater rewards. Then the theory guarantees that the V-function under policy π’ might be higher than the one for the policy π.

With the policy improvement theorem, we will at all times derive higher policies by greedily selecting actions of the present policy that result in maximum rewards for each state.

Given any starting policy π, we will compute its V-function. This V-function will be used to enhance the policy to π’. With this policy π’, we will calculate its V’-function. This procedure will be repeated multiple times to iteratively produce higher policies and value functions.

Within the limit, for a finite variety of states, this algorithm, called policy iteration, converges to the optimal policy and the optimal value function.

The iterative alternation between policy evaluation (E) and policy improvement (I). Image adapted by the writer. Source: Reinforcement Learning. An Introduction. Second Edition | Richard S. Sutton and Andrew G. Barto

If we applied the policy iteration algorithm to the maze example, then the optimal V-function and policy would appear like this:

the optimal V-function and policy for the maze example. Image adapted by the writer. Source: Reinforcement Learning. An Introduction. Second Edition | Richard S. Sutton and Andrew G. Barto

In these settings, with the obtained optimal V-function, we will easily estimate the variety of steps required to get to the terminal state, based on the optimal strategy.

What’s so interesting about this instance is the proven fact that we’d only need two policy iterations to acquire these values from scratch (we will notice that the optimal policy from the image is precisely the identical because it was before once we had greedily updated it to the respective V-function). In some situations, the policy iteration algorithm requires only few iterations to converge.

An example of the optimal V-function and policy for a more complex maze environment.

Though the unique policy iteration algorithm will be used to search out optimal policies, it will possibly be slow, mainly due to multiple sweeps performed during policy evaluation steps. Furthermore, the total convergence process to the precise V-function might require loads sweeps.

As well as, sometimes it will not be mandatory to get exact v-values to yield a greater policy. The previous example demonstrates it perfectly: as an alternative of performing multiple sweeps, we could have done only k = 3 sweeps after which built a policy based on the obtained approximation of the V-function. This policy would have been the exact same because the one we now have computed after V-function convergence.

V-function and policy evaluations on the primary three iterations. We are able to see that ranging from the third iteration, the policy doesn’t change. This instance demonstrates that in some cases it will not be mandatory to run all iterations of policy iteration. Image adapted by the writer. Source: Reinforcement Learning. An Introduction. Second Edition | Richard S. Sutton and Andrew G. Barto

On the whole, is it possible to stop the policy evaluation algorithm in some unspecified time in the future? It seems that yes! Moreover, only a single sweep will be performed during every policy evaluation step and the result will still converge to the optimal policy. The described algorithm is named value iteration.

We usually are not going to check the proof of this algorithm. Nevertheless, we will notice that policy evaluation and policy improvement are two very similar processes to one another: each of them use the Bellman equation aside from the proven fact that policy improvement takes the max operation to yield a greater motion.

By iteratively performing a single sweep of policy evaluation and a single sweep of policy improvement, we will converge faster to the optimum. In point of fact, we will stop the algorithm once the difference between successive V-functions becomes insignificant.

Asynchronous value iteration

In some situations, performing only a single sweep during every step of value iteration will be problematic, especially when the variety of states |S| is large. To beat this, asynchronous versions of the algorithm will be used: as an alternative of systematically performing updates of all states in the course of the whole sweep, only a subset of state values is updated in-place in whatever order. Furthermore, some states will be updated multiple times before other states are updated.

Nevertheless, in some unspecified time in the future, the entire states could have to be updated, to make it possible for the algorithm to converge. In keeping with the speculation, the entire states should be updated in total an infinite variety of times to realize convergence but in practice this aspect is frequently omitted since we usually are not at all times desirous about getting 100% optimal policy.

There exist different implementations of asynchronous value iteration. In real problems, they make it possible to efficiently trade off between the algorithm’s speed and accuracy.

One among the the only asynchronous versions is to update only a single state in the course of the policy evaluation.

We now have checked out the policy iteration algorithm. Its idea will be used to consult with a broader term in reinforcement learning called generalized policy iteration (GPI).

The GPI consists of finding the optimal policy through independent alternation between policy evaluation and policy improvement processes.

Almost the entire reinforcement learning algorithms will be known as GPI.

Sutton and Barto provide a simplified geometric figure that intuitively explains how GPI works. Allow us to imagine a 2D plane where every point represents a mixture of a worth function and a policy. Then we are going to draw two lines:

The primary line will contain points corresponding to different V-functions of an environment.
The second line will represent a set of greedy policies in relation to respective V-functions.

Geometric visualisation of policy improvement towards the optimality point. Image adapted by the writer. Source: Reinforcement Learning. An Introduction. Second Edition | Richard S. Sutton and Andrew G. Barto

Each time once we calculate a greedy policy for the present V-function, we move closer to the policy line while moving away from the V-function line. That’s logical because for the brand new computed policy, the previous V-function not applies. However, each time we perform policy evaluation, we move towards the projection of a degree on the V-function line and thus we move farther from the policy line: for the brand new estimated V-function, the present policy isn’t any longer optimal. The entire process is repeated again.

As these two processes alternate between one another, each current V-function and policy regularly improve and at some moment in time they have to reach a degree of optimality that can represent an intersection between the V-function and policy lines.

In this text, we now have passed through the most important ideas behind policy evaluation and policy improvement. The great thing about these two algorithms is their ability to interact with one another to achieve the optimal state. This approach only works in perfect environments where the agent’s probability transitions are given for all states and actions. Despite this constraint, many other reinforcement learning algorithms use the GPI method as a fundamental constructing block for locating optimal policies.

For environments with quite a few states, several heuristics will be applied to speed up the convergence speed considered one of which incorporates asynchronous updates in the course of the policy evaluation step. Since the vast majority of reinforcement algorithms require a number of computational resources, this method becomes very useful and allows efficiently trading accuracy for gains in speed.

All images unless otherwise noted are by the writer.

From data to decisions: maximizing rewards with policy improvement methods for optimal strategies

Example

Update variations

Example

Policy improvement theorem

Asynchronous value iteration

LEAVE A REPLY Cancel reply