Within the section *Off-policy Monte Carlo Control *of the book *Reinforcement Learning: An Introduction 2nd Edition (page 112)*, the writer left us with an interesting exercise: using the weighted importance sampling off-policy Monte Carlo method to seek out the fastest way driving on each tracks. This exercise is comprehensive that asks us to think about and construct almost every component of a reinforcement learning task, just like the environment, agent, reward, actions, conditions of termination, and the algorithm. Solving this exercise is fun and helps us construct a solid understanding of the interaction between algorithm and environment, the importance of an accurate episodic task definition, and the way the worth initialization affects the training consequence. Through this post, I hope to share my understanding and solution to this exercise with everyone occupied with reinforcement learning.

As mentioned above, this exercise asks us to seek out a policy that makes a race automotive drive from the starting line to the ending line as fast as possible without running into gravel or off the track. After rigorously reading the exercise descriptions, I listed some key points which are essential to finish this task:

**Map representation**: maps on this context are literally 2D matrices with (row_index, column_index) as coordinates. The worth of every cell represents the state of that cell; for example, we will use 0 to explain gravel, 1 for the track surface, 0.4 for the starting region, and 0.8 for the ending line. Any row and column index outside the matrix might be regarded as out-of-boundary.**Automobile representation**: we will directly use the matrix’s coordinates to represent the automotive’s position;**Velocity and control**: the speed space is discrete and consists of horizontal and vertical speeds that might be represented as a tuple (row_speed, col_speed). The speed limit on each axes is (-5, 5) and incremented by +1, 0, and -1 on each axis in each step; subsequently, there are a complete of 9 possible actions in each step. Literally, the speed can’t be each zero except on the starting line, and the vertical velocity, or row speed, can’t be negative as we don’t want our automotive to drive back to the starting line.**Reward and episode**: the reward for every step before crossing the ending line is -1. When the automotive runs out of the track, it’ll be reset to considered one of the starting cells. The episode ends**ONLY**when the automotive successfully crosses the ending line.**Starting states**: we randomly select starting cell for the automotive from the starting line; the automotive’s initial speed is (0, 0) in accordance with the exercise’s description.**Zero-acceleration challenge**: the writer proposes a small*zero-acceleration challenge*that, at every time step, with 0.1 probability, the motion won’t take effect and the automotive will remain at its previous speed. We are able to implement this challenge in training as a substitute of adding the feature to the environment.

The answer to the exercise is separated into two posts; on this post, we’ll deal with constructing a racetrack environment. The file structure of this exercise is as follows:

`|-- race_track_env`

| |-- maps

| | |-- build_tracks.py // this file is used to generate track maps

| | |-- track_a.npy // track a knowledge

| | |-- track_b.npy // track b data

| |-- race_track.py // race track environment

|-- exercise_5_12_racetrack.py // the answer to this exercise

And the libraries utilized in this implementation are as follows:

`python==3.9.16`

numpy==1.24.3

matplotlib==3.7.1

pygame==2.5.0

We are able to represent track maps as 2D matrices with different values indicating track states. I would like to be loyal to the exercise, so I’m attempting to construct the identical maps shown within the book by assigning matrix values manually. The maps will probably be saved as separate *.npy* files in order that the environment can read the file in training as a substitute of generating them within the runtime.

The 2 maps look as follows; the sunshine cells are gravel, the dark cells are track surfaces, the green-ish cells represent the ending line, and the light-blue-ish cells represent the starting line.

With the track maps ready, we now proceed to create a gym-like environment with which the algorithm can interact. The gymnasium, previously the OpenAI gym now belonging to the *Farama Foundation*, provides a straightforward interface for testing RL algorithms; we’ll use the *gymnasium* package to create the racetrack environment. Our surroundings will include the next components/features:

**env.nS**: the form of the commentary space, which is (num_rows, num_cols, row_speed, col_speed) in this instance. The variety of rows and columns varies between maps, however the speed space are consistent across tracks. For the row speed, as we don’t want the automotive to return to the starting line, the row speed observations consist of [-4, -3, -2, -1, 0] (negative value because we expect the automotive to go upwards within the map), five spaces in total; the column speed has no such limit, so the observations range from -4 to 4, nine spaces in total. Subsequently, the form of the commentary space within the racetrack example is (num_rows, num_cols, 5, 9).**env.nA**: the variety of possible actions. There are 9 possible actions in our implementation; we may also create a dictionary within the environment class to map the integer motion to the (row_speed, col_speed) tuple representation to manage the agent.**env.reset()**: the function that takes the automotive back to considered one of the starting cells when the episode finishes or the automotive runs out of the track;**env.step(motion)**: the step function enables the algorithm to interact with the environment by taking an motion and returning a tuple of (next_state, reward, is_terminated, is_truncated).- State-checking functions: there’re two private functions that help to examine if the automotive left the track or crossed the ending line;
- Rendering functions: rendering function is crucial to the customized environment from my viewpoint; I all the time check if the environment has properly been built by rendering out the sport space and agent’s behaviors, which is more straightforward than simply watching logging information.

With these features in mind, we will start constructing the racetrack environment.

Thus far, with the whole lot needed for the racetrack environment ready, we will test/confirm the environment setup.

First, we render out the game-playing to examine if every component is working easily:

Then we turn off the render choice to make the environment run background to examine if the trajectory terminates properly:

`// Track a`

| Statement | reward | Terminated | Total reward |

| (1, 16, 0, 3) | -1 | True | -4984 |

| (2, 16, 0, 2) | -1 | True | -23376 |

| (3, 16, 0, 3) | -1 | True | -14101 |

| (1, 16, 0, 3) | -1 | True | -46467 |// Track b

| Statement | reward | Terminated | Total reward |

| (1, 31, -2, 2) | -1 | True | -3567 |

| (0, 31, -4, 4) | -1 | True | -682 |

| (2, 31, -2, 1) | -1 | True | -1387 |

| (3, 31, -1, 3) | -1 | True | -2336 |

With the environment ready, we will prepare to implement the off-policy MC control algorithm with weighted importance sampling algorithm, as show below:

The Monte Carlo methods solve the reinforcement learning problem by averaging sample returns. In training, the strategy generates a trajectory based on a given policy and learns from the tail of every episode. The difference between on-policy and off-policy learning is that the on-policy methods use one policy to make decisions and enhancements, whereas the off-policy methods use separate policies for generating data and policy improvement. In accordance with the writer of the book, just about all off-policy methods utilize importance sampling to estimate expected values under one distribution given samples from one other. [2]

The next several sections will explain tricks of sentimental policy and weighted importance sampling appearing within the algorithm above and the way essential a correct value initialization is to this task.

The off-policy approach to this algorithm uses two policies:

**Goal policy**: the policy being learned about. On this algorithm, the goal policy is greedy and deterministic, which implies the probability of the greedy motion*A*at time*t*is 1.0, with all other actions’ probability equal to 0.0. The goal policy follows the Generalized Policy Iteration (GPI) that improves after every step.**Behavior policy**: the policy used to generate behavior. The behavior policy on this algorithm is defined as*soft policy*, meaning that every one actions in a given state have a non-zero probability. We’ll use the ε-greedy policy as our behavior policy here.

In soft policy, the probability of an motion is:

We must always also return this probability when generating actions for the importance sampling purpose. The code for the behavior policy looks as follows:

We estimate the relative probability of the trajectory generated by the goal policy under the trajectory from the behavior policy, and the equation for such probability is:

The worth function for the weighted importance sampling is:

We are able to reframe the equation to suit it to the episodic task with the shape of incremental updates:

There are plenty of excellent examples of the best way to derivate the above equation, so we don’t spend time deriving it here again. But I do also want to elucidate the interesting tricks in regards to the last several lines of the algorithm:

The trick relies on the setting that the goal policy here is deterministic. If the motion taken at a time step differs from that comes from the goal policy, then the probability of that motion w.r.t the goal policy is 0.0; thus, all of the proceeding steps now not update to the state-action value function of the trajectory. Quite the opposite, if the motion at the moment step is similar because the goal policy, then the probability is 1.0, and the action-value update continues.

Proper weight initialization plays a vital role in successfully solving this exercise. Let’s first take a have a look at the rewards on each tracks from a random policy:

`// Track a`

| Statement | reward | Terminated | Total reward |

| (1, 16, 0, 3) | -1 | True | -4984 |

| (2, 16, 0, 2) | -1 | True | -23376 |

| (3, 16, 0, 3) | -1 | True | -14101 |

| (1, 16, 0, 3) | -1 | True | -46467 |// Track b

| Statement | reward | Terminated | Total reward |

| (1, 31, -2, 2) | -1 | True | -3567 |

| (0, 31, -4, 4) | -1 | True | -682 |

| (2, 31, -2, 1) | -1 | True | -1387 |

| (3, 31, -1, 3) | -1 | True | -2336 |

The reward at every time step before crossing the ending line is -1. On the early stage of coaching, the reward will probably be large in negative values; if we initialize a state-action value to 0 or random values from a traditional distribution, say, with a mean of 0 and variance of 1, the worth then could possibly be considered too optimistic that may take a really very long time for this system to seek out an optimal policy.

With several tests, I discovered a traditional distribution with a mean of -500 and a variance of 1 could possibly be effective values for a faster convergence; you might be encouraged to try other values and might definitely find a greater initial value than this.

With the entire thoughts above in mind, we will program out the Off-policy Monte Carlo control function and use it to resolve the racetrack exercise. We’ll also add the zero-acceleration challenge that’s mentioned within the exercise description.

Then we train the policies for 1,000,000 episodes without/with zero acceleration challenge on each tracks and evaluate them with the next code:

By plotting out the training record, we discover that the policy converges across the 10,000th episode on each tracks without considering zero acceleration; adding zero acceleration makes the convergence slower and unstable before the 100,000th episode but still converges afterward.

Finally, we will ask the agents to play the sport with trained policies, and we also plot out the possible trajectories to further check the correctness of the policies.

Sample trajectories for track a:

Sample trajectories for track b:

Thus far, we’ve solved the racetrack exercise. This implementation could still have some problems, and also you’re very welcome to point them out and discuss a greater solution within the comment. Thanks for reading!

[1] Midjourney Terms of Service: https://docs.midjourney.com/docs/terms-of-service

[2] Sutton, Richard S., and Andrew G. Barto. *Reinforcement learning: An introduction*. MIT Press, 2018.

My GitHub repo of this exercise: [Link]; please check the “exercise section”.