Home Artificial Intelligence Reinforcement Learning: Introduction and Primary Concepts Introduction About this text Reinforcement learning framework Reward types Kinds of tasks Policies and value functions Bellman equation Optimal policy Conclusion Resources

Reinforcement Learning: Introduction and Primary Concepts Introduction About this text Reinforcement learning framework Reward types Kinds of tasks Policies and value functions Bellman equation Optimal policy Conclusion Resources

0
Reinforcement Learning: Introduction and Primary Concepts
Introduction
About this text
Reinforcement learning framework
Reward types
Kinds of tasks
Policies and value functions
Bellman equation
Optimal policy
Conclusion
Resources

Making step one into the world of reinforcement learning

Towards Data Science

Reinforcement learning is a special domain in machine learning that differs rather a lot from the classic methods utilized in supervised or unsupervised learning.

The last word objective consists of developing a so-called agent that can perform optimal actions in environments. From the beginning, the agent often performs very poorly but as time goes on, it adapts its strategy from the trial and error approach by interacting with the environment.

The great thing about reinforcement learning is that the identical algorithms may be used to make the agent adapt to utterly different, unknown, and complicated conditions.

Reinforcement learning has a big selection of applications and mostly used when a given problem can’t be solved by classic methods:

  • Games. Existing approaches can design optimal game strategies and outperform humans. Probably the most well-known examples are chess and Go.
  • Robotics. Advanced algorithms may be incorporated into robots to assist them move, carry objects or complete routine tasks at home.
  • Autopilot. Reinforcement learning methods may be developed to mechanically drive cars, control helicopters or drones.
Among the reinforcement learning applications

Though reinforcement learning is a really exciting and unique area, it remains to be one of the vital sophisticated topics in machine learning. As well as, it is totally critical from the start to know all of its basic terminology and ideas.

For these reasons, this text introduces only the important thing theoretical concepts and concepts that can help the reader to further advance in reinforcement learning.

Moreover, this text relies on the third chapter of the famous book “Reinforcement Learning” written by Richard S. Sutton and Andrew G. Barto, which I might highly recommend to everyone excited by delving deeper.

Aside from that, this book comprises practical exercises. Their solutions may be present in this repository.

To start with, allow us to understand the reinforcement learning framework which comprises several vital terms:

  • Agent represents an object whose goal is to learn a technique to optimize a certain process;
  • Environment acts as a world by which the agent is situated and consists of a set of various states;
  • At each timestamp, the agent can perform an motion within the environment that can change the environment’s state to a brand new one. Moreover, the agent will receive feedback indicating how good or bad the chosen motion was. This feedback known as a reward and is represented within the numerical form.
  • By utilizing feedback from different states and actions, the agent steadily learns the optimal technique to maximize the total reward over time.
Reinforcement learning framework. Image adopted by the creator. Source: Reinforcement Learning. An Introduction. Second Edition | Richard S. Sutton and Andrew G. Barto

In lots of cases, given the present state and the agent’s motion in that state, the change to a brand new state can lead to different rewards (reasonably a single one) where each of them corresponds to its own probability.

The formula below considers this fact by summing up over all possible next states and rewards that correspond to them.

For a given state and motion, the sum of all probabilities of transitioning to every other state s’ with reward r is the same as 1. Source: Reinforcement Learning. An Introduction. Second Edition | Richard S. Sutton and Andrew G. Barto

To make things more clear, we’re going to use the prime symbol to designate a variable in its next step. For instance, if s represents the agent’s current state, then s’ will confer with the following agent’s state.

To formally define the overall reward in the long term, we want to introduce the term the “cumulative reward” (also called “return”) which may take several forms.

Easy formulation

Allow us to denote Rᵢ because the reward received by the agent at timestamp i, then the cumulative reward may be defined because the sum of rewards received between the following timestamp and the ultimate timestamp T:

Cumulative reward. Image is taken from the book Reinforcement Learning. An Introduction. Second Edition | Richard S. Sutton and Andrew G. Barto

Discounted cumulative reward

More often than not, the discounted cumulative reward is used. It represents the identical reward system as before aside from the incontrovertible fact that every individual reward within the sum is now multiplied by an exponentially decayed discount coefficient.

Discounted cumulative reward. Source: Reinforcement Learning. An Introduction. Second Edition | Richard S. Sutton and Andrew G. Barto

The γ (also sometimes denoted as α) parameter within the formula above known as the discount rate and might take a worth between 0 and 1. The introduction of discounted reward makes sure that the agent prioritizes actions that end in more short-term rewards. Ultimately, the discounted cumulative reward may be expressed within the recursive form:

Recursive equation for discounted cumulative reward. Source: Reinforcement Learning. An Introduction. Second Edition | Richard S. Sutton and Andrew G. Barto

Episodic tasks

In some cases, the interaction between an agent and the environment can include a set of independent episodes. On this scenario, every episode starts independently from others and its starting state is sampled from the distribution of states.

For example, imagine that we wish the agent to learn the optimal technique to play a game. To do this, we’ll run a set of independent games where in each of them a robot can either win or lose. The received rewards in every episode will steadily influence the strategy that the robot shall be using in the next games.

Episodes are also known as trials.

Continuing tasks

At the identical time, not all kinds of tasks are episodic: a few of them may be continuing meaning that they should not have a terminal state. In such cases, it is just not all the time possible to define the cumulative return since the variety of timestamps is infinite.

Policy

Policy π is a mapping between all possible states s ∈ S to probabilities p of performing any possible motion from that state s.

If an agent follows a policy π, then the agent’s probability p(a | s) of performing the motion a from state s is the same as p(a | s) = π(s).

By definition, any policy may be represented in the shape of a table of size |S| x |A|.

Allow us to take a look at the instance of the maze game below. The agent that’s initially situated on the A1 cell. During each step, the agent has to maneuver either horizontally or vertically (not diagonally) to an adjoining cell. The sport ends when the agent reaches the terminal state situated at C1. The cell A3 comprises a big reward that an agent can collect if it steps on it. The cells B1 and C3 are maze partitions that can not be reached.

Maze example with 7 possible states (the cells B1 and C3 are maze partitions). The sport starts with the agent being put at A1 and ends when the agent reaches C1. The cell A3 comprises a big reward.

One in all the best policies that may be used is random: at each state, the agent randomly moves to any allowed cell with equal probability. The corresponding policy for this strategy is illustrated within the figure above.

The demonstrated maze can also be an example of an episodic task. After reaching the terminal state and obtaining a certain reward, a brand new independent game may be initialized.

Aside from policies, in reinforcement learning, it’s common to make use of the notion of value functions which describe how good or bad (when it comes to the expected reward) it’s for the agent to be in a given state or to take a certain motion given the present state.

State-value function

State-value function v(s) (or just V-function) is a mapping from every environment state to the cumulative expected reward the agent would receive if it were initially placed at that state following a certain policy π.

V-function may be represented as a 1-dimensional table of size |S|.

V-function outputs the expected reward given an input state s under the condition that the agent follows the policy π. Source: Reinforcement Learning. An Introduction. Second Edition | Richard S. Sutton and Andrew G. Barto

To higher understand the definition of the V-function, allow us to confer with the previous example. We will see that cells situated within the neighbourhood of A3 (that are A2, A3 and B3) have higher V-values than those situated farther from it (like A1, B2 and C2). This is sensible because being situated near a big reward at A3, the agent has a better probability to gather it.

V-function example. Every game state corresponds to a cumulative reward the agent would receive if it were initially placed in it.

The V-value for terminal states is the same as 0.

Motion-value function

Motion-value functions have similar concept, compared to state-value functions. Nevertheless, in addition they keep in mind a possible motion the agent can take under a given policy.

Motion-value function q(s, a) (or just Q-function) is a mapping from each environment state s ∈ S and every possible agent’s motion a ∈ A to the expected reward the agent would receive if it were initially placed at that state and needed to take that motion following a certain policy π.

Q-function may be represented in the shape of table of size |S| x |A|.

Q-function definion. Source: Reinforcement Learning. An Introduction. Second Edition | Richard S. Sutton and Andrew G. Barto
Q-function example. For each pair (state, motion), the Q-function outputs the corresponding expected reward.

The difference between state and motion functions is barely within the incontrovertible fact that the action-value function takes additional information in regards to the motion the agent goes to soak up the present state. The state function only considers the present state and doesn’t keep in mind the following agent’s motion.

Each V- and Q-functions are learned from the agent’s experience.

A subtility on V- and Q-values

Why is q(s, a) ≠ v(s’), i.e. why the expected reward of an agent being in state s and taking the motion a resulting in next state s’ is just not equal to the expected reward of an agent being in state s’?

This query may appear logical at first. Indeed, allow us to take the agent from the instance above who’s at cell B2 and assume that it then makes a transition to B3. From the Q-table we are able to see that the expected reward q(B2, “up”) = 5.6. At the identical time, the V-table shows the expected reward v(B3) = 6.2. While 5.6 is comparatively near 6.2, each values are still not equal. So the last word query is why q(B2, “up”) ≠ v(B3)?

The reply to this query lies within the incontrovertible fact that despite selecting an motion in the present state s that deterministically results in the following state s’, the reward received by the agent from that transition is taken under consideration by the Q-function but not by the V-function. In other words, if the present timestamp is t, then the expected reward q(s, a) will consider the discounted sum ranging from the step t: Rₜ + αRₜ₊₁ … . The expected reward corresponding to v(s’) is not going to have the term Rₜ in its sum: Rₜ₊₁ + αRₜ₊₂ + … .

It’s price moreover noting that sometimes an motion a taken at some state s can result in multiple possible next states. The straightforward maze example above doesn’t show this idea. But we are able to add the next condition, as an illustration, to the agent’s actions: when the agent chooses a direction to maneuver within the maze, there’s a 20% probability that the sunshine in the brand new cell shall be turned off and since of that the agent will ultimately move by 90° relatively to that direction.

The introduced concept demonstrates how the identical motion from the identical state can result in different states. As a consequence, the rewards received by the agent from the identical motion and state can differ. That’s one other aspect that contributes to the inequality between q(s, a) and v(s’).

Bellman equation is considered one of the basic equations in reinforcement learning! In easy words, it recursively establishes the state / motion function values at the present and next timestamps.

V-function

By utilizing the definition of the expected value, we are able to rewrite the expression of the state value function to make use of the V-function of the following step:

Bellman equation for the V-function. Source: Reinforcement Learning. An Introduction. Second Edition | Richard S. Sutton and Andrew G. Barto

What this equality states is just the incontrovertible fact that the v-value of the present state s equals the expected value of the sum of the reward received by the agent from transitioning to that state s and the discounted v-value of the following state s’.

Of their book, Richard S. Sutton and Andrew G. Barto use so-called “backup diagrams” that allow to higher understand the flow of state functions and capture the logic behind the probability multiplication which take places within the Bellman equation. The one used for the V-function is demonstrated below.

Backup diagram for V-function. Image adopted by the creator. Source: Reinforcement Learning. An Introduction. Second Edition | Richard S. Sutton and Andrew G. Barto

Bellman equation plays a vital role for computing, approximating and calculating the V-function.

Q-function

Similarly to V-functions, we are able to derive the Bellman equation for Q-functions.

Bellman equation for the Q-function. Source: Reinforcement Learning. Exercise Solutions | GitHub repository (@LyWangPX).
Backup diagram for Q-function. Image adopted by the creator. Source: Reinforcement Learning. An Introduction. Second Edition | Richard S. Sutton and Andrew G. Barto

Allow us to define the comparison operation between different policies.

A policy π₁ is alleged to be higher than or equal to policy π₂ if the expected reward of π₁ is bigger than or equal to the expected reward of π₂ for all states s ∈ S.

A policy π⁎ is alleged to be optimal if it is best than or equal to every other policy.

Every optimal policy also has the optimal V⁎- and Q⁎-functions related to it.

Bellman optimality equation

We will rewrite Bellman equations for optimal policies. In point of fact, they appear very just like normal Bellman equations we saw before aside from the incontrovertible fact that that the policy term π(a|s) is removed and the max function is added to deterministically get the utmost reward from selecting the very best motion a from the present state s.

Bellman optimality equation for the V-function. Source: Reinforcement Learning. An Introduction. Second Edition | Richard S. Sutton and Andrew G. Barto
Bellman optimality equation for the Q-function. Source: Reinforcement Learning. An Introduction. Second Edition | Richard S. Sutton and Andrew G. Barto

These equations may be mathematically solved for each state. If either the optimal V⁎-function or Q⁎-function is found, the optimal policy π⁎ may also be easily calculated which can all the time greedily select the actions that maximise the expected reward.

Unfortunately, it is rather hard in practice to mathematically solve Bellman equations since the variety of states in most problems will likely be huge. For that reason, reinforcement learning methods are used that may roughly calculate optimal policies with much fewer computations and memory requirements.

In this text, now we have discussed how agents learn through experience by the trial and error approach. The simplicity of the introduced reinforcement learning framework generalizes well for a lot of problems, yet provides a versatile approach to use the notions of policy and value functions. At the identical time, the utlimate algorithm objective consists of calculating the optimal V– and Q– functions maximizing the expected reward.

Most of the present algorithms attempt to approximate the optimal policy function. While the very best solution is nearly not possible to get in real-world problems as a consequence of memory and computation constraints, approximation methods work thoroughly in most situations.

All images unless otherwise noted are by the creator.

LEAVE A REPLY

Please enter your comment!
Please enter your name here