A few of the most important breakthroughs in artificial intelligence are inspired by nature and the RL paradigm is not any exception. This straightforward yet powerful concept is closest to how we humans learn and might be seen as an important element of what we might expect from a synthetic general intelligence: Learning through trial and error. This approach to learning teaches us about cause and effect, and the way our actions impact our surroundings. It teaches us how our behaviour can either harm or profit us, and enables us to develop strategies to realize our long-term goals.

## What’s RL?

The RL paradigm is a robust and versatile machine learning approach that permits decision makers to learn from their interactions with the environment. It draws from a big selection of ideas and methodologies for locating an optimal technique to maximize a numerical reward. With a protracted history of connections to other scientific and engineering disciplines, research in RL is well-established. Nevertheless, while there’s a wealth of educational success, practical applications of RL within the industrial sphere remain rare. Probably the most famous examples of RL in motion are computers achieving super-human level performance on games comparable to chess and Go, in addition to on titles like Atari and Starcraft. Lately, nevertheless, now we have seen a growing variety of industries adopt RL methods.

## How is it used today?

Despite the low level of economic adoption of RL, there are some exciting applications in the sector of:

- Health: Dynamic treatment regime; automated diagnosis; drug discovery
- Finance: Trading; dynamic pricing; risk management
- Transportation: Adaptive traffic control; autonomous driving
- Advice: Web search; news advice; product advice
- Natural Language Processing: text summarization; query answering; machine translation; dialog generation

method to gain an understanding of RL use cases is to contemplate an example challenge. Allow us to imagine we try to assist our friend learn to play a musical instrument. Each morning, our friend tells us how motivated they feel and the way much they’ve learned during yesterday’s practice, and asks us how they need to proceed. For reasons unknown to us, our friend has a limited set of studying decisions: Taking a break day, practicing for one hour, or practicing for 3 hours.

After observing our friend’s progress, now we have noticed a number of interesting characteristics:

- It seems that the progress our friend is making is directly correlated with the quantity of hours they practice.
- Consistent practice sessions make our friend progress faster.
- Our friend doesn’t do well with long practicing sessions. Each time we instructed them to review for 3 hours, the following day they felt drained and unmotivated to proceed.

From our observations, now we have created a graph modeling their learning progress using state machine notation.

Allow us to discuss again our findings based on our model:

- Our friend has three distinct emotional states: neutral, motivated, and demotivated.
- On any given day, they will decide to practice for zero, one, or three hours, except after they are feeling demotivated — during which case studying for zero hours (or not studying) is their only available option.
- Our friend’s mood is predictive: Within the neutral state, practicing for one hour, will make them feel motivated the next day, while practicing for 3 hours will leave them feeling demotivated und not practicing in any respect will keep them in a neutral state. Conversely, within the motivated state, one hour of practice will maintain our friend’s motivation, while three hours of practice will demotivate them and no practice in any respect will leave them feeling neutral. Lastly, within the demotivated state, our friend will refrain from studying altogether, leading to them feeling neutral the following day.
- Their progress is heavily influenced by their mood and the quantity of practice they put in: the more motivated they’re and the more hours they dedicate to practice, the faster they are going to learn and grow.

Why did we structure our findings like this? Since it helps us model our challenge using a mathematical framework called finite *Markov decision processes* (MDPs). This approach helps us gain a greater understanding of the issue and how you can best address it.

## Markov Decission Processes

Finite MDPs provide a useful framework to model RL problems, allowing us to abstract away from the specifics of a given problem and formulate it in a way that might be solved using RL algorithms. In doing so, we’re capable of transfer learnings from one problem to a different, as a substitute of getting to theorise about each problem individually. This helps us to simplify the technique of solving complex RL problems. Formally, a finite MDP is a control process defined by a four-tuple:

The four-tuple (*S*, *A*, *P*, *R*) defines 4 distinct components, each of which describes a selected aspect of the system. *S* and *A* define the set of *states* and *actions* respectively. Whereas, *P* denotes the *transition function* and R denotes the *reward function*. In our example, we define our friend’s mood as our set of states *S* and their practice decisions as our set of actions *A*. The transition function *P*, visualised by arrows within the graph, shows us how our friend’s mood shall be altered depending on the quantity of studying they do. Moreover, the reward function *R* is used to measure the progress our friend has made, which is influenced by their mood and the practice decisions they make.

## Policies and value functions

Given the MDP, we are able to now develop strategies for our friend. Drawing on the wisdom of our favourite cooking podcast, we’re reminded that to master the art of cooking one must develop a routine of practicing a bit of each day. Inspired by this concept, we develop a method for our friend that advocates for a consistent practice schedule: practice for one hour each day. In RL theory, strategies are known as *policies* or *policy functions*, and are defined as mappings from the set of states to the possibilities of every possible motion in that state. Formally, a policy *π *is a probability distribution over actions *a *given state *s.*

To stick to the “practice a bit of each day” mantra, we establish a policy with a 100% probability of practicing for one hour in each the neutral and motivated states. Nevertheless, within the demotivated state, we skip practice 100% of the time, because it is the one available motion. This instance demonstrates that policies might be deterministic, as a substitute of returning a full probability distribution over available actions, they return a degenerate distribution with a single motion which is taken exclusively.

As much as we trust our favourite cooking podcast, we would love to learn the way well our friend is doing by following our strategy. In RL lingo we speak of evaluating our policy, which is defined by the *value function*. To get a primary impression, allow us to calculate how much knowledge our friend is gaining by following our strategy for ten days. Assuming they begin the practice feeling neutral, they are going to gain one unit of information on the primary day and two units of information thereafter, leading to a complete of 19 units. Conversely, if our friend had already been motivated on the primary day, they might have gained 20 units of information and in the event that they had began feeling demotivated, they might have gained only 17 units.

While this calculation could seem a bit of arbitrary at first, there are literally a number of things we are able to learn from it. Firstly, we intuitively found a method to assign our policy a numerical value. Secondly, we observe that this value relies on the mood our friend starts in. That said, allow us to have a have a look at the formal definition of value functions. A price function *v* of state *s *is defined because the expected *discounted return *an agent receives starting in state* s *and following policy *π *thereafter. We seek advice from *v *because the* state-value function *for policy *π. *Where we define the state-value function because the expected value *E *of the discounted return *G *when starting in state* s*

Where we define the state-value function because the expected value *E *of the discounted return *G *when starting in state* s. *Because it seems, our first approach is in reality not far off the actual definition. The one difference is that we based our calculations on the sum of information gains over a hard and fast variety of days, versus the more objective expected discounted return *G*. In RL theory, the discounted return is defined because the sum of discounted future rewards:

Where *R* denotes the reward at timestep *t *multiplied by the *discount rate* denoted by a lowercase gamma. The discount rate* *lies within the interval of zero to at least one and determines how much value we assign to future rewards. To higher understand the implication of the discount rate on the sum of rewards allow us to consider the special cases of assigning gamma to zero or to at least one. By setting gamma to zero, we consider only immediate rewards and disrespect any future rewards, meaning the discounted return would only equal the reward *R* at timestep *t+1*. Conversely, when gamma is ready to at least one, we assign any future rewards their full value, thus the discounted return would equal the sum of all future rewards.

Equipped with the concept of value functions and discounted returns we are able to now properly evaluate our policy. Firstly, we want to come to a decision on an acceptable discount rate for our example. We must discard zero as a possible candidate, as it might not account for the long-term value of information generation we’re fascinated by. A reduction rate of 1 also needs to be avoided, as our example doesn’t have a natural notion of a final state; thus, any policy that features regular practice of the instrument, regardless of how ineffective, would yield an infinite amount of information with enough time. Hence, chosing a reduction rate of 1, would make us indifferent between having our friend practice each day or yearly. After rejecting the special cases of zero and one, now we have to decide on an acceptable discount rate between the 2. The smaller the discount rate, the less value is assigned to future rewards and vice versa. For our example, we set the discount rate to 0.9 and calculate the discounted returns for every of our friend’s moods. Allow us to start again with the motivated state. As an alternative of considering only the following ten days, we calculate the sum of all discounted future rewards, leading to 20 units of information. The calculation is as follows¹:

Note, by introducing a reduction rate smaller than one, the sum of an infinite variety of future rewards continues to be constant. The following state we wish to research is the neutral state. On this state, our agent choses to practice for one hour, gaining one unit of information, after which transitions to the motivated state. This simplifies the calculation process tremendously, as we already know the worth of the motivated state.

As a final step, we also can calculate the worth function of the demotivated state. The method is analogous to the neutral state, leading to a price of a bit of over 17.

By examining the state-value functions of our policy in all states, we are able to deduce that the motivated state is probably the most rewarding, which is why we should always instruct our friend to succeed in it as quickly as possible and remain there.

I encourage you to contemplate developing alternative policy functions and evaluating them using the state-value function. While a few of them could also be more successful within the short term, they are going to not generate as much units of information as our proposed theory within the long-term². If you should dig deeper into the mathematics behind MDPs, policies and value functions, I highly recommend “Reinforcement Learning — An Introduction” by Richard S. Sutton and Andrew G. Barto. Alternatively, I suggest trying out the “RL Course by David Silver” on YouTube.

What if our friend was not into music, but as a substitute asked us to assist them construct a self-driving automobile, or our supervisor instructed our team to develop an improved recommender system? Unfortunately, discovering the optimal policy for our example is not going to help us much with other RL problems. Subsequently, we want to plot algorithms which can be able to solving any finite MDP².

In the next blog posts you’ll explore how you can apply various RL algorithms to practical examples. We’ll start with tabular solution methods, that are the only type of RL algorithms and are suitable for solving MDPs with small state and motion spaces, comparable to the one in our example. We’ll then delve into deep learning to tackle more intricate RL problems with arbitrarily large state and motion spaces, where tabular methods are not any longer feasible. These approximate solutions methods shall be the main focus of the second a part of this course. Finally, to conclude the course, we are going to cover a number of the most modern papers in the sector of RL, providing a comprehensive evaluation of every one, together with practical examples for example the idea.