In conclusion, the Q-learning agent converged to a sub-optimal strategy as mentioned previously. Furthermore, a portion of the environment stays unexplored by the Q-function, which prevents the agent from finding the brand new optimal path when the purple portal appears after the a hundredth episode.
These performance limitations will be attributed to the relatively low number of coaching steps (400), limiting the chances of interaction with the environment and the exploration induced by the ε-greedy policy.
Planning, an integral part of model-based reinforcement learning methods is especially useful to enhance sample efficiency and estimation of motion values. Dyna-Q and Dyna-Q+ are good examples of TD algorithms incorporating planning steps.
The Dyna-Q algorithm (Dynamic Q-learning) is a mix of model-based RL and TD learning.
Model-based RL algorithms depend on a model of the environment to include planning as their primary way of updating value estimates. In contrast, model-free algorithms depend on direct learning.
”A model of the environment is anything that an agent can use to predict how the environment will reply to its actions” — Reinforcement Learning: an introduction.
Within the scope of this text, the model will be seen as an approximation of the transition dynamics p(s’, r|s, a). Here, p returns a single next-state and reward pair given the present state-action pair.
In environments where p is stochastic, we distinguish distribution models and sample models, the previous returns a distribution of the subsequent states and actions while the latter returns a single pair, sampled from the estimated distribution.
Models are especially useful to simulate episodes, and subsequently train the agent by replacing real-world interactions with planning steps, i.e. interactions with the simulated environment.
Agents implementing the Dyna-Q algorithm are a part of the category of planning agents, agents that mix direct reinforcement learning and model learning. They use direct interactions with the environment to update their value function (as in Q-learning) and likewise to learn a model of the environment. After each direct interaction, they also can perform planning steps to update their value function using simulated interactions.
A fast Chess example
Imagine playing game of chess. After playing each move, the response of your opponent lets you assess the quality of your move. This is analogous to receiving a positive or negative reward, which lets you “update” your strategy. In case your move results in a blunder, you almost certainly wouldn’t do it again, supplied with the identical configuration of the board. To this point, that is comparable to direct reinforcement learning.
Now let’s add planning to the combo. Imagine that after each of your moves, while the opponent is considering, you mentally return over each of your previous moves to reassess their quality. You would possibly find weaknesses that you just neglected at first sight or discover that specific moves were higher than you thought. These thoughts can also can help you update your strategy. This is precisely what planning is about, updating the worth function without interacting with the true environment but reasonably a model of said environment.
Dyna-Q subsequently accommodates some additional steps in comparison with Q-learning:
After each direct update of the Q values, the model stores the state-action pair and the reward and next-state that were observed. This step is named model training.
- After model training, Dyna-Q performs n planning steps:
- A random state-action pair is chosen from the model buffer (i.e. this state-action pair was observed during direct interactions)
- The model generates the simulated reward and next-state
- The worth function is updated using the simulated observations (s, a, r, s’)