Simulate games and predict the outcomes.
Isn’t it amazing that every little thing you could excel in an ideal information game is there for everybody to see in the principles of the sport?
Unfortunately, for mere mortals like me, reading the principles of a brand new game is just a tiny fraction of the journey to learn to play a fancy game. More often than not is spent playing, ideally against a player of comparable strength (or a greater player who’s patient enough to assist us expose our weaknesses). Losing often and hopefully winning sometimes provides the psychological punishments and rewards that steer us towards playing incrementally higher.
Perhaps, in a not-too-far future, a language model will read the principles of a fancy game resembling chess and, right from the beginning, play at the best possible level. Within the meantime, I propose a more modest challenge: learning by self-play.
On this project, we’ll train an agent to learn to play perfect information, two player games by observing the outcomes of matches played by previous versions of itself. The agent will approximate a worth (the sport expected result) for any game state. As a further challenge, our agent won’t be allowed to keep up a lookup table of the state space, as this approach wouldn’t be manageable for complex games.
The sport
The sport that we’re going to discuss is SumTo100. The sport goal is to achieve a sum of 100 by adding numbers between 1 and 10. Listed below are the principles:
- Initialize sum = 0.
- Select a primary player. The 2 players take turns.
- While sum < 100:
- The player chooses a number between 1 and 10 inclusively. The chosen number gets added to the sum without exceeding 100.
- If sum < 100, the opposite player plays (i.e., we return to the highest of point 3).
4. The player that added the last number (reaching 100) wins.
Starting with such an easy game has many benefits:
- The state space has only 101 possible values.
- The states can get plotted on a 1D grid. This peculiarity will allow us to represent the state value function learned by the agent as a 1D bar graph.
- The optimal strategy is understood:
– Reach a sum of 11n + 1, where n ∈ {0, 1, 2, …, 9}
We will visualize the state value of the optimal strategy:
The sport state is the sum after an agent has accomplished its turn. A worth of 1.0 implies that the agent is bound to win (or has won), while a worth of -1.0 implies that the agent is bound to lose (assuming the opponent plays optimally). An intermediary value represents the estimated return. For instance, a state value of 0.2 means a rather positive state, while a state value of -0.8 represents a probable loss.
If you should dive within the code, the script that performs the entire training procedure is learn_sumTo100.sh, on this repository. Otherwise, bear with me as we’ll undergo a high level description of how our agent learns by self-play.
Generation of games played by random players
We wish our agent to learn from games played by previous versions of itself, but in the primary iteration, for the reason that agent has not learned anything yet, we’ll need to simulate games played by random players. At each turn, the players will get the list of legal moves from the sport authority (the category that encodes the sport rules), given the present game state. The random players will select a move randomly from this list.
Figure 2 is an example of a game played by two random players:
On this case, the second player won the sport by reaching a sum of 100.
We’ll implement an agent that has access to a neural network that takes as input a game state (after the agent has played) and outputs the expected return of this game. For any given state (before the agent has played), the agent gets the list of legal actions and their corresponding candidate states (we only consider games having deterministic transitions).
Figure 3 shows the interactions between the agent, the opponent (whose move selection mechanism is unknown), and the sport authority:
On this setting, the agent relies on its regression neural network to predict the expected return of game states. The higher the neural network can predict which candidate move yields the best return, the higher the agent will play.
Our list of randomly played matches will provide us with the dataset for our first pass of coaching. Taking the instance game from Figure 2, we would like to punish the moves made by player 1 since its behaviour led to a loss. The state resulting from the last motion gets a worth of -1.0 because it allowed the opponent to win. The opposite states get discounted negative values by an element of γᵈ , where d is the space with respect to the last state reached by the agent. γ (gamma) is the discount factor, a number ∈ [0, 1], that expresses the uncertainty within the evolution of a game: we don’t wish to punish early decisions as hard because the last decisions. Figure 4 shows the state values related to the choices made by player 1:
The random games generate states with their goal expected return. For instance, reaching a sum of 97 has a goal expected return of -1.0, and a sum of 73 has a goal expected return of -γ³. Half the states take the standpoint of player 1, and the opposite half take the standpoint of player 2 (even though it doesn’t matter within the case of the sport SumTo100). When a game ends with a win for the agent, the corresponding states get similarly discounted positive values.
Training an agent to predict the return of games
Now we have all we’d like to begin our training: a neural network (we’ll use a two-layers perceptron) and a dataset of (state, expected return) pairs. Let’s see how the loss on the expected expected return evolves:
We shouldn’t be surprised that the neural network doesn’t show much predicting power over the consequence of games played by random players.
Did the neural network learn anything in any respect?
Fortunately, since the states can get represented as a 1D grid of numbers between 0 and 100, we will plot the expected returns of the neural network after the primary training round and compare them with the optimal state values of Figure 1:
Because it seems, through the chaos of random games, the neural network learned two things:
- When you can reach a sum of 100, do it. That’s good to know, considering it’s the goal of the sport.
- When you reach a sum of 99, you’re sure to lose. Indeed, in this example, the opponent has just one legal motion and that motion yields to a loss for the agent.
The neural network learned essentially to complete the sport.
To learn to play a little bit higher, we must rebuild the dataset by simulating games played between copies of the agent with their freshly trained neural network. To avoid generating equivalent games, the players play a bit randomly. An approach that works well is selecting moves with the epsilon-greedy algorithm, using ε = 0.5 for every players first move, then ε = 0.1 for the remaining of the sport.
Repeating the training loop with higher and higher players
Since each players now know that they need to reach 100, reaching a sum between 90 and 99 ought to be punished, since the opponent would jump on the chance to win the match. This phenomenon is visible in the expected state values after the second round of coaching:
We see a pattern emerging. The primary training round informs the neural network in regards to the last motion; the second training round informs in regards to the penultimate motion, and so forth. We’d like to repeat the cycle of games generation and training on prediction no less than as again and again as there are actions in a game.
The next animation shows the evolution of the expected state values after 25 training rounds:
The envelope of the expected returns decays exponentially, as we go from the top towards the start of the sport. Is that this an issue?
Two aspects contribute to this phenomenon:
- γ directly damps the goal expected returns, as we move away from the top of the sport.
- The epsilon-greedy algorithm injects randomness within the player behaviours, making the outcomes harder to predict. There may be an incentive to predict a worth near zero to guard against cases of extremely high losses. Nonetheless, the randomness is desirable because we don’t want the neural network to learn a single line of play. We wish the neural network to witness blunders and unexpected good moves, each from the agent and the opponent.
In practice, it shouldn’t be an issue because in any situation, we’ll compare values among the many legal moves in a given state, which share comparable scales, no less than for the sport SumTo100. The dimensions of the values doesn’t matter when we decide the greedy move.
We challenged ourselves to create an agent that may learn to master a game of perfect information involving two players, with deterministic transitions from a state to the subsequent, given an motion. No hand coded strategies nor tactics were allowed: every little thing needed to be learned by self-play.
We could solve the straightforward game of SumTo100 by running multiple rounds of pitching copies of the agent against one another, and training a regression neural network to predict the expected return of the generated games.
The gained insight prepares us well for the subsequent ladder in game complexity, but that can be for my next post! 😊
Thanks on your time.