Deepmind's AlphaGo Zero:- Explanation
Originally Written on:- January 23rd, 2019.
AlphaGo Zero
It is the first computer program to beat a human professional Go player
without handicap on a 19 x 19 board. It has also beaten the world
champion Lee Sedol 4 games to 1, Ke Jie (number
one world ranked player at the time) and many other top ranked players with the
Zero version. Google DeepMind has published their final iteration of AlphaGo,
AlphaGo Zero.
In Go, the playing pieces are called "stones". One player uses
the white stones and the other, black. The players take turns placing the
stones on the vacant intersections ("points") of a board. Once placed
on the board, stones may not be moved, but stones are removed from the board if
"captured". Capture happens when a stone or group of stones is
surrounded by opposing stones on all orthogonally-adjacent points. The game
proceeds until neither player wishes to make another move. When a game
concludes, the winner is determined by counting each player's surrounded
territory along with captured stones and komi (points added to
the score of the player with the white stones as compensation for playing
second). Games may also be terminated by resignation.
The standard Go board has a 19×19 grid of lines, containing 361 points.
Beginners often play on smaller 9×9 and 13×13 boards
GO is actually a discrete deterministic game with
perfect information.
Discrete-> where both the players can see the board.
Deterministic game-> where players can compete with one another.
perfect information-> every move has a set of outcomes.
Now, tic tac toe also has the same properties if you think about it.
Suppose, we had an AI that would play tic tac toe. For the human player to win,
tic tac toe needs to choose a value which would have the lowest score for the
opponent. In tic tac toe there are three possible outcomes.
1) Win 2) lose and 3) draw. Our algorithm would pick out states like win
or draw for the human player. For that, we can use the depth first
search (DFS) algorithm in order to determine the optimal move to play.
The question here is, can we apply the same strategy to Go ?.
Since Go, has a lot more possibilities (Sometimes, Go is said to have
more possible states than that of the number of atoms in the universe), a
brute-force strategy will not help. It is much more computationally expensive
and won’t be realistic to cover all possible states. It won’t be possible with
the current computational power that we have at our disposal.
We need to come up with a better strategy.
1) How about we choose a selection strategy, which is
applied recursively, until an unknown position is reached.
2) Then, one more simulated game
is played.
3) After playing that simulated game,
another node is added to the tree. (Expansion)
4) The result of the above 3 steps
is backpropagated in the tree.
5) We repeat process 1) to 4) again a
number of times.
This is the basic idea of the Monte- Carlo
Tree Search (MCTS). This is the go to algorithm for playing
deterministic games, creating bots etc.
For example: we can choose a number of simulations and then run a MCTS
over those random values selected from the playout boards(simulated board after
moves) and choose which is a winnable state and return that score. Suppose,
player A wins on “x” state more frequently than any other state. Then x will be
considered as the more winnable state. Below is the pseudocode.
Fundamentally, MCTS performs the kind of lookahead search that we would
imagine a human master would perform if given enough time: it intelligently
guesses which variations— sequences of future moves — are most promising,
simulates those variations, evaluates how good they actually are, and updates
its assessments of its current best moves accordingly.
Randomness
Now, coming to the question of randomness.
The question here is that, while performing the MCTS,
do we lookahead and then randomly go through all states, without
realising whether that state is actually a winnable state or not ?
We can use a Heuristic solver, to help us select from the play out
boards. We can compute using a heuristic solver, the true value of each move
and choosing the moves within the playouts, based on this heuristic.
For this reason we have the Upper confidence Tree (
UCT ).
The formula for balancing exploitation and exploration in games,
is called UCT: Where Uc =
Where Sc = The aggregate score after playing move n in
all simulations thus far.
Nc = the number of move c in all simulations thus far.
Np= total number of adjustable games simulated.
K = an adjustable constant representing the amount of exploration to allow.
UCT is learning over time. The random playouts become less random and
more heavy over time as a result. With the increase in data, the algorithm
becomes skewed more towards the good outcomes, rather than the bad outcomes.,
thus creating a self- reinforcement learning cycle.
But here is an interesting question.
What if there are two game similar game states- obviously for different
configuration, but the same state for both players. Will the AI need to
remember all the states and put it into a hash table ?
Why not take the heuristic result frames of the game board and put it in
a convolutional neural network ?
DeepMind did just that.
Neural Network for Alpha Go Zero
We would want our neural network to learn which moves are likely
to lead to wins. So, also as before, our agent—using its MCTS-improved
evaluations and the current state of its neural network — could play games
against itself, winning some and losing others.
This data, generated purely via lookahead and self-play, is what
DeepMind used to train AlphaGo Zero. More specifically:
The neural network was trained to play moves that reflected the improved
evaluations from performing the “lookahead” search.
The neural network was adjusted so that it was more likely to play moves
similar to those that led to wins and less likely to play moves similar to
those that led to losses during the self-play games.
Much was made of the fact that no games between humans were used to
train AlphaGo Zero, and this “trick” was the reason why: for a given state of a
Go agent, it can always be made smarter by performing MCTS-based lookahead and
using the results of that lookahead to improve the agent. This is how AlphaGo
Zero was able to continuously improve, from when it was an amateur all the way
up to when it better than the best human players.
AlphaGo Zero’s neural network architecture was a “two-headed”
architecture. Its first 20 layers or so were layer “blocks” of a type often
seen in modern neural net architectures. These layers were followed by two
“heads”: one head that took the output of the first 20 layers and produced
probabilities of the Go agent making certain moves, and another that took the
output of the first 20 layers and outputted a probability of the current player
winning.
AlphaGo Zero used a more “cutting edge” neural network architecture than
AlphaGo. Specifically, they used a “residual” neural network architecture
instead of a purely “convolutional” architecture.
It improved the efficiency. Each of these two neural network-related
tricks — switching from convolutional to residual architecture and using the
“Two Headed ” neural network architecture instead of separate neural
networks — would have resulted in about half of the increase in playing
strength as was achieved when both were combined.
Diagram comparing residual to convolutional architectures, from the original “ResNet” paper |
All in all it took Deepmind 40 days to beat the world’s best GO player.
It took, 64 GPUs and 19 CPUs to do it.
Image courtesy: Kdnuggets |
AlphaGo Zero constitutes a critical step towards solving the challenges
of the human society. If similar techniques can be applied to other structured
problems, such as protein folding, reducing energy consumption or searching for
revolutionary new materials, the resulting breakthroughs have the potential to
positively impact society.
Credits and More Resources:-
AlphaGo Zero constitutes a critical step towards solving the challenges
of the human society. If similar techniques can be applied to other structured
problems, such as protein folding, reducing energy consumption or searching for
revolutionary new materials, the resulting breakthroughs have the potential to
positively impact society.
Credits and More Resources:-
Comments
Post a Comment