Deepmind's AlphaGo Zero:- Explanation

Originally Written on:-  January 23rd, 2019.



AlphaGo Zero




It is the first computer program to beat a human professional Go player without handicap on a 19 x 19 board. It has also beaten the world champion Lee Sedol 4 games to 1, Ke Jie (number one world ranked player at the time) and many other top ranked players with the Zero version. Google DeepMind has published their final iteration of AlphaGo, AlphaGo Zero.

In Go, the playing pieces are called "stones". One player uses the white stones and the other, black. The players take turns placing the stones on the vacant intersections ("points") of a board. Once placed on the board, stones may not be moved, but stones are removed from the board if "captured". Capture happens when a stone or group of stones is surrounded by opposing stones on all orthogonally-adjacent points. The game proceeds until neither player wishes to make another move. When a game concludes, the winner is determined by counting each player's surrounded territory along with captured stones and komi (points added to the score of the player with the white stones as compensation for playing second). Games may also be terminated by resignation.
The standard Go board has a 19×19 grid of lines, containing 361 points. Beginners often play on smaller 9×9 and 13×13 boards

GO is actually a discrete deterministic game with perfect information.

Discrete-> where both the players can see the board.
Deterministic game-> where players can compete with one another.
perfect information-> every move has a set of outcomes.


Now, tic tac toe also has the same properties if you think about it. Suppose, we had an AI that would play tic tac toe. For the human player to win, tic tac toe needs to choose a value which would have the lowest score for the opponent. In tic tac toe there are three possible outcomes.
1) Win 2) lose and 3) draw. Our algorithm would pick out states like win or draw for the human player. For that, we can use the depth first search (DFS) algorithm in order to determine the optimal move to play.





The question here is, can we apply the same strategy to Go ?.


Since Go, has a lot more possibilities (Sometimes, Go is said to have more possible states than that of the number of atoms in the universe), a brute-force strategy will not help. It is much more computationally expensive and won’t be realistic to cover all possible states. It won’t be possible with the current computational power that we have at our disposal.

We need to come up with a better strategy.

1) How about we choose a selection strategy, which is applied recursively, until an unknown position is reached.
      2) Then, one more simulated game is played.
      3)  After playing that simulated game, another node is added to the tree. (Expansion)          
      4) The result of the above 3 steps is backpropagated in the tree.
      5)  We repeat process 1) to 4) again a number of times.

This is the basic idea of the Monte- Carlo Tree  Search (MCTS). This is the go to algorithm for playing deterministic games, creating bots etc.




For example: we can choose a number of simulations and then run a MCTS over those random values selected from the playout boards(simulated board after moves) and choose which is a winnable state and return that score. Suppose, player A wins on “x” state more frequently than any other state. Then x will be considered as the more winnable state. Below is the pseudocode.






Fundamentally, MCTS performs the kind of lookahead search that we would imagine a human master would perform if given enough time: it intelligently guesses which variations— sequences of future moves — are most promising, simulates those variations, evaluates how good they actually are, and updates its assessments of its current best moves accordingly.



Randomness

Now, coming to the question of randomness.
The question here is that, while performing the MCTS,
do we lookahead and then randomly  go through all states, without realising whether that state is actually a winnable state or not ?
We can  use a Heuristic solver, to help us select from the play out boards. We can compute using a heuristic solver, the true value of each move and choosing the moves within the playouts, based on this heuristic.

 For this reason we have the Upper confidence Tree ( UCT ).

The  formula for balancing exploitation and exploration in games, is called  UCT: Where Uc =



Where Sc = The aggregate score after playing move n in all simulations thus far.
              Nc = the number of move c in all simulations thus far.
              Np= total number of adjustable games simulated.
               K = an adjustable constant representing the amount of exploration to allow.

UCT is learning over time. The random playouts become less random and more heavy over time as a result. With the increase in data, the algorithm becomes skewed more towards the good outcomes, rather than the bad outcomes., thus creating a self- reinforcement learning cycle.
But here is an interesting question.
What if there are two game similar game states- obviously for different configuration, but the same state for both players. Will the AI need to remember all the states and put it into a hash table ?
Why not take the heuristic result frames of the game board and put it in a convolutional neural network ?
DeepMind did just that.


Neural Network for Alpha Go Zero

 We would want our neural network to learn which moves are likely to lead to wins. So, also as before, our agent—using its MCTS-improved evaluations and the current state of its neural network — could play games against itself, winning some and losing others.
This data, generated purely via lookahead and self-play, is what DeepMind used to train AlphaGo Zero. More specifically:

The neural network was trained to play moves that reflected the improved evaluations from performing the “lookahead” search.
The neural network was adjusted so that it was more likely to play moves similar to those that led to wins and less likely to play moves similar to those that led to losses during the self-play games.
Much was made of the fact that no games between humans were used to train AlphaGo Zero, and this “trick” was the reason why: for a given state of a Go agent, it can always be made smarter by performing MCTS-based lookahead and using the results of that lookahead to improve the agent. This is how AlphaGo Zero was able to continuously improve, from when it was an amateur all the way up to when it better than the best human players.

AlphaGo Zero’s neural network architecture was a “two-headed” architecture. Its first 20 layers or so were layer “blocks” of a type often seen in modern neural net architectures. These layers were followed by two “heads”: one head that took the output of the first 20 layers and produced probabilities of the Go agent making certain moves, and another that took the output of the first 20 layers and outputted a probability of the current player winning.
AlphaGo Zero used a more “cutting edge” neural network architecture than AlphaGo. Specifically, they used a “residual” neural network architecture instead of a purely “convolutional” architecture.
It improved the efficiency. Each of these two neural network-related tricks — switching from convolutional to residual architecture and using the “Two Headed ” neural network architecture instead of separate neural networks — would have resulted in about half of the increase in playing strength as was achieved when both were combined.


Diagram comparing residual to convolutional architectures, from the original “ResNet” paper

All in all it took Deepmind 40 days to beat the world’s best GO player. It took, 64 GPUs and 19 CPUs to do it.


Image courtesy: Kdnuggets

AlphaGo Zero constitutes a critical step towards solving the challenges of the human society. If similar techniques can be applied to other structured problems, such as protein folding, reducing energy consumption or searching for revolutionary new materials, the resulting breakthroughs have the potential to positively impact society.


Credits and More Resources:-
Read the paper
Read the accompanying Nature News and Views article









AlphaGo Zero constitutes a critical step towards solving the challenges of the human society. If similar techniques can be applied to other structured problems, such as protein folding, reducing energy consumption or searching for revolutionary new materials, the resulting breakthroughs have the potential to positively impact society.

Credits and More Resources:-
Read the paper
Read the accompanying Nature News and Views article

Comments