How does AlphaGo work?

Edward Tsang 2016.03.20; revised 2016.03.21

AlphaGo, is a computer program that beats a top professional human player Lee Sedol recently. To find the next move, AlphaGo plays millions of games from the current board situation by itself. To do so, AlphaGo needs to do two things:

  1. shortlist sensible moves; and
  2. evaluate how good a board situation is.
AlphaGo uses (a) expert knowledge, plus (b) machine learning, a branch in Artificial Intelligence, to perform these tasks.

Note: "AI beats human in Go" makes attractive headline. But one must not belittle the contributions to AlphaGo by human intelligence and computer hardware.


A non-technical explanation

AlphaGo, Google Deepmind's computer program in the board game Go, beat professional player Lee Sedol 9-dan 4-1 in a five games match from 9th to 15th March 2016. This is impressive, because before AlphaGo, no computer programs could beat professional human players in Go.

In Go, a game play is formed by a sequence of alternate moves by two players. To find the next move, AlphaGo plays millions of games from the current board situation by itself. It then chooses the move that leads to the highest score. To do so, AlphaGo needs to do two things:

These two tasks are explained below.

Task 1: moves shortlisting

On Task1 AlphaGo attempts to shortlist moves. If AlphaGo were to consider all legal moves on the board, then it will not be able to play many games, due to the large number of combinations available. If it shortlists too few moves, it may miss good moves. AlphaGo uses (a) expert input; and (b) machine learning to learn what moves to consider. Experts suggest what "features" make a move valuable. For example, the positions around the the last two moves tend to be worth considering. A position that kills a group tend to be worth considering too. Exactly how these features should be combined together is the job of machine learning. This includes (i) learning from 30 million positions which experts played; and (ii) self-playing.

Task 2: board evaluation

Task2 is to score a board situation. If a game has played to the end, then AlphaGo will know which side is better. However, if the game sequence does not end in one side winning, AlphaGo has to estimate how favourable either side is. If the estimation is grossly inaccurate, then the simulation was futile or even misleading.

It is unclear in the publication exactly how AlphaGo evaluates a board situation. Roughly speaking, experts identify factors that they consider important for evaluating a board position. For example, how much territory is occupied by either side? How many pieces have been captured by either side? AlphaGo uses the moves generator (which is part of Task1) to play many games by itself, in order to learn how those factors should be combined.

Where does AlphaGo's power come from?

"Artificial intelligence beats human beings" is an exciting headline, hence machine learning gets most of the publicity. In reality, AlphaGo's power comes from a combination of the following sources:

  1. Human knowledge:
    Human knowledge is channelled into AlaphaGo in (i) the 30 million positions used for training; as well as (ii) the features that the moves generation and board evaluation networks use as input;
  2. Machine learning:
    Through neural network and simulation;
  3. Computer processing power:
    It is worth noting that the AlphaGo software today, unaltered, will be substantially more powerful in three years' time when Deepmind updates the processors
  4. Intelligence from the Google Deepmind team, who put the above together:
    Other teams may attempt to put the above together, which they would not necessarily be able to build a program as strong as AlphaGo.


A bit of technical information

For Task 1, AlphaGo generates moves using an artificial neural network, which is called a “Policy Network”. The connections in this network and the weights associated to it determine how the input features should be combined with each other. The weights on the connections were learned by (a) supervised learning, followed by (b) reinforcement learning through self-playing.

For Task 2, AlphaGo uses “Valuation Network” to tell the fitness of a given board situation. It is a neural network. The weights on the connections were learned through simulation using the Policy Network used in Task 1.

More on the Technical Side

Policy Network representation:

Valuation Network representation:

Machine learning:

  1. Use supervised learning to train a "SL policy network"
    Input to this network is a board situation, output is a move Supervised training used 30,000,000 board positions to train the weights
  2. Use Reinforcement Learning (RL) to generate a RL policy network
    Play games between programs that use the SL policy network Outcomes are used to improve the weights in the SL policy network
  3. Use the RL network to train a valuation function
    Given a board, the valuation function will rate its fitness. This is the key to success. Monte Carlo Tree Search is used to train the weights of a neural network, which outputs the fitness of a given board situation.
  4. Hardware
    • "The final version of AlphaGo used 40 search threads, 48 CPUs and 8 GPUs" (page 4); i.e. 56 processors were used;
    • A distributed version uses “40 search threads, 1,202 CPUs and 176 GPUs"; i.e. 1378 processors were used. This is the version used to play Lee Sidol.

Actual play [this is the author's interpretation based on the literature]:

[End]


This summary is mainly based on: The author welcomes corrections.