How does AlphaGo work?

Edward Tsang 2016.03.20; revised 2016.03.21

AlphaGo, is a computer program that beats a top professional human player Lee Sedol recently. To find the next move, AlphaGo plays millions of games from the current board situation by itself. To do so, AlphaGo needs to do two things:

shortlist sensible moves; and
evaluate how good a board situation is.

AlphaGo uses (a) expert knowledge, plus (b) machine learning, a branch in Artificial Intelligence, to perform these tasks.

Note: "AI beats human in Go" makes attractive headline. But one must not belittle the contributions to AlphaGo by human intelligence and computer hardware.

A non-technical explanation

AlphaGo, Google Deepmind's computer program in the board game Go, beat professional player Lee Sedol 9-dan 4-1 in a five games match from 9th to 15th March 2016. This is impressive, because before AlphaGo, no computer programs could beat professional human players in Go.

In Go, a game play is formed by a sequence of alternate moves by two players. To find the next move, AlphaGo plays millions of games from the current board situation by itself. It then chooses the move that leads to the highest score. To do so, AlphaGo needs to do two things:

Task 1: Generate and prioritise sensible moves;
Task 2: Evaluate how good a board situation is.

These two tasks are explained below.

Task 1: moves shortlisting

On Task1 AlphaGo attempts to shortlist moves. If AlphaGo were to consider all legal moves on the board, then it will not be able to play many games, due to the large number of combinations available. If it shortlists too few moves, it may miss good moves. AlphaGo uses (a) expert input; and (b) machine learning to learn what moves to consider. Experts suggest what "features" make a move valuable. For example, the positions around the the last two moves tend to be worth considering. A position that kills a group tend to be worth considering too. Exactly how these features should be combined together is the job of machine learning. This includes (i) learning from 30 million positions which experts played; and (ii) self-playing.

Task 2: board evaluation

Task2 is to score a board situation. If a game has played to the end, then AlphaGo will know which side is better. However, if the game sequence does not end in one side winning, AlphaGo has to estimate how favourable either side is. If the estimation is grossly inaccurate, then the simulation was futile or even misleading.

It is unclear in the publication exactly how AlphaGo evaluates a board situation. Roughly speaking, experts identify factors that they consider important for evaluating a board position. For example, how much territory is occupied by either side? How many pieces have been captured by either side? AlphaGo uses the moves generator (which is part of Task1) to play many games by itself, in order to learn how those factors should be combined.

Where does AlphaGo's power come from?

"Artificial intelligence beats human beings" is an exciting headline, hence machine learning gets most of the publicity. In reality, AlphaGo's power comes from a combination of the following sources:

Human knowledge:
Human knowledge is channelled into AlaphaGo in (i) the 30 million positions used for training; as well as (ii) the features that the moves generation and board evaluation networks use as input;
Machine learning:
Through neural network and simulation;
Computer processing power:
It is worth noting that the AlphaGo software today, unaltered, will be substantially more powerful in three years' time when Deepmind updates the processors
Intelligence from the Google Deepmind team, who put the above together:
Other teams may attempt to put the above together, which they would not necessarily be able to build a program as strong as AlphaGo.

A bit of technical information

For Task 1, AlphaGo generates moves using an artificial neural network, which is called a “Policy Network”. The connections in this network and the weights associated to it determine how the input features should be combined with each other. The weights on the connections were learned by (a) supervised learning, followed by (b) reinforcement learning through self-playing.