There has been quite a bit of discussion about whether Alphago can learn from the games it plays against Lee Sedol. I think not. At least, not directly.
The heart of the program is the ”policy network” a convolutional neural network (CNN) that was designed for image processing. CNNs return a probability that a given image belongs to each of a predefined set of classifications, like ”cat”, ”horse”, etc. CNNs work astonishingly well, but have the weakness that they can only be used with a fix size image to estimate a fixed set of classifications.
The policy network views go positions as 19í—19 images and returns probabilities that human players would make one of 361 possible moves. This probability drives with the Monte Carlo tree search for good moves that has been used for some time in go computers.The policy network is trained on 30 million positions (or moves) initially.
CNN (aka ”deep learning”) behavior is pretty well understood. The number of samples required for learning depends on the complexity of the model. A model of this complexity probably requires tes of thousands of example positions before it changes much.
The number of samples required to train any machine learning program depends on the complexity of the strategy, not on the number of possible positions. For example, Gomoku ("five in a row", also called goban) on a 19í—19 board would take many fewer examples to train than go would, even though the number of possible positions is also very large.
Another point: Any machine learning algorithm will eventually hit a training limit, after which it wonâ€™t be able to improve itself by more training. After that, a new algorithm based on a new model of game play would be required to improve the play. It is interesting that the Alphago team seems to be actively seeking ideas in this area. Maybe that is because they are starting to hit a limit, but maybe it's just because they are looking into the future.
So Alphago probably canâ€™t improve its play measurably by playing any single player five times, no matter how strong. That would be ”overfitting”. The team will be learning from the comments of the pro players and modifying the program to improve it instead.
Interesting tidbit: Alphago said the chances of a human playing move 37 in game 2 was 1 in 10,000. So the policy network doesnâ€™t decide everything.