Google’s DeepMind group recently demonstrated a Go-playing application, AlphaGo, that just defeated Lee Sedol, the world’s leading Go grandmaster. (In a five-game tournament, AlphaGo won the first three and the fifth games.) This is the first time that a professional 9-dan Go grandmaster has ever been beaten by a Go-playing program. Go is an oriental board game played on a 19x19 matrix. The players alternate playing black or white “stones” on the points created by the intersecting lines. The goal of the game is to end up controlling the most space on the board. Play is defined by a very precise set of rules.
When IBM’s DeepBlue beat world chess champion Garry Kasparov in 1997, AI experts immediately began to think about how they could build a computer that could play and defeat a human Go player (see Figure 1), since Go was the only perfect information game of strategy that everyone acknowledged was more difficult than chess.1 The first move of a chess game offers 28 possibilities. The first move in a Go game can involve placing a “black” stone in any one of 361 positions. The second player, who plays “white” stones, then responds by placing a stone in any one of the 360 remaining positions. A typical chess game lasts around 80 turns, while Go games last 150 turns.
Figure 1 — Playing Go.
Both games have explicit moves and rules that, theoretically, would allow an analyst to create a branching diagram to explore all logical possibilities. In both cases, however, the combinations are so vast that a logical analysis is impossible. The possible game states, in either game, are greater than the number of atoms in the universe. (The search space for chess is generally said to be 1047. The search space in Go is generally held to be 10170. Even if you could do the analysis, I’ve often joked, you couldn’t get enough atoms of ink to print out the answer.)
Perhaps the most important difference between chess and Go is that it’s easy to know who’s winning a chess game. The conclusion of the game is obvious; it occurs when the opponent’s king is captured. Pieces are given values, and one can roughly evaluate a given series of moves by determining who captures the most enemy pieces. During most Go play, at least among masters, it is hard to tell who is winning. Moreover, during play an entire board can change radically if a player, while developing a pattern surrounding an area, makes a single mistake. In such a case, the opponent may quickly reverse the pattern and the person seeking to surround may find himself surrounded. Nor does a Go game end at any specific point, but only when both players agree there are no more useful moves. Then the players “rearrange” the board, removing and replacing stones until it’s possible to count the spaces (territory) that each player controls.
Ultimately, IBM’s Deep Blue depended on brute-force Monte Carlo tree search when it played chess. In essence, at any point, when it had to move, Deep Blue considered all the possible moves it might reasonably make. Then, one at a time, it would assume it made a given move and then consider each possible response its opponent might make, and then each possible subsequent move it might make, and so forth. These projected paths were each considered and the pieces captured as the path was followed were calculated. As Deep Blue became more powerful, it was able to calculate five, seven, and then even more moves ahead. (Deep Blue was able to project and evaluate 200 million moves per second.) In the end, the computer program always chose the path that gained it the most points. Moreover, as soon as the human player responded, Deep Blue did the whole thing all over again.
When AI developers initially considered trying to develop a program to play Go, they used a “look forward” approach similar to Deep Blue. There are, in fact, several commercial products on the market that run on PCs or in cyberspace that depend on a version of Monte Carlo search. They are good enough to challenge amateurs but offer no serious challenge to a human Go expert. In 2014, a developer of one of the popular Go software programs estimated that it would be at least 10 years before a machine could hope to beat a professional Go player.
To avoid some of the confusion that surrounded the victory of IBM’s Deep Blue over Kasparov in 1997 — when no one was quite sure how Deep Blue worked — Google published a detailed description of AlphaGo in one of the leading science journals, Nature, immediately after its initial victory in October 2015, when it defeated European Go champion Fan Hui. This article made it possible to consider AlphaGo’s architecture and development in more detail than any similarly impressive commercial application.
So how does AlphaGo work? First, the core of AlphaGo was not developed as a software package to play Go. The basic neural net architecture used in AlphaGo was initially developed to play Atari software games. The Atari-playing program was designed to “look” at computer screens (matrices of pixels) and respond to them. When DeepMind subsequently decided to tackle the Go-playing problem, it simply repurposed the Atari software package. The input that AlphaGo uses is a detailed 19x19 matrix of a Go board with all pieces that have been played on it. The key point, however, is that the underlying AlphaGo platform is based on a generic software package designed to learn to play games; it’s not a specially developed Go-playing program.
Second, AlphaGo uses Monte Carlo tree search, just as Deep Blue and most other Go-playing software do, but only occasionally to check certain types of situations. AlphaGo largely depends on two deep neural nets. One of the neural networks is termed a policy network and is trained to predict the next move for a given board state. Once an opponent makes a move, the policy network considers and chooses the moves that AlphaGo might make. It can also consider how the opponent might respond to each move AlphaGo might make, suggest the move AlphaGo would make after that, and so on. This is used to guide the tree search, focusing the search on the most promising nodes. This is based on what AlphaGo has learned from historic play and from reinforcement learning, when it learned what its opponent might do in the millions of games it played with a variation of itself. In fact, there are two variations on this policy network. One version is more exhaustive and careful in its analysis effort. A second fast version of the policy network is built to race ahead from a given position and complete the entire game to determine if the considered move(s) results in a win. This guarantees that the program remains ultimately focused on its goal — winning the game — and doesn’t get diverted into maximizing its gains in a limited set of moves. These two versions of the policy network are used together. The first considers several desirable moves. The second quickly checks to see if the chosen move results in a win. The policy network is a 13-layer network.
The second neural network, called a value network, is trained to evaluate a board state and determine the likelihood that such a board state would lead to a win. This is rather like the normal random playouts used in Monte Carlo tree search and similar to the evaluation function used in alpha-best search in chess. In other words, the value network has solved the problem of identifying what a “win” looks like in Go, and by working backward, can identify board states more likely to lead to wins. In essence, AlphaGo has evaluated 100,000s of actual Go games and has developed an idea of what patterns of board states precede a win in those actual games. So, working together, the policy network chooses likely moves and the value network chooses the specific move from among those the policy network identified that are most likely to result in a win. Note that the policy and value networks are not focused on massive tree search — although they use some when necessary — but on identifying the specific move to make next. Thus, in practice, AlphaGo evaluates thousands fewer positions than Deep Blue did in its chess match against Kasparov, focusing instead on selecting the positions it does evaluate more carefully, using the policy network, and evaluating them more intelligently, using the value network.
AlphaGo is managed by a single master machine that coordinates the various networks and rollouts and adjusts the use of resources to times provided or allowed at various points in Go play. As already noted, the basic unit being evaluated by AlphaGo is the entire Go board. The input for the neural network was a graphic representation of the entire 19x19 Go board with all the black and white stones. In effect, AlphaGo “looks” at the actual board and state of play and then uses that complete pattern as one unit. Winning games are boards with hundreds of stones in place. The unit that preceded the winning board was a board with all the final stones, save one, and so forth. A few years ago, no computer would have been able to handle the amount of data that AlphaGo manipulates to “consider” board states. (In a similar way, much of IBM’s Watson’s usefulness is its ability to ask questions and provide answers in human language. This natural language facility isn’t really a part of the core thought processes going on in Watson, but it adds a huge amount of utility to the overall application. In a similar way, the ability of AlphaGo to use images of actual Go boards with their pieces in place adds an immense amount of utility to AlphaGo, when it’s presented as a Go-playing application.)
One more thing to note: AlphaGo has examined 100,000s of Go games as it learned to identify likely next move or board states that lead to a win. A few decades ago, it would have been impossible to obtain detailed examples of good Go games. The games played in major tourneys have always been recorded, but most Go games were not documented. All that changed with the invention of the Internet and the Web. Today, many Go players play with Go software in the cloud, and their moves are automatically captured. Similarly, many players exchange moves online, and many sites document games. Just as business and government organizations now have huge databases that mine for patterns, today’s Go players are able to draw on huge databases of Go games, and the team that developed AlphaGo was able to draw on these databases when it initially trained AlphaGo using actual examples (i.e., supervised learning).
One key to grasping AlphaGo and other deep neural network-based applications is to understand the role of reinforcement learning. When we developed expert systems in the late 1980s and a system failed to make a prediction correctly, according to our human expert, the developers and the human expert spent days or even weeks pouring over the hundreds of rules in the systems to see where the system went wrong. Then rules were changed and tests were run to see if specific rule changes would solve the problem. Making even a small change in a large expert system was a very labor-intensive and time-consuming job.
As mentioned, AlphaGo defeated the leading European Go master in October 2015. In March 2016, it played the world Go champion. Predictably, the world Go champion studied AlphaGo’s October games to learn how AlphaGo plays. Unfortunately for him, AlphaGo had played millions of additional games — playing against a version of itself — since October, and significantly increased its ability to judge board states that lead to victory. Unlike the expert system development team that was forced to figure out how its system failed and then make a specific improvement, the AlphaGo team has simply put AlphaGo in learning mode and then set it to playing games with a version of itself. Each time AlphaGo won, it adjusted the connection weights of its network to develop better approximations of the patterns that lead to victory. (Every so often, the version of AlphaGo that it was playing against would be updated so it was as strong as the current version of AlphaGo. That would make subsequent games more challenging for AlphaGo and make the progress even more rapid.) AlphaGo is capable of playing a million Go games a day, with itself, when in reinforcement learning mode.
As impressive as AlphaGo’s 2015 victory over Hui was, it paled by comparison with AlphaGo’s win over Lee Sedol in 2016. Hui, while a very good player, was only ranked a 2-dan professional (he was considered the 633rd best professional player in the world), while Lee was ranked a 9-dan professional and was widely considered the strongest active player in the world. After examining the games that AlphaGo played against Hui, experts were confident that Sedol could easily defeat AlphaGo. (They informally ranked AlphaGo a 5-dan player.) In fact, when the match with Sedol took place, four months after the match with Hui, everyone was amazed at how much better AlphaGo was. What the professional Go players failed to realize was that, in the course of four months, AlphaGo had played millions of games with itself, constantly improving its play. It was as if a human expert had managed to accumulate several additional lifetimes of experience between the October 2015 and March 2016 matches.
One 9-dan commentator at the AlphaGo-Lee Sedol match explained to an audience that he had brought copies of the October games with him and had intended to analyze them as part of his commentary to show some of AlphaGo’s weaknesses. Once he began to watch the first game in the March match with Sedol, however, he realized that the October game printouts were useless because AlphaGo was playing an entirely different and much better game. Sedol, after losing the second game, said that he was in shock and was impressed that AlphaGo played a near-perfect game.
AlphaGo was trained to maximize the probability that it would win the game. Thus, as one AlphaGo team member explained, “If AlphaGo must choose between a scenario where it will win by 20 points with 80% probability and another where it will win by 1 ½ points with 99% probability, it will choose the latter.” This explains the combination of AlphaGo’s very aggressive middle game play, but its rather conservative play during the end game. It may also explain the difficulties that Sedol seemed to have when he reached end game and found many of the moves he wanted to make already precluded. In its most recent tournament, AlphaGo used 1,920 processors and a further 280 GPUs — specialized chips capable of performing simple calculations in staggering quantities.
In summary, AlphaGo is a major breakthrough in cognitive computing; it is a software application that can successfully play the hardest strategy game that people play and can beat human experts at it. It represents a major advance over chess- and Go-playing software that was considered state-of-the-art only two years ago. This advance was achieved by creating an architecture that combined some existing techniques (e.g., Monte Carlo tree search) with two different neural networks. It also incorporated a visual pattern analysis front end that allowed the application to interact in a much more “human” manner with its opponents. The breakthrough cannot be assigned to any specific technical development — even deep neural networks have been used for well over a decade. Instead, the breakthrough relies on developers learning to combine different techniques and using resources drawn from large databases to solve the challenge of Go.
If any techniques should be called out for special attention, it is deep learning and the various techniques that allow for the construction of networks with many layers of hidden processor elements and the reinforcement learning technique that was so successfully used by the AlphaGo team to empower AlphaGo to play itself, over and over, constantly learning to play a little better. The ability of AlphaGo to become significantly better in four months impressed everyone involved in the tournament. Presumably, AlphaGo could continue to play itself and be much better, four months from now, than it is today. That might not mean anything for Go-playing as AlphaGo has already defeated the best human player, but it might mean quite a lot once AlphaGo’s technology focuses on designing a better solar panel or recommending cancer treatments. The software platform that was the basis of AlphaGo was a generic package of learning networks and tools that can and will presumably be used to solve other problems.
Endnote
1A perfect information game refers to a game that has well-defined rules and in which all the elements and moves are known to all players at all times. In theory, it is always possible to analyze such games to determine if there is a set of moves that will guarantee a win. This can’t be done, however, if the number of possible moves is so great that it exceeds the ability of our fastest computers to physically do the analysis. Both chess and Go fall in this latter category and are termed NP complete games (nondeterministic polynomial time) — games that are impossible to analyze completely because the combinations are so extensive as to make complete enumeration impossible.
[For more from the author on this topic, see "The Nature of Cognitive Computing, Part I."]