One-Page AlphaGo -- 10分鐘看懂AlphaGo的核心演算法

科普文,沒有數學公式,PDF文件剛好一頁,GitHub鏈接.看懂這篇文章需要一點深度學習的基礎.

This review summarizes the breakthrough of game Go [1] in 2016.

Solving a game of perfect information can usually be boiled down to search the optimal value in a tree containing b^d possible paths, where in chess bapprox 35, dapprox 80 and in Go bapprox 250, dapprox 150. Obviously, tackling problem of Go is infeasible by using exhaustive search or simple heuristics. However, there are some effective strategies we can apply: (1) Reduce breadth of search by sampling actions from policy p(a|s) (2) Reduce depth of search by position evaluation (3) Combine policy and value with Monte Carlo tree search (MCTS). The general steps are:

  • Train a supervised learning (SL) policy network p_{sigma} directly from human moves using a 13-layer convolutional neural networks. The input are 48 x 19 x 19 images (for example, its component stone color is 3 x 19 x 19), the output is probability over all legal moves by using a softmax layer. Accuracy is 55.7%.

  • Train a fast policy p_{pi} that can rapidly sample actions during rollouts. This uses a linear softmax of small pattern features. Accuracy is 24.2%, but it takes 2mu s to select a move, compared to 3ms in p_{sigma}.

  • Train a reinforcement learning (RL) policy network p_{
ho} to improve the supervised policy network by optimizing for outcome of the game. This adjusts the policy network towards winning the game instead of maximizing the prediction accuracy. p_{
ho} essentially has the same structure with p_{sigma} and its weights are initialized with same values 
ho = sigma. The game is played between current policy network p_{
ho} and a randomly (to combat overfitting) selected previous iteration of policy network.

  • Train a value network 
u_{	heta} to predict the winner of games played by reinforcement learned policy network against itself. This network has a similar architecture to the policy network, but with one more feature plane (current player color) and with the output changed to a single prediction (regression, mean squared error loss). Predicting game outcomes from complete games can easily lead to overfitting, since successive positions are strongly correlated, differing by just one stone. So newly generated self-play data between RL policy network is used. The data is sampled from separate games containing 30 million distinct positions.

  • Combine policy network, value network and fast policy with MCTS. A standard MCTS procedure contains four steps: Selection, Expansion, Evaluation and Backup. To make it more easily to understand, we only roughly talk about how it selects state in simulation (refer the math in original paper for exact formulation if you are interested):

state score = value network output + fast rollout policy outcome + SL policy network output

High state score (or stone move) will be selected. Value network output and fast rollout policy outcome are evaluation functions and are evaluated at the leaf node (but note that to evaluate fast rollout, we need to go until terminal step); SL policy network output is a probability of action at current stage, acting as a bonus of selection score. The score will be decayed by number of visits to encourage exploration. Note that RL policy network is only used for helping to derive value network and not used directly in MCTS.

That"s it, the AlphaGo program that beats human!

Reference

[1] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driess-che, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature, 529:484–489, 2016.

推薦閱讀:

跟我學做菜吧!sklearn快速上手!用樸素貝葉斯/SVM分析新聞主題
Jeff Dean執筆谷歌團隊2017年終總結,乾貨滿滿

TAG:AlphaGo | 人工智能 | 机器学习 |