周志華老師《機器學習》圖16.13 Q-Learning 演算法是否有問題？

01-30

周志華老師《機器學習》page 388中的圖16.13 Q-Learning演算法，感覺有問題，首先是a沒有賦初值，同時第8行里也不應該a=a，否則每次迭代都沿當前Q值最高的路徑前進嗎？能否回答，萬分感謝！

2017.4.1更新：

周老師回信了，肯定了你的想法。

-----------------------------------------------------------------------------------------------------

你說的有道理。應該是每次迭代一開始 $a=pi^{epsilon}left(x ight)$ 比較合理。參考一下：

來源: https://groups.google.com/forum/#!topic/rl-list/4Efnr0gXhAU

也可以參考一下Sutton An Introduction to Reinforcement Learning (draft for the second edition)以及Szepesvári Algorithms for Reinforcement Learning里的寫法。

P140, An Introduction to Reinforcement Learning (draft for the second edition), Richard S. Sutton and Andrew G. Barto

P57, Algorithms for Reinforcement Learning (updated May 18, 2013), Csaba Szepesvári

還有一個區別就是周老師限制了 $epsilon$ -貪心策略的使用，另外兩者則沒有。我看到周老師有援引Watkins和Dayan 1992年發的那篇證明Q-learning以概率1收斂到最優動作-價值的論文。我猜想有可能Watkins在1989年或者1992年的paper里一開始用的是 $epsilon$ -貪心策略？不過這一點影響不大。

Page 388頁，請看下圖: