獨家 | 教你用Q學習演算法訓練神經網路玩遊戲（附源碼）

04-29

我們之前介紹了使用Q學習演算法教AI玩簡單遊戲，但這篇博客因為引入了額外的維度會更加複雜。為了從這篇博客文章中獲得最大的收益，我建議先閱讀前一篇文章(https://www.practicalai .io/teaching-ai-play-simple-game-using-q-learning/)。

這個示例的完整源代碼可以在Github(https:// http://github.com/daugaard/q-learning-simple-game/tree/neuralnetwork)上獲得。注意，神經網路版本的強化學習演算法是在神經網路分支中。

遊戲

我們的遊戲是一個簡單的「抓乳酪」遊戲，玩家P必須移動去抓乳酪C，並避免掉進坑O里。

玩家P發現一個乳酪得一分，當玩家P掉到坑裡的時候就會減去一分。如果用戶得到5分或者-5分，遊戲就會結束。

如上所述，我們正在用一個新的維度來擴展原始遊戲，玩家可以上下左右移動。這張gif圖顯示了玩家正在玩這個新遊戲。

基於神經網路的強化學習

在上一篇文章中，我們使用q學習演算法得到一個Q表來構建AI。該演算法使用Q表來查找當前狀態下最優的下一個動作(想要了解Q學習演算法的工作原理可以查看這篇文章（https://www.practicalai.io /teaching-ai-play-simple-game-using-q-learning#q-learning-algorithm）)。對於簡單的遊戲來說是很好的，隨著遊戲複雜性的增加，Q表複雜度也在增加。這是因為在每一個可能的遊戲狀態S下，Q表必須包含每種可能動作A 的q值。

一種替代方法是用神經網路替代Q表查詢。神經網路會將狀態S和動作A作為輸入，同時輸出q值。q值是指在狀態S下執行動作A的可能獎勵。

隨著神經網路的實現，我們就可以確定在狀態S下執行哪個動作A。我們的AI會為每一個動作運行一次網路，並從中選擇使得神經網路輸出最高的的那個動作，這種做法將最大限度地提高AI的獎勵。

為了訓練我們的神經網路，我們將採用與原始的Q學習演算法相似的方法，但是我們對這個神經網路做了一些自定義的調整：

STEP 1：使用任意值初始化神經網路。
STEP 2：當玩遊戲時執行如下循環。

STEP 2.a：在0和1之間生成任意數。

如果產生的數大於某個閾值e，那麼隨機選擇一個動作，否則的話，在當前狀態和每個可能動作的組合下運行神經網路，選擇那個可以獲得最高獎勵的動作。

STEP 2.b：執行從步驟2.a獲得的那個動作。
STEP 2.c：觀察獎勵r。
STEP 2.d：用獎勵r和下面公式來訓練神經網路。

通過這一過程，我們將得到一個AI，這個AI的神經網路是基於在線訓練方式得到的，即在數據可用時立即培訓神經網路。

災難性干擾和經驗重現

正如上文所解釋的那樣，在線訓練演算法很容易受到災難性的干擾。當一個神經網路突然在學習新信息時忘記先前所學習到的東西時，就會產生災難性的干擾。

例如，在遊戲中有時會體驗到向左走時出現乳酪，但是其他時候往左走會讓你掉進坑裡。災難性干擾會使神經網路忘記先前學習的「往左走掉進坑裡」。這使得神經網路很難找到一個好的遊戲解決方案。

我們使用一種叫做經驗回放的方法解決災難性干擾。我們將大小R的重放內存引入到AI中,在每一次迭代中，我們從重放內存中隨機提取大小為B的狀態信息和動作信息來訓練神經網路。使用這種方法，我們不斷地使用新的批樣本來對神經網路進行訓練，而不是只使用某一段樣本。從而解決了災難性干擾。

現在我們的Q學習演算法如下:

STEP 1：使用任意值初始化神經網路。
STEP 2：當玩遊戲時執行如下循環。

STEP 2.a：在0和1之間生成任意的數。

如果產生的數大於某個閾值e，那麼隨機選擇一個動作，否則的話，在當前狀態和每個可能動作的組合下運行神經網路，選擇那個可以獲得最高獎勵的動作

STEP 2.b：執行從步驟2.a獲得的那個動作。
STEP 2.c：觀察獎勵r。
STEP 2.d：在重放內存中添加當前狀態、動作、獎勵和新狀態（如果內存滿了，覆蓋最早的那部分信息）。
STEP 2.e：如果重放內存是滿的-抽取尺寸為B的批樣本。

在批樣本的每個例子中，使用下式計算目標q值:

使用批目標q值和輸入狀態對神經網路進行訓練。

實現神經網路的AI

一旦我們定義了演算法，就可以開始實現我們的AI玩家。遊戲以玩家類的實例作為玩家對象。玩家類必須實現get_input函數。get_input函數在遊戲循環的每次迭代中被調用一次，並返回玩家的行動方向。

下面給出了一個人類玩家類的例子：

require io/console
class Player
attr_accessor :y,:x
def initialize
@x = 0
@y = 0
end
def get_input
input = STDIN.getch
if input == a
return :left
elsif input == d
return :right
elsif input == w
return :up
elsif input == s
return :down
elsif input == q
exit
end
return :nothing
end
end

關於神經網路AI玩家，我們必須實現一個新的玩家類，它使用上面的演算法大綱來確定get_input函數中的動作。

我們首先需要的是Ruby-FANN工具包，它包含了用於FANN（快速人工神經網路，一個C語言的神經網路實現）的Ruby綁定。

接下來，我們定義一個構造函數，該函數設置演算法需要的玩家的屬性和參數。我們的例子使用了一個大小為500的重放內存和大小為400的批訓練樣本。

require ruby-fann
class QLearningPlayer
attr_accessor :y, :x, :game
def initialize
@x = 0

@y = 0
@actions = [:left, :right, :up, :down]
@first_run = true
@discount = 0.9
@epsilon = 0.1
@max_epsilon = 0.9
@epsilon_increase_factor = 800.0
@replay_memory_size = 500
@replay_memory_pointer = 0
@replay_memory = []

@replay_batch_size = 400
@runs = 0
@r = Random.new
end

要注意那些用來支持動態e值的參數設置。e是演算法中第2.a步驟用於選擇動作的概率。如果e值很低，那麼我們會以高概率隨機選擇一個動作，而不是選擇最高獎勵的那個動作。e值的實現將是動態的，從一個非常低的值開始探索，並在每一次迭代中增長，直到達到最大值。

接下來設置一個函數來初始化神經網路。我們設置網路的輸入大小等於xy軸的映射數量加上可執行動作數量的和。我們有一個和輸入層神經元數量一致的隱藏層和一個輸出節點（q值）。另外，將學習速率設置為0.2，並將激活函數更改為S型對稱以支持負值。

def initialize_q_neural_network
# Setup model
# Input is the size of the map + number of actions
# Output size is one
@q_nn_model = RubyFann::Standard.new(

num_inputs: @game.map_size_x*@game.map_size_y + @actions.length,
hidden_neurons: [ (@game.map_size_x*@game.map_size_y+@actions.length) ],
num_outputs: 1 )
@q_nn_model.set_learning_rate(0.2)
@q_nn_model.set_activation_function_hidden(:sigmoid_symmetric)
@q_nn_model.set_activation_function_output(:sigmoid_symmetric)
end

現在是實現get_input函數的時候了。先暫停幾毫秒來幫助我們跟隨AI玩家並增加跟蹤運行次數的屬性。然後檢查是否是第一次運行，以及是否初始化了神經網路（步驟1）。

def get_input
# Pause to make sure humans can follow along
# Increase pause with the number of runs

sleep 0.05 + 0.01*(@runs/400.0)
@runs += 1
if @first_run
# If this is first run initialize the Q-neural network
initialize_q_neural_network
@first_run = false
else

如果這不是第一次運行，那麼評估最後一次發生了什麼，並計算相應的獎勵（步驟2.c）。如果遊戲得分增加則將獎勵設置為1；如果遊戲分數降低則將獎勵設置為-1；如果沒有事情發生則獎勵為-0.1。在沒有發生任何事情的情況下，給予一個負的獎勵，這將鼓勵演算法直接去捉乳酪。

# If this is not the first
# Evaluate what happened on last action and calculate reward
r = 0 # default is 0

if !@game.new_game and @old_score < @game.score
r = 1 # reward is 1 if our score increased
elsif !@game.new_game and @old_score > @game.score
r = -1 # reward is -1 if our score decreased
elsif !@game.new_game
r = -0.1
end

接下來要捕捉遊戲的當前狀態，並和獎勵以及上一狀態一起放到重放內存中。將捕捉到的狀態作為神經網路的輸入矢量。通過在玩家位置設置一個矢量1來編碼輸入矢量的當前位置（步驟2.d）。

# Capture current state
# Set input to network map_size_x * map_size_y + actions length vector with a 1 on the player position
input_state = Array.new(@game.map_size_x*@game.map_size_y + @actions.length, 0)

input_state[@x + (@game.map_size_x*@y)] = 1
# Add reward, old_state and input state to memory
@replay_memory[@replay_memory_pointer] = {reward: r, old_input_state: @old_input_state, input_state: input_state}
# Increment memory pointer
@replay_memory_pointer = (@replay_memory_pointer<@replay_memory_size) ? @replay_memory_pointer+1 : 0

然後檢查內存是否已滿。如果已滿，提取一個隨機的批樣本，計算更新q值並對網路進行訓練（步驟2.e）。

# If replay memory is full train network on a batch of states from the memory
if @replay_memory.length > @replay_memory_size
# Randomly sample a batch of actions from the memory and train network with these actions
@batch = @replay_memory.sample(@replay_batch_size)
training_x_data = []
training_y_data = []
# For each batch calculate new q_value based on current network and reward
@batch.each do |m|
# To get entire q table row of the current state run the network once for every posible action
q_table_row = []
@actions.length.times do |a|
# Create neural network input vector for this action
input_state_action = m[:input_state].clone
# Set a 1 in the action location of the input vector
input_state_action[(@game.map_size_x*@game.map_size_y) + a] = 1
# Run the network for this action and get q table row entry
q_table_row[a] = @q_nn_model.run(input_state_action).first
end
# Update the q value
updated_q_value = m[:reward] + @discount * q_table_row.max
# Add to training set
training_x_data.push(m[:old_input_state])
training_y_data.push([updated_q_value])
end
# Train network with batch
train = RubyFann::TrainData.new( :inputs=> training_x_data, :desired_outputs=>training_y_data );
@q_nn_model.train_on_data(train, 1, 1, 0.01)
end
end

隨著網路的更新我們開始思考下一步該做什麼。首先在網路輸入矢量中捕捉遊戲的當前狀態，然後根據演算法的當前運行來計算e值。越高的e值意味著以越高的概率選擇那些獎勵最高的動作，而不是隨機動作。

接下來，要麼選擇一個隨機動作，要麼在當前狀態S運行神經網路，執行每個動作A，並根據網路輸出來決定要執行哪個動作。

# Capture current state and score
# Set input to network map_size_x * map_size_y vector with a 1 on the player position
input_state = Array.new(@game.map_size_x*@game.map_size_y + @actions.length, 0)
input_state[@x + (@game.map_size_x*@y)] = 1
# Chose action based on Q value estimates for state
# If a random number is higher than epsilon we take a random action
# We will slowly increase @epsilon based on runs to a maximum of @max_epsilon - this encourages early exploration
epsilon_run_factor = (@runs/@epsilon_increase_factor) > (@max_epsilon-@epsilon) ? (@max_epsilon-@epsilon) : (@runs/@epsilon_increase_factor)
if @r.rand > (@epsilon + epsilon_run_factor)
# Select random action
@action_taken_index = @r.rand(@actions.length)
else
# To get the entire q table row of the current state run the network once for every posible action
q_table_row = []
@actions.length.times do |a|
# Create neural network input vector for this action
input_state_action = input_state.clone
# Set a 1 in the action location of the input vector
input_state_action[(@game.map_size_x*@game.map_size_y) + a] = 1
# Run the network for this action and get q table row entry
q_table_row[a] = @q_nn_model.run(input_state_action).first
end
# Select action with highest posible reward
@action_taken_index = q_table_row.each_with_index.max[1]
end

最後，將當前的分數存儲在舊的分數變數中，將當前狀態存儲在舊的狀態變數中，並返回遊戲能夠執行的動作（步驟2.b）。

# Save current state, score and q table row
@old_score = @game.score
# Set action taken in input state before storing it
input_state[(@game.map_size_x*@game.map_size_y) + @action_taken_index] = 1
@old_input_state = input_state
# Take action
return @actions[@action_taken_index]
end

可以在這裡找到完整的組合代碼：

https://github.com/daugaard/q-learning-simple-game/blob/55748d5e821b34a531dba4d9c4b2683038db6b3d/q_learning_player.rb。

讓AI玩

用訓練好的AI運行代碼，看看它是如何運行的。

我們能看到AI一開始在到處遊走。這是由動態的e值導致的，在重放內存滿之前，我們不會開始訓練神經網路。這意味著開始的時候執行的所有動作都是隨機的。但是在運行1和運行2結束時會看到AI已經學會了避免掉進陷坑，直接朝著乳酪去了。

更通用的方法

這篇文章展示了如何訓練一個具有對稱s形激活器的神經網路來玩一個簡單的遊戲，方法是通過編碼遊戲狀態和動作作為神經網路的輸入向量，同時將對獎勵的某種測量值作為神經網路的輸出。這個方案需要了解遊戲的知識來建立一個網路，當然這對我們建立更通用的AI是一個限制。

更一般的方法是將作為輸入的編碼遊戲狀態替換成渲染遊戲用的RBG值。DeepMind公司的研究人員在《用深度強化學習玩雅達利遊戲》這篇論文中詳盡地討論了這個方法。他們成功地訓練了Q學習，用一個神經網路Q表來玩太空入侵者、Pong、Q伯特和其他雅達利2600遊戲。

原文標題：Teaching a NeuralNetwork to play a game using Q-learning

作者：Soren D

翻譯：楊金鴻

原文鏈接：

https://www.practicalai.io/teaching-a-neural-network-to-play-a-game-with-q-learning/