實現屬於自己的TensorFlow(三) - 反向傳播與梯度下降實現

02-13

前言

上一篇介紹了損失函數對計算圖中節點梯度計算的方法和實現以及反向傳播演算法的原理，但是並沒有給出如何實現損失函數從後向前計算梯度。本文主要總結在可以計算每個節點梯度的基礎上如何實現反向傳播計算變數的梯度以及通過反向傳播演算法得到的梯度實現梯度下降演算法優化來優化節點中的參數。最後我們使用實現的simpleflow完成一個線性回歸模型的優化。

附上simpleflow代碼:

PytLab/simpleflowgithub.com

正文

反向傳播計算梯度的實現

再拿出上篇中用於解釋反向傳播演算法的計算圖:

而且上篇中針對圖中的每個節點中輸出對輸入的梯度計算進行了實現，現在我們實現需要從後向前反向搜索與損失節點相聯繫的節點進行反向傳播計算梯度。通過上面我們知道若我們需要計算其他節點關於 $g$ 的梯度我們需要以損失節點為起點對計算圖進行廣度優先搜索，在搜索的過程中我們使用上篇中實現的針對每個節點的梯度計算便可以一邊遍歷一邊計算計算節點對遍歷節點的梯度了。我們可以使用一個字典將節點與梯度進行保存。

這裡需要注意的一點是，我們在上一篇針對每個節點計算梯度是損失值 $Loss$ 對該節點輸入的梯度，用公式表示就是針對一個 $f$ 節點的操作 $y=f(x,y)$ , 我們計算的梯度為 $frac{partial Loss}{partial x}$ 和 $frac{partial Loss}{partial y}$ , 如下圖:

而在廣度優先搜索的時候我們訪問一個節點時候計算的梯度是損失值 $Loss$ 對此節點的輸出 $z$ 的梯度, 也就是當我們遍歷訪問到節點 $y=f(x,y)$ 的時候計算的梯度為 $frac{partial Loss}{partial f}$ 或者 $frac{partial Loss}{partial z}$ , 其中 $g$ 為操作節點的輸出節點, 如下圖. 若當前訪問的節點有多個輸出，我們就計算損失函數 $Loss$ 對多個輸出分別的梯度再求和。

區別了上面的梯度，我們就可以使用Python來實現計算圖的廣度優先搜索了, 廣度優先搜索實現這裡不多做贅述了, 就是使用一個先進先出(FIFO)隊列控制遍歷順序，一個集合對象存儲已訪問的節點防止重複訪問，然後遍歷的時候計算梯度並將梯度放到grad_table中。廣度優先搜索細節可以參考Breadth-first search

梯度計算實現如下(https://github.com/PytLab/simpleflow):

def compute_gradients(target_op): Backpropagation implementation computing gradient of target operation wrt all the other connected nodes. :param target_op: The target operation whose gradient wrt other nodes would be computed. :type target_op: Any operation type. :return grad_table: A table containing node objects and gradients. :type grad_table: dict. # A dict containing a mapping between node and gradient value of target_op wrt the nodes output. # NOTE: It is the gradient wrt the nodes OUTPUT NOT input. grad_table = {} # The gradient wrt target_op itself is 1. grad_table[target_op] = np.ones_like(target_op.output_value) # Perform a breadth-first search staring from the target_op in graph. # Queue for node traverasl. queue = Queue() queue.put(target_op) # Set for visited nodes. visited = set() visited.add(target_op) while not queue.empty(): node = queue.get() # Compute gradient wrt the nodes output. if node != target_op: grads_wrt_node_output = [] for output_node in node.output_nodes: # Retrieve the gradient wrt output_nodes OUTPUT. grad_wrt_output_node_output = grad_table[output_node] # Compute the gradient wrt current nodes output. grad_wrt_node_output = output_node.compute_gradient(grad_wrt_output_node_output) if len(output_node.input_nodes) > 1: input_node_index = output_node.input_nodes.index(node) grads_wrt_node_output.append(grad_wrt_node_output[input_node_index]) else: grads_wrt_node_output.append(grad_wrt_node_output) # Sum all gradients wrt nodes output. tot_grad_wrt_node_output = sum(grads_wrt_node_output) grad_table[node] = tot_grad_wrt_node_output # Put adjecent nodes to queue. if hasattr(node, input_nodes): for input_node in node.input_nodes: if input_node not in visited: visited.add(input_node) queue.put(input_node) return grad_table

梯度下降優化

梯度(最速下降)法

上面我們實現了損失函數對其他節點梯度的計算，得到梯度的目的是為了能夠優化我們的參數，本部分我們實現一個梯度下降優化器來完成參數優化的目的。關於梯度下降優化的過程其實很簡單，它基於以下的觀察:

對於實值函數 $F(x)$ 在 $a$ 處可微且有定義，則函數 $F(x)$ 在 $a$ 點沿著梯度方向的反方向 $- abla F(a)$ 下降最快。

梯度下降法就是以梯度的反方向作為每輪迭代的搜索方向然後根據設定的步長對局部最優值進行搜索。而步長就是我們的學習率。

迭代公式

$x^{(k+1)} = x^{(k)} + lambda_{k}d^{(k)}$

搜索方向: $d^{(k)} = - abla f(x^{(k)})$ , 也稱為最速下降方向
搜索步長: $lambda_{k}$ , 可以固定位某值也可以變化如指數衰減或者通過一維搜索取最優步長等

梯度演算法步驟

取初始點 $x^{(1)} in mathbb{R}^n$ , 允許誤差 $epsilon > 0$ , 令 $k = 1$ ;
計算搜索方向 $d^{(k)} = - abla f(x^{(k)})$ ;
若 $lVert d^{(k)} Vert le epsilon$ , 則停止計算, $x^{(k)}$ 為所求極值點; 否則, 獲取步長 $lambda_k$ ;
令 $x^{(k+1)} = x^{(k) + lambda_{k}d^{(k)}}$ , 令 $k:=k+1$ 轉到第2步

對於一個二元函數 $x^{(k)}$ 的移動軌跡類似下圖:

對於一個一元函數的局部極小值附近 $x^{(k)}$ 的移動軌跡類似下圖:

上面的流程中的終止條件是梯度收斂，在訓練神經網路的時候我們需要根據自己的需要終止迭代，例如根據分類的準確度，均方誤差或者迭代步數等.

使用梯度下降優化計算圖中的參數

上面介紹了梯度下降演算法通用的演算法過程描述，這裡將其應用在我們計算圖中參數的優化。過程基本是一致的，只是我們現在的 $x^{(k)}$ 變成了計算圖中的參數（或者神經網路中的權重）值 $W$ 了:

隨機初始化變數 $W$ 和 $b$ ;
根據 $W$ 和 $b$ 計算損失函數值 $Loss$ , 若滿足終止條件，則終止迭代
計算損失函數 $Loss$ ,對 $W$ 和 $b$ 的梯度值 $G_W$ , $G_b$ ;
沿著梯度反方向更新 $W = W - rate imes G_W$ 和b=b?rate×Gbb=b?rate×Gb;
返回到2

實現梯度下降優化器

好了，知道了演算法的過程，我們需要給我們的simpleflow添加一個優化器，他的作用是根據當前圖中節點的值計算圖中所有可訓練的trainable的變數節點中變數的梯度，並沿著梯度方向進行一步更新。

為了模仿tensorflow的介面，我們也定義一個GradientDescentOptimizer並在裡面定義一個minimize方法來返回一個優化操作使其可以在session中執行, 代碼如下:

class GradientDescentOptimizer(object): Optimizer that implements the gradient descent algorithm. def __init__(self, learning_rate): Construct a new gradient descent optimizer :param learning_rate: learning rate of optimizier. :type learning_rate: float self.learning_rate = learning_rate def minimize(self, loss): Generate an gradient descent optimization operation for loss. :param loss: The loss operation to be optimized. :type loss: Object of `Operation` learning_rate = self.learning_rate class MinimizationOperation(Operation): def compute_output(self): # Get gradient table. grad_table = compute_gradients(loss) # Iterate all trainable variables in graph. for var in DEFAULT_GRAPH.trainable_variables: if var in grad_table: grad = grad_table[var] # Update its output value. var.output_value -= learning_rate*grad return MinimizationOperation()

使用simpleflow進行線性回歸

綜合前兩篇以及本篇的內容，我們實現的計算圖已經可以進行簡單的模型訓練了，本部分就以一個簡單的線性模型為例子來看看成果。

生成數據

這裡我們先基於 $y=3x$ 生成加入一些雜訊數據用於回歸.

import numpy as npinput_x = np.linspace(-1, 1, 100)input_y = input_x*3 + np.random.randn(input_x.shape[0])*0.5

構建計算圖

下面我們來構建線性回歸的計算圖，線性回歸的公式很簡單而且只需要兩個參數ww和bb，我們根據公式 $y=wx+b$ 來構建計算圖:

import simpleflow as sf# Placeholder for training datax = sf.placeholder()y_ = sf.placeholder()# 權重參數, 因為還未實現隨機初始, 目前使用1.0來初始化w = sf.Variable([[1.0]], name=weight)# 閾值b = sf.Variable(0.0, name=threshold)# 模型預測值y = x*w + b

損失函數

這裡我們使用平方差來作為損失函數:

loss = sf.reduct_sum(sf.square(y - y_))

創建梯度下降優化器來訓練模型

train_op = sf.GradientDescentOptimizer(learning_rate=0.005).minimize(loss)# 訓練feed_dict = {x: input_x, y_: input_y}with sf.Session() as sess: for step in range(20): loss_value = sess.run(loss, feed_dict=feed_dict) mse = loss_value/len(input_x) # 優化一步 print(step: {}, loss: {}, mse: {}.format(step, loss_value, mse)) sess.run(train_op, feed_dict) # 訓練後的參數值 w_value = sess.run(w, feed_dict=feed_dict) b_value = sess.run(b, feed_dict=feed_dict) print(w: {}, b: {}.format(w_value, b_value))

輸出:

step: 0, loss: 147.31229301332186, mse: 1.4731229301332187step: 1, loss: 75.2862864776536, mse: 0.7528628647765361step: 2, loss: 44.009064387253, mse: 0.44009064387253step: 3, loss: 30.387486500361295, mse: 0.30387486500361294step: 4, loss: 24.455137917985002, mse: 0.24455137917985004step: 5, loss: 21.871534168474543, mse: 0.21871534168474543step: 6, loss: 20.746346017138585, mse: 0.20746346017138584step: 7, loss: 20.256314070038826, mse: 0.20256314070038825step: 8, loss: 20.042899710055327, mse: 0.20042899710055326step: 9, loss: 19.949955384044085, mse: 0.19949955384044085step: 10, loss: 19.90947709692998, mse: 0.19909477096929978step: 11, loss: 19.891848352949488, mse: 0.19891848352949487step: 12, loss: 19.884170838991114, mse: 0.19884170838991114step: 13, loss: 19.880827196321707, mse: 0.19880827196321707step: 14, loss: 19.879371002772437, mse: 0.19879371002772436step: 15, loss: 19.8787368142952, mse: 0.19878736814295198step: 16, loss: 19.878460618163945, mse: 0.19878460618163946step: 17, loss: 19.878340331678682, mse: 0.19878340331678682step: 18, loss: 19.878287945577295, mse: 0.19878287945577294step: 19, loss: 19.878265130847833, mse: 0.19878265130847833w: [[ 2.93373825]], b: 0.04568701269996128

繪製回歸線

w_value = float(w_value)max_x, min_x = np.max(input_x), np.min(input_x)max_y, min_y = w_value*max_x + b_value, w_value*min_x + b_valueplt.plot([max_x, min_x], [max_y, min_y], color=r)plt.scatter(input_x, input_y)plt.show()

總結

本文主要實現了計算圖中的反向傳播演算法計算梯度以及梯度下降法優化參數，最後以一個線性模型為例子使用simpleflow構建的圖來進行優化。後續將會在新增幾個操作節點和梯度的實現來使用simpleflow構建一個多層感知機來解決簡單的回歸和分類問題。

參考

http://www.deepideas.net/deep-learning-from-scratch-iv-gradient-descent-and-backpropagation/
https://en.wikipedia.org/wiki/Gradient_descent
https://en.wikipedia.org/wiki/Breadth-first_search