tensorflow是如何求導的？

01-14

如果每次代入樣本後，都要進行符號求導(如sympy那樣)，感覺效率是極其低下的。如果像Andrew Ng《神經網路與機器學習里》視頻里某節課用的數值方法那樣，每次使參數變化一個很小的量的，比如0.001，求得損失函數相應的變化量，進而得到導數，這樣感覺效率是很高的。
我的問題是：
tensorflow的具體實現是用Andrew Ng視頻里講的數值方法那樣求導嗎？如果是的話，這個0.001就是學習率嗎？

符號求導不難噠。。

對於符號求導來說，最重要的是要在各個符號操作的地方，記下對導數的影響。然後使用鏈式法則

$frac{mathrm d f(g(x))}{mathrm dx} = frac{mathrm df(g(x))}{mathrm d(g(x))}frac{mathrm dg(x)}{mathrm dx}$

即可

例如舉一個例子：

$f(x) = x^2$

顯然我們知道這個函數關於x的導數是 $f$

在Tensorflow里長這樣：（位於math_grad.py)

@ops.RegisterGradient("Square") def _SquareGrad(op, grad): x = op.inputs[0] # Added control dependencies to prevent 2*x from being computed too early. with ops.control_dependencies([grad.op]): x = math_ops.conj(x) return grad * (2.0 * x)

這個文件的所有的函數都用RegisterGradient裝飾器包裝了起來，這些函數都接受兩個參數，op和grad。其他的只要註冊了op的地方也有各種使用這個裝飾器，例如batch

RegisterGradient裝飾器的類文檔描述了這個裝飾器的作用：

This decorator is only used when defining a new op type. For an op with `m` inputs and `n` outputs, the gradient function is a function that takes the original `Operation` and `n` `Tensor` objects (representing the gradients with respect to each output of the op), and returns `m` `Tensor` objects (representing the partial gradients with respect to each input of the op).


  For example, assuming that operations of type `"Sub"` take two

  inputs `x` and `y`, and return a single output `x - y`, the

  following gradient function would be registered:

```python @tf.RegisterGradient("Sub") def _sub_grad(unused_op, grad): return grad, tf.negative(grad) ```

第一個參數op是操作，第二個參數是grad是之前的梯度，實際上就是鏈式法則的後半部分 $frac{mathrm dg(x)}{mathrm dx}$ 。文檔里給的例子就是x-y的梯度的示例

因為每個操作可能是只有一個符號作為輸入，也可能是兩個符號，前者返回這一個符號的梯度，後者返回輸出對於兩個符號的梯度。以上函數返回的是

$frac{mathrm d f(x,y)}{mathrm dx}$ 和 $frac{mathrm df(x,y)}{mathrm dy}$ 兩個梯度，顯然前者是1，後者是-1，結合鏈式法則即可

鏈式法則算完了整個圖的梯度之後，就乘上一個delta算出對應的數值解

這個的具體代碼在gradient_checker.py 里，你可以用編輯器快速跳轉跳進去

函數大概如下：

def compute_gradient(x,x_shape,y,y_shape,x_init_value=None,delta=1e-3) #忽略了一些參數 dx, dy = _compute_dx_and_dy(x, y, y_shape) ret = _compute_gradient(x, x_shape, dx, y, y_shape, dy, x_init_value, delta, extra_feed_dict=extra_feed_dict) return ret

其中第一個函數_compute_dx_and_dy調用了gradients_impl.py:gradients ，按照函數說明，就是取出對應的偏導的符號運算

Constructs symbolic partial derivatives of sum of `ys` w.r.t. x in `xs`.

就是我們前面所說的鏈式法則。

後面那個_compute_gradient是算雅克比矩陣的，返回了一個元組 (理論解，一個數值解)，delta用在數值解的雅克比矩陣上

另外顯然地，計算圖的導數只用構造計算圖的時候算一次，只要代入對應的x,y就可以計算。就算以後動態改變計算圖，也只是繼續使用鏈式法則就好。所以每個樣本進去的時候只是x,y不一樣

@豬了個去的回答很贊，建議先閱讀豬了個去：tensorflow是如何求導的？。

一般機器學習系統可以提供兩類介面，命令式（Imperative）和聲明式（Declarative）。命令式就是直接把一些op的正向運算和求導運算都直接實現了，例如下面的Python代碼。

def square(x): return x * x;


def square_grad(x):

  return 2 * x

print(square(10)) print(square_grad(10))

而聲明式就是用戶只需要描述數學公式即可，系統可以自動根據公式求導，大概用法如下。

loss = (y - w * x - b) ** 2

print(loss.forward()) print(loss.grad())

顯然，TensorFlow提供的是聲明式的編程介面，用戶不需要關心求導的細節，只需要定義好模型得到一個loss方程，然後使用TensorFlow實現的各種Optimizer來進行運算即可。而為了讓loss下降，一般採用的是梯度下降（Gradient Descent）演算法，這要求對loss公式上的每個op都需要求偏導，然後使用鏈式法則結合起來。這要求TensorFlow本身提供了每個op的求偏導方法，而且雖然我們使用的是Python的加減乘除運算符，實際上是TensorFlow重載了運算符實際上會創建「Square」這樣的op，可以方便用戶更容易得構建表達式。

因此TensorFlow的求導，實際上是先提供每一個op求導的數學實現，然後使用鏈式法則求出整個表達式的導數。

我們知道，加減乘除平方的導數都是數學上定義好的，例如我們知道x ** n的導數是n * x ** (n - 1)，兩個數之和的導數是兩個數導數之和，如果用代碼實現就如下。

class PowerOp(Op): def __init__(self, input, power, name="Power"): super(PowerOp, self).__init__(name)


    if not isinstance(input, Op):

      self._op = ConstantOp(input)

    else:

      self._op = input
    self._power = power
    self._graph = graph.get_default_graph()

    self._graph.add_to_graph(self)
  def forward(self):

    result = pow(self._op.forward(), self._power)

    return result
  def grad(self, partial_derivative_opname=None):

    if isinstance(self._op, PlaceholderOp) or isinstance(self._op, ConstantOp):

      # op is the constant

      grad = 0

    elif isinstance(self._op, VariableOp):

      # op is the variable

      grad = self._power * pow(self._op.forward(), self._power - 1)

    else:

      # op is other complex operation and use chain rule

      grad = self._power * pow(self._op.forward(), self._power - 1

                               ) * self._op.grad(partial_derivative_opname)

return grad

以此為例，我們可以實現負數的導數，當然還有「加法法則」、「乘法法則」等等。代碼如下。

class AddOp(Op): """ The addition operation which has only two inputs. The input can be primitive, ConstantOp, PlaceholerOp, VariableOp or other ops. """


  def __init__(self, input1, input2, name="Add"):

    super(AddOp, self).__init__(name)
    if not isinstance(input1, Op):

      self._op1 = ConstantOp(input1)

    else:

      self._op1 = input1
    if not isinstance(input2, Op):

      self._op2 = ConstantOp(input2)

    else:

      self._op2 = input2
    self._graph = graph.get_default_graph()

    self._graph.add_to_graph(self)
  def forward(self):

    result = self._op1.forward() + self._op2.forward()

    return result

def grad(self, partial_derivative_opname=None): result = self._op1.grad(partial_derivative_opname) + self._op2.grad( partial_derivative_opname) return result

實際上，要實現TensorFlow的自動求導功能非常簡單，只需要為每個op實現完整的數學求導公式，然後實現乘法法則、加法法則等公式得到整個表達式的解即可。

上述代碼來自TensorFlow的最小化Python實現 tobegit3hub/miniflow 項目，歡迎通過源碼學習更好地理解TensorFlow的實現。

這個0.001不是學習率, 學習率是指每次求完梯度之後參數更新的係數.

微調自變數求loss的變化這種做法的效率並不會比直接符號求導更高.

假設需要求一個loss關於一個矩陣的導數, 你需要微調矩陣中每一個元素, 才能得到整個矩陣的導數.

符號求導的效率並不低, 實際上構建計算圖的時候, 根據鏈式法則, 梯度也可以構成計算圖中的節點, 反向計算梯度與正向求loss需要的時間應該是差不多的.

具體可以看UW的dlsys課程課件:http://dlsys.cs.washington.edu/pdf/lecture4.pdf

把dlsys-course/assignment1這個做一下應該能加深理解

http://dlsys.cs.washington.edu/pdf/lecture4.pdf

不是learning rate.

Learning rate 出現在迭代步。迭代parameter時，用上一步的parameter + learning rate * gradient，得到更新後的parameter.

這個0.0001是數值計算Gradient是用到的epsilon 。當epsilon很小時，f』(x) 約等於 [f(x+epsilon)-f(x-epsilon)]/(2*epsilon)。

我記得某個博客曾經提過，tensorflow 用的是automatic differential,這是一種介於數值微分和符號微分之間的方法。數值微分眾所周知，在緯度高的時候會是災難，而且會有不穩定的情況。而符號微分對整個計算圖進行符號微分容易導致表達式爆炸。自動微分就是不對整個計算圖進行符號微分，而是在某一小分支用符號微分，然後帶入數值，這樣最後把各個小分支組合起來形成最終計算圖的求導結果。說錯了請輕噴

現在主流的神經網路庫使用的都是計算圖，然後使用自動微分的方式來計算梯度。

我們計算梯度目前主要使用三種方式：

數值微分：
符號微分
自動微分

這麼說吧，卷積網路的一個重要特性就是：

用簡單的網路計算梯度代替了複雜的分析梯度。也就是說，無論你想要逼近的方程有多複雜，在深度網路里無非就是有限種簡單的梯度計算（如矩陣乘法，卷積， BN等）。

數值梯度也可以代替分析梯度，但效率感人。

這種數值方法是很難實現的：

第一，你使參數變化一個很小的量，這個量取多少怎麼定？比如取0.01，某些情況下求得導數是很不精確的。

第二，每求一個參數的導數都需要進行一次前向傳播，效率肯定是比反向傳播求導要低。

這是我寫的反向傳播的詳解，希望對你有幫助。https://zhuanlan.zhihu.com/p/29731200

符號求導在編譯計算圖的時候弄一次，以後就可以一直用了。數值求導的話，每次更新現算一遍，才會慢吧。。