Geoffrey Hinton的CapsNet（《Dynamic Routing Between Capsules》）中的tf.stop_gradient()

02-05

tf.stop_gradient():

stop_gradient(

input,

name=None )

在naturomics/CapsNet-Tensorflow 實現的Capsule裡面：

# In forward, u_hat_stopped = u_hat; in backward, no gradient passed back from u_hat_stopped to u_hatn u_hat_stopped = tf.stop_gradient(u_hat, name=stop_gradient)n

這裡的u_hat_stopped是u_hat的拷貝，但是梯度不會從u_hat_stopped傳到u_hat。

下面是tensorflow的文檔：

See the guide: Training > Gradient Computation

Stops gradient computation.

When executed in a graph, this op outputs its input tensor as-is.直接拷貝輸出

When building ops to compute gradients, this op prevents the contribution of its inputs to be taken into account.

在計算梯度的時候，輸入（例如u_hat，被認為是不存在的）

Normally, the gradient generator adds ops to a graph to compute the derivatives（導數） of a specified loss by recursively finding out inputs that contributed to its computation. If you insert this op in the graph it inputs are masked from the gradient generator. They are not taken into account for computing gradients.

This is useful any time you want to compute a value with TensorFlow but need to pretend that the value was a constant. Some examples include:

The EM algorithm where the M-step should not involve backpropagation through the output of the E-step.
Contrastive divergence training of Boltzmann machines where, when differentiating the energy function, the training must not backpropagate through the graph that generated the samples from the model.
Adversarial training, where no backprop should happen through the adversarial example generation process.

Args:

input: A Tensor.
name: A name for the operation (optional).

Returns:

A Tensor. Has the same type as input.

代碼和具體的解釋如下：

u_hat = tf.matmul(W, input, transpose_a=True)n assert u_hat.get_shape() == [cfg.batch_size, 1152, 10, 16, 1]nn # In forward, u_hat_stopped = u_hat; in backward, no gradient passed back from u_hat_stopped to u_hatn u_hat_stopped = tf.stop_gradient(u_hat, name=stop_gradient)nn # line 3,for r iterations don for r_iter in range(cfg.iter_routing):n with tf.variable_scope(iter_ + str(r_iter)):n # line 4:n # => [1, 1152, 10, 1, 1]n c_IJ = tf.nn.softmax(b_IJ, dim=2)nn # At last iteration, use `u_hat` in order to receive gradients from the following graphn if r_iter == cfg.iter_routing - 1:n # line 5:n # weighting u_hat with c_IJ, element-wise in the last two dimsn # => [batch_size, 1152, 10, 16, 1]n s_J = tf.multiply(c_IJ, u_hat)n # then sum in the second dim, resulting in [batch_size, 1, 10, 16, 1]n s_J = tf.reduce_sum(s_J, axis=1, keep_dims=True)n assert s_J.get_shape() == [cfg.batch_size, 1, 10, 16, 1]nn # line 6:n # squash using Eq.1,n v_J = squash(s_J)n assert v_J.get_shape() == [cfg.batch_size, 1, 10, 16, 1]n elif r_iter < cfg.iter_routing - 1: # Inner iterations, do not apply backpropagationn s_J = tf.multiply(c_IJ, u_hat_stopped)n s_J = tf.reduce_sum(s_J, axis=1, keep_dims=True)n v_J = squash(s_J)nn # line 7:n # reshape & tile v_j from [batch_size ,1, 10, 16, 1] to [batch_size, 1152, 10, 16, 1]n # then matmul in the last tow dim: [16, 1].T x [16, 1] => [1, 1], reduce mean in then # batch_size dim, resulting in [1, 1152, 10, 1, 1]n v_J_tiled = tf.tile(v_J, [1, 1152, 1, 1, 1])n u_produce_v = tf.matmul(u_hat_stopped, v_J_tiled, transpose_a=True)n assert u_produce_v.get_shape() == [cfg.batch_size, 1152, 10, 1, 1]nn # b_IJ += tf.reduce_sum(u_produce_v, axis=0, keep_dims=True)n b_IJ += u_produce_vn

在這個循環里，循環中間的u_hat實際上在是在一個batch裡面的，不應該反向傳播。但是，tensorflow對這個for循環的執行，實際上是展成一個長鏈（用tf.variable_scope(iter_ + str(r_iter))充當namespace，區分不同的變數）【自己的理解，不確定對不對】，所以要保證中間循環的op節點不會產生梯度，對反向傳播的梯度造成影響。所以這裡採用了tf.stop_gradient()構造了一個類似「中間變數」的不傳梯度的u_hat.

naturomics/CapsNet-Tensorflow實現的第一個版本沒有採用這個機制，報道迭代次數為1的時候效果最好；第二個版本使用了tf.stop_gradient()，貌似是迭代次數為3的時候效果最好。結合論文的意思，看來使用tf.stop_gradient()是正確的。