Pytorch入坑二：autograd 及Variable

08-21

來自專欄用pytorch深度懵逼5 人贊了文章

這個是0.3的版本，我之後會改的，因為知乎草稿抽了次瘋，以防萬一，先發出來了

Autograd: 自動微分

　　autograd包是PyTorch中神經網路的核心, 它可以為基於tensor的的所有操作提供自動微分的功能, 這是一個逐個運行的框架, 意味著反向傳播是根據你的代碼來運行的, 並且每一次的迭代運行都可能不同.

Variable

tensor是硬幣的話，那Variable就是錢包，它記錄著裡面的錢的多少，和錢的流向

variable是tensor的外包裝，data屬性存儲著tensor數據，grad屬性存儲關於該變數的導數，creator是代表該變數的創造者

　　autograd.Variable 是包的中央類, 它包裹著Tensor, 支持幾乎所有Tensor的操作,並附加額外的屬性, 在進行操作以後, 通過調用.backward()來計算梯度, 通過.data來訪問原始raw data (tensor), 並將變數梯度累加到.grad

　　Variable 與 Function互連並建立一個非循環圖，編碼完整的計算歷史。每個變數都有一個 .grad_fn 屬性，它引用了一個已經創建了 Variable 的操作,如加減乘除等（除了用戶創建的變數代替creator is None 即第一個運算節點, .grad_fn為空)

Variable 和 Tensor

Tensor是存在Variable中的.data里的，而cpu和gpu的數據是通過 .cpu()和.cuda()來轉換的

>> a=Variable(torch.Tensor([1]),requires_grad=True).cuda()>> aVariable containing: 1[torch.cuda.FloatTensor of size 1 (GPU 0)]>> a.data 1[torch.cuda.FloatTensor of size 1 (GPU 0)]>> a.cpu()Variable containing: 1[torch.FloatTensor of size 1]>> a.cpu().data 1[torch.FloatTensor of size 1]

自動求導:

需要注意的是因為當初開發時設計的是，對於中間變數，一旦它們完成了自身反傳的使命，grad就會被釋放掉。另外啟始節點的grad_fn為空

>> import torch>> from torch.autograd import Variable# requres_grad=True開啟微分模式>> a=Variable(torch.Tensor([1]),requires_grad=True) >> b=Variable(torch.Tensor([2]),requires_grad=True)>> c=Variable(torch.Tensor([3]),requires_grad=True)>> d=a+b>> e=d+c>> e.backward()>> print(a.grad,b.grad,c.grad)Variable containing: 1[torch.FloatTensor of size 1]Variable containing: 1[torch.FloatTensor of size 1]Variable containing: 1[torch.FloatTensor of size 1]>> d.grad #中間梯度值不保存，為空>> a.grad_fn ##第一個節點的.grad_fn為空>> e.grad_fn<AddBackward1 at 0x7f387cf1c588>

從反向傳播中排除子圖

　　每個Variable都有兩個屬性，requires_grad和volatile, 這兩個屬性都可以將子圖從梯度計算中排除並可以增加運算效率

requires_grad：排除特定子圖，不參與反向傳播的計算，即不會累加記錄grad

volatile: 推理模式, 計算圖中只要有一個子圖設置為True,　所有子圖都會被設置不參與反向傳

播計算，.backward()被禁止

>> a=Variable(torch.Tensor([1]),requires_grad=False) >> b=Variable(torch.Tensor([2]),requires_grad=True)>> c=a+b>> c.backward()>> a.grad # 因為a的requires_grad=False 所以不存儲梯度值>> b.gradVariable containing: 1[torch.FloatTensor of size 1]－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－－>> a=Variable(torch.Tensor([1]),volatile=True) >> b=Variable(torch.Tensor([2]),requires_grad=True)>> c=a+b>> c.backward() #由於其中一個子圖設置了volatile，所以不能反向傳播RuntimeError: element 0 of variables tuple is volatile

註冊鉤子：

　　Variable中的hook, 相當於插件，可以在既不修改主體的情況下，同時增加額外的功能掛在主體代碼上，就好比一個人去打獵，他的衣服上有掛槍的扣子，他可以選擇他想要帶的槍．來取得不同的打獵效果．　

　　因此，在Variable中通過reister_hook來實現，register_hook的作用是，當反向傳播時，你所註冊的hook都會被調用，比如你可以定義個列印函數，每次反向傳播都將grad值列印出來．

　　需要注意的是，register_hook函數接收的是一個函數，這個函數有如下的形式：

hook(grad) -> Variable or None

也就是說，這個函數是擁有改變梯度值的威力的！

例子：列印grad的hook

import torchfrom torch.autograd import Variable#定義一個列印函數，每次反向傳播都將相應的grad列印出來def print_grad(grad): print(grad)>> x = Variable(torch.randn([1]), requires_grad=True)>> y = x+2>> z = torch.mean(torch.pow(y, 2))>> y.register_hook(print_grad)　＃將列印函數掛在變數y上>> z.backward()Variable containing: 5.1408[torch.FloatTensor of size 1]

　　那一般什麼時候我們可以用到註冊hook呢？有幾種常見的情況，比如當我們想要提取中間層參數來進行可視化的時候，或者當我們想要保存中間參數變數，或者我們想要在傳播過程中改變梯度值的時候．

自定義Function

　　在快速實現想法的過程中，創建自定義的 Operation 可以讓我們靈活的使用 pytorch. 基於這個過程，你所自定義的 Operation 需要繼承 class autograd.Function 類來將其添加到 autograd , 這樣當我們調用 Operation 時可以使用 autograd 來計算結果和梯度，並編碼 operation 的歷史，在定義過程中每個 operation 都需要實現三個方法：

_init__（optional): 如果你的 operation 包含非 Variable 的參數，那麼可以將其傳入到init 並在 operation 中使用，如果你的 operation 不需要額外的參數，你可以忽略__init__
forward() : 這裡寫的是 operation 的邏輯代碼，可以有任意數量的參數，但參數只能是Variable,返回既可以是 Variable, 也可以是 Variable 的 tuple.
backward():　梯度計算邏輯代碼，參數個數和 forward() 返回個數一樣，每個參數代表傳回到此 Operation 的梯度，返回值的個數和此 operation 輸入的個數一樣，如果operation 不需要返回梯度，可以返回None

例子：

註：官方定義一般會用到@staticmethod , 同時通過調用customfunction.apply來實現,

　　但也可以不加@staticmethod,　這樣調用要custom_function()來實現．

import torchfrom torch.autograd import Variableclass custom_add(torch.autograd.Function): 　　""" 我們可以實現通過繼承torch.autograd.Function來實現自定義function. 正向和　　　　反向傳播通過tensors來實現　　""" 　 @staticmethod def forward(ctx,input,input2): 　　 """ 　　在正向傳播中,我們接受tensor作為輸入，並返回tensor類型的輸出， ctx是可以用來在反向傳播計算的存儲屬性的對象，如ctx.save_for_backward可以　　　　　　在反向傳播中使用　　 """ ctx.save_for_backward(input,input2) output=input+input2 return output @staticmethod def backward(ctx,grad_output): 　　 """ 　　在反向傳播中，我們接受一個存儲loss梯度的tensor(grad_output) 同時根據輸入來計算應該返回的的梯度　　 """ input1, input2=ctx.saved_tensors　＃在forward中存儲的數據 save_for_backward grad_input=grad_output.clone() return grad_input,grad_input #由於input是兩個輸入，所以也返回兩個gradnew_add=custom_add.applya=Variable(torch.Tensor([1]),requires_grad=True)b=Variable(torch.Tensor([2]),requires_grad=True)c=new_add(a,b)>> cVariable containing: 3[torch.FloatTensor of size 1]>> c.backward()>> a.gradVariable containing: 1[torch.FloatTensor of size 1]

自定義function檢查：

　　在你完成function後，你可能會想要知道你的邏輯，反向傳播是否有寫錯，你可以通過比較小數值的差分法結果來進行確認，通過gradcheck來實現

from torch.autograd import gradcheck# gradchek takes a tuple of tensor as input, check if your gradient# evaluated with these tensors are close enough to numerical# approximations and returns True if they all verify this condition.input = (Variable(torch.Tensor([2]).double(), requires_grad=True), Variable(torch.Tensor([3]).double(), requires_grad=True),)test = gradcheck(custom_add.apply, input, eps=1e-6, atol=1e-4)>> print(test)True

分析工具：

　　如果你想查看你定義操作的時間花銷，autograd 的 profiler 提供內視每個操作在GPU和CPU的花銷，對於CPU通過profile, 　基於nvprof通過使用emit_nvtx

>>> x = Variable(torch.randn(1, 1), requires_grad=True)>>> with torch.autograd.profiler.profile() as prof:... y = x ** 2... y.backward()>>> # NOTE: some columns were removed for brevity... print(prof)------------------------------------- --------------- ---------------Name CPU time CUDA time------------------------------------- --------------- ---------------PowConstant 142.036us 0.000usN5torch8autograd9GraphRootE 63.524us 0.000usPowConstantBackward 184.228us 0.000usMulConstant 50.288us 0.000usPowConstant 28.439us 0.000usMul 20.154us 0.000usN5torch8autograd14AccumulateGradE 13.790us 0.000usN5torch8autograd5CloneE 4.088us 0.000us