小白的tensorflow+CUDA編程踩坑記錄

08-14

4 人贊了文章

初學CUDA，記錄下自己踩過的坑。

A. 編譯GPU版本的tensorflow OpKernel

參考網頁https://www.tensorflow.org/extend/adding_an_op。步驟如下：

在一個.cpp文件中寫這個Op類的介面，這個自定義的類應當繼承tensorflow::OpKernel基類，並重寫其Compute方法；
在cpp文件中寫Register代碼，可以選擇用CPU或者GPU實現，最後編譯的時候把以下參數傳到REGISTER_KERNEL_BUILDER中。

typedef Eigen::ThreadPoolDevice CPUDevice;typedef Eigen::GpuDevice GPUDevice;

具體的Op計算過程只在cpp文件中留出函數介面，不具體實現；
在.cu文件中寫cuda代碼，實現並行化的計算；
（optional）重複以上過程，實現該Op所對應的gradient計算過程，同樣封裝成一個tensorflow::OpKernel的子類；
用g++和nvcc編譯C++和CUDA代碼，生成動態鏈接.so文件；
在Python中調用動態鏈接庫並進行進一步的封裝，用Python中的@tf.RegisterGradient做decorator把gradient的實現綁定到Op上；

Note：

為了編譯方便，REGISTER_KERNEL_BUILDER的代碼最好寫在條件編譯中；
tensorflow的OpKernel註冊時的名字必須嚴格按照大駝峰法命名，對應Python中變成小寫，下劃線分隔。比如說：

REGISTER_KERNEL_BUILDER( Name("SparseGraphConvolution").Device(DEVICE_GPU).TypeConstraint<float>("T"), GraphAdjacencyGeneratorOp<GPUDevice, float>);

到了Python中的名字就叫做sparse_graph_convolution

用OP_REQUIRES檢查輸入合法性，用OP_REQUIRES_OK分配新的內存空間。在多GPU環境下，後者比cuda中的cudaMalloc效率更高，因為不會造成多GPU之間的資源競爭。所以在多GPU環境下應該儘可能不用cudaMalloc函數，最好也不要在CUDA kernel裡面寫new和delete；
可以仿照STL的設計思路，設計這樣一個類，然後傳到tensorflow::OpKernel子類的template參數中去：

struct GpuAlloc { GpuAlloc(OpKernelContext *context) { this->context = context;} void alloc(void**data, int bytes) { int num_elems = std::ceil(bytes / 8.0); // Have to keep the tensors until end of op to avoid memory crash. tmp_tensors.push_back(Tensor()); Tensor &tmp = tmp_tensors.back(); TensorShape shape({num_elems}); OP_REQUIRES_OK(context, context->allocate_temp(DataTypeToEnum<double>::value, shape, &tmp)); double *buf = &(tmp.flat<double>()(0)); *data = (void*)buf; } OpKernelContext *context; std::vector<Tensor> tmp_tensors;};

原則上，確實應該儘可能地多用CUDA中的share的memory，在一些情況下，shared memory的讀取速度甚至和寄存器的速度差不多。問題在於，神經網路中每一層的輸出節點數量用戶是可以隨便定義的，如果輸出節點數量太大，shared memory就不夠用，如果每個block中用到的shared memory數量超過了maxSharedMemoryPerBlock的值，kernel就不會啟動。我暫時還沒有搞清楚谷歌在tensorflow中是如何處理這個問題的；

B. CUDA中的kernel封裝問題

這個問題起源於我的一次自作聰明：每次傳給kernel的參數，無論實際維度是多少，總是一個一級指針，直接對這個一級指針操作很容易不小心粗心寫錯，那麼可不可以對這個指針做一個簡單的封裝，讓它稍微好用一點呢？

於是我寫了這麼一段簡單的代碼來測試這個想法：

template <typename T> class QueryBase {public: QueryBase(T* _data): data(_data) {} __device__ virtual T& operator()(int, int) = 0;protected: __device__ QueryBase(const QueryBase<T>&); T* data;};template <typename T> class ForwardQuery : public QueryBase<T> {public: ForwardQuery(T* _data, int _size) : QueryBase<T>(_data), size(_size) {} __device__ T& operator()(int a, int b) { return this->data[a*size+b]; }protected: int size;};template<typename T>__global__ void testTemplate(QueryBase<T>& in, QueryBase<T>& out) { out(threadIdx.x， threadIdx.y) = in(threadIdx.x, threadIdx.y) + 1.0; __syncthreads();}

思路就是在host上建一個類，這個類的構造函數是__host__函數，其中保存一個指向device內存的指針，把它的operator()寫成__device__函數，最後把這個類傳進kernel的參數列表中。這樣，在kernel中就可以方便地用operator()解引用矩陣元素了。

結果：可以通過編譯，但運行結果不對。即使我小心翼翼地設計這個類，把它寫成線程安全的，結果依然是錯誤的。原因至今還不清楚。這說明寫CUDA kernel的時候還是老老實實用下標比較好。

C. Trouble shooting for nvcc

Bug1：找不到op.h——沒有包含tensorflow的include路徑，參考網頁https://www.tensorflow.org/extend/adding_an_op，加上這部分可以解決；

?-I/home/xxx/anaconda3/lib/python3.5/site-packages/tensorflow/include -I/home/xxx/anaconda3/lib/python3.5/site-packages/te/nsync/public–D_GLIBCXX11_ABI=0

Bug2：version GLIBCXX_3.4.21 not found——參考網頁https://blog.csdn.net/amor_tila/article/details/77976964，這個錯誤常見於新裝的ubuntu系統，因為新ubuntu系統中的libgcc是老版本的，其他網站一般就教你用硬鏈接ln一個假的libgcc出來，但個人覺得最好的辦法還是自己conda install libgcc裝一個新的；
Bug3：找不到nsynccv.h——解決方案詳見https://blog.csdn.net/qq_27637315/article/details/79114633
Bug4：Undefined symbol: _ZTIN10tensorflow8OpKernelE，詳見網站https://blog.csdn.net/qq_17827079/article/details/79709674
Bug5：undefined symbol: _ZN10tensorflow8internal21CheckOpMessageBuilder9NewStringEv：參考網頁https://github.com/tensorflow/tensorflow/issues/9137。需要注意不光是nvcc編譯的時候要加-D_GLIBCXX_USE_CXX11_ABI=0，最後一步用g++聯合編譯生成動態鏈接庫的時候，也要加這一條命令；
g++編譯時如果用了比-O2更高級別的編譯器優化，可能會有意料之外的bug。

D. C++版本CUDA代碼示例

網上C風格的CUDA學習資源很多，很多人也提倡用C風格的代碼，因為效率更高，編譯器的坑也少一些。之所以要寫C++的代碼，主要還是因為C++的template寫起來可以很輕鬆的控制OpKernel中的變數類型，可以同時支持很多不同類型參數的計算。下面的示例代碼把tensorflow官網上的那個AddOneOp改寫成C++風格，功能是把輸入的前K個數字加1：

// cuda_op_kernel.cu.cc#if GOOGLE_CUDA #define EIGEN_USE_GPU#include "third_party/eigen3/unsupported/Eigen/CXX11/Tensor"#include "tensorflow/core/framework/op.h"#include "tensorflow/core/framework/op_kernel.h"#include "tensorflow/core/framework/register_types.h"#include "tensorflow/core/framework/shape_inference.h"#include "tensorflow/core/framework/common_shape_fns.h"using namespace tensorflow;typedef Eigen::ThreadPoolDevice CPUDevice;typedef Eigen::GpuDevice GPUDevice;template <typename T>__global__ void AddOneKernel(const T* in, const int N, T* out) { int idx = blockIdx.x * N + threadIdx.x; out[idx] = in[idx] + 1;}template <typename T>void AddOneKernelLauncher(const T* in, const int batch_size, const int N, const int K, T* out) { AddOneKernel<T><<<batch_size, K>>>(in, N, out); cudaDeviceSynchronize();}template <typename Device, typename T>class AddOneOp : public OpKernel {public: explicit AddOneOp(OpKernelConstruction* context) : OpKernel(context) { OP_REQUIRES_OK(context, context->GetAttr("K", &K_)); } void Compute(OpKernelContext* context) override { const Tensor& input_tensor = context->input(0); auto input = input_tensor.flat<T>(); const int batch_size = input_tensor.shape().dim_size(0); const int N = input.size() / batch_size; Tensor* output_tensor = NULL; OP_REQUIRES_OK(context, context->allocate_output(0, input_tensor.shape(), &output_tensor)); auto output = output_tensor->flat<T>(); OP_REQUIRES(context, K_>0 && K_ <= N, ::tensorflow::errors::InvalidArgument("Invalid K value")); AddOneKernelLauncher<T>(input.data(), batch_size, N, K_, output.data()); }private: int K_;};REGISTER_KERNEL_BUILDER(Name("AddOne").Device(DEVICE_GPU).TypeConstraint<int>("T"), AddOneOp<GPUDevice, int>);REGISTER_KERNEL_BUILDER(Name("AddOne").Device(DEVICE_GPU).TypeConstraint<float>("T"), AddOneOp<GPUDevice, float>);REGISTER_KERNEL_BUILDER(Name("AddOne").Device(DEVICE_GPU).TypeConstraint<double>("T"), AddOneOp<GPUDevice, double>);#endif

然後是C++代碼：

// cuda_op_kernel.cc#include "tensorflow/core/framework/op.h" #include "tensorflow/core/framework/op_kernel.h" using namespace tensorflow;typedef Eigen::ThreadPoolDevice CPUDevice;typedef Eigen::GpuDevice GPUDevice;REGISTER_OP("AddOne").Attr("T: {int32, float, double}").Input("input: T").Output("output: T").Attr("K: int").Doc(R"doc( Adds 1 to all elements of the tensor. output: A Tensor. out_{i<K} = in_{i<K} + 1)doc");template<typename T>void AddOneKernelLauncher(const T* in, const int batch_size, const int N, const int K, T* out);template <typename Device, typename T>class AddOneOp : public OpKernel {public: explicit AddOneOp(OpKernelConstruction* context) : OpKernel(context) { OP_REQUIRES_OK(context, context->GetAttr("K", &K_)); } void Compute(OpKernelContext* context) override { const Tensor& input_tensor = context->input(0); auto input = input_tensor.flat<T>(); const int batch_size = input_tensor.shape().dim_size(0); const int N = input.size() / batch_size; Tensor* output_tensor = NULL; OP_REQUIRES_OK(context, context->allocate_output(0, input_tensor.shape(), &output_tensor)); auto output = output_tensor->flat<T>(); OP_REQUIRES(context, K_>0 && K_<=N, ::tensorflow::errors::InvalidArgument("Invalid K value")); AddOneKernelLauncher<T>(input.data(), batch_size, N, K_, output.data()); }private: int K_;};#ifndef GOOGLE_CUDAREGISTER_KERNEL_BUILDER(Name("AddOne").Device(DEVICE_CPU).TypeConstraint<int>("T"), AddOneOp<CPUDevice, int>);REGISTER_KERNEL_BUILDER(Name("AddOne").Device(DEVICE_CPU).TypeConstraint<float>("T"), AddOneOp<CPUDevice, float>);REGISTER_KERNEL_BUILDER(Name("AddOne").Device(DEVICE_CPU).TypeConstraint<double>("T"), AddOneOp<CPUDevice, double>);#endif

最後是bash腳本（踩了一堆坑，寫的有點啰嗦了）：

TF_CFLAGS=( $(python -c import tensorflow as tf; print(" ".join(tf.sysconfig.get_compile_flags()[: -1]))) )TF_LFLAGS=( $(python -c import tensorflow as tf; print(" ".join(tf.sysconfig.get_link_flags()))) )TF_IFLAGS=( $(python -c import tensorflow as tf; print(tf.sysconfig.get_include())) )g++ -std=c++11 -shared cuda_op_kernel.cc -o cuda_op_kernel.so -fPIC ${TF_CFLAGS[@]} -D_GLIBCXX_USE_CXX11_ABI=0 ${TF_LFLAGS[@]} -O2nvcc -std=c++11 -c -o cuda_op_kernel.cu.o cuda_op_kernel.cu.cc ${TF_CFLAGS[@]} -D_GLIBCXX_USE_CXX11_ABI=0 ${TF_LFLAGS[@]} -D GOOGLE_CUDA=1 -x cu -Xcompiler -fPICg++ -std=c++11 -shared -o cuda_op_kernel.so cuda_op_kernel.cc cuda_op_kernel.cu.o -D_GLIBCXX_USE_CXX11_ABI=0 -I$TF_IFLAGS -I$TF_IFLAGS/external/nsync/public -L/usr/local/cuda-9.1/lib64 ${TF_LFLAGS[@]} -fPIC -lcudart