MXNet 的代碼要怎麼讀？

12-28

最近在看 mxnet 的源碼 dmlc/mxnet · GitHub。初步體會感覺 mxnet 有潛力成為一個非常優秀的深度學習的框架。足夠靈活，速度足夠快，擴展新的功能比較容易。如果我沒理解錯的話，mxnet 透過 parameter server 能夠直接做到多機多卡並行，透過 mshdow 可以寫一份代碼同時可以在 cpu 和 gpu 上面跑，symbol的作用我還沒搞明白。總體來說，我認為 mxnet 是一套非常棒的深度學習的工具。所以，我計劃深入代碼研究。然而，dmlc 的初衷好像是提供一套比較簡單易用的 python的介面，而對於代碼體系和代碼細節的文檔很少。請問各位知乎大神，mxnet 的代碼結構是怎樣，各個模塊的功能，以及最好能有關鍵技術的解釋。

類似問題：
parameter server的代碼要怎麼讀？ - 機器學習
如何高效的學習TensorFlow代碼? - 學習方法

幫回答一下這個問題，設計文檔在這裡，希望對大家有所幫助：

http://mxnet.readthedocs.org/en/latest/developer-guide/index.html

另外樓上說的效率問題，在googlenet+bn的setting上測試下來，單gpu和多gpu都比cxxnet快50%，顯存使用節省一倍。

跪著讀

我今天跪著看了一天

基本感受就是:
卧槽，這是什麼用法？google一下。
卧槽，這又是什麼用法？搜都不知道怎麼搜啊！

相比之下，我寫的代碼根本就是純c...

求大神解讀

代碼很現代

使用了很多c++11的特性，開始讀的時候，發現自己落伍了

---

內存很節省

我運行了分散式的googlenet+bn，batch-size 144，內存採用了不到4G

非同步效果不錯

跑滿了網路帶寬

----

補充 #2015.12.15
符號計算

MXNet中的符號計算是用mxnet::symbol來構建一個網路，比如googlenet，比如inception-bn，比如LSTM，用起來和caffe的網路定義文件一樣的。定義好之後就可以用這個模型了。

如果你對caffe 的網路定義結構很熟悉，那麼我敢保證你用不了幾分鐘就可以看懂mxnet的symbol的網路定義方式。

命令式計算

MXNet 中命令計算方式是指用mshadow來做具體的數據計算，比如矩陣操作。

於TF（tensorflow）使用eigen（計算庫）來對tensor做計算不同，MXNet 使用mshadow 這個表達式模板來做計算。

表達式模板是c++中一個非常有用的特性，可以讓做到以下幾點

避免中間變數的生成
延遲計算，獲取整個計算圖之後做優化

表達式模板

我說一下我的理解，這種方式很像是編譯原理說的構建一個AST的過程，抽象語法樹的過程。有了圖之後就可以對遞歸的求值。

簡單的讀了一下代碼，和現在大部分現有的深度學習系統類似

存儲系統-Parameter server

PS很簡單了，也很知名了，但是要如何實現，ASP，SSP等～需要有很好的設計思路，不可能兩個模式寫兩套代碼吧，摔！

矩陣計算系統-Mshadow

Mshadow裡面主要是對矩陣運算的封裝，我們知道，現在要用CPU和GPU計算，MShadow用模版把這兩個Heterogeneous computing泛化。如果要用SSE和MKL，我們就需要Align Memory, Mashdow中都把這些很好的實現了。

Computational Graph- Symbol, 如果要優化計算圖，等到計算圖建立完成以後調用Apply Pass 做一些優化工作，比如提到的operator fusion, Memory Planning, InferShape and PlanDevice. PlanDevice實現的很暴力，手工指定，InferShape主要是為用戶易用性考慮，用戶以後可以不用對每個layer明確指定Shape多大。operator fusion可以融合小的operator到大的operator，這樣有助於很多的numerical stability. Softmax的Cross entropy Layer就是一個非常好的例子

Computational Graph - Operator / 每個節點的前向計算和反向計算的細節，需要和Mshadow合併，實現同時支持GPU和CPU

計算圖還需要提到的是Mxnet Computational Graph同時把正向圖反相圖建立在一起，但是實現的時候還是先跑了正向圖，然後拓撲逆序跑反向圖～其實細節無所謂，雖然感覺有些實現和文檔有些細微化的差異，但是思路已經在那了～

NNVM主要是一種設計思想，也是類似於Tensorflow(No Offense)，把深度學習的過程，看作是一個語言的翻譯過程，Decouple了前端語言的限制，和後段具體實現的限制。

一些技術細節，我都列在下面了。再次感謝開源社區和大牛們～謝謝你們無私的分享

MXNET Analysis

MXNET

Programming API
Gradient Calculation (Differentiation API)
Computational Graph Optimization and Execution
Runtime Paralle Scheduling
GPU Kernels, Optimized Device Code
Accelerators and Hardware

Programming API

Mxnet use Cpython to bind the C++ with python language

Gradient Calculation (Differentiation API)

Computational Graph by Christopher Olah

Key points:

Forward-Mode differentiation:

Backward-Mode differentiation: We apply it to do the back propogation algorithm. Cuz we just care about taking derivative of loss function with respect to each node

Details: http://dlsys.cs.washington.edu/pdf/lecture4.pdf

Difference between computation graph and traditional back propogation algorithm:

Computational Graph Optimization and Execution

Memory Planning
PlaceDevice
InferShape

Execute the computational graph:https://github.com/dmlc/mxnet/blob/986b736b816018b96e9d1e2c358bb7665b80944d/src/executor/graph_executor.cc#L51

http://dlsys.cs.washington.edu/pdf/lecture7.pdf

Runtime Paralle Scheduling

http://dlsys.cs.washington.edu/pdf/lecture9.pdfamp;amp;#x60;

GPU Kernels, Optimized Device Code

Decouple the hardware related optimization from the computational graphhttp://dlsys.cs.washington.edu/pdf/lecture8.pdf

Accelerators and Hardware

Mshadow

General operation: https://github.com/dmlc/mshadow/tree/master/guide Cutting-edge techniques:https://github.com/dmlc/mshadow/tree/master/guide/exp-templatehttps://en.wikipedia.org/wiki/Expression_templates

NNVM

http://dlsys.cs.washington.edu/pdf/lecture16.pdf

NNVM operator

https://github.com/dmlc/nnvm/blob/master/include/nnvm/op.h

Connect the front-end to the back-end

https://github.com/dmlc/mxnet/tree/master/src/c_apihttps://github.com/dmlc/mxnet/blob/master/src/c_api/c_api_symbolic.cc#L93https://github.com/dmlc/nnvm/blob/master/src/core/symbolic.cc#L509

Training with Multiple GPUs Using Model Parallelism http://mxnet.io/how_to/model_parallel_lstm.html

Examples

https://github.com/yuruofeifei/assignment1/blob/master/autodiff_test.pyhttps://github.com/yuruofeifei/assignment1/blob/master/autodiff.py

MShadow

CUDA

http://developer.download.nvidia.com/CUDA/training/StreamsAndConcurrencyWebinar.pdf

Stream: A sequence of operations that execute in issue-order on the GPU

Stride: this is used to deal with pitch allocation in GPU or SSE (align x dimension to 64bit) for efficiency

Pitch memory: aligned location for GPU

Requirements: Data used by concurrent operations should be independent

#program unroll

https://stackoverflow.com/questions/22278631/what-does-pragma-unroll-do-exactly-does-it-affect-the-number-of-threads

cudaGetDeviceCount(int* count) return the number of devices with compute capabilities greater or equal to 1.0

host cudaError_t cudaSetDevice ( int device ) Set device to be used for GPU executions.

host cudaError_t cudaGetDeviceProperties ( cudaDeviceProp* prop, int device ) Returns information about the compute-device.

host cudaError_t cudaMallocPitch ( void devPtr, size_t* pitch, size_t width, size_t height )** Allocates pitched memory on the device. Allocates at least width (in bytes) * height bytes of linear memory on the device and returns in *devPtr a pointer to the allocated memory. The function may pad the allocation to ensure that corresponding pointers in any given row will continue to meet the alignment requirements for coalescing as the address is updated from row to row. The pitch returned in *pitch by cudaMallocPitch() is the width in bytes of the allocation. The intended usage of pitch is as a separate parameter of the allocation, used to compute addresses within the 2D array. Given the row and column of an array element of type T, the address is computed as:

cudaStreamCreat http://on-demand.gputechconf.com/gtc/2014/presentations/S4158-cuda-streams-best-practices-common-pitfalls.pdf

CuDNN has handle:

Blas has handle:

cublasHandle_t handle; cublasCreate(handle);

CuDNN handle:

cudnnStatus_t err = cudnnCreate(dnn_handle_); CHECK_EQ(err, CUDNN_STATUS_SUCCESS) &<&< cudnnGetErrorString(err); err = cudnnSetStream(dnn_handle_, stream_);

cudaStreamCreate

This section describes the stream management functions of the CUDA runtime application programming interfac doxyen format: http://www.doxygen.org

typedef typename std::conditional&::value, DType, double&>::type FType; typedef typename std::conditional&::value, std::uniform_int_distribution&, std::uniform_real_distribution&&>::type GType;

std::gamma_distribution dist_gamma(alpha, beta);

Packet

allocate a aligned space with num_line * lspace cellshttps://github.com/dmlc/mshadow/blob/master/mshadow/packet-inl.h#L59

Reference

MXNet System Architecture

http://mxnet.io/architecture/overview.html

Deep Learning Programming Style

http://mxnet.io/architecture/program_model.html

Dependency Engine for Deep Learning

http://mxnet.io/architecture/note_engine.html

Optimizing Memory Consumption in Deep Learning

http://mxnet.io/architecture/note_memory.html

Tinyflow .. To be con

最近在優化Caffe的顯存佔用問題，準備讀一下MXNet的源碼，佔個坑，回頭希望能想起來補充一下。

mxnet代碼閱讀 Q群 103907192

不推薦看mxnet的代碼.

如果看網上貌似只有這一個資料:
mxnet - mydear_11000的專欄 - 博客頻道 - CSDN.NET

mxnet的文檔看似豐富其實一點不系統, 代碼注釋看似多其實大部分是廢話.
而且dmlc的所有項目的工程管理很差勁, 真的很差.

源代碼好不好讀我不關心，因為我水平也不高，也看不懂，但我想說的是，文檔能不能做得再詳細點。另外，聽說是國人寫的，那能不能給同胞們一點紅利呀，就是中文，原因很簡單，雖然我也能看懂點英語，但畢竟我們的母語都不是英語，理解效率大幅下降。在這個時間就是金錢的時代，這樣的閱讀效率實在是不吸引眼球呀。

MXNet入坑（一）- Overview and Understand Symbol API

MXNet源碼是C++寫的，專欄里具體分析了Symbol源碼部分。以後會逐漸增加各個API的分析。

之所以英文是因為我托福寫作太差了: (

建議、專業性問題、英語語法問題，歡迎指出錯誤。

說實話，所有的回答都非常的雞肋，感覺mxnet入門的門檻還是比caffe高一些，很多模塊的依賴實現設計存在2個問題
（1）實現的人很多，代碼的可讀性不如caffe。
（2）python layer給的很多例子雖然本質模式相同，但是有很多地方可讀性不是很好。
我主要求問一下如何更好的從caffe的熟悉使用者遷移到mxnet高手呢？發現並不如想像中那麼簡單，mxnet實現要複雜不少。

求大神指點！！！

佔個坑，先研究下TF，在來搞mxnet

也在學習，先佔個坑！