[一周一paper][ISCA] In-Datacenter Performance Analysis of a Tensor Processing Unit?

01-30

https://drive.google.com/file/d/0Bx4hafXDDq2EMzRNcy1vSUxtcEk/view

這個是目前放出來最全的TPU的報告，六月份要發ISCA，不過所有的對比都是跟K80，具體跟P100比起來還會好多少不知道，看起來根之前放出來的TPU架構差別也不大，估計是有新的好東西了才會把老的分享出來

The TPU』s deterministic execution model is a better match to the 99th-percentile response-time requirement of our NN applications than are the time-varying optimizations of CPUs and GPUs (caches, out-of-order execution, multithreading, multiprocessing, prefetching, ...) that help average throughput more than guaranteed latency.n

這個點在abstract裡面就提到應為ads跟search的請求是有延時的限制的，等太久用戶會不耐煩，所以在有限的時間裡儘快執行模型，比throughput高要來的更吸引這類大公司

Despite low utilization for some applications, the TPU is on average about 15X -30X faster than its contemporary GPU or CPU, with TOPS/Watt about 30X - 80X higher. Moreover, using the GPU』s GDDR5 memory in the TPU would triple achieved TOPS and raise TOPS/Watt to nearly 70X the GPU and 200X the CPU.n

功耗低的情況下還更快

Introduction to Neural Networks

估計google search/ads 在production裡面跑的是mlp，具體架構應該是這個[1606.07792] Wide & Deep Learning for Recommender Systems，所以裡面通篇第一部分提到的都是mlp，然後才是cnn/lstm。

● As a result of latency limits, the K80 GPU is under-utilized for inference, and is just a little faster than the Haswell CPU.n● Despite having a much smaller and lower power chip, the TPU has 25 times as many MACs and 3.5 times as much on-chip memory as the K80 GPU.n

latency limits新的volta在發布的時候就有講到。MAC這裡指的是matrix multiplication accummulation，估計跟volta新放出來的4x4x4的sgemm加速器差不多。

The performance/Watt of the TPU is 30X - 80X that of contemporary products; the revised TPU with K80 memory would be 70X - 200X better.n

看的這個capacity估計要笑瘋了。

While most architects have been accelerating CNNs, they represent just 5% of our datacenter workload.n

不過這裡沒有說mlp跟rnn的百分比，那看來rnn佔用的百分比還是很高的。

TPU Origin, Architecture, and Implementation

Rather than be tightly integrated with a CPU, to reduce the chances of delaying deployment, the TPU was designed to be a coprocessor on the PCIe I/O bus, allowing it to plug into existing servers just as a GPU does. Moreover, to simplify hardware design and debugging, the host server sends TPU instructions for it to execute rather than fetching them itself. Hence, the TPU is closer in spirit to an FPU (floating-point unit) coprocessor than it is to a GPU.n

這裡提到的TPU跟接近FPU應該是TPU在inference速度方面比GPU快的核心原因。

The matrix unit holds one 64KiB tile of weights plus one for double-buffering (to hide the 256 cycles it takes to shift a tile in). This unit is designed for dense matrices. Sparse architectural support was omitted for time-to-deploy reasons. Sparsity will have high priority in future designs. The weights for the matrix unit are staged through an on-chip Weight FIFO that reads from an off-chip 8 GiB DRAM called Weight Memory (for inference, weights are read-only; 8 GiB supports many simultaneously active models). The weight FIFO is four tiles deep. The intermediate results are held in the 24 MiB on-chip Unified Buf er, which can serve as inputs to the Matrix Unit. A programmable DMA controller transfers data to or from CPU Host memory and the Unified Buffer.n

TPU的內存管理比Nvidia GPU的那一套要好理解多了，主要是weight fetcher跟一個unified buffer。8G的off chip ddr3的內存確實比現在的gpu還要大，但是因為追求的不是memory bandwidth而是determinism，所以這一套確實非常的合理。圖裡面還特意提到了每個階段之間的memory bandwidth，簡直良心。

As instructions are sent over the relatively slow PCIe bus, TPU instructions follow the CISC tradition, including a repeat field. The average clock cycles per instruction (CPI) of these CISC instructions is typically 10 to 20. It has about a dozen instructions overall, but these five are the key ones:n1. Read_Host_Memory reads data from the CPU host memory into the Unified Buffer (UB).n2. Read_Weights reads weights from Weight Memory into the Weight FIFO as input to the Matrix Unit.n3. MatrixMultiply/Convolve causes the Matrix Unit to perform a matrix multiply or a convolution from the Unified Buffer into the Accumulators. A matrix operation takes a variable-sized B*256 input, multiplies it by a 256x256 constant weight input, and produces a B*256 output, taking B pipelined cycles to complete.n4. Activate performs the nonlinear function of the artificial neuron, with options for ReLU, Sigmoid, and so on. Its inputs are the Accumulators, and its output is the Unified Buffer. It can also perform the pooling operations needed for convolutions using the dedicated hardware on the die, as it is connected to nonlinear function logic.n5. Write_Host_Memory writes data from the Unified Buffer into the CPU host memory.n

這裡提了matrix multiplication的大小主要是B*256x256*256，那看來限制還是蠻多的，太小的模型應該不會受惠。

The philosophy of the TPU microarchitecture is to keep the matrix unit busy. It uses a 4-stage pipeline for these CISC instructions, where each instruction executes in a separate stage. The plan was to hide the execution of the other instructions by overlapping their execution with the MatrixMultiply instruction. Toward that end, the Read_Weights instruction follows the decoupled-access/execute philosophy [Smi82], in that it can complete after sending its address but before the weight is fetched from Weight Memory. The matrix unit will stall if the input activation or weight data is not ready.n

再次強調TPU如何優化數據的流動。

接下來一大段是講具體tpu/gpu/cpu能差多少performance,roofline model 怎麼來的以及跟roofline model比差距在哪裡，這些我全部跳過了，有興趣自己可以去看。

Cost-Performance, TCO, and Performance/Watt

TPU看能耗的話比haswell好17倍，比K10好14倍。這段主要是解釋為什麼Google要做這個AISC，因為都不是錢的問題了，而是能耗的問題。

Energy Proportionality

這段主要是解釋TPU的架構在低工作量的情況下不怎麼省電。

Evaluation of Alternative TPU Designs

TPU架構簡單，很多小的部件可以換掉，所以tpu的人嘗試了每個部件都換一下看看都有什麼影響。

主要解釋，增加memory bandwidth對TPU的增長是最關鍵的。同理nvidia這次也是推出了新的nvlink，主要也是增加memory bandwidth。這裡面還有一個很有意思的是比起256*256的架構,512*512沒有快多少，因為不是所有的network都是2的倍數，可能最後算起來512*512的工作量還會大，不會小。

Discussion

這裡作者已經開始隨筆了。

- 數據中心裏面NN對速度要求比較高的是response time，而不是throughput。也就是說training慢一點其實也還好，上GPU多花錢就是了。response time才是需要花大精力優化的地方。

- K80在response time上面沒有比CPU好多少。這個在volta裡面有解決，看看之後volta會比TPU慢多少吧。

-TPU現有的架構沒有怎麼優化CNN。也確實，CNN跟賺錢沒什麼關係，TPU自然不管

-k80就算開boost也快不了多少（這個是有仇么，這樣噴好么）

-Google之前也試過在現有的架構上面優化軟體（avx），但是怎麼樣都是跑的比TPU慢

-TPU有很多跟NN相關的performance counter，他們想要幫助研究人員優化速度還是很容易的

裡面還有提到很多相關的paper，有興趣可以去看看。

總的來說這個paper拉仇恨拉的還是很兇的，但是Google自己的人肯定得這麼說，畢竟自己做ASIC花費至少幾億美金，小公司肯定直接買nvidia的GPU了。通篇講了那麼多速度的東西，其實google最在意的估計是能耗和response time，不過低workload情況下能耗其實也不是特別低。