AI晶元的架構之爭真的有意義嗎？

02-05

做了這麼多年晶元，沒有看到過有哪一次像現在的AI晶元這樣備受關注。年初Google公布TPU細節，Nvidia立刻發文鄙視。這段時間隨著GTC2017和TPU2的發布，GPU好還是TPU好的爭論又熱鬧起來了。爭論AI晶元的架構，真的這麼有意義嗎？

我們先看一張圖，它來自RIT的Shaaban教授的計算機體系結構課程。其中他把GPU，DSP歸為ASP。TPU的結構大概的位置是configurable hardware/Co-processor這一類。

實際上，今天大家吵的厲害的CPU，GPU，DSP，TPU，FPGA這些架構並不新鮮，都是體系結構這個領域研究多年的內容。這些架構的發明，往往是針對特定目的（無特定目的的就是GPP，或者我們常說的CPU，它要求可以執行所有程序）。對他們的對比，應該放在一個合適的坐標系中來看（比如可編程性/靈活性和性能的對比）。從他們的分布不難看出，每個架構都是trade-off的結果，有得必有失。

那麼，針對AI應用是不是可以說TPU比GPU好，或者反之呢？也不盡然。

我們再來看一張圖，它來自International Business Strategies, Inc. (IBS)2014年發布的報告 "Strategies in Optimizing Market Positions for Semiconductor Vendors Based on IP Leverage"。它說明的是在主流的晶元設計中，隨著工藝節點的演進，設計成本變化的趨勢和分布（不包括生產成本）。這其中可以看出我們在一個晶元項目中各項任務投入的比例。顯然，現在一個晶元項目中投入最大的部分是software，verification和validation。而Architecture設計只佔其中的很小一部分。

那麼我們設計一個AI晶元的時候，最後決定成敗的因素是什麼呢？僅僅是因為你選擇了的Nvidia GPU的硬體架構或者Google TPU的脈動陣列架構嗎？或者是你發明了一種新的架構？當然不是。架構的選擇應該服務於整個晶元項目的成功，是很多因素trade-off的結果。Nvidia在Deep learning上的巨大成功，是歸功於它的底層硬體架構，還是它完善的軟硬體生態呢？當然是後者。

而Google在晶元設計上的實力和Nvidia相比，相差很遠，這也是為什麼他們選擇先從比較簡單的inference做起的原因。而在Google TPU的論文里也明確提到的項目時間比較緊，很多優化也只能放棄（Google TPU 揭密）。

另一方面，架構設計真的有很高的門檻嗎？還是那句話，體系結構的研究已經很成熟了，創新很難，想做別人做不了的東西基本不可能。Nvidia最新的GPU中，增加了Tensor Core（Nvidia Volta - 架構看點），而在面向自動駕駛的Xavier SoC中，設計了專門的硬體加速器DLA（Deep Learning Accelerator，從Nvidia開源深度學習加速器說起）。Google TPU2中為了同時實現training（第一代TPU只支持inference），增加了對浮點數的支持。雖然目前看不到細節，但可以猜想它的架構也相對第一代TPU的簡單的脈動陣列（脈動陣列 - 因Google TPU獲得新生）做了很大改進。可以看出，在口水戰的同時，他們也在相互借鑒對方的優勢，並能夠快速付諸實施。

「Under The Hood Of Google』s TPU2 Machine Learning Clusters」，這篇文章對Google TPU2現有的信息做了非常深入的分析。其中有這樣的描述：

「This tight coupling of TPU2 accelerators to processors is much different than the 4:1 to 6:1 ratios typical for GPU accelerators in deep learning training tasks. The low 2:1 ratio suggests that Google kept the design philosophy used in the original TPU: 「the TPU is closer in spirit to an FPU (floating-point unit) coprocessor than it is to a GPU.」 The processor is still doing a lot of work in Google』s TPU2 architecture, but it is offloading all its matrix math to the TPU2.」

也就是說TPU2雖然專門為Deep Learning設計，但它相比GPU，他需要更多的依賴CPU這樣的通用處理器。這也是Google的trade-off之一。

另外，從這篇文章還可以看出，Goolge在Data center領域的經驗，讓它可以用很多板級設計和系統級設計優化來彌補晶元設計能力的欠缺。

最後這樣說：

「There is not enough information yet about Google』s TPU2 stamp behavior to reliably compare it to merchant accelerator products like Nvidia』s new 「Volta」 generation. The architectures are simply too different to compare without benchmarking both architectures on the same task. Comparing peak FP16 performance is like comparing the performance of two PCs with different processor, memory, storage, and graphics options based solely on the frequency of the processor.
That said, we believe the real contest is not at the chip level. The challenge is scaling out compute accelerators to exascale proportions. Nvidia is taking its first steps with NVLink and pursuing greater accelerator independence from the processor. Nvidia is growing its software infrastructure and workload base up from single GPUs to clusters of GPUs.
Google chose to scale out its original TPU as a coprocessor directly linked to a processor. The TPU2 can also scale out as a direct 2:1 accelerator for processors. However, the TPU2 hyper-mesh programming model doesn』t appear to have a workload that can scale well. Yet. Google is looking for third-party help to find workloads that scale with TPU2 architecture.」

Google展示TPU2的同時發布TensorFlow Research Cloud (TFRC)，對於發展TPU2的應用和生態，這才是最關鍵的部分。

對於一個AI晶元項目來說，考慮整個軟硬體生態，要比底層硬體架構的設計重要的多。最終給用戶提供一個好用的解決方案，才是王道。

T.S.

歡迎關注我的微信公眾號:StarryHeavensAbove