ML + System = ?

04-09

之前我曾經發過這樣一張圖，顯示晶元和機器學習的良性循環關係。最近，隨著第一次SysML Conference的召開以及Google把機器學慣用於計算機系統設計的嘗試，我們可以看出ML和System的互動越來越密切。未來，ML應該會和整個計算機系統（不僅是晶元，而是軟硬體各個方面）形成一個更大範圍的良性循環。

SysML會議是由Google，Amazon和Facebook發起，今年是第一次會議。Jeff Dean是主要的組織者之一，而他在會議上做的Keynote的核心也是ML和System結合。前面的大部分內容和之前的演講類似，這裡我們主要關心新增的部分，不妨先看看他的結論。

這有點像我們上面所說的晶元和ML的良性循環。首先，專用的ML硬體還處在「嬰兒期」，隨著更快的系統不斷出現並得到更廣泛的部署，我們可以期待在更多領域取得突破；同時把學習機制引入計算系統的核心可以讓其得到更好的優化。對於前一部分，我們已經有過很多的討論，因此這裡主要關注後一部分，即把「Learning」引入計算系統的優化。

對此，Jeff Dean可以說是非常的的樂觀，他認為「凡是我們使用啟發式（heuristic ）技術來做決定的領域，都是可能應用機器學習的好地方」。

A heuristic technique (/hj???r?st?k/; Ancient Greek: ε?ρ?σκω, "find" or "discover"), often called simply a heuristic, is any approach to problem solving, learning, or discovery that employs a practical method not guaranteed to be optimal or perfect, but sufficient for the immediate goals. Where finding an optimal solution is impossible or impractical, heuristic methods can be used to speed up the process of finding a satisfactory solution. Heuristics can be mental shortcuts that ease the cognitive load of making a decision. - Wikipedia

實際的例子很多，涉及計算機系統的方方面面。他給出的例子包括，編譯器功能，網路優化，操作系統設計，任務調度系統，甚至是ASIC設計。而在這些應用中要獲得成功，關鍵點包括兩個：

其中，第一點是強調找到一個能用數字表示的指標，對增強學習來說，這就意味著一個清晰準確的Reward；而第二點，對於增強學習來說意味著能不能獲得準確的環境（Environment），對於監督學習而言則意味著能不能方便的獲得訓練和測試數據。如果大家還不清楚這兩點對於RL能否實現的重要性，可以參考這篇文章「推特爆款：谷歌大腦工程師的深度強化學習勸退文」。這裡的好消息是，對於計算系統的優化，這兩個要求似乎是比較容易實現的。比如要優化device placement，則runtime就可以作為一個很清晰的Reward；而runtime的結果可以通過計算任務在實際系統上運行獲得。這也是為什麼上述文章中專門提到Google的Device Placement的嘗試是比較成功的，「I know there』s some neat work optimizing device placement for large Tensorflow graphs (Mirhoseini et al, ICML 2017).」

到目前為止，Google已經在這個方面做了幾個實際的嘗試，結果包括在下面幾篇論文中。在Jeff Dean的Keynote當中，介紹了其中的前兩篇文章。

「The Case for Learned Index Structures」（https://arxiv.org/abs/1712.01208）

Indexes are models: a B-Tree-Index can be seen as a model to map a key to the position of a record within a sorted array, a Hash-Index as a model to map a key to a position of a record within an unsorted array, and a BitMap-Index as a model to indicate if a data record exists or not. In this exploratory research paper, we start from this premise and posit that all existing index structures can be replaced with other types of models, including deep-learning models, which we term learned indexes. The key idea is that a model can learn the sort order or structure of lookup keys and use this signal to effectively predict the position or existence of records. We theoretically analyze under which conditions learned indexes outperform traditional index structures and describe the main challenges in designing learned index structures. Our initial results show, that by using neural nets we are able to outperform cache-optimized B-Trees by up to 70% in speed while saving an order-of-magnitude in memory over several real-world data sets. More importantly though, we believe that the idea of replacing core components of a data management system through learned models has far reaching implications for future systems designs and that this work just provides a glimpse of what might be possible.

「Device Placement Optimization with Reinforcement Learning」（https://arxiv.org/abs/1706.04972）

The past few years have witnessed a growth in size and computational requirements for training and inference with neural networks. Currently, a common approach to address these requirements is to use a heterogeneous distributed environment with a mixture of hardware devices such as CPUs and GPUs. Importantly, the decision of placing parts of the neural models on devices is often made by human experts based on simple heuristics and intuitions. In this paper, we propose a method which learns to optimize device placement for TensorFlow computational graphs. Key to our method is the use of a sequence-to-sequence model to predict which subsets of operations in a TensorFlow graph should run on which of the available devices. The execution time of the predicted placements is then used as the reward signal to optimize the parameters of the sequence-to-sequence model. Our main result is that on Inception-V3 for ImageNet classification, and on RNN LSTM, for language modeling and neural machine translation, our model finds non-trivial device placements that outperform hand-crafted heuristics and traditional algorithmic methods.

第三項是非常新的工作，其中把Prefetching中的地址預測看成是自然語言處理中「next-word or character prediction」的問題也是很有啟發的。

「Learning Memory Access Patterns」（https://arxiv.org/abs/1803.02329）

The explosion in workload complexity and the recent slow-down in Moores law scaling call for new approaches towards efficient computing. Researchers are now beginning to use recent advances in machine learning in software optimizations, augmenting or replacing traditional heuristics and data structures. However, the space of machine learning for computer hardware architecture is only lightly explored. In this paper, we demonstrate the potential of deep learning to address the von Neumann bottleneck of memory performance. We focus on the critical problem of learning memory access patterns, with the goal of constructing accurate and efficient memory prefetchers. We relate contemporary prefetching strategies to n-gram models in natural language processing, and show how recurrent neural networks can serve as a drop-in replacement. On a suite of challenging benchmark datasets, we find that neural networks consistently demonstrate superior performance in terms of precision and recall. This work represents the first step towards practical neural-network based prefetching, and opens a wide range of exciting directions for machine learning in computer architecture research.

除了Google的工作，在這次SysML會議上還有很多比較有意思的話題。由於是結合ML和System的會議，話題覆蓋的範圍也非常廣。

硬體加速器：

有Eyeress團隊的Vivienne Sze的talk：「Understanding the Limitations of Current Energy-Efficient Design Approaches for Deep Neural Networks」。她們的「 Tutorial on Hardware Architectures for Deep Neural Networks」還是目前為止對深度神經網路硬體最好的綜述。另外，她還透露了一下Eyeress2的情況。

還有「Efficient Deep Learning Inference on Edge Devices」，「Stitch-X: An Accelerator Architecture for Exploiting Unstructured Sparsity in Deep Neural Networks」，「Mobile Machine Learning Hardware at ARM: A Systems-on-Chip (SoC) Perspective」。

針對特殊系統優化模型：

「Towards Optimal Winograd Convolution on Manycores」，「Blink: A fast NVLink-based collective communication library」，「On Scale-out Deep Learning Training for Cloud and HPC」

針對ML任務的系統優化和把ML用於系統優化：

「Learning Graph-based Cluster Scheduling Algorithms」，「Representation Learning for Resource Usage Prediction」，「Better Caching with Machine Learned Advice」, 「Towards Interactive Curation & Automatic Tuning of ML Pipelines」，「SLAQ: Quality-Driven Scheduling for Distributed Machine Learning」，「Distributed Shared Memory for Machine Learning」，「Learning Network Size While Training with ShrinkNets」

Benchmark：

「DAWNBench: An End-to-End Deep Learning Benchmark and Competition」，「DNN-Train: Benchmarking and Analyzing DNN Training」

最後，還有一篇有趣的文章叫做「In-network Neural Networks」。其基本思想就是利用目前的網路設備中的可編程運算資源來實現神經網路的應用。這個和我之前的文章「AI晶元開年 」中提到的直接在網路設備中加速AI應用的想法是類似的。另外，在之前的WMC會議上，Nokia也發布了他們5G基站晶元「ReefShark」，強調了其AI計算能力，號稱要讓運營商的網路成為最大的AI計算平台。在大量數據需要本地處理的趨勢下，從端設備到雲端的整個網路中間，各種節點都可能越來越多的增加AI處理能力，在離數據最近的地方完成對數據的處理。

從會議視頻里可以看出，當Jeff Dean介紹「我們使用啟發式技術的任何領域，都是可能應用機器學習的好地方——編譯器、網路、操作系統、甚至是晶元設計...」的時候，他的腦子裡應該在快速閃回各種可能性。正如他所說「There are many opportunities for this」。

- END-

歡迎關注我的微信公眾號:StarryHeavensAbove

題圖來自網路，版權歸原作者所有