TensorFlow博客翻譯——TensorFlow 0.8發布

05-13

原文地址

Announcing TensorFlow 0.8 – now with distributed computing support!

Wednesday, April 13, 2016

Posted by Derek Murray, Software Engineer

TensorFlow0.8發布——目前可以支持分散式計算！

Google uses machine learning across a wide range of its products. In order to continually improve our models, its crucial that the training process be as fast as possible. One way to do this is to run TensorFlow across hundreds of machines, which shortens the training process for some models from weeks to hours, and allows us to experiment with models of increasing size and sophistication. Ever since we released TensorFlow as an open-source project, distributed training support has been one of the most requested features. Now the wait is over.

Google在一系列的產品中都使用了機器學習。為了繼續發展我們的模式，訓練過程儘可能的快就顯得十分重要。其中的一個辦法就是在成百上千台機器上運行TensorFlow，這將把一些模型的訓練時間從幾周縮短到幾個小時，並且能有條件讓我們去試驗更大和更複雜的模型。甚至當我們把TensorFlow開源之後，對分散式訓練的支撐也成為了呼聲最高的特徵之一。現在，等待結束了。

Today, were excited to release TensorFlow 0.8 with distributed computing support, including everything you need to train distributed models on your own infrastructure. Distributed TensorFlow is powered by the high-performance gRPC library, which supports training on hundreds of machines in parallel. It complements our recent announcement of Google Cloud Machine Learning, which enables you to train and serve your TensorFlow models using the power of the Google Cloud Platform.

今天，我們非常興奮的發布可以支持分散式計算的TensorFlow0.8，包含了你在自己的架構上去訓練分散式模型所需的一切。分散式的TensorFlow是通過高性能的gPRC庫驅動的，它可以支持數以百計的機器並行進行訓練。它補充了我們最近宣布的Google雲機器學習，這將確保你可以使用Google雲平台的力量去訓練和服務你的TensorFlow模型。

To coincide with the TensorFlow 0.8 release, we have published a distributed trainer for the Inception image classification neural network in the TensorFlow models repository. Using the distributed trainer, we trained the Inception network to 78% accuracy in less than 65 hours using 100 GPUs. Even small clusters—or a couple of machines under your desk—can benefit from distributed TensorFlow, since adding more GPUs improves the overall throughput, and produces accurate results sooner.

為了和TensorFlow0.8的發布相呼應，我們發布了一個分散式的訓練者，為了TensorFlow模型包里的Inception圖片分類神經網路。使用分散式的訓練者，我們訓練開始網路達到78%的精確度，在採用100個GPU的情況下只用了不到65個小時。甚至更小的集群，或者是你書桌下的幾台機器也可以從分散式TensorFlow里獲取益處，因此添加更多的GPU可以增加吞吐量，並且可以更快的產生更精確的結果。

TensorFlow can speed up Inception training by a factor of 56, using 100 GPUs.

The distributed trainer also enables you to scale out training using a cluster management system like Kubernetes. Furthermore, once you have trained your model, you can deploy to production andspeed up inference using TensorFlow Serving on Kubernetes.

分散式訓練者也可以確保你使用一個像Kubernets的集群管理系統去度量我們的訓練。此外，一旦你已經訓練好你的模型，你可以部署為產品，並且可以使用TensorFlow在Kubernetes上的服務去加速推理。

Beyond distributed Inception, the 0.8 release includes new libraries for defining your own distributed models. TensorFlows distributed architecture permits a great deal of flexibility in defining your model, because every process in the cluster can perform general-purpose computation. Our previous system DistBelief (like many systems that have followed it) used special "parameter servers" to manage the shared model parameters, where the parameter servers had a simple read/write interface for fetching and updating shared parameters. In TensorFlow, all computation—including parameter management—is represented in the dataflow graph, and the system maps the graph onto heterogeneous devices (like multi-core CPUs, general-purpose GPUs, and mobile processors) in the available processes. To make TensorFlow easier to use, we have included Python libraries that make it easy to write a model that runs on a single process and scales to use multiple replicas for training.

超過分散式的開端的是，0.8的發布包括了定義你自己的分散式模型的一些新庫。TensorFlow的分散式架構允許定義你自己的模型的時候有一系列的可伸縮性，因為集群里的每一個進程都可以表現為特殊目的的計算。我們之前的系統DistBelief（像很多跟隨它的系統一樣）使用特殊的參數服務者去管理共享的模型參數，這種參數服務者有一個簡單的讀/寫介面用來獲取和更新共享的參數。在TensorFlow里，所有的計算包含參數的管理，都表現在數據流圖裡，並且系統規劃這些圖到異構設備上（像多核的CPU，特殊目的的GPU，和移動處理器）在可獲得進程里。為了確保TensorFlow更容易使用，我們已經包含了一些Python庫，這些庫將很容易編寫一個運行在單獨進程上，並且可以通過多個副本進行訓練以擴大訓練規模的模型。

This architecture makes it easier to scale a single-process job up to use a cluster, and also to experiment with novel architectures for distributed training. As an example, my colleagues have recently shown that synchronous SGD with backup workers, implemented in the TensorFlow graph, achieves improved time-to-accuracy for image model training.

這個架構使得它可以更加容易的從運用單進程擴大到使用集群，並且易於使用新架構去試驗分散式訓練。作為一個例子，我的大學同學最近展示了使用備份功能的同步SGD，它是用TensorFlow圖實現的，成功的提高了圖像模型訓練的時間和精確比。

The current version of distributed computing support in TensorFlow is just the start. We are continuing to research ways of improving the performance of distributed training—both through engineering and algorithmic improvements—and will share these improvements with the community on GitHub. However, getting to this point would not have been possible without help from the following people:

TensorFlow所支持的目前的分散式計算版本才剛剛開始。我們將繼續研究提高分散式訓練的性能的方法，從工程和演算法的兩個角度推動，並且將這些進展在GitHub的社區分享出來。不管怎麼說，達到目前這樣的程度，無論如何也是少不了如下這些人的幫助：

TensorFlow training libraries - Jianmin Chen, Matthieu Devin, Sherry Moore and Sergio Guadarrama
TensorFlow core - Zhifeng Chen, Manjunath Kudlur and Vijay Vasudevan
Testing - Shanqing Cai
Inception model architecture - Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Jonathon Shlens and Zbigniew Wojna
Project management - Amy McDonald Sandjideh
Engineering leadership - Jeff Dean and Rajat Monga

TensorFlow訓練庫——Jianmin Chen, Matthieu Devin, Sherry Moore and Sergio Guadarrama

TensorFlow內核——Zhifeng Chen, Manjunath Kudlur and Vijay Vasudevan

測試——Shanqing Cai

Inception模型架構——Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Jonathon Shlens and Zbigniew Wojna

項目管理——Amy McDonald Sandjideh

工程領導——Jeff Dean and Rajat Monga