分散式 tensorflow 指南

分散式 tensorflow 指南

本指南是一個分散式訓練樣例集合(可以作為樣板代碼)和一個基本的分散式tensorflow教程。許多的例子集中在著名的分散式訓練方案的實施,如作者的博客文章探討過的分散式 keras。

幾乎所有的示例都可以在一台帶有CPU的機器上運行,所有的示例只使用數據並行(即在圖形複製之間)。

項目地址:tmulc18/Distributed-TensorFlow-Guide

This guide is a collection of distributed training examples (that can act as boilerplate code) and a tutorial of basic distributed TensorFlow. Many of the examples focus on implementing well-known distributed training schemes, such as those available in Distriubted Keras which were discussed in the author"s blog post.

Almost all the examples can be run on a single machine with a CPU, and all the examples only use data-parallelism (i.e. between-graph replication).

The motivation for this guide stems from the current state of distributed deep learning. Deep learning papers typical demonstrate successful new architectures on some benchmark, but rarely show how these models can be trained with 1000x the data which is usually the requirement in industy. Furthermore, most successful distributed cases use state-of-the-art hardware to bruteforce massive effective minibatches in a synchronous fashion across high-bandwidth networks; there has been little research showing the potential of asynchronous training (which is why there are a lot of those examples in this guide). Finally, the lack of documenation for distributed TF was the real reason this project was started. TF is a great tool that prides itself on its scalability, but unfortunately there are few examples that show how to make your model scale with datasize.

The aim of this guide is to aid all interested in distributed deep learning, from beginners to researchers.

更多機器學習資源:TensorFlow 安裝,TensorFlow 教程,TensorFlowNews 原創人工智慧,機器學習,深度學習,神經網路,計算機視覺,自然語言處理項目分享。


推薦閱讀:

成為HBase Committer後
分散式系統理論 - 從放棄到入門
CAP 理論常被解釋為一種「三選二」定律,這是否是一種誤解?
zhh-2015在分散式系統和資料庫領域研究水平和工程能力怎樣?
大數據計算框架除了 MapReduce 還有哪些呢,不應該是 MapReduce 去解決所有問題吧?

TAG:TensorFlow | Keras | 分布式系统 |