有基於spark的parameter server嗎?

01-14

parameter server用於訓練大規模的機器學習模型，那有沒有基於spark的parameter server的框架呢？感覺它倆每次都被分別提及，是因為兩者的設計原理相差較大嗎？在spark上開發一個ps框架的難點在哪裡？

謝謝遵文和雲龍的邀約

很遺憾，這麼多年過去了，Spark本身的PS依然是個問題，並沒有得到解決，大部分的MLLib演算法，依然是性能上有著很大的瓶頸。

不過好消息是，今年騰訊開源了一個工業級別的Parameter Server，是支持Spark的。Angel是基於參數伺服器理念開發的分散式機器學習平台，提供作為一個參數伺服器必須具備的完整功能，包括：

模型劃分（Model Partitioner）
非同步控制（Sync Controller）
擴展函數（psFunc）

這些都是Glint沒有的，或者做得比較簡單的。當然Glint是個很不錯的項目，小巧玲瓏，代碼整潔，設計精緻，我們做Angel的Spark on Angel模塊時，對它也進行了學習和參考，這裡表示謝意。

Angel不止支持Spark，它可以獨立運行，也可以基於PS-Service支持Spark。用戶的代碼修改量很小，不需要侵入Spark Core，輕鬆具備PS的能力，加速演算法的分散式訓練速度。具體介紹可以到Github（https://github.com/Tencent/angel）看這3篇文檔

Spark on Angel的架構設計
Spark on Angel Quick Start
Spark on Angel編程指南

Spark on Angel在騰訊內部是經過真實業務數據考驗，2.3億的樣本，5kw的維度，各種優化的LR演算法，可以輕鬆跑通。（代碼已經開源，歡迎大家評測PK）

Angel平台還在快速發展和演進中，後續將加入Python介面和深度學習支持。歡迎做機器學習的同仁，一起來使用。喜歡它的朋友，麻煩幫忙點一下小星星，Star一下，謝謝：）

實在沒時間寫，不過過了這麼長時間，也許spark上的ps不應該再是問題了。

兩三年前我在intel的時候實現了distml，開源在github上，這應該是spark上最早的實現了，可惜的是後來沒有太多機會去優化了，很多功能不夠完善。現在忙於做人機對話，更是沒有時間維護了，向剩餘不多的繼續關注的朋友說聲抱歉了。

阿里的明風去了騰訊後，他們組今年開源了一個機器學習平台，裡面好像是支持ps的，有興趣的可以看看

最近讀論文時，正好看到了一個相關的項目 - Glint

Glint is a high-performance parameter server compatible with Spark. It provides a high-level API in Scala making it easy for users to build machine learning algorithms with the parameter server as a concurrency model

2016 Spark summit 的時候有人就有提出了:

Apache Spark enables applications to efficiently process massive datasets. Its extensive machine learning libraries, such as MLlib, have showcased the power of the Spark system architecture and the elegance of its API. The existing Spark libraries assume that the number of parameters in machine learning models are small enough to fit in the memory of a single machine. This assumption, however, is not compatible with demanding use cases at Yahoo. In order to fit our needs, we developed a set of Spark ML libraries that can handle large models with billions of parameters. To enable models with billions of parameters, we have explored a system architecture that augments Spark driver/executors with Parameter Servers (PS). This provides distributed in-memory model storage and computation, and parameter servers enable Spark-executor-based learners to jointly learn large models efficiently. In this talk, we will illustrate the power of Spark+PS architecture via two algorithms: logistic regression and word2vec. We will elaborate on how Spark+PS has enabled us to achieve significant model size scale-up and speed-up of machine learning. We will also discuss how this approach could be applied to other ML algorithms.

Scaling Machine Learning To Billions Of Parameters