請問各位大神,spark的ML和MLLib兩個包區別和聯繫?!?


技術角度上,面向的數據集類型不一樣:ML的API是面向Dataset的(Dataframe是Dataset的子集,也就是Dataset[Row]), mllib是面對RDD的。Dataset和RDD有啥不一樣呢?Dataset的底端是RDD。Dataset對RDD進行了更深一層的優化,比如說有sql語言類似的黑魔法,Dataset支持靜態類型分析所以在compile time就能報錯,各種combinators(map,foreach等)性能會更好,等等。

編程過程上,構建機器學習演算法的過程不一樣:ML提倡使用pipelines,把數據想成水,水從管道的一段流入,從另一端流出:

大體概念:DataFrame =&> Pipeline =&> A new DataFrame

Pipeline: 是由若干個Transformers和Estimators連起來的數據處理過程

Transformer:入:DataFrame =&> 出: Data Frame

Estimator:入:DataFrame =&> 出:Transformer


其實Spark官方文檔已經寫的很明白了 Spark ML Programming Guide。 ML是1.4比Mllib更高抽象的庫,它解決如果簡潔的設計一個機器學習工作流的問題,而不是具體的某種機器學習演算法。未來這兩個庫會並行發展。


spark.mllib中的演算法介面是基於RDDs的;

spark.ml中的演算法介面是基於DataFrames的。

實際使用中推薦ml,建立在DataFrames基礎上的ml中一系列演算法更適合創建包含從數據清洗到特徵工程再到模型訓練等一系列工作的ML pipeline;而且未來mllib也會被棄用。


請看官方文檔MLlib - Spark 1.6.1 Documentation

上面寫的很清楚

It divides into two packages:

  • spark.mllib contains the original API built on top of RDDs.
  • spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines.

Using spark.ml is recommended because with DataFrames the API is more versatile and flexible. But we will keep supportingspark.mllib along with the development of spark.ml. Users should be comfortable using spark.mllib features and expect more features coming. Developers should contribute new algorithms to spark.ml if they fit the ML pipeline concept well, e.g., feature extractors and transformers.


ml是對所有的演算法都做了更高一級的抽象,如果你看了源碼就會發現mllib只是對ml的調用,所以無論你用哪一個都是沒有區別的,不過mllib使用起來還是會方便一些,建議用mllib就好了。


Announcement: DataFrame-based API is primary APIMLlib: Main Guide

The MLlib RDD-based API is now in maintenance mode.

As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. The primary Machine Learning API for Spark is now the DataFrame-based API in the spark.ml package.

What are the implications?

  • MLlib will still support the RDD-based API in spark.mllib with bug fixes.
  • MLlib will not add new features to the RDD-based API.
  • In the Spark 2.x releases, MLlib will add features to the DataFrames-based API to reach feature parity with the RDD-based API.
  • After reaching feature parity (roughly estimated for Spark 2.2), the RDD-based API will be deprecated.
  • The RDD-based API is expected to be removed in Spark 3.0.

Why is MLlib switching to the DataFrame-based API?

  • DataFrames provide a more user-friendly API than RDDs. The many benefits of DataFrames include Spark Datasources, SQL/DataFrame queries, Tungsten and Catalyst optimizations, and uniform APIs across languages.
  • The DataFrame-based API for MLlib provides a uniform API across ML algorithms and across multiple languages.
  • DataFrames facilitate practical ML Pipelines, particularly feature transformations. See the Pipelines guide for details.


mlib主要是基於RDD的,抽象級別不夠高, ml主要是把數據處理的流水線抽象出來,演算法相當於流水線的一個組件,可以被其他演算法隨意的替換,這樣就讓演算法和數據處理的其他流程分割開來,實現低耦合


推薦閱讀:

為什麼有些程序員看不起 PHP 這門語言?
scala case class 這時候該怎麼用?
代數數據類型是什麼?
kotlin和scala兩種語言的對比?
如何看待TIOBE2016年預測scala將停留在前20內?

TAG:機器學習 | Scala | 大數據處理 | Spark |