
技術角度上,面向的數據集類型不一樣:ML的API是面向Dataset的(Dataframe是Dataset的子集,也就是Dataset[Row]), mllib是面對RDD的。Dataset和RDD有啥不一樣呢?Dataset的底端是RDD。Dataset對RDD進行了更深一層的優化,比如說有sql語言類似的黑魔法,Dataset支持靜態類型分析所以在compile time就能報錯,各種combinators(map,foreach等)性能會更好,等等。


大體概念:DataFrame =&> Pipeline =&> A new DataFrame

Pipeline: 是由若干個Transformers和Estimators連起來的數據處理過程

Transformer:入:DataFrame =&> 出: Data Frame

Estimator:入:DataFrame =&> 出:Transformer

其實Spark官方文檔已經寫的很明白了 Spark ML Programming Guide。 ML是1.4比Mllib更高抽象的庫,它解決如果簡潔的設計一個機器學習工作流的問題,而不是具體的某種機器學習演算法。未來這兩個庫會並行發展。



實際使用中推薦ml,建立在DataFrames基礎上的ml中一系列演算法更適合創建包含從數據清洗到特徵工程再到模型訓練等一系列工作的ML pipeline;而且未來mllib也會被棄用。

請看官方文檔MLlib - Spark 1.6.1 Documentation


It divides into two packages:

  • spark.mllib contains the original API built on top of RDDs.
  • spark.ml provides higher-level API built on top of DataFrames for constructing ML pipelines.

Using spark.ml is recommended because with DataFrames the API is more versatile and flexible. But we will keep supportingspark.mllib along with the development of spark.ml. Users should be comfortable using spark.mllib features and expect more features coming. Developers should contribute new algorithms to spark.ml if they fit the ML pipeline concept well, e.g., feature extractors and transformers.


Announcement: DataFrame-based API is primary APIMLlib: Main Guide

The MLlib RDD-based API is now in maintenance mode.

As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. The primary Machine Learning API for Spark is now the DataFrame-based API in the spark.ml package.

What are the implications?

  • MLlib will still support the RDD-based API in spark.mllib with bug fixes.
  • MLlib will not add new features to the RDD-based API.
  • In the Spark 2.x releases, MLlib will add features to the DataFrames-based API to reach feature parity with the RDD-based API.
  • After reaching feature parity (roughly estimated for Spark 2.2), the RDD-based API will be deprecated.
  • The RDD-based API is expected to be removed in Spark 3.0.

Why is MLlib switching to the DataFrame-based API?

  • DataFrames provide a more user-friendly API than RDDs. The many benefits of DataFrames include Spark Datasources, SQL/DataFrame queries, Tungsten and Catalyst optimizations, and uniform APIs across languages.
  • The DataFrame-based API for MLlib provides a uniform API across ML algorithms and across multiple languages.
  • DataFrames facilitate practical ML Pipelines, particularly feature transformations. See the Pipelines guide for details.

mlib主要是基於RDD的,抽象級別不夠高, ml主要是把數據處理的流水線抽象出來,演算法相當於流水線的一個組件,可以被其他演算法隨意的替換,這樣就讓演算法和數據處理的其他流程分割開來,實現低耦合


