SparkR專欄[5]—機器學習

05-24

SparkR專欄[5]—機器學習

@qq交流群 : 440125673

測試環境：Centos7 + Rstudio Server + Hadoop2.7.4偽分散式 + Spark2.2.0

1. Spark MLlib簡介

MLlib是Spark的機器學習（ML）庫。其目標是使實用的機器學習具有可擴展性和容易性。在高水平上，它提供了以下工具：

ML演算法：常用的學習演算法，如分類，回歸，聚類和協同過濾等；
特徵：特徵提取，變換，降維和選擇；
管道：構建，評估和調整ML管道的工具；
持久性：保存和載入演算法，模型和管道；
實用程序：線性代數，統計，數據處理等；

MLlib分為兩個獨立的包：spark.mllib 和 spark.ml

spark.mllib：它是在RDD的基礎上構建的原始機器學習API。從spark2.0開始，這個基於RDD的API已經處於維護模式，預計在即將推出的Spark3.0版本中會被棄用和刪除。

更多信息參考官網：http://spark.apache.org/docs/latest/ml-guide.html

spark.ml：它是在DataFrame基礎上構建的主要的機器學習API，用於構建機器學習流水線和優化。

2. SparkR機器學習

Spark2.0以後MLlib引入了Pipeline的概念。MLlib 將機器學習演算法的API標準化（spark.ml的基於DataFrame API），以便將多種演算法更容易地組合成單個 Pipeline （管道）或者工作流。本節介紹Pipelines API 的關鍵概念,其中 Pipeline（管道）的概念主要是受到 scikit-learn 項目（一個Python的機器學習庫）的啟發。

很遺憾的是，Apache spark為了迎合R語言的編程習慣，spark ml並沒有在R語言裡面使用Ppineline的概念。這意味著spark ml 的許多特點在R語言中都無法使用。並且，在R語言中，雖然功能很齊全，但是各個模塊的分類做的比較差（Python程序員一直詬病R語言的原因之一），包括sparkR包。當然，很多東西習慣就好。

註：spark機器學習建議使用PySpark。後面有時間的話整理Pyspark系列的資料。

在sparkR中，支持的機器學習演算法：

監督學習：

邏輯回歸（spark.logit）
隨機森林（spark.randomForest）
線性svm（spark.svmLinear）
樸素貝葉斯（spark.naiveBayes）
廣義線性回歸（spark.glm）
…

非監督學習：

Kmeans（spark.kmeans）
關聯規則（spark.fpGrowth）
…

基本上，這些函數的使用方法都符合R語言自身的編程習慣，例如 formula的使用等等。下面簡單用一個例子來看看。

#### 1.從hive中讀取數據 ####

iris_df <- sql("select Sepal_Length,Sepal_Width,Petal_Length,Petal_Width,label from czf_test.iris")

head(iris_df)

# Sepal_Length Sepal_Width Petal_Length Petal_Width label

# 1 5.1 3.5 1.4 0.2 setosa

# 2 4.9 3.0 1.4 0.2 setosa

# 3 4.7 3.2 1.3 0.2 setosa

# 4 4.6 3.1 1.5 0.2 setosa

# 5 5.0 3.6 1.4 0.2 setosa

# 6 5.4 3.9 1.7 0.4 setosa

#### 2.劃分數據集 ####

iris_list <- randomSplit(iris_df,c(0.7,0.3),20180126)

train <- iris_list[[1]]

test <- iris_list[[2]]

#### 3.建立logistic回歸 ####

library(dplyr)

iris_glm <- spark.logit(data = train,formula = label ~ Sepal_Length+Sepal_Width+Petal_Length+Petal_Width,family = multinomial)

summary(iris_glm)

$coefficients

versicolor virginica setosa

(Intercept) 16.77075 -18.49414 1.723393

Sepal_Length 7.341095 5.143021 -12.48412

Sepal_Width -20.28102 -26.23101 46.51202

Petal_Length 2.465449 10.79543 -13.26088

Petal_Width 7.878006 22.60734 -30.48535

#### 4.評估模型 ####

iris_fit_glm <- iris_glm %>% predict(test) %>% subset(select=(c("label","prediction")))

head(iris_fit_glm)

iris_predict <- as.data.frame(iris_fit_glm)

accracy <- sum(iris_predict$label==iris_predict$prediction)/dim(test)[1] # 準確度

print(accracy) # 準確率為1

可以看到，spark.logit建立的多分類邏輯回歸模型表現得很好

3. R4ML 包

3.1 什麼是R4ML？

R4ML is a scalable, hybrid approach to ML/Stats using R, Apache SystemML, and Apache Spark.

Github地址：https://github.com/SparkTC/r4ml

R4ML是由IBM公司在Git下開發的可下載的開源R包；
R4ML是建立在SparkR和Apache SystemML之上的，因此都支持他們二者的功能；
R4ML扮演者SparkR與Apache SystemML之間的橋樑；
R4ML提供了一系列封裝好的演算法；
R4ML提供創建自定義ML演算法的功能；
R4ML針對於R語言用戶十分友好

3.2 安裝R4ML

第一步：添加sparkR包的路徑

.libPaths(c(.libPaths(), " /usr/hdp/2.6.0.3-8/spark2/R/lib"))

第二步：安裝依賴包

install.packages(c("uuid", "R6"), repos = "http://cloud.r-project.org")

第三步：下載安裝R4ML

download.file("http://169.45.79.58/R4ML_0.8.1.tar.gz", "~/R4ML_0.8.1.tar.gz")

install.packages("~/R4ML_0.8.1.tar.gz", repos = NULL, type = "source") # 最新版本是 0.8.1

第四步：載入R4ML包

3.3 創建spark入口

r4ml.session(master = "local[1]", sparkHome = "/usr/hdp/2.6.0.3-8/spark2")

3.4 應用實例

# 創建 r4ml.frame

iris.df <- as.r4ml.frame(iris)

class(iris.df)

# [1] "r4ml.frame"

# attr(,"package")

# [1] "R4ML

# 數據預處理，並且創建Species啞變數

pp <- r4ml.ml.preprocess(iris.df, transformPath = "/tmp",

dummycodeAttrs = c("Species"))

# 將處理後的數據轉化為 r4ml.matrix

iris.mat <- as.r4ml.matrix(pp$data)

# 劃分數據集

ml.coltypes(iris.mat) <- c("scale", "scale", "scale", "scale",

"nominal", "nominal", "nominal")

s <- r4ml.sample(iris.mat, perc=c(0.2,0.8))

test <- s[[1]] # 訓練集

train <- s[[2]]# 測試集

# 擬合線性回歸模型

iris.model <- r4ml.lm(Sepal_Length ~ . , data = train, intercept = TRUE,tolerance = 0.0001,iter.max = 100,lambda = 1)

# 模型係數

# 預測

preds <- predict(iris.model, test)

# 模型的評估

# 關閉 r4ml 入口

r4ml.session.stop()

4. 結束語

相比sparkR中的機器學習模塊，R4ML包的功能更強大。SparkR+R4ML幾乎覆蓋了我們要用到的功能。多用多熟悉就好。