【機器學習】高效能的機器學習框架H2O.ai

01-23

本文由本專欄作者@Hunter 投稿，經本人編輯整理

前言

關於R的並行化計算，有parral包等將現有程序轉化為apply或者for之後來實現並行計算。今天要介紹的是H2O.ai。H2O.ai是一個開源的機器學習框架。其內部使用java編程，實現多線程以及多機計算，可以通過R來調用。原理是H2O作為後端進行數據的計算，而R只是作為一個前端進行數據的傳遞和結果的顯示。

一、R的啟動

我們在載入H2O包後要初始化運行的線程數，比如nthreads= -1表示使用全部核。當數據轉化成H20的格式之後，後續的的計算將會自動被分配到多個線程以及多個CPU上。H20中現有的演算法是GBM，GLM，Distibution Random Forest、Navie Bayes，DeepLearning。另外還有數據競賽的大殺器：Ensemble model（模型融合）。

library(h2o)h2o.init(nthreads= -1)## Connection successful!## ## R is connected to the H2O cluster: ## H2O cluster uptime: 3 hours 52 minutes ## H2O cluster version: 3.10.0.8 ## H2O cluster version age: 3 months and 23 days !!! ## H2O cluster name: H2O_started_from_R_xianda_elh185 ## H2O cluster total nodes: 1 ## H2O cluster total memory: 0.71 GB ## H2O cluster total cores: 4 ## H2O cluster allowed cores: 4 ## H2O cluster healthy: TRUE ## H2O Connection ip: localhost ## H2O Connection port: 54321 ## H2O Connection proxy: NA ## R Version: R version 3.3.2 (2016-10-31)

二、數據轉化

1、as.h2o()函數

把R的數據框轉化為H2O.ai可識別的數據格式，後續的計算會在iris.hex數據集上進行。當然你也可以用R的一些常用函數，比如head()、names()、summary()等對iris.hex數據集進行操作

data(iris)iris.hex <- as.h2o(iris, destination_frame= "iris.hex")summary(iris.hex)## Sepal.Length Sepal.Width Petal.Length Petal.Width ## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.1000 ## 1st Qu.:5.099 1st Qu.:2.799 1st Qu.:1.596 1st Qu.:0.2992 ## Median :5.798 Median :2.998 Median :4.348 Median :1.3000 ## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.1993 ## 3rd Qu.:6.399 3rd Qu.:3.298 3rd Qu.:5.095 3rd Qu.:1.7992 ## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.5000 ## Species ## setosa :50 ## versicolor:50 ## virginica :50 ##

2、as.data.frame()函數

當然你可以把iris.hex數據集還原成R的數據格式，調用函數as.data.frame():

iris.R <- as.data.frame(iris.hex)

GBM與Ensemble

在數據發現的競賽中，我們經常需要對數據建模並做預測。在競賽中處理預測問題，比較流行的演算法有GBM、XGBoost和Ensemble。由於在H2O.ai中並沒有XGBoost演算法，下面以GBM演算法和Ensemble作為示例來介紹。

三、GBM演算法

調用公式：

h2o.gbm(x,y,training_frame,model_id,ntrees,distribution,max_depth,stopping_metric,balance_classes,learn_rate)

h2O里每個模型可調節的參數非常多，其GBM演算法的參數與XGBoost的參數比較類似。此處調用iris數據集進行模型擬合

iris.hex <- as.h2o(iris, destination_frame= "iris.hex")iris.gbm <- h2o.gbm(y = 5, x = 1:4, training_frame = iris.hex, ntrees= 15, max_depth = 5, min_rows = 2, learn_rate = 0.01, distribution= "multinomial")

如果是回歸模型，則distribution="gaussian"。

要查看模型的擬合效果，可以使用下面語句：

iris.gbm@model$training_metrics## H2OMultinomialMetrics: gbm## ** Reported on training data. **## ## Training Set Metrics: ## =====================## ## Extract training frame with `h2o.getFrame("iris.hex")`## MSE: (Extract with `h2o.mse`) 0.3293958## RMSE: (Extract with `h2o.rmse`) 0.5739301## Logloss: (Extract with `h2o.logloss`) 0.8533637## Mean Per-Class Error: 0.01333333## Confusion Matrix: Extract with `h2o.confusionMatrix(<model>,train = TRUE)`)## =========================================================================## Confusion Matrix: vertical: actual; across: predicted## setosa versicolor virginica Error Rate## setosa 50 0 0 0.0000 = 0 / 50## versicolor 0 49 1 0.0200 = 1 / 50## virginica 0 1 49 0.0200 = 1 / 50## Totals 50 50 50 0.0133 = 2 / 150## ## Hit Ratio Table: Extract with `h2o.hit_ratio_table(<model>,train = TRUE)`## =======================================================================## Top-3 Hit Ratios: ## k hit_ratio## 1 1 0.986667## 2 2 1.000000## 3 3 1.000000

2、h2o.predict()函數

對於模型的預測，調用

pred.hex <- h2o.predict(object = iris.gbm, newdata = iris.hex)pred.hex## predict setosa versicolor virginica## 1 setosa 0.4297616 0.2851192 0.2851192## 2 setosa 0.4297616 0.2851192 0.2851192## 3 setosa 0.4297616 0.2851192 0.2851192## 4 setosa 0.4297616 0.2851192 0.2851192## 5 setosa 0.4297616 0.2851192 0.2851192## 6 setosa 0.4297616 0.2851192 0.2851192## ## [150 rows x 4 columns]

3、與R中GBM演算法的比較

最後，我用一個P2P的數據集作為測試，從AUC和運行時間兩個維度來比較H2O與R中GBM演算法的優劣。在win10上使用2個線程進行測試，改變H2O中的learn_rate參數和R中的shrinkage參數，其它參數保持不變。以下是9次結果的對比

從AUC來看兩者差距並不大，但H2O所花費的時間要遠遠小於R。關於機器學習模型的效率比較可以參考github上的一篇文章，作者進行了詳細的驗證【5】

四、Ensembles model

H2O的Ensemble可以通過h2oEnsemble包來調用。該包允許用戶使用任何有監督的機器學習演算法來進行融合的訓練。現在H2O的Ensemble只支持回歸和二分類。

H2O的Ensemble的函數為h2o.ensemble()，其主要參數如下，下面作簡單的介紹

h2o.ensemble(x = x, y = y, training_frame = data, family = family, learner = learner, metalearner = metalearner, cvControl = list(V = 5))

1、family參數

family代表使用回歸還是分類模型，對於二分類問題，可以設置family <- "binomial"

2、learner參數

learner表示初始進行模型訓練的機器學習庫，下面的一個例子使用的是默認的H2O中的模型： GLM, Random Forest, GBM and Deep learning（全部使用默認模型參數值）

learner <- c("h2o.glm.wrapper","h2o.randomForest.wrapper", "h2o.gbm.wrapper", "h2o.deeplearning.wrapper")

當然你也可以通過自己定義模型參數加入到learner中。比如

h2o.gbm.1 <- function(..., ntrees =100, seed = 1) h2o.gbm.wrapper(..., ntrees = ntrees, seed = seed)learner <- c("h2o.glm.wrapper","h2o.randomForest.wrapper", "h2o.gbm1", "h2o.deeplearning.wrapper")

3、metalearner參數

metalearner表示最終進行融合的模型，以下使用默認參數的glm模型進行

metalearner <- "h2o.glm.wrapper"

4、例子

了解了大致參數後，我們用以下R代碼進行完整的模型訓練

library(devtools)install_github("h2oai/h2o-3/h2o-r/ensemble/h2oEnsemble-package")library(h2oEnsemble) h2o.init(nthreads = -1) h2o.removeAll() train <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_train_5k.csv")test <- h2o.importFile("https://s3.amazonaws.com/erin-data/higgs/higgs_test_5k.csv")y <- "response"x <- setdiff(names(train), y)family <-"binomial"#必須把解釋變數中的類別變數全部轉化為因子類型train[,y] <- as.factor(train[,y]) test[,y] <- as.factor(test[,y])#自定義GBM模型參數h2o.gbm.1 <- function(..., ntrees = 100,seed = 1) h2o.gbm.wrapper(..., ntrees = ntrees, seed = seed)learner <- c("h2o.glm.wrapper", "h2o.randomForest.wrapper", "h2o.gbm.1", "h2o.deeplearning.wrapper")metalearner <-"h2o.glm.wrapper"fit <- h2o.ensemble(x = x, y = y, training_frame = train,family =family, learner = learner, metalearner = metalearner, cvControl = list(V = 5))

為了評估模型在測試集上的效果，我們可以通過AUC來看模型融合對效果的提升程度

perf <- h2o.ensemble_performance(fit, newdata = test)perf## Base learner performance, sorted by specified metric: learner AUC## 1 h2o.glm.wrapper 0.6824304## 4 h2o.deeplearning.wrapper 0.6919192## 2 h2o.randomForest.wrapper 0.7599636## 3 h2o.gbm.1 0.7751240## H2O Ensemble Performance on <newdata>:## ----------------## Family: binomial## Ensemble performance (AUC): 0.779474562705375

五、參考資料

上述只是列舉了H2O一些基本操作，關於H2O更為詳細的資料列舉如下：

H2O與R的官方文檔
H2O的模型融合思想及操作
H2O與spark的結合(RSparkling)
H2O中機器學習以及深度學習的超參數尋優
機器學習各演算法準確率以及運行時間的比較

最後

想要了解關於R、Python、數據科學以及機器學習更多內容。

請關注專欄：Data Science with R&Python