lightgbm演算法優化實際案例分享

零、案例背景介紹與建模思路說明

1.背景介紹

本案例使用的數據為kaggle中「Santander Customer Satisfaction」比賽的數據。此案例為不平衡二分類問題,目標為最大化auc值(ROC曲線下方面積)。競賽題目鏈接為:Santander Customer Satisfaction | Kaggle 。目前此比賽已經結束。

2.建模思路

本文檔採用微軟開源的lightgbm演算法進行分類,運行速度極快。

1) 讀取數據;

2) 並行運算:由於lightgbm包可以通過設置相應參數進行並行運算,因此不再調用doParallel與foreach包進行並行運算;

3) 特徵選擇:使用mlr包提取了99%的chi.square;

4) 調參:逐步調試lgb.cv函數的參數,並多次調試,直到滿意為止;

5) 預測結果:用調試好的參數值構建lightgbm模型,輸出預測結果;本案例所用程序輸出結果的ROC值為0.833386,已超過Private Leaderboard排名第一的結果(0.829072)。

3.lightgbm演算法

lightgbm演算法具體介紹網址:Microsoft/LightGBM ;由於lightgbm演算法沒有給出具體的數學公式,因此此處不再介紹,如有需要,請查看github項目網址。

4.聯繫方式與個人簡介

關於演算法,如有疑問,請聯繫E-mail:sugs01@outlook.com

蘇高生,西南財經大學統計學碩士畢業,現就職於中國電信,主要負責企業存量客戶數據分析、數據建模。 研究方向:機器學習。

一、讀取數據

options(java.parameters = "-Xmx8g") ## 特徵選擇時使用,但是需要在載入包之前設置library(readr)lgb_tr1 <- read_csv("C:/Users/Administrator/Documents/kaggle/scs_lgb/train.csv")lgb_te1 <- read_csv("C:/Users/Administrator/Documents/kaggle/scs_lgb/test.csv")

二、數據探索

1.設置並行運算

library(dplyr)library(mlr)library(parallelMap)parallelStartSocket(2)

2.數據各列初步探索

summarizeColumns(lgb_tr1) %>% View()

3.處理缺失值

impute missing values by mean and modeimp_tr1 <- impute( as.data.frame(lgb_tr1), classes = list( integer = imputeMean(), numeric = imputeMean() ))imp_te1 <- impute( as.data.frame(lgb_te1), classes = list( integer = imputeMean(), numeric = imputeMean() ))

處理缺失值後

summarizeColumns(imp_tr1$data) %>% View()

4.觀察訓練數據類別的比例–數據類別不平衡

table(lgb_tr1$TARGET)

5.剔除數據集中的常數列

lgb_tr2 <- removeConstantFeatures(imp_tr1$data)lgb_te2 <- removeConstantFeatures(imp_te1$data)

6.保留訓練數據集與測試數據及相同的列

tr2_name <- data.frame(tr2_name = colnames(lgb_tr2))te2_name <- data.frame(te2_name = colnames(lgb_te2))tr2_name_inner <- tr2_name %>% inner_join(te2_name, by = c(tr2_name = te2_name))TARGET = data.frame(TARGET = lgb_tr2$TARGET)lgb_tr2 <- lgb_tr2[, c(tr2_name_inner$tr2_name[2:dim(tr2_name_inner)[1]])]lgb_te2 <- lgb_te2[, c(tr2_name_inner$tr2_name[2:dim(tr2_name_inner)[1]])]lgb_tr2 <- cbind(lgb_tr2, TARGET)

7.註:

1)由於本次使用lightgbm演算法,故而不對數據進行標準化處理;

2)lightgbm演算法運行效率極高,1GB內不進行特徵篩選也可以運行的極快,但是此處進行特徵篩選,以進一步加快運行速率;

3)本案例直接進行特徵篩選,未生成衍生變數,原因為:不知特徵實際意義,不好隨機生成。

三、特徵篩選–卡方檢驗

library(lightgbm)

1.試算最大權重值程序,後面將繼續優化

grid_search <- expand.grid( weight = seq(1, 30, 2) ## table(lgb_tr1$TARGET)[1] / table(lgb_tr1$TARGET)[2] = 24.27261 ## 故而設定weight在[1, 30]之間)

lgb_rate_1 <- numeric(length = nrow(grid_search))

set.seed(0)

for(i in 1:nrow(grid_search)){ lgb_weight <- (lgb_tr2$TARGET * i + 1) / sum(lgb_tr2$TARGET * i + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr2[, 1:300]), label = lgb_tr2$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 參數 params <- list( objective = binary, metric = auc ) # 交叉驗證 lgb_tr2_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, learning_rate = .1, num_threads = 2, early_stopping_rounds = 10 ) lgb_rate_1[i] <- unlist(lgb_tr2_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr2_mod$record_evals$valid$auc$eval))]}library(ggplot2)grid_search$perf <- lgb_rate_1ggplot(grid_search,aes(x = weight, y = perf)) + geom_point()

從此圖可知auc值受權重影響不大,在weight=5時達到最大

3.特徵選擇

1)特徵選擇

lgb_tr2$TARGET <- factor(lgb_tr2$TARGET)lgb.task <- makeClassifTask(data = lgb_tr2, target = TARGET)lgb.task.smote <- oversample(lgb.task, rate = 5)fv_time <- system.time( fv <- generateFilterValuesData( lgb.task.smote, method = c(chi.squared) ## 此處可以使用信息增益/卡方檢驗的方法,但是不建議使用隨機森林方法,效率極低 ## 如果有興趣,也可以嘗試IV值方法篩選 ## 特徵工程決定目標值(此處為auc)的上限,可以把特徵篩選方法作為超參數處理 ))

2)製圖查看

# plotFilterValues(fv)plotFilterValuesGGVIS(fv)

3)提取99%的chi.squared(lightgbm演算法效率極高,因此可以取更多的變數)

註:提取的X%的chi.squared中的X可以作為超參數處理

fv_data2 <- fv$data %>% arrange(desc(chi.squared)) %>% mutate(chi_gain_cul = cumsum(chi.squared) / sum(chi.squared))fv_data2_filter <- fv_data2 %>% filter(chi_gain_cul <= 0.99)dim(fv_data2_filter) ## 減少了一半的自變數fv_feature <- fv_data2_filter$namelgb_tr3 <- lgb_tr2[, c(fv_feature, TARGET)]lgb_te3 <- lgb_te2[, fv_feature]

4)寫出數據

write_csv(lgb_tr3, C:/users/Administrator/Documents/kaggle/scs_lgb/lgb_tr3_chi.csv)write_csv(lgb_te3, C:/users/Administrator/Documents/kaggle/scs_lgb/lgb_te3_chi.csv)

四、演算法

lgb_tr <- rxImport(C:/Users/Administrator/Documents/kaggle/scs_lgb/lgb_tr3_chi.csv)lgb_te <- rxImport(C:/Users/Administrator/Documents/kaggle/scs_lgb/lgb_te3_chi.csv)

## 建議lgb_te數據在預測時再讀取,以節約內存

library(lightgbm)

1.調試weight參數

grid_search <- expand.grid( weight = 1:30)

perf_weight_1 <- numeric(length = nrow(grid_search))

for(i in 1:nrow(grid_search)){ lgb_weight <- (lgb_tr$TARGET * i + 1) / sum(lgb_tr$TARGET * i + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 參數 params <- list( objective = binary, metric = auc ) # 交叉驗證 lgb_tr_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, learning_rate = .1, num_threads = 2, early_stopping_rounds = 10 ) perf_weight_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}

library(ggplot2)grid_search$perf <- perf_weight_1ggplot(grid_search,aes(x = weight, y = perf)) + geom_point() + geom_smooth()

從此圖可知auc值在weight=4時達到最大,呈遞減趨勢

2.調試learning_rate參數

grid_search <- expand.grid( learning_rate = 2 ^ (-(8:1)))

perf_learning_rate_1 <- numeric(length = nrow(grid_search))

for(i in 1:nrow(grid_search)){ lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 參數 params <- list( objective = binary, metric = auc, learning_rate = grid_search[i, learning_rate] ) # 交叉驗證 lgb_tr_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, num_threads = 2, early_stopping_rounds = 10 ) perf_learning_rate_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}

grid_search$perf <- perf_learning_rate_1ggplot(grid_search,aes(x = learning_rate, y = perf)) + geom_point() + geom_smooth()

從此圖可知auc值在learning_rate=2^(-5) 時達到最大,但是 2^(-(6:3)) 區別極小,故取learning_rate = .125,提高運行速度

3.調試num_leaves參數

grid_search <- expand.grid( learning_rate = .125, num_leaves = seq(50, 800, 50))

perf_num_leaves_1 <- numeric(length = nrow(grid_search))

for(i in 1:nrow(grid_search)){ lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 參數 params <- list( objective = binary, metric = auc, learning_rate = grid_search[i, learning_rate], num_leaves = grid_search[i, num_leaves] ) # 交叉驗證 lgb_tr_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, num_threads = 2, early_stopping_rounds = 10 ) perf_num_leaves_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}

grid_search$perf <- perf_num_leaves_1ggplot(grid_search,aes(x = num_leaves, y = perf)) + geom_point() + geom_smooth()

從此圖可知auc值在num_leaves=650時達到最大

4.調試min_data_in_leaf參數

grid_search <- expand.grid( learning_rate = .125, num_leaves = 650, min_data_in_leaf = 2 ^ (1:7))

perf_min_data_in_leaf_1 <- numeric(length = nrow(grid_search))

for(i in 1:nrow(grid_search)){ lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 參數 params <- list( objective = binary, metric = auc, learning_rate = grid_search[i, learning_rate], num_leaves = grid_search[i, num_leaves], min_data_in_leaf = grid_search[i, min_data_in_leaf] ) # 交叉驗證 lgb_tr_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, num_threads = 2, early_stopping_rounds = 10 ) perf_min_data_in_leaf_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}

grid_search$perf <- perf_min_data_in_leaf_1ggplot(grid_search,aes(x = min_data_in_leaf, y = perf)) + geom_point() + geom_smooth()

從此圖可知auc值對min_data_in_leaf不敏感,因此不做調整

5.調試max_bin參數

grid_search <- expand.grid( learning_rate = .125, num_leaves = 650, max_bin = 2 ^ (5:10))

perf_max_bin_1 <- numeric(length = nrow(grid_search))

for(i in 1:nrow(grid_search)){ lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 參數 params <- list( objective = binary, metric = auc, learning_rate = grid_search[i, learning_rate], num_leaves = grid_search[i, num_leaves], max_bin = grid_search[i, max_bin] ) # 交叉驗證 lgb_tr_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, num_threads = 2, early_stopping_rounds = 10 ) perf_max_bin_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}

grid_search$perf <- perf_max_bin_1ggplot(grid_search,aes(x = max_bin, y = perf)) + geom_point() + geom_smooth()

從此圖可知auc值在max_bin=2^10 時達到最大,需要再次微調max_bin值

6.微調max_bin參數

grid_search <- expand.grid( learning_rate = .125, num_leaves = 650, max_bin = 100 * (6:15))

perf_max_bin_2 <- numeric(length = nrow(grid_search))

for(i in 1:nrow(grid_search)){ lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 參數 params <- list( objective = binary, metric = auc, learning_rate = grid_search[i, learning_rate], num_leaves = grid_search[i, num_leaves], max_bin = grid_search[i, max_bin] ) # 交叉驗證 lgb_tr_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, num_threads = 2, early_stopping_rounds = 10 ) perf_max_bin_2[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}

grid_search$perf <- perf_max_bin_2ggplot(grid_search,aes(x = max_bin, y = perf)) + geom_point() + geom_smooth()

從此圖可知auc值在max_bin=1000時達到最大

7.調試min_data_in_bin參數

grid_search <- expand.grid( learning_rate = .125, num_leaves = 650, max_bin=1000, min_data_in_bin = 2 ^ (1:9))

perf_min_data_in_bin_1 <- numeric(length = nrow(grid_search))

for(i in 1:nrow(grid_search)){ lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 參數 params <- list( objective = binary, metric = auc, learning_rate = grid_search[i, learning_rate], num_leaves = grid_search[i, num_leaves], max_bin = grid_search[i, max_bin], min_data_in_bin = grid_search[i, min_data_in_bin] ) # 交叉驗證 lgb_tr_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, num_threads = 2, early_stopping_rounds = 10 ) perf_min_data_in_bin_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}

grid_search$perf <- perf_min_data_in_bin_1ggplot(grid_search,aes(x = min_data_in_bin, y = perf)) + geom_point() + geom_smooth()

從此圖可知auc值在min_data_in_bin=8時達到最大,但是變化極其細微,因此不做調整

8.調試feature_fraction參數

grid_search <- expand.grid( learning_rate = .125, num_leaves = 650, max_bin=1000, min_data_in_bin = 8, feature_fraction = seq(.5, 1, .02))

perf_feature_fraction_1 <- numeric(length = nrow(grid_search))

for(i in 1:nrow(grid_search)){ lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 參數 params <- list( objective = binary, metric = auc, learning_rate = grid_search[i, learning_rate], num_leaves = grid_search[i, num_leaves], max_bin = grid_search[i, max_bin], min_data_in_bin = grid_search[i, min_data_in_bin], feature_fraction = grid_search[i, feature_fraction] ) # 交叉驗證 lgb_tr_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, num_threads = 2, early_stopping_rounds = 10 ) perf_feature_fraction_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}

grid_search$perf <- perf_feature_fraction_1ggplot(grid_search,aes(x = feature_fraction, y = perf)) + geom_point() + geom_smooth()

從此圖可知auc值在feature_fraction=.62時達到最大,feature_fraction在[.60,.62]之間時,auc值保持穩定,表現較好;從.64開始呈下降趨勢

9.調試min_sum_hessian參數

grid_search <- expand.grid( learning_rate = .125, num_leaves = 650, max_bin=1000, min_data_in_bin = 8, feature_fraction = .62, min_sum_hessian = seq(0, .02, .001))

perf_min_sum_hessian_1 <- numeric(length = nrow(grid_search))

for(i in 1:nrow(grid_search)){ lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 參數 params <- list( objective = binary, metric = auc, learning_rate = grid_search[i, learning_rate], num_leaves = grid_search[i, num_leaves], max_bin = grid_search[i, max_bin], min_data_in_bin = grid_search[i, min_data_in_bin], feature_fraction = grid_search[i, feature_fraction], min_sum_hessian = grid_search[i, min_sum_hessian] ) # 交叉驗證 lgb_tr_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, num_threads = 2, early_stopping_rounds = 10 ) perf_min_sum_hessian_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}

grid_search$perf <- perf_min_sum_hessian_1ggplot(grid_search,aes(x = min_sum_hessian, y = perf)) + geom_point() + geom_smooth()

從此圖可知auc值在min_sum_hessian=0.005時達到最大,建議min_sum_hessian取值在[0.002, 0.005]區間,0.005後呈遞減趨勢

10.調試lamda參數

grid_search <- expand.grid( learning_rate = .125, num_leaves = 650, max_bin=1000, min_data_in_bin = 8, feature_fraction = .62, min_sum_hessian = .005, lambda_l1 = seq(0, .01, .002), lambda_l2 = seq(0, .01, .002))

perf_lamda_1 <- numeric(length = nrow(grid_search))

for(i in 1:nrow(grid_search)){ lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 參數 params <- list( objective = binary, metric = auc, learning_rate = grid_search[i, learning_rate], num_leaves = grid_search[i, num_leaves], max_bin = grid_search[i, max_bin], min_data_in_bin = grid_search[i, min_data_in_bin], feature_fraction = grid_search[i, feature_fraction], min_sum_hessian = grid_search[i, min_sum_hessian], lambda_l1 = grid_search[i, lambda_l1], lambda_l2 = grid_search[i, lambda_l2] ) # 交叉驗證 lgb_tr_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, num_threads = 2, early_stopping_rounds = 10 ) perf_lamda_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}

grid_search$perf <- perf_lamda_1ggplot(data = grid_search, aes(x = lambda_l1, y = perf)) + geom_point() + facet_wrap(~ lambda_l2, nrow = 5)

從此圖可知建議lambda_l1 = 0, lambda_l2 = 0

11.調試drop_rate參數

grid_search <- expand.grid( learning_rate = .125, num_leaves = 650, max_bin=1000, min_data_in_bin = 8, feature_fraction = .62, min_sum_hessian = .005, lambda_l1 = 0, lambda_l2 = 0, drop_rate = seq(0, 1, .1))

perf_drop_rate_1 <- numeric(length = nrow(grid_search))

for(i in 1:nrow(grid_search)){ lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 參數 params <- list( objective = binary, metric = auc, learning_rate = grid_search[i, learning_rate], num_leaves = grid_search[i, num_leaves], max_bin = grid_search[i, max_bin], min_data_in_bin = grid_search[i, min_data_in_bin], feature_fraction = grid_search[i, feature_fraction], min_sum_hessian = grid_search[i, min_sum_hessian], lambda_l1 = grid_search[i, lambda_l1], lambda_l2 = grid_search[i, lambda_l2], drop_rate = grid_search[i, drop_rate] ) # 交叉驗證 lgb_tr_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, num_threads = 2, early_stopping_rounds = 10 ) perf_drop_rate_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}

grid_search$perf <- perf_drop_rate_1ggplot(data = grid_search, aes(x = drop_rate, y = perf)) + geom_point() + geom_smooth()

從此圖可知auc值在drop_rate=0.2時達到最大,在0, .2, .5較好;在[0, 1]變化不大

12.調試max_drop參數

grid_search <- expand.grid( learning_rate = .125, num_leaves = 650, max_bin=1000, min_data_in_bin = 8, feature_fraction = .62, min_sum_hessian = .005, lambda_l1 = 0, lambda_l2 = 0, drop_rate = .2, max_drop = seq(1, 10, 2))

perf_max_drop_1 <- numeric(length = nrow(grid_search))

for(i in 1:nrow(grid_search)){ lgb_weight <- (lgb_tr$TARGET * 4 + 1) / sum(lgb_tr$TARGET * 4 + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 參數 params <- list( objective = binary, metric = auc, learning_rate = grid_search[i, learning_rate], num_leaves = grid_search[i, num_leaves], max_bin = grid_search[i, max_bin], min_data_in_bin = grid_search[i, min_data_in_bin], feature_fraction = grid_search[i, feature_fraction], min_sum_hessian = grid_search[i, min_sum_hessian], lambda_l1 = grid_search[i, lambda_l1], lambda_l2 = grid_search[i, lambda_l2], drop_rate = grid_search[i, drop_rate], max_drop = grid_search[i, max_drop] ) # 交叉驗證 lgb_tr_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, num_threads = 2, early_stopping_rounds = 10 ) perf_max_drop_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}

grid_search$perf <- perf_max_drop_1ggplot(data = grid_search, aes(x = max_drop, y = perf)) + geom_point() + geom_smooth()

從此圖可知auc值在max_drop=5時達到最大,在[1, 10]區間變化較小

五、二次調參

1.調試weight參數

grid_search <- expand.grid( learning_rate = .125, num_leaves = 650, max_bin=1000, min_data_in_bin = 8, feature_fraction = .62, min_sum_hessian = .005, lambda_l1 = 0, lambda_l2 = 0, drop_rate = .2, max_drop = 5)

perf_weight_2 <- numeric(length = nrow(grid_search))

for(i in 1:20){ lgb_weight <- (lgb_tr$TARGET * i + 1) / sum(lgb_tr$TARGET * i + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 參數 params <- list( objective = binary, metric = auc, learning_rate = grid_search[1, learning_rate], num_leaves = grid_search[1, num_leaves], max_bin = grid_search[1, max_bin], min_data_in_bin = grid_search[1, min_data_in_bin], feature_fraction = grid_search[1, feature_fraction], min_sum_hessian = grid_search[1, min_sum_hessian], lambda_l1 = grid_search[1, lambda_l1], lambda_l2 = grid_search[1, lambda_l2], drop_rate = grid_search[1, drop_rate], max_drop = grid_search[1, max_drop] ) # 交叉驗證 lgb_tr_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, learning_rate = .1, num_threads = 2, early_stopping_rounds = 10 ) perf_weight_2[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}

library(ggplot2)ggplot(data.frame(num = 1:length(perf_weight_2), perf = perf_weight_2), aes(x = num, y = perf)) + geom_point() + geom_smooth()

從此圖可知auc值在weight>=3時auc趨於穩定, weight=7 the max

2.調試learning_rate參數

grid_search <- expand.grid( learning_rate = seq(.05, .5, .03), num_leaves = 650, max_bin=1000, min_data_in_bin = 8, feature_fraction = .62, min_sum_hessian = .005, lambda_l1 = 0, lambda_l2 = 0, drop_rate = .2, max_drop = 5)

perf_learning_rate_1 <- numeric(length = nrow(grid_search))

for(i in 1:nrow(grid_search)){ lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 參數 params <- list( objective = binary, metric = auc, learning_rate = grid_search[i, learning_rate], num_leaves = grid_search[i, num_leaves], max_bin = grid_search[i, max_bin], min_data_in_bin = grid_search[i, min_data_in_bin], feature_fraction = grid_search[i, feature_fraction], min_sum_hessian = grid_search[i, min_sum_hessian], lambda_l1 = grid_search[i, lambda_l1], lambda_l2 = grid_search[i, lambda_l2], drop_rate = grid_search[i, drop_rate], max_drop = grid_search[i, max_drop] ) # 交叉驗證 lgb_tr_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, num_threads = 2, early_stopping_rounds = 10 ) perf_learning_rate_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}

grid_search$perf <- perf_learning_rate_1ggplot(data = grid_search, aes(x = learning_rate, y = perf)) + geom_point() + geom_smooth()

結論:learning_rate=.11時,auc最大

3.調試num_leaves參數

grid_search <- expand.grid( learning_rate = .11, num_leaves = seq(100, 800, 50), max_bin=1000, min_data_in_bin = 8, feature_fraction = .62, min_sum_hessian = .005, lambda_l1 = 0, lambda_l2 = 0, drop_rate = .2, max_drop = 5)

perf_num_leaves_1 <- numeric(length = nrow(grid_search))

for(i in 1:nrow(grid_search)){ lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 參數 params <- list( objective = binary, metric = auc, learning_rate = grid_search[i, learning_rate], num_leaves = grid_search[i, num_leaves], max_bin = grid_search[i, max_bin], min_data_in_bin = grid_search[i, min_data_in_bin], feature_fraction = grid_search[i, feature_fraction], min_sum_hessian = grid_search[i, min_sum_hessian], lambda_l1 = grid_search[i, lambda_l1], lambda_l2 = grid_search[i, lambda_l2], drop_rate = grid_search[i, drop_rate], max_drop = grid_search[i, max_drop] ) # 交叉驗證 lgb_tr_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, num_threads = 2, early_stopping_rounds = 10 ) perf_num_leaves_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}

grid_search$perf <- perf_num_leaves_1ggplot(data = grid_search, aes(x = num_leaves, y = perf)) + geom_point() + geom_smooth()

結論:num_leaves=200時,auc最大

4.調試max_bin參數

grid_search <- expand.grid( learning_rate = .11, num_leaves = 200, max_bin = seq(100, 1500, 100), min_data_in_bin = 8, feature_fraction = .62, min_sum_hessian = .005, lambda_l1 = 0, lambda_l2 = 0, drop_rate = .2, max_drop = 5)

perf_max_bin_1 <- numeric(length = nrow(grid_search))

for(i in 1:nrow(grid_search)){ lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 參數 params <- list( objective = binary, metric = auc, learning_rate = grid_search[i, learning_rate], num_leaves = grid_search[i, num_leaves], max_bin = grid_search[i, max_bin], min_data_in_bin = grid_search[i, min_data_in_bin], feature_fraction = grid_search[i, feature_fraction], min_sum_hessian = grid_search[i, min_sum_hessian], lambda_l1 = grid_search[i, lambda_l1], lambda_l2 = grid_search[i, lambda_l2], drop_rate = grid_search[i, drop_rate], max_drop = grid_search[i, max_drop] ) # 交叉驗證 lgb_tr_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, num_threads = 2, early_stopping_rounds = 10 ) perf_max_bin_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}

grid_search$perf <- perf_max_bin_1ggplot(data = grid_search, aes(x = max_bin, y = perf)) + geom_point() + geom_smooth()

結論:max_bin=600時,auc最大;400,800也是可接受值

5.調試min_data_in_bin參數

grid_search <- expand.grid( learning_rate = .11, num_leaves = 200, max_bin = 600, min_data_in_bin = seq(5, 50, 5), feature_fraction = .62, min_sum_hessian = .005, lambda_l1 = 0, lambda_l2 = 0, drop_rate = .2, max_drop = 5)

perf_min_data_in_bin_1 <- numeric(length = nrow(grid_search))

for(i in 1:nrow(grid_search)){ lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 參數 params <- list( objective = binary, metric = auc, learning_rate = grid_search[i, learning_rate], num_leaves = grid_search[i, num_leaves], max_bin = grid_search[i, max_bin], min_data_in_bin = grid_search[i, min_data_in_bin], feature_fraction = grid_search[i, feature_fraction], min_sum_hessian = grid_search[i, min_sum_hessian], lambda_l1 = grid_search[i, lambda_l1], lambda_l2 = grid_search[i, lambda_l2], drop_rate = grid_search[i, drop_rate], max_drop = grid_search[i, max_drop] ) # 交叉驗證 lgb_tr_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, num_threads = 2, early_stopping_rounds = 10 ) perf_min_data_in_bin_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}

grid_search$perf <- perf_min_data_in_bin_1ggplot(data = grid_search, aes(x = min_data_in_bin, y = perf)) + geom_point() + geom_smooth()

結論:min_data_in_bin=45時,auc最大;其中25是可接受值

5.調試feature_fraction參數

grid_search <- expand.grid( learning_rate = .11, num_leaves = 200, max_bin = 600, min_data_in_bin = 45, feature_fraction = seq(.5, .9, .02), min_sum_hessian = .005, lambda_l1 = 0, lambda_l2 = 0, drop_rate = .2, max_drop = 5)

perf_feature_fraction_1 <- numeric(length = nrow(grid_search))

for(i in 1:nrow(grid_search)){ lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 參數 params <- list( objective = binary, metric = auc, learning_rate = grid_search[i, learning_rate], num_leaves = grid_search[i, num_leaves], max_bin = grid_search[i, max_bin], min_data_in_bin = grid_search[i, min_data_in_bin], feature_fraction = grid_search[i, feature_fraction], min_sum_hessian = grid_search[i, min_sum_hessian], lambda_l1 = grid_search[i, lambda_l1], lambda_l2 = grid_search[i, lambda_l2], drop_rate = grid_search[i, drop_rate], max_drop = grid_search[i, max_drop] ) # 交叉驗證 lgb_tr_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, num_threads = 2, early_stopping_rounds = 10 ) perf_feature_fraction_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}

grid_search$perf <- perf_feature_fraction_1ggplot(data = grid_search, aes(x = feature_fraction, y = perf)) + geom_point() + geom_smooth()

結論:feature_fraction=.54時,auc最大, .56, .58時也較好

6.調試min_sum_hessian參數

grid_search <- expand.grid( learning_rate = .11, num_leaves = 200, max_bin = 600, min_data_in_bin = 45, feature_fraction = .54, min_sum_hessian = seq(.001, .008, .0005), lambda_l1 = 0, lambda_l2 = 0, drop_rate = .2, max_drop = 5)

perf_min_sum_hessian_1 <- numeric(length = nrow(grid_search))

for(i in 1:nrow(grid_search)){ lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 參數 params <- list( objective = binary, metric = auc, learning_rate = grid_search[i, learning_rate], num_leaves = grid_search[i, num_leaves], max_bin = grid_search[i, max_bin], min_data_in_bin = grid_search[i, min_data_in_bin], feature_fraction = grid_search[i, feature_fraction], min_sum_hessian = grid_search[i, min_sum_hessian], lambda_l1 = grid_search[i, lambda_l1], lambda_l2 = grid_search[i, lambda_l2], drop_rate = grid_search[i, drop_rate], max_drop = grid_search[i, max_drop] ) # 交叉驗證 lgb_tr_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, num_threads = 2, early_stopping_rounds = 10 ) perf_min_sum_hessian_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}

grid_search$perf <- perf_min_sum_hessian_1ggplot(data = grid_search, aes(x = min_sum_hessian, y = perf)) + geom_point() + geom_smooth()

結論:min_sum_hessian=0.0065時auc取得最大值,取min_sum_hessian=0.003,0.0055時可接受

6.調試lambda參數

grid_search <- expand.grid( learning_rate = .11, num_leaves = 200, max_bin = 600, min_data_in_bin = 45, feature_fraction = .54, min_sum_hessian = 0.0065, lambda_l1 = seq(0, .001, .0002), lambda_l2 = seq(0, .001, .0002), drop_rate = .2, max_drop = 5)

perf_lambda_1 <- numeric(length = nrow(grid_search))

for(i in 1:nrow(grid_search)){ lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 參數 params <- list( objective = binary, metric = auc, learning_rate = grid_search[i, learning_rate], num_leaves = grid_search[i, num_leaves], max_bin = grid_search[i, max_bin], min_data_in_bin = grid_search[i, min_data_in_bin], feature_fraction = grid_search[i, feature_fraction], min_sum_hessian = grid_search[i, min_sum_hessian], lambda_l1 = grid_search[i, lambda_l1], lambda_l2 = grid_search[i, lambda_l2], drop_rate = grid_search[i, drop_rate], max_drop = grid_search[i, max_drop] ) # 交叉驗證 lgb_tr_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, num_threads = 2, early_stopping_rounds = 10 ) perf_lambda_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}

grid_search$perf <- perf_lambda_1ggplot(data = grid_search, aes(x = lambda_l1, y = perf)) + geom_point() + facet_wrap(~ lambda_l2, nrow = 5)

結論:lambda與auc整體呈負相關,取lambda_l1=.0002, lambda_l2 = .0004

7.調試drop_rate參數

grid_search <- expand.grid( learning_rate = .11, num_leaves = 200, max_bin = 600, min_data_in_bin = 45, feature_fraction = .54, min_sum_hessian = 0.0065, lambda_l1 = .0002, lambda_l2 = .0004, drop_rate = seq(0, .5, .05), max_drop = 5)

perf_drop_rate_1 <- numeric(length = nrow(grid_search))

for(i in 1:nrow(grid_search)){ lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 參數 params <- list( objective = binary, metric = auc, learning_rate = grid_search[i, learning_rate], num_leaves = grid_search[i, num_leaves], max_bin = grid_search[i, max_bin], min_data_in_bin = grid_search[i, min_data_in_bin], feature_fraction = grid_search[i, feature_fraction], min_sum_hessian = grid_search[i, min_sum_hessian], lambda_l1 = grid_search[i, lambda_l1], lambda_l2 = grid_search[i, lambda_l2], drop_rate = grid_search[i, drop_rate], max_drop = grid_search[i, max_drop] ) # 交叉驗證 lgb_tr_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, num_threads = 2, early_stopping_rounds = 10 ) perf_drop_rate_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}

grid_search$perf <- perf_drop_rate_1ggplot(data = grid_search, aes(x = drop_rate, y = perf)) + geom_point()

結論:drop_rate=.4時取到最大值,.15, .25可接受

8.調試max_drop參數

grid_search <- expand.grid( learning_rate = .11, num_leaves = 200, max_bin = 600, min_data_in_bin = 45, feature_fraction = .54, min_sum_hessian = 0.0065, lambda_l1 = .0002, lambda_l2 = .0004, drop_rate = .4, max_drop = seq(1, 29, 2))

perf_max_drop_1 <- numeric(length = nrow(grid_search))

for(i in 1:nrow(grid_search)){ lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1) lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight ) # 參數 params <- list( objective = binary, metric = auc, learning_rate = grid_search[i, learning_rate], num_leaves = grid_search[i, num_leaves], max_bin = grid_search[i, max_bin], min_data_in_bin = grid_search[i, min_data_in_bin], feature_fraction = grid_search[i, feature_fraction], min_sum_hessian = grid_search[i, min_sum_hessian], lambda_l1 = grid_search[i, lambda_l1], lambda_l2 = grid_search[i, lambda_l2], drop_rate = grid_search[i, drop_rate], max_drop = grid_search[i, max_drop] ) # 交叉驗證 lgb_tr_mod <- lgb.cv( params, data = lgb_train, nrounds = 300, stratified = TRUE, nfold = 10, num_threads = 2, early_stopping_rounds = 10 ) perf_max_drop_1[i] <- unlist(lgb_tr_mod$record_evals$valid$auc$eval)[length(unlist(lgb_tr_mod$record_evals$valid$auc$eval))]}

grid_search$perf <- perf_max_drop_1ggplot(data = grid_search, aes(x = max_drop, y = perf)) + geom_point()

結論:max_drop=14時取到最大值,但是差距細微

六、預測

1)權重

lgb_weight <- (lgb_tr$TARGET * 7 + 1) / sum(lgb_tr$TARGET * 7 + 1)

2)訓練數據集

lgb_train <- lgb.Dataset( data = data.matrix(lgb_tr[, 1:148]), label = lgb_tr$TARGET, free_raw_data = FALSE, weight = lgb_weight)

3)訓練

# 參數

params <- list( learning_rate = .11, num_leaves = 200, max_bin = 600, min_data_in_bin = 45, feature_fraction = .54, min_sum_hessian = 0.0065, lambda_l1 = .0002, lambda_l2 = .0004, drop_rate = .4, max_drop = 14)

# 模型

lgb_mod <- lightgbm( params = params, data = lgb_train, nrounds = 300, early_stopping_rounds = 10, num_threads = 2)

# 預測

lgb.pred <- predict(lgb_mod, data.matrix(lgb_te))

4)結果

lgb.pred2 <- matrix(unlist(lgb.pred), ncol = 1)lgb.pred3 <- data.frame(lgb.pred2)

5)輸出

write.csv(lgb.pred3, "C:/Users/Administrator/Documents/kaggle/scs_lgb/lgb.pred1_tr.csv")

註: 此處給在校讀書的朋友一些建議:

1.在學校學習機器學習演算法時,測試所用數據量一般較少,因此可以嘗試大多數演算法,大多數的R函數,例如測試隨機森林演算法時,可以選擇randomforest包,如果數據量稍微增多,可以設置並行運算,但是如果數據量達到GB級別,並行運算randomforest包也處理不了了,並且內存會溢出;建議使用專業版R中的函數;

2.學校學習主要針對理論進行學習,測試數據一般較為乾淨,實際數據結構一般更為複雜一些。


推薦閱讀:

TAG:機器學習 | Kaggle | R編程語言 |