Kaggle:Machine Learning from Disaster 進階學習,殺入Top10%

寫在前面的話

在Kaggle初體驗:隨機森林分析Machine Learning from Disaster In R語言中,小編根據如下步驟對Titanic: Machine Learning from Disaster進行首次分析,取得的的名次如下圖所示。

  • 數據清洗
  • 特徵工程
  • 模型設計
  • 預測

本文在前面的基礎上,繼續對Machine Learning from Disaster數據集按照如上步驟分析。

導入數據

#模塊載入 library(dplyr)nlibrary(stringr)nlibrary(ggplot2)nlibrary(ggthemes)nlibrary(lattice)nlibrary(caret)nlibrary(rpart)nn#導入上期分析完的數據ndata.combined<- read.csv("./Advanced_analyze/Advanced_analyze.csv", header = T, stringsAsFactors = F)nn特徵工程n# 微調TitlenAD_Title <- data.combined[,"Title"]nAD_Title[AD_Title %in% c("Dona", "the Countess","Mme","Lady")] <- "Mrs"nAD_Title[AD_Title %in% c("Ms", "Mlle")] <- "Miss."nAD_Title[AD_Title %in% c("Jonkheer", "Don","Col", "Capt", "Major","Sir","Dr","Rev")] <- "Mr"ndata.combined$AD_Title <- AD_Titlenn# Visualize new version of titlenggplot(data.combined[1:891,],aes(x = AD_Title, y = ..count.., fill=factor(Survived))) + n geom_bar(stat = "count", position=stack) + n xlab(Title) + n ylab(Count) + n ggtitle(How Title impact survivor) + n scale_fill_discrete(name="Survived", breaks=c(0, 1), labels=c("Perish", "Survived")) + n geom_text(stat = "count", aes(label = ..count..), position=position_stack(vjust = 0.5)) +n theme(plot.title = element_text(hjust = 0.5), legend.position="bottom")n

#確認Title與Sex是否正確ntable(data.combined$Sex,data.combined$AD_Title)n Master Miss Miss. Mr Mrsn female 0 260 4 1 201n male 61 0 0 782 0 #匹配Title與Sex關係nindexes <- which(data.combined$AD_Title == "Mr" & n data.combined$Sex == "female")nndata.combined$AD_Title[indexes] <- "Mrs" n

支出船票費用越高倖存率越高

ggplot(data.combined[1:891,], aes(x = Fare, fill = Survived)) +n geom_density(alpha = 0.5) +n labs(title = "How Fare impact survivor", x = "Fare",fill = "Survived")n

特徵提取

# 新特徵探索nnticket.party.size <- rep(0, nrow(data.combined))navg.fare <- rep(0.0, nrow(data.combined))ntickets <- unique(data.combined$Ticket)nnfor (i in 1:length(tickets)) {n current.ticket <- tickets[i]n party.indexes <- which(data.combined$Ticket == current.ticket)n current.avg.fare <- data.combined[party.indexes[1], "Fare"] / length(party.indexes)n n for (k in 1:length(party.indexes)) {n ticket.party.size[party.indexes[k]] <- length(party.indexes)n avg.fare[party.indexes[k]] <- current.avg.faren }n}nndata.combined$ticket.party.size <- ticket.party.sizendata.combined$avg.fare <- avg.farenn#avg.fare缺失值ndata.combined[which(is.na(data.combined$avg.fare)),]n PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Title family.size Ticket.first.charn1044 1044 None 3 Storey, Mr. Thomas male 60.5 0 0 3701 NA U S Mr 1 3n cabin.first.char AD_Title ticket.party.size avg.faren1044 U Mr 1 NA #萬幸avg.fare缺失值只有一個,通過找相似條件的乘客取中位值填充nindexes <- with(data.combined, which(Pclass == "3" & Title == "Mr" & family.size == 1 &n Ticket != "3701"))nsimilar.na.passengers <- data.combined[indexes,]nsummary(similar.na.passengers$avg.fare)n Min. 1st Qu. Median Mean 3rd Qu. Max. n 0.000 7.250 7.840 7.717 8.050 10.171 nndata.combined[is.na(avg.fare), "avg.fare"] <- 7.840 # 計算ticket.party.size和avg.fare間的相關係數npreproc.data.combined <- data.combined[, c("ticket.party.size", "avg.fare")]npreProc <- preProcess(preproc.data.combined, method = c("center", "scale"))nnpostproc.data.combined <- predict(preProc, preproc.data.combined)nncor(postproc.data.combined$ticket.party.size, postproc.data.combined$avg.fare)n[1] 0.09428625 n

模型設計&預測

#隨機森林演算法 library(randomForest)ntest.subset <-data.combined[1:891,]nset.seed(1234)nforest_unit1 <- randomForest(factor(Survived)~Pclass+AD_Title+ticket.party.size+avg.fare,n data=test.subset, n importance=TRUE, n ntree=1000)nvarImpPlot(forest_unit1)nnforest_unit1nCall:n randomForest(formula = factor(Survived) ~ Pclass + AD_Title + ticket.party.size + avg.fare, data = test.subset, importance = TRUE, ntree = 1000) n Type of random forest: classificationn Number of trees: 1000nNo. of variables tried at each split: 2nn OOB estimate of error rate: 16.16%nConfusion matrix:n 0 1 class.errorn0 504 45 0.08196721 1 99 243 0.28947368nnvalidate_subset <- data.combined[892:1309,]nn# 基於測試集進行預測nprediction <- predict(forest_unit1,validate_subset)nn# 將結果保存為數據框,按照Kaggle提交文檔的格式要求。[兩列:PassengerId and Survived (prediction)]nsolution <- data.frame(PassengerID = validate_subset$PassengerId, Survived = prediction)nn# 將結果寫入文件nwrite.csv(solution, file = mod_Solution1.csv, row.names = F)nn#決策樹ntest.subset <-data.combined[1:891,]nset.seed(1234)nTree_fit1 <- rpart(factor(Survived)~Pclass+AD_Title+ticket.party.size+avg.fare,n data=test.subset, n method = "class")nnnnvalidate_subset <- data.combined[892:1309,]nn# 基於測試集進行預測nprediction_new_feature <- predict(Tree_fit1,validate_subset,type = "class")nn# 將結果保存為數據框,按照Kaggle提交文檔的格式要求。[兩列:PassengerId and Survived (prediction)]nsolution <- data.frame(PassengerID = validate_subset$PassengerId, Survived =prediction_new_feature)nn# 將結果寫入文件nwrite.csv(solution, file = mod_Solution2.csv, row.names = F)nfancyRpartPlot(Tree_fit1 ,sub = "Decision tree")n

#查看兩者的差異n cat(Difference ratio between Tree and conditional random forest:, sum(prediction_new_feature!=prediction)/nrow(validate_subset))nDifference ratio between Tree and conditional random forest: 0.01196172 n

集合

#查看兩者的差異nensemble <- as.numeric(prediction_new_feature) + as.numeric(prediction)-2nensemble <- sapply(ensemble/2, round)nsubmission <- data.frame(PassengerId = validate_subset$PassengerId, Survived = ensemble)n# 將結果寫入文件nwrite.csv(submission, file = mod_Solution3.csv, row.names = F)n

得到的文件後,就可以上傳Kaggle獲取排名情況啦~

比賽頁面:Titanic: Machine Learning from Disaster

總結

經過新特徵提取,並與原來的特徵一塊,分別經過隨機森林和決策樹演算法決策,而且中間經過「中和」處理,提交後,顯示排名從26%進步到7.9%左右,雖然從準確率上看進步並不多,但名次確實有大大的躍進哦!就像學習道路很長,每天的進步有限,但是日積月累也是很客觀的~~

推薦閱讀:

用TensorFlow做Kaggle「手寫識別」達到98%準確率-詳解
泰坦尼克號倖存預測n ——Kaggle排名321名(前4%)
Kaggle入門系列:(二)Kaggle簡介
遺憾未進前10%, Kaggle&Quora競賽賽後總結

TAG:图片素材 | 图片 | 设计 | 五月天乐团 | 自传五月天第九张专辑 | 投资 | 金融 | 乌克兰 | 安225 | 运输机 | 咖啡 | 咖啡制作 | 美食 | 海洋生物 | 无脊椎动物 | 情感 | 成长 | 青春 | 现场可编辑逻辑门阵列FPGA | 字体 | 平面设计 | 大数据分析 | Kaggle | R编程语言 |