Kaggle數據分析——Titanic

1.背景介紹

1912年4月10日,號稱「世界工業史上的奇蹟」的豪華客輪Titanic號開始了自己的處女航,從英國的南安普頓出發駛向美國紐約。1912年4月14日23時40分,Titanic號與一座冰山相撞,船體逐漸斷裂成兩截後沉入大西洋。穿上2224名船員及乘客中,逾1500人喪生。Rose和Jack凄美的愛情故事以及那句「You Jump,I Jump」就是發生在這次災難中。此次Kaggle競賽題(kaggle.com/c/titanic),通過分析訓練集(train)中乘客不同特徵消息,運用機器學習演算法建立預測乘客的存活(1是生存,0是死亡)的模型,對訓練集(test)中乘客的存活狀態進行預測。

2.數據導入

2.1載入工具包

library(ggplot2) #可視化library(dplyr) #處理字元串library(mice) #處理缺失值library(caret) #機器學習訓練library(e1071) #機器學習library(party) #決策樹 #cforest,randomForest有一個最大53個類別的限制,在後續增加Family類別的時候會溢出library(gmodels) #混淆矩陣library(ROCR) #ROC曲線library(C50) #C5.0決策樹

2.1數據預覽

#導入數據

train <-read.csv("train.csv",stringsAsFactors = FALSE)test <-read.csv("test.csv",stringsAsFactors = FALSE)str(train)str(test)#將Survived變數單獨處理train_labels <-train$Survivedtrain$Survived <-NULL#合併數據full <-rbind(train,test)all$PassengerId <-NULLstr(full)> str(full)data.frame: 1309 obs. of 10 variables: $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ... $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ... $ Sex : chr "male" "female" "female" "female" ... $ Age : num 22 38 26 35 35 NA 54 2 27 14 ... $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ... $ Parch : int 0 0 0 0 0 0 0 1 2 0 ... $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ... $ Fare : num 7.25 71.28 7.92 53.1 8.05 ... $ Cabin : chr "" "C85" "" "C123" ... $ Embarked: chr "S" "C" "S" "S" ...

從上可知,合併後的數據集包含10個變數,1309條數據,其中891條為訓練數據,418條為測試數據。

2.2變數描述

PassengerID 整型變數,代表乘客變化,對建模預測無意義#Survived 整型變數,代表乘客存活狀態:生存(1)死亡(0)#Pclass 整型變數,代表船艙等級,1表示高等級船艙,3表示低等級船艙,按常理越高級船艙在船體上部,便於逃生,且乘客為上流名人#Name 字元型變數,代表乘客名字,其中包含一些有意義信息,比如稱謂,一定程度上反映性別和社會身份,姓氏,可以一定程度上跟家族相關 #Sex 字元型變數,代表乘客性別,逃生時提倡女士和未成年人優先,一定程度上可以與逃生狀況聯繫#Age 整型變數,代表乘客年齡,有意義變數#SibSp 整型變數,代表在船兄弟姐妹/配偶數#Parch 整型變數,代表在船父母/子女數#Ticket 字元型變數,代表船票編號,暫無可提取的有效信息#Fare 實數變數,代表船票價格#Cabin 字元型變數,代表客艙號#Embarked 字元型變數,代表登船港口

我們需要根據這些數據對生存情況(Survived)——因變數進行預測,可供使用的自變數一共有10個。

3數據準備及初步探索

3.1數據清洗

查看數據缺失狀況

sapply(all,function(x) {sum(is.na(x))})sapply(all,function(x) {sum(x == "")})> sapply(all,function(x) sum(is.na(x))) Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 0 0 263 0 0 0 1 0 0 > sapply(all,function(x) sum(x =="",na.rm = T)) Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked 0 0 0 0 0 0 0 0 1014 2

Age、Cabin、Fare和Embarked中缺失數據,Fare缺失1個,Embarked缺失2個,Age缺失數據情況嚴重263個。缺失數據分為三類:完全隨機缺失,隨機缺失,非隨機缺失。大部分數據的缺失的處理方法中都假設缺失情況為完全隨機缺失或隨機缺失,對於缺失數據較少的情況可以使用均值、中位數或者眾數等方法進行填補。對於缺失值較多的情況,處理起來會比較複雜, 需要考慮填補之後對源數據的影響。這裡參考了一篇博主的文章。

3.1.1船票價格Fare中缺失數據填補

先找到缺失Fare數據乘客ID

all[is.na(all$Fare),]1044

可以從乘客信息中看到,該乘客客艙等級為3,乘船港口為S,根據這兩個特徵,推理其他具有相同特徵乘客的船票價格規律,以此來填補該乘客的船票價格信息。

ggplot(all[all$Pclass == 3 & all$Embarked == S, ], aes(x = Fare)) + geom_density(fill = #99d6ff, alpha=0.4) + geom_vline(aes(xintercept=median(Fare, na.rm=T)), colour=red, linetype=dashed, lwd=1)all$Fare[1044]<-median(all[full$Pclass ==3 &full$Embarked ==S,]$Fare,na.rm =TRUE)

在這裡,我們使用中位數填補

all$Fare[1044] <- median(full[full$Pclass == 3 & full$Embarked ==S, ]$Fare, na.rm = TRUE)all$Fare[1044][1] 8.05

3.1.2登船口Embarked缺失值填補

一般情況下,不同的客艙乘客,登船口是不同的,利用客艙等級來填補登船口

tapply(all$Embarked, all$Pclass,median,na.rm = TRUE) 1 2 3 "S" "S" "S"

我們發現,各等級船艙乘客大多數是從"S"口登船的,找到登船口缺失的乘客ID

all[all$Embarked == "",]$Embarked <-"S"

3.1.3提取姓名Name中信息

觀察到Name中包含稱謂信息,例如Mrs,Miss以及一些其他的生僻稱謂,一定程度上反映身份地位

head(all$Name)[1] "Braund, Mr. Owen Harris" [2] "Cumings, Mrs. John Bradley (Florence Briggs Thayer)"[3] "Heikkinen, Miss. Laina" [4] "Futrelle, Mrs. Jacques Heath (Lily May Peel)" [5] "Allen, Mr. William Henry" [6] "Moran, Mr. James"

應用strsplit函數,分兩次提取到稱謂信息

n = nrow(all)all$Title <-rep(NA,n)for (i in 1:n){ lastname = strsplit(all$Name[i],", ")[[1]][2] all$Title[i] = strsplit(lastname,". ")[[1]][1]}all$Title[all$Title %in% c("Ms","Mlle")] <-"Miss"all$Title[all$Title %in% c("Jonkheer")] <-"Master"all$Title[all$Title %in% c("Mme")] <-"Mrs"all$Title[all$Title %in% c(Capt, Don, Major, Col, Sir)] <-Mrall$Title[all$Title!=Mr & all$Title!=Miss& all$Title!=Mrs& all$Title!=Master] <-Rare> table(all$Title,all$Sex) female male Master 0 62 Miss 264 0 Mr 0 766 Mrs 198 0 Rare 4 15

3.1.4獲取船票信息

獲得每位乘客所擁有的船艙數量。

all$TicSize <-NAfor (i in 1:n) { if (nchar(all$Cabin[i])==0){ all$TicSize[i] <-0 } else{ s <-strsplit(all$Cabin[i]," ") all$TicSize[i] = length(s[[1]]) }}table(all$TicSize)> table(all$TicSize) 0 1 2 3 4 1014 254 26 10 5

3.1.5獲取乘客倉位信息

> head(all$Cabin,20) [1] "" "C85" "" "C123" "" "" "E46" "" "" "" "G6" "C103"[13] "" "" "" "" "" "" "" ""

我們可以猜測,E、C、G這樣的字母類代表船體不同的區域,當Titanic號被撞時,只有某些區域的乘客是便於逃生的。一些沒有船艙信息的,全部重命名為X

all$Deck <-sapply(all$Cabin,function(x) strsplit(x,NULL)[[1]][1])all$Deck[is.na(all$Deck)] <-"X"table(all$Deck)> table(all$Deck) A B C D E F G T X 22 65 94 46 41 21 5 1 1014

3.1.6多重補插Age數據

選擇補插數據集

var_fac <-c("Pclass","Sex","Embarked","Title","Deck")all[var_fac] <-lapply(all[var_fac],function(x) as.factor(x))mice_all <-all[c(Pclass,Sex,Age,SibSp,Fare,Embarked, Title,Deck)]

應用mice包進行多重補插,試了幾種方法,補插方法rf的效果比較好。可以看到,補插前後,Age數據的分布密度沒有發生明顯變化。

set.seed(111)mice_data <-mice(data = mice_all,m = 5,method = "pmm")mice_output<-complete(mice_data)par(mfrow=c(1,2))hist(all$Age,freq = F,main = "Age : Original Data", col = "darkgreen", ylim = c(0,0.04))hist(mice_output$Age,freq = F,main = "Age : Output of MICE", col = "lightgreen", ylim = c(0,0.04))all$Age <-mice_output$Age

3.2特徵工程

3.2.1 性別Sex

由Titanic逃生情況的背景可知,女士擁有優先逃生的權力,SEX變數對於預測乘客的存活狀況應該有明顯幫助。下列數據驗證了這一猜想:大部分女性(233/(233+81)=74.20%)得以倖存,而男性中只有很小部分(109/(109+468)=22.85%)倖存。圖中表明,女士的存活情況要明顯優於男士。

ggplot(all[1:891,],aes(Sex,fill=factor(Survived))) + geom_histogram(stat = count,position=dodge)+ labs(x = The influence of Sex ) + theme_few()

3.2.2 年齡Age

按照「婦女與小孩先走」的規則,未成年人應該有更大可能倖存。如下圖所示,Age < 18的乘客中,倖存人數確實高於遇難人數。同時青壯年乘客中,遇難人數遠高於倖存人數。因此也可以暫時將Age作為後續模型的特徵變數之一。

all$Child <-"Child"all$Child[all$Age >18] <-"Adult"table(all$Child,all$Survived)ggplot(all[1:891,],aes(Child,fill=factor(Survived))) + geom_histogram(stat = count,position=dodge)+ labs(x = The influence of Age) + theme_few()

從數據來看,未成年人的存活率高於成年人

3.2.3稱謂Title

ggplot(all[1:891,],aes(Title,fill=factor(Survived))) + geom_histogram(stat = count,position=dodge)+ labs(x = The influence of Title ) + theme_few()

Mr存活率較低,Miss和Mrs女士存活率居高,當然,這部分的結果與SEX可能會有重疊的可能。

3.2.4船艙等級Pclass

ggplot(all[1:891,],aes(Pclass,fill=factor(Survived))) + geom_histogram(stat = count,position=dodge)+ labs(x = The influence of Pclass ) + theme_few()

船艙等級越低,越處於船底的位置,高等級船艙越靠近夾板,便於逃生。因此船艙等級越高,存活率越高也是符合常識的。

3.2.5家庭規模Familysize

all$FamilyScale <-all$SibSp+all$Parch +1ggplot(all[1:891,],aes(FamilyScale ,fill=factor(Survived))) + geom_histogram(stat = count,position=dodge)+ labs(x = The influence of FamilyScale ) + theme_few()

1家庭規模為1的乘客死亡率較高,家庭規模2-4的存活率較高,家庭規模再大的存活率較低。

3.2.6 倉位數量Ticketsize

ggplot(all[1:891,],aes(Ticketsize,fill=factor(Survived))) + geom_histogram(stat = count,position=dodge)+ labs(x = The influence of Ticketsize ) + theme_few()

沒有倉位的乘客死亡率較高

3.2.7船票價格Fare

ggplot(all[1:891,],aes(Fare,fill=factor(Survived))) + geom_histogram()

較低票價的乘客存活率較低。

3.2.8登船口Embarked

ggplot(all[1:891,],aes(Embarked,fill=factor(Survived))) + geom_histogram(stat = count)+ labs(x =all$Embarked)

登船口S的乘客存活率較低

3.2.9船艙區域Deck

ggplot(all[1:891,],aes(Deck,fill=factor(Survived))) + geom_histogram(stat = count) + labs(x =all$Deck)

BDEF船艙的存活率較高

3.2.10 家庭類別Family

all$Surname <-sapply(all$Name,function(x) strsplit(x,",")[[1]][1])all$Family <-paste(all$Surname,all$FamilyScale,sep = "-")all$Family[all$FamilyScale <2] <-"Rare"sort(table(all$Family),decreasing = T)for (i in 1:n) { if ((all[i,]$Family %in% all$Family[1:891]) & ((all[i,]$Family %in% all$Family[892:1309]))){ all[i,]$Family <- all[i,]$Family } else{ all[i,]$Family <-"Rare" } }

國外相同姓氏的兩個人可能是同一個家族,如果家庭規模又相同的話,很可能是同一家人。在災難發生時,同一家人同時存活或者死亡的可能性會比較高,因此抽象出家庭類別這一特徵用於後續建模。將同時存在於train和test中的家庭類型保留,其他的全部設置為Rare。

4建模及預測

4.1 訓練數據集及測試數據集

train_var_factor <-c("Pclass","Sex","Embarked","Title","Deck","child","Family")all[train_var_factor] <-lapply(all[train_var_factor],function(x) as.factor(x))train_var <-c("Pclass","Sex","Fare","TicSize","Embarked", "Title","Deck","Age","Family","FamilyScale")train_data <- cbind(all[train_var][1:891,],train_labels)train_labels <-NULLstr(train_data)test_data <-all[train_var][892:1309,]

4.2 建模

4.2.1 Logistic回歸

fit_logit <-glm(train_labels~.,data = train_data,family = binomial(link = "logit"))ans_logit <-round(fit_logit$fitted.values)#在訓練集上評估模型預測效果CrossTable(ans_logit,train_data$train_labels)> CrossTable(ans_logit,train_data$train_labels) Cell Contents|-------------------------|| N || Chi-square contribution || N / Row Total || N / Col Total || N / Table Total ||-------------------------| Total Observations in Table: 891 | train_data$train_labels ans_logit | 0 | 1 | Row Total | -------------|-----------|-----------|-----------| 0 | 494 | 69 | 563 | | 62.378 | 100.132 | | | 0.877 | 0.123 | 0.632 | | 0.900 | 0.202 | | | 0.554 | 0.077 | | -------------|-----------|-----------|-----------| 1 | 55 | 273 | 328 | | 107.069 | 171.874 | | | 0.168 | 0.832 | 0.368 | | 0.100 | 0.798 | | | 0.062 | 0.306 | | -------------|-----------|-----------|-----------|Column Total | 549 | 342 | 891 | | 0.616 | 0.384 | | -------------|-----------|-----------|-----------|confusionMatrix(train_data$train_labels,ans_logit, positive = "1")Confusion Matrix and Statistics ReferencePrediction 0 1 0 494 55 1 69 273 Accuracy : 0.8608 95% CI : (0.8363, 0.8829) No Information Rate : 0.6319 P-Value [Acc > NIR] : <2e-16 Kappa : 0.7035 Mcnemars Test P-Value : 0.243 Sensitivity : 0.8323 Specificity : 0.8774 Pos Pred Value : 0.7982 Neg Pred Value : 0.8998 Prevalence : 0.3681 Detection Rate : 0.3064 Detection Prevalence : 0.3838 Balanced Accuracy : 0.8549 Positive Class : 1

因為訓練集和測試集中Family的類別不同,因此在此處建模時未考慮Family變數,回歸之後,先在訓練集上檢驗回歸模型的性能,訓練集的預測準確度為86.1%,其中對死亡樣本預測結果稍好一些,存活樣本的預測效果稍差。

繪製廣義線性模型建模的ROC曲線:

pred <-prediction(predictions = fit_logit$fitted.values, labels = train_data$train_labels)perf <-performance(pred,measure = "tpr",x.measure = "fpr")plot(perf,main="ROC curve for Survived Prediction of Titanic", col = "blue",lwd=3)

4.2.2決策樹建模(boosting提升樹)

#C5.0決策樹,提升樹train_data_C50 <-train_datatrain_data_C50$train_labels <-factor(train_data_C50$train_labels)set.seed(100)fit_C50 <- C5.0(train_labels~.,data = train_data_C50,trials = 10)ans_C50 <-predict(fit_C50,train_data_C50)CrossTable(ans_C50,train_data_C50$train_labels)confusionMatrix(train_data$train_labels,ans_C50, positive = "1")Confusion Matrix and Statistics ReferencePrediction 0 1 0 502 47 1 88 254 Accuracy : 0.8485 95% CI : (0.8232, 0.8714) No Information Rate : 0.6622 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.6723 Mcnemars Test P-Value : 0.000576 Sensitivity : 0.8439 Specificity : 0.8508 Pos Pred Value : 0.7427 Neg Pred Value : 0.9144 Prevalence : 0.3378 Detection Rate : 0.2851 Detection Prevalence : 0.3838 Balanced Accuracy : 0.8474 Positive Class : 1

在train訓練集上的正確度僅為0.849,比廣義線性模型的還要低一些。

4.2.3隨機森林演算法

rf_ctrl設置隨機森林演算法的最大決策樹數量為500,隨機特徵數量為5。

rf_ctrl <-cforest_control(ntree = 500,mtry = 5)set.seed(112)fit_rf <-cforest(train_labels~.,data = train_data_C50,controls = rf_ctrl)ans_rf <-predict(fit_rf,train_data_C50,OOB=T,type="response")CrossTable(train_data_C50$train_labels,ans_rf,chisq = F)confusionMatrix(train_data$train_labels,ans_rf, positive = "1")Confusion Matrix and Statistics ReferencePrediction 0 1 0 506 43 1 93 249 Accuracy : 0.8474 95% CI : (0.822, 0.8704) No Information Rate : 0.6723 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 0.6682 Mcnemars Test P-Value : 2.649e-05 Sensitivity : 0.8527 Specificity : 0.8447 Pos Pred Value : 0.7281 Neg Pred Value : 0.9217 Prevalence : 0.3277 Detection Rate : 0.2795 Detection Prevalence : 0.3838 Balanced Accuracy : 0.8487 Positive Class : 1

隨機森林演算法模型在訓練集上的精度為0.847,在三個演算法中目前是最低的。接下來,我們用建立的三個模型,對測試集進行預測,並提交查看最終預測結果。

4.3預測評估

ans_logit <-predict(fit_logit,test_data,type = response)ans_C50 <-predict(fit_C50,test_data,type = class)ans_rf <-predict(fit_rf,test_data,OOB = T,type = "response")solution_logit <-data.frame(PassengerID = test$PassengerId,Survived = round(ans_logit))solution_C50 <-data.frame(PassengerID = test$PassengerId,Survived = ans_C50)solution_rf <-data.frame(PassengerID = test$PassengerId,Survived = ans_rf)write.csv(solution_logit,file = logit_mod_solution.csv,row.names = F)write.csv(solution_C50,file = C5.0_mod_solution.csv,row.names = F)write.csv(solution_rf,file = ramdomForest_mod_solution.csv,row.names = F)

Logistic回歸的預測結果準確度0.77,C5.0的提升樹模型為0.801,隨機森林演算法預測結果0.818,排名278。

文章主要思路和實踐過程參考Kaggle上已有的三個項目。

Predictive Analysis of Survival Rate on Titanic

Feature Engineering on the Titanic for 0.81339

Exploring Survival on the Titanic

推薦閱讀:

泰坦尼克號kaggle隨機森林預測
Titanic: kaggle入門實踐-top10%(附Python代碼)
Kaggle 入門 1.0——Titanic 問題介紹
Zillow簡介(三)未來的成長空間有多大?
[TED] 聽Kaggle創始人講講這些年會被機器"搶"走的工作

TAG:Kaggle | 數據分析 |