預測泰坦尼克號客戶的生存率
一、載入包,檢查數據
二、探索性分析,分析各個預測變數與響應變數之間的關係
1.Age
2.Sex
3.Sex vs Age
4.Pclass vs Sex
5.Pclass vs Sex vs Age
6.Fare vs Pclass
三、數據處理與探索性分析
1.新向量:Title(來自name)
2.新向量:Family Size(來自name,SibSp,Parch)
3.處理 Embarked 缺失值
4.處理Fare 缺失值
5.處理Age 缺失值
6.新向量:Child(來自Age)
7.列出相關矩陣
四、隨即森林預測
一、載入包,檢查數據
> library(ggplot2)n> library(ggthemes)n> library(scales)n> library(dplyr)n> library(corrplot)n> library(plyr)n> library(randomForest)n> full<-bind_rows(train_titanic,test_titanic)n> str(full)nClasses 『tbl_df』, 『tbl』 and data.frame:t1309 obs. of 12 variables:n $ PassengerId: int 1 2 3 4 5 6 7 8 9 10 ...n $ Survived : int 0 1 1 1 0 0 0 0 1 1 ...n $ Pclass : int 3 1 3 1 3 3 1 3 3 2 ...n $ Name : chr "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...n $ Sex : chr "male" "female" "female" "female" ...n $ Age : num 22 38 26 35 35 NA 54 2 27 14 ...n $ SibSp : int 1 1 0 1 0 0 0 3 0 1 ...n $ Parch : int 0 0 0 0 0 0 0 1 2 0 ...n $ Ticket : chr "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...n $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...n $ Cabin : chr NA "C85" NA "C123" ...n $ Embarked : chr "S" "C" "S" "S" ...n> summary(full)n PassengerId Survived Pclass Name Sex n Min. : 1 Min. :0.0000 Min. :1.000 Length:1309 Length:1309 n 1st Qu.: 328 1st Qu.:0.0000 1st Qu.:2.000 Class :character Class :character n Median : 655 Median :0.0000 Median :3.000 Mode :character Mode :character n Mean : 655 Mean :0.3838 Mean :2.295 n 3rd Qu.: 982 3rd Qu.:1.0000 3rd Qu.:3.000 n Max. :1309 Max. :1.0000 Max. :3.000 n NAs :418 n Age SibSp Parch Ticket Fare n Min. : 0.17 Min. :0.0000 Min. :0.000 Length:1309 Min. : 0.000 n 1st Qu.:21.00 1st Qu.:0.0000 1st Qu.:0.000 Class :character 1st Qu.: 7.896 n Median :28.00 Median :0.0000 Median :0.000 Mode :character Median : 14.454 n Mean :29.88 Mean :0.4989 Mean :0.385 Mean : 33.295 n 3rd Qu.:39.00 3rd Qu.:1.0000 3rd Qu.:0.000 3rd Qu.: 31.275 n Max. :80.00 Max. :8.0000 Max. :9.000 Max. :512.329 n NAs :263 NAs :1 n Cabin Embarked n Length:1309 Length:1309 n Class :character Class :character n Mode :character Mode :character n
二、數據探索
Age vs Survived
> ggplot(full[1:891,],aes(Age,fill=factor(Survived)))+n+ geom_histogram(bins = 30)+n+ theme_few()+n+ xlab("Age")+n+ scale_fill_discrete(name="Survived")+n+ ggtitle("Age vs Survived")n
> ggplot(full[1:891,],aes(Sex,fill=factor(Survived)))+n+ geom_bar(stat = "count",position = "dodge")+n+ theme_few()+n+ xlab("Sex")+n+ ylab("Count")+n+ scale_fill_discrete(name="Survived")+n+ ggtitle("Sex vs Survived")n
> ggplot(full[1:891,],aes(Age,fill=factor(Survived)))+n+ geom_histogram(bins = 30)+n+ theme_few()+n+ xlab("Age")+n+ ylab("Count")+n+ facet_grid(.~Sex)+n+ scale_fill_discrete(name="Survived")+n+ theme_few()+n+ ggtitle("Age vs Sex vs Survived")n
> tapply(full[1:891,]$Survived,full[1:891,]$Pclass,mean)n 1 2 3 n0.6296296 0.4728261 0.2423625 nn> ggplot(full[1:891,],aes(Pclass,fill=factor(Survived)))+n+ geom_bar(stat = "count")+n+ theme_few()+n+ xlab("Pclass")+n+ facet_grid(.~Sex)+n+ ylab("Count")+n+ scale_fill_discrete(name="Survived")+n+ ggtitle("Pclass vs Sex vs Survived")n
> ggplot(full[1:891,],aes(x=Age,y=Sex))+n+ geom_jitter(aes(colour=factor(Survived)))+n+ theme_few()+n+ theme(legend.title = element_blank())+n+ facet_wrap(~Pclass)+n+ labs(x="Age",y="Sex",title="Pclass vs Sex vs Age vs Survived")+n+ scale_fill_discrete(name="Survived")+n+ scale_x_continuous(name = "Age",limits = c(0,81)) n
Fare vs Pclassn> ggplot(full[1:891,],aes(x=Fare,y=Pclass))+n+ geom_jitter(aes(colour=factor(Survived)))+n+ theme_few()+n+ theme(legend.title = element_blank())+n+ labs(x="Age",y="Pclass",title="Fare vs Pclass")+n+ scale_fill_discrete(name="Survived")+n+ scale_x_continuous(name="Fare",limits = c(0,270),breaks = c(0,40,80,120,160,200,240,280))n
0、新變數:Title,來自Name
> full$Title <- gsub((.*, )|(..*), , full$Name)n> table(full$Sex, full$Title)n n Capt Col Don Dona Dr Jonkheer Lady Major Master Miss Mlle Mme Mr Mrs Ms Revn female 0 0 0 1 1 0 1 0 0 260 2 1 0 197 2 0n male 1 4 1 0 7 1 0 2 61 0 0 0 757 0 0 8n n Sir the Countessn female 0 1n male 1 0n> officer <- c(Capt, Col, Don, Dr, Major, Rev)n> royalty <- c(Dona, Lady, the Countess,Sir, Jonkheer)n> full$Title[full$Title == Mlle]<- Missn> full$Title[full$Title == Ms]<- Missn> full$Title[full$Title == Mme]<- Mrsn> full$Title[full$Title %in% royalty]<- Royaltyn> full$Title[full$Title %in% officer] <- Officern> full$Surname <- sapply(full$Name,function(x) strsplit(x, split = [,.])[[1]][1])n> ggplot(full[1:891,], aes(Title,fill = factor(Survived))) +n+ geom_bar(stat = "count")+n+ xlab(Title) +n+ ylab("Count") +n+ scale_fill_discrete(name = " Survived") + n+ ggtitle("Title vs Survived")+n+ theme_few()n
1、新變數:家庭大小
> full$Fsize<-full$SibSp+full$Parch+1n> ggplot(full[1:891,],aes(x=Fsize,fill=factor(Survived)))+n+ geom_bar(stat="count",position = "dodge")+n+ scale_x_continuous(breaks = c(1:11))+n+ xlab("Family Size")+n+ ylab("Count")+n+ theme_few()+n+ scale_fill_discrete(name="Survived")+n+ ggtitle("Family Size vs Survived"nnn> tapply(full[1:891,]$Survived,full[1:891,]$Fsize,mean)n 1 2 3 4 5 6 7 n0.3035382 0.5527950 0.5784314 0.7241379 0.2000000 0.1363636 0.3333333 n 8 11n0.0000000 0.0000000 n
> full$FsizeD[full$Fsize==1]<-"Alone"n> full$FsizeD[full$Fsize<5&full$Fsize>1]<-"Small"n> full$FsizeD[full$Fsize>4]<-"Big"n> tapply(full[1:891,]$Survived,full[1:891,]$FsizeD,mean)n Alone Big Small n 0.3035382 0.1612903 0.5787671 nn> mosaicplot(table(full$FsizeD,full$Survived),main = "FsizeD vs Survived",ylab="Survived",xlab="FsizeD",col=hcl(c(50,120)))n
> tapply(full$Embarked,full$Pclass,median,na.rm=T)n 1 2 3 n "S" "S" "S" n> full[c(62,830),"Embarked"]nn> full$Embarked[c(62,830)]<-"S"n> ggplot(full[1:891,],aes(Pclass,fill=factor(Survived)))+n+ geom_bar(stat = "count")+n+ theme_few()+n+ xlab("Pclass")+n+ ylab("Count")+n+ facet_wrap(~Embarked)+n+ scale_fill_discrete(name="Survived")+n+ ggtitle("Embarked vs Pclass vs Survived")n
3、處理Fare缺失值
> ggplot(full[full$Pclass=="3",],aes(x=Fare))+geom_density(fill="lightgrey",alpha=0.4)+n+ geom_vline(aes(xintercept=median(Fare,na.rm = T)),col="darkred",linetype="dashed",lwd=1)+xlab("Fare")+n+ ggtitle("Pclass=3")+n+ ylab("Density")+n+ scale_x_continuous(labels = dollar_format())+n+ theme_few()nn> tapply(full$Fare,full$Pclass,median,na.rm=T)n 1 2 3 n 60.0000 15.0458 8.0500 n> full$Fare[1044]<-median(full[full$Pclass=="3",]$Fare,na.rm = T)n
4、處理Age缺失值
> tapply(full$Age,full$Pclass,median,na.rm=T)n 1 2 3 n39 29 24 n> tapply(full$Age,full$Title,median,na.rm=T)n Master Miss Mr Mrs Officer Royalty n 4 22 29 35 49 39 n> title.age<-aggregate(full$Age,by=list(full$Title),FUN=function(x)median(x,na.rm = T))n> full[is.na(full$Age),"Age"]<-apply(full[is.na(full$Age),],1,function(x)title.age[title.age[,1]==x["Title"],2])n> sum(is.na(full$Age))n[1] 0n
5、新變數 Child 來自 Age
> full$Child[full$Age<18]<-"Child"n> full$Child[full$Age>=18]<-"Adult"n> ggplot(full[1:891,][full[1:891,]$Child=="Child",],aes(Sex,fill=factor(Survived)))+n+ geom_bar(stat = "count")+n+ xlab("Sex")+n+ ylab("Count")+n+ facet_wrap(~Pclass)+n+ scale_fill_discrete(name="Survived")+n+ ggtitle("Child vs Sex vs Pclass vs Survived")+n+ theme_few()nntapply(full[1:891,]$Survived,full[1:891,]$Child, mean)ntable(full$Child, full$Survived)n
> corr_data<-full[1:891,]n> corr_data$Embarked<-revalue(corr_data$Embarked,c("S"=1,"Q"=2,"C"=3))n> corr_data$Sex<-revalue(corr_data$Sex,c("male"=1,"female"=2))n> corr_data$Title <- revalue(corr_data$Title,c("Mr" = 1, "Master" = 2,"Officer" = 3,"Mrs" = 4,"Royalty" = 5,"Miss" = 6))n> corr_data$FsizeD <- revalue(corr_data$FsizeD,c("Small" = 1, "Alone" = 2, "Big" = 3))n> corr_data$Child <- revalue(corr_data$Child,c("Adult" = 1, "Child" = 2))n> corr_data$FsizeD <- as.numeric(corr_data$FsizeD)n> corr_data$Child <- as.numeric(corr_data$Child)n> corr_data$Sex <- as.numeric(corr_data$Sex)n> corr_data$Embarked <- as.numeric(corr_data$Embarked)n> corr_data$Title <- as.numeric(corr_data$Title)n> corr_data$Pclass <- as.numeric(corr_data$Pclass)n> corr_data$Survived <- as.numeric(corr_data$Survived)n> corr_data <-corr_data[,c("Survived", "Pclass", "Sex", "FsizeD", "Fare", "Embarked","Title","Child")]n> str(corr_data)nClasses 『tbl_df』, 『tbl』 and data.frame:t891 obs. of 8 variables:n $ Survived: num 0 1 1 1 0 0 0 0 1 1 ...n $ Pclass : num 3 1 3 1 3 3 1 3 3 2 ...n $ Sex : num 1 2 2 2 1 1 1 1 2 2 ...n $ FsizeD : num 1 1 2 1 2 2 2 3 1 1 ...n $ Fare : num 7.25 71.28 7.92 53.1 8.05 ...n $ Embarked: num 1 3 1 1 1 2 1 1 1 3 ...n $ Title : num 1 4 6 4 1 1 1 2 4 4 ...n $ Child : num 1 1 1 1 1 1 1 2 1 2 ...n> mcorr_data<-cor(corr_data)n> corrplot(mcorr_data,method = "circle")n
> full$Child <- factor(full$Child)n> full$Sex <- factor(full$Sex)n> full$Embarked <- factor(full$Embarked)n> full$Title <- factor(full$Title)n> full$Pclass <- factor(full$Pclass)n> full$FsizeD <- factor(full$FsizeD)n> full1 <- full[,-9]n> full_mod <- full1[,-10]n
四、隨機森林預測
> full$Child <- factor(full$Child)n> full$Sex <- factor(full$Sex)n> full$Embarked <- factor(full$Embarked)n> full$Title <- factor(full$Title)n> full$Pclass <- factor(full$Pclass)n> full$FsizeD <- factor(full$FsizeD)n> full1 <- full[,-9]n> full_mod <- full1[,-10]n> train <- full_mod[1:891,]n> test <- full_mod[892:1309,]n# random forestn> library(randomForest)n> set.seed(123)n> rf_model <- randomForest(factor(Survived) ~ Pclass + Sex + Fare + Embarked + Title + FsizeD + Child, data = train)n> plot(rf_model,ylim=c(0,0.36),main = "RF_MODEL")n> legend("topright",colnames(rf_model$err.rate),col=1:3,fill=1:3)n
> rf_modelnnCall:n randomForest(formula = factor(Survived) ~ Pclass + Sex + Fare + Embarked + Title + FsizeD + Child, data = train) n Type of random forest: classificationn Number of trees: 500nNo. of variables tried at each split: 2nn OOB estimate of error rate: 17.17%nConfusion matrix:n 0 1 class.errorn0 503 46 0.08378871n1 107 235 0.31286550n
importance()函數用於計算模型變數的重要性
> importance(rf_model)n MeanDecreaseGininPclass 32.424148nSex 54.414834nFare 49.627249nEmbarked 8.626723nTitle 75.665568nFsizeD 22.359556nChild 5.967436n> varImpPlot(rf_model)n
們提到在構建隨機森林模型時,模型的準確性如何,關鍵在於選擇最優的變數個數(參數mtry),在上面的例子中,我們使用的iris數據集使用了函數中該參數的默認情況(數據集變數個數的二次方根(分類模型)或三分之一(預測模型))。但一般情況下需要進行人為的逐次挑選,以確定最佳的值。
> n<-length(names(train))n> for(i in 1:(n-1)){n+ model<-randomForest(factor(Survived)~Pclass+Sex+Fare+Embarked+Title+FsizeD+Child,data = train,mtry=i)n+ err<-mean(model$err.rate)n+ print(err)n+ }n[1] 0.1867092n[1] 0.1841437n[1] 0.1889661n[1] 0.1947642n[1] 0.1820473n[1] 0.1770447n[1] 0.1756495n[1] 0.1826706n[1] 0.187006n[1] 0.1795339n[1] 0.1858961n[1] 0.1847507n[1] 0.1775304n[1] 0.1780929n
觀察結果,當mtry=7時,模型內平均誤差最小。
> rf_model2<-randomForest(factor(Survived)~Pclass+Sex+Fare+Embarked+Title+FsizeD+Child,data = train,mtry=7)n> rf_model2nnCall:n randomForest(formula = factor(Survived) ~ Pclass + Sex + Fare + Embarked + Title + FsizeD + Child, data = train, mtry = 7) n Type of random forest: classificationn Number of trees: 500nNo. of variables tried at each split: 7nn OOB estimate of error rate: 16.61%nConfusion matrix:n 0 1 class.errorn0 485 64 0.1165756n1 84 258 0.2456140nn> rf_modelnnCall:n randomForest(formula = factor(Survived) ~ Pclass + Sex + Fare + Embarked + Title + FsizeD + Child, data = train) n Type of random forest: classificationn Number of trees: 500nNo. of variables tried at each split: 2nn OOB estimate of error rate: 17.17%nConfusion matrix:n 0 1 class.errorn0 503 46 0.08378871n1 107 235 0.31286550n
我們看見 rf_model的 oob error 值降低了。
預測生存率
> rf_pred<-predict(rf_model2,test)n> solution<-data.frame(Survived=rf_pred,PassengerID=test$PassengerId)n
推薦閱讀:
※105年前的今天,泰坦尼克號上的孫中山和11隻犬
※尼可的戰艦打造:工藝戰艦泰坦尼克?
※有關泰坦尼克的真正事實?
※泰坦尼克號一層甲板價格最貴,二層次之,三層最便宜,而獲救人數一層最多,三層最少,如何看待這種問題?
※如果《泰坦尼克號》發生在現代,世界上有哪些救援手段?
TAG:R编程语言 | 「泰坦尼克号」沉没事故 | 数据挖掘 |