人類史上最大惡性海難-TITANIC沉船事故

05-03

我以前和你一樣也是個冒險家，直到我的膝蓋中了一箭。
——《上古捲軸5：天際》

這是我第一次接觸kaggle，做得時候完全沒有頭緒和思路，不知道從哪裡入手，所以找了一個kaggler寫的文章，按照她的思路做了一遍，感謝她，讓我從中學習了很多數據分析的思路。

巧婦難為無米之炊，幹活之前肯定要先把數據都導入RStudio中，其中用到了許多包，我看了一下，用過的也就是ggplot2和dplyr，剩下的具體用到再看。（PS：提前需安裝）

ggthemes:可以理解為ggplot2的擴展包（This package contains extra themes, scales, and geoms, and functions for and related to ggplot2.）。

scales:數據標準化處理（Generic plot scaling methods）

mice:對缺失值進行填補的包（Generates Multivariate Imputations by Chained Equations (MICE)）。

randomForest：對數據進行分類並回歸（Classification and Regression with Random Forest隨機森林分類與回歸）。

# Load packages library(ggplot2) # visualizationlibrary(ggthemes) # visualizationlibrary(scales) # visualizationlibrary(dplyr) # data manipulationlibrary(mice) # imputationlibrary(randomForest) # classification algorithmtrain <- read.csv(C:/Users/Administrator/Desktop/train.csv, stringsAsFactors = F)test <- read.csv(C:/Users/Administrator/Desktop/test.csv, stringsAsFactors = F)full <- bind_rows(train, test) # bind training & test data# check datastr(full)

// read.x() bind_rows()

現在我們得到了一組合併數據，1309條，12個觀察值，分別代表乘客編號，生還情況，階級（1高，2中，3低），姓名，性別，年齡，同胞（兄弟姐妹/配偶），直系（父母/孩子），船票編號，票價，船艙號，乘船港口（乘客上船的港口不一樣）。

構造特徵值（通過這些值分析得出一些我們需要的數據）

1 名字中藏了些什麼？

# Grab title from passenger namesfull$Title <- gsub((.*, )|(\..*), , full$Name)# Show title counts by sextable(full$Sex, full$Title)

// gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE,

fixed = FALSE, useBytes = FALSE) 模式的匹配和替換 *

我們從姓名中，提取出這些乘客的姓氏（頭銜），並按照性別進行統計。

#Titles with very low cell counts to be combined to "rare" levelrare_title <- c(Dona, Lady, the Countess,Capt, Col, Don, Dr, Major, Rev, Sir, Jonkheer)# Also reassign mlle, ms, and mme accordinglyfull$Title[full$Title == Mlle] <- Miss full$Title[full$Title == Ms] <- Missfull$Title[full$Title == Mme] <- Mrs full$Title[full$Title %in% rare_title] <- Rare Title# Show title counts by sex againtable(full$Sex, full$Title)

之後根據稀有度進行歸類，分為以下五類：

# Finally, grab surname from passenger namefull$Surname <- sapply(full$Name, function(x) strsplit(x, split = [,.])[[1]][1])cat(paste(We have , nlevels(factor(full$Surname)), unique surnames. I would be interested to infer ethnicity based on surname --- another time.))

// cat(... , file = "", sep = " ", fill = FALSE, labels = NULL,append = FALSE) 連接與列印 *

回答：藏了他們的身份與社會地位。

2 不求同年同月生，但求同年同月死？

這裡我們創建了新的觀察值full$Fsize，關於一個人的家庭規模（包括了同胞和直系）。以及full$Family中包含的家庭規模和姓氏（頭銜），進一步深入每名乘客的內心世界。

# Create a family size variable including the passenger themselvesfull$Fsize <- full$SibSp + full$Parch + 1# Create a family variable full$Family <- paste(full$Surname, full$Fsize, sep=_)# Use ggplot2 to visualize the relationship between family size & survivalggplot(full[1:891,], aes(x = Fsize, fill = factor(Survived))) + geom_bar(stat=count, position=dodge) + scale_x_continuous(breaks=c(1:11)) + labs(x = Family Size) + theme_few()

// paste (..., sep = " ", collapse = NULL) 連接字元串

可以看出，一個人在船上（船上沒其他家人）生還的可能不大，小家庭（船上有2~4）生還幾率明顯提高，大家庭（超過4）很慘。

# Discretize family sizefull$FsizeD[full$Fsize == 1] <- singletonfull$FsizeD[full$Fsize < 5 & full$Fsize > 1] <- smallfull$FsizeD[full$Fsize > 4] <- large# Show family size by survival using a mosaic plotmosaicplot(table(full$FsizeD, full$Survived), main=Family Size by Survival, shade=TRUE)

// mosaicplot(x, main = deparse(substitute(x)),sub = NULL, xlab = NULL, ylab = NULL,sort = NULL, off = NULL, dir = NULL,color = NULL, shade = FALSE, margin = NULL,cex.axis = 0.66, las = par("las"), border = NULL,type = c("pearson", "deviance", "FT"), ...) 鑲嵌圖 *

為了使觀測更加具有比例化，更直觀，採用鑲嵌圖來顯示。

回答：眾人拾柴火焰高，一人拾柴火不夠，一小群人拾柴剛剛好，一大群人拾柴火太大。小家庭模式更利於生還。

3 客艙藏匿的秘密？

# This variable appears to have a lot of missing valuesfull$Cabin[1:28]

似乎缺失了很多信息，比如船艙號。

# The first character is the deck. For example:strsplit(full$Cabin[2], NULL)[[1]]

// strsplit(x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE) 分解特徵向量的元素 *

其中第一個字母代表的甲板，我們將其拆分。

創建的甲板變數，也為我們確定了乘客位於船的哪些位置，便於發掘出更多有用的信息。

# Create a Deck variable. Get passenger deck A - F:full$Deck<-factor(sapply(full$Cabin, function(x) strsplit(x, NULL)[[1]][1]))

神秘人？

# Passengers 62 and 830 are missing Embarkmentfull[c(62, 830), Embarked]

乘客62與乘客830缺少上船的地方，天外來客？

cat(paste(We will infer their values for **embarkment** based on present data that we can imagine may be relevant: **passenger class** and **fare**. We see that they paid $, full[c(62, 830), Fare][[1]][1], and $, full[c(62, 830), Fare][[1]][2], respectively and their classes are, full[c(62, 830), Pclass][[1]][1], and, full[c(62, 830), Pclass][[1]][2], . So from where did they embark?))

但是任何事物都有關聯，我們可以根據其頭銜和票價來推斷出這個人來自哪裡。

# Get rid of our missing passenger IDsembark_fare <- full %>% filter(PassengerId != 62 & PassengerId != 830)# Use ggplot2 to visualize embarkment, passenger class, & median fareggplot(embark_fare, aes(x = Embarked, y = Fare, fill = factor(Pclass))) + geom_boxplot() + geom_hline(aes(yintercept=80), colour=red, linetype=dashed, lwd=2) + scale_y_continuous(labels=dollar_format()) + theme_few()

我們首先去除這兩個特殊的乘客，對剩下資料健全的乘客進行分析。從中可以推斷出這個人是在C上的船（上船地點-票價（因子：頭銜））

# Since their fare was $80 for 1st class, they most likely embarked from Cfull$Embarked[c(62, 830)] <- C#We』re close to fixing the handful of NA values here and there. Passenger on row 1044 has an NA Fare value.# Show row 1044full[1044, ]

而另一位缺少票價信息的乘客1044（PClass=3,Cabin=S），可以確定票價肯定在＄0~＄100。

ggplot(full[full$Pclass == 3 & full$Embarked == S, ], aes(x = Fare)) + geom_density(fill = #99d6ff, alpha=0.4) + geom_vline(aes(xintercept=median(Fare, na.rm=T)), colour=red, linetype=dashed, lwd=1) + scale_x_continuous(labels=dollar_format()) + theme_few()

進一步分析確定1044票價的區間，縮小範圍，因為＄0~＄100的範圍太大，不利於數據分析。

# Replace missing fare value with median fare for class/embarkmentfull$Fare[1044] <- median(full[full$Pclass == 3 & full$Embarked == S, ]$Fare, na.rm = TRUE)

最終得到最有可能的值為＄8.05！！！

現在來讓我們預測一下東西

# Show number of missing Age valuessum(is.na(full$Age))

我們找到缺失年齡的乘客在第263行。

# Make variables factors into factorsfactor_vars <- c(PassengerId,Pclass,Sex,Embarked, Title,Surname,Family,FsizeD)full[factor_vars] <- lapply(full[factor_vars], function(x) as.factor(x))# Set a random seedset.seed(129)# Perform mice imputation, excluding certain less-than-useful variables:mice_mod <- mice(full[, !names(full) %in% c(PassengerId,Name,Ticket,Cabin,Family,Surname,Survived)], method=rf)

之後我們根據因子，設置隨機種子，創建MICE模型，來進行後續的處理分析並預測其年齡。

# Save the complete output mice_output <- complete(mice_mod)#Let』s compare the results we get with the original distribution of passenger ages to ensure that nothing has gone completely awry.# Plot age distributionspar(mfrow=c(1,2))hist(full$Age, freq=F, main=Age: Original Data, col=darkgreen, ylim=c(0,0.04))hist(mice_output$Age, freq=F, main=Age: MICE Output, col=lightgreen, ylim=c(0,0.04))

下面我們來對比一下根據原始數據和進行MICE分析的輸出，可以看出大概一致。

# Replace Age variable from the mice model.full$Age <- mice_output$Age# Show new number of missing Age valuessum(is.na(full$Age))

這裡顯示處理後缺失年齡值的總條數。

下面我們來進一步分析那些依賴年齡的變數。

# First well look at the relationship between age & survivalggplot(full[1:891,], aes(Age, fill = factor(Survived))) + geom_histogram() + # I include Sex since we know (a priori) its a significant predictor facet_grid(.~Sex) + theme_few()

這裡我們輸出一張基於年齡根據性別劃分的生還人數圖。生還的大多是婦女兒童，並且以青中年為主，因為上船的人群大多以青中年為主。

# Create the column child, and indicate whether child or adultfull$Child[full$Age < 18] <- Childfull$Child[full$Age >= 18] <- Adult# Show countstable(full$Child, full$Survived)

這裡以具體數字統計了成年人與兒童的生還情況。

# Adding Mother variablefull$Mother <- Not Motherfull$Mother[full$Sex == female & full$Parch > 0 & full$Age > 18 & full$Title != Miss] <- Mother# Show countstable(full$Mother, full$Survived)

進一步觀察母親這一特殊觀察值。

# Finish by factorizing our two new factor variablesfull$Child <- factor(full$Child)full$Mother <- factor(full$Mother)

這裡將其因子化，作為因子進行最後的統計分析。

md.pattern(full)

再次確定沒有缺失值的數據，顯示缺失數據模式。

成為一個預言家

1 分離最開始的數據train和test

# Split the data back into a train set and a test settrain <- full[1:891,]test <- full[892:1309,]

2 建立預測模型（對test進行）

# Set a random seedset.seed(754)# Build the model (note: not all possible variables are used)rf_model <- randomForest(factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + FsizeD + Child + Mother, data = train)# Show model errorplot(rf_model, ylim=c(0,0.36))legend(topright, colnames(rf_model$err.rate), col=1:3, fill=1:3)

# Get importanceimportance <- importance(rf_model)varImportance <- data.frame(Variables = row.names(importance), Importance = round(importance[ ,MeanDecreaseGini],2))# Create a rank variable based on importancerankImportance <- varImportance %>% mutate(Rank = paste0(#,dense_rank(desc(Importance))))# Use ggplot2 to visualize the relative importance of variablesggplot(rankImportance, aes(x = reorder(Variables, Importance), y = Importance, fill = Importance)) + geom_bar(stat=identity) + geom_text(aes(x = Variables, y = 0.5, label = Rank), hjust=0, vjust=0.55, size = 4, colour = red) + labs(x = Variables) + coord_flip() + theme_few()

我們將各變數根據重要性得出結論，確定生還情況與變數之間的關係。

接下來我們開始對他們進行「宣判」!

# Predict using the test setprediction <- predict(rf_model, test)# Save the solution to a dataframe with two columns: PassengerId and Survived (prediction)solution <- data.frame(PassengerID = test$PassengerId, Survived = prediction)# Write the solution to filewrite.csv(solution, file = rf_mod_Solution.csv, row.names = F)

做完這些之後，我特別感嘆，明白了自己和別人之間的差距，需要學習的東西太多了。就像猴子老師說的，貪多必失，慢即是快。

以後的路上，做個永遠的行動派！