標籤:

人類史上最大惡性海難-TITANIC沉船事故

我以前和你一樣也是個冒險家,直到我的膝蓋中了一箭。

——《上古捲軸5:天際》

這是我第一次接觸kaggle,做得時候完全沒有頭緒和思路,不知道從哪裡入手,所以找了一個kaggler寫的文章,按照她的思路做了一遍,感謝她,讓我從中學習了很多數據分析的思路。

巧婦難為無米之炊,幹活之前肯定要先把數據都導入RStudio中,其中用到了許多包,我看了一下,用過的也就是ggplot2和dplyr,剩下的具體用到再看。(PS:提前需安裝)

ggthemes:可以理解為ggplot2的擴展包(This package contains extra themes, scales, and geoms, and functions for and related to ggplot2.)。

scales:數據標準化處理(Generic plot scaling methods)

mice:對缺失值進行填補的包(Generates Multivariate Imputations by Chained Equations (MICE))。

randomForest:對數據進行分類並回歸(Classification and Regression with Random Forest隨機森林分類與回歸)。

# Load packages library(ggplot2) # visualizationlibrary(ggthemes) # visualizationlibrary(scales) # visualizationlibrary(dplyr) # data manipulationlibrary(mice) # imputationlibrary(randomForest) # classification algorithmtrain <- read.csv(C:/Users/Administrator/Desktop/train.csv, stringsAsFactors = F)test <- read.csv(C:/Users/Administrator/Desktop/test.csv, stringsAsFactors = F)full <- bind_rows(train, test) # bind training & test data# check datastr(full)

// read.x() bind_rows()

現在我們得到了一組合併數據,1309條,12個觀察值,分別代表乘客編號,生還情況,階級(1高,2中,3低),姓名,性別,年齡,同胞(兄弟姐妹/配偶),直系(父母/孩子),船票編號,票價,船艙號,乘船港口(乘客上船的港口不一樣)。

構造特徵值(通過這些值分析得出一些我們需要的數據)

1 名字中藏了些什麼?

# Grab title from passenger namesfull$Title <- gsub((.*, )|(\..*), , full$Name)# Show title counts by sextable(full$Sex, full$Title)

// gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE,

fixed = FALSE, useBytes = FALSE) 模式的匹配和替換 *

我們從姓名中,提取出這些乘客的姓氏(頭銜),並按照性別進行統計。

#Titles with very low cell counts to be combined to "rare" levelrare_title <- c(Dona, Lady, the Countess,Capt, Col, Don, Dr, Major, Rev, Sir, Jonkheer)# Also reassign mlle, ms, and mme accordinglyfull$Title[full$Title == Mlle] <- Miss full$Title[full$Title == Ms] <- Missfull$Title[full$Title == Mme] <- Mrs full$Title[full$Title %in% rare_title] <- Rare Title# Show title counts by sex againtable(full$Sex, full$Title)

之後根據稀有度進行歸類,分為以下五類:

# Finally, grab surname from passenger namefull$Surname <- sapply(full$Name, function(x) strsplit(x, split = [,.])[[1]][1])cat(paste(We have <b>, nlevels(factor(full$Surname)), </b> unique surnames. I would be interested to infer ethnicity based on surname --- another time.))

// cat(... , file = "", sep = " ", fill = FALSE, labels = NULL,append = FALSE) 連接與列印 *

回答:藏了他們的身份與社會地位。

2 不求同年同月生,但求同年同月死?

這裡我們創建了新的觀察值full$Fsize,關於一個人的家庭規模(包括了同胞和直系)。以及full$Family中包含的家庭規模和姓氏(頭銜),進一步深入每名乘客的內心世界。

# Create a family size variable including the passenger themselvesfull$Fsize <- full$SibSp + full$Parch + 1# Create a family variable full$Family <- paste(full$Surname, full$Fsize, sep=_)# Use ggplot2 to visualize the relationship between family size & survivalggplot(full[1:891,], aes(x = Fsize, fill = factor(Survived))) + geom_bar(stat=count, position=dodge) + scale_x_continuous(breaks=c(1:11)) + labs(x = Family Size) + theme_few()

// paste (..., sep = " ", collapse = NULL) 連接字元串

可以看出,一個人在船上(船上沒其他家人)生還的可能不大,小家庭(船上有2~4)生還幾率明顯提高,大家庭(超過4)很慘。

# Discretize family sizefull$FsizeD[full$Fsize == 1] <- singletonfull$FsizeD[full$Fsize < 5 & full$Fsize > 1] <- smallfull$FsizeD[full$Fsize > 4] <- large# Show family size by survival using a mosaic plotmosaicplot(table(full$FsizeD, full$Survived), main=Family Size by Survival, shade=TRUE)

// mosaicplot(x, main = deparse(substitute(x)),sub = NULL, xlab = NULL, ylab = NULL,sort = NULL, off = NULL, dir = NULL,color = NULL, shade = FALSE, margin = NULL,cex.axis = 0.66, las = par("las"), border = NULL,type = c("pearson", "deviance", "FT"), ...) 鑲嵌圖 *

為了使觀測更加具有比例化,更直觀,採用鑲嵌圖來顯示。

回答:眾人拾柴火焰高,一人拾柴火不夠,一小群人拾柴剛剛好,一大群人拾柴火太大。小家庭模式更利於生還。

3 客艙藏匿的秘密?

# This variable appears to have a lot of missing valuesfull$Cabin[1:28]

似乎缺失了很多信息,比如船艙號。

# The first character is the deck. For example:strsplit(full$Cabin[2], NULL)[[1]]

// strsplit(x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE) 分解特徵向量的元素 *

其中第一個字母代表的甲板,我們將其拆分。

創建的甲板變數,也為我們確定了乘客位於船的哪些位置,便於發掘出更多有用的信息。

# Create a Deck variable. Get passenger deck A - F:full$Deck<-factor(sapply(full$Cabin, function(x) strsplit(x, NULL)[[1]][1]))

神秘人?

# Passengers 62 and 830 are missing Embarkmentfull[c(62, 830), Embarked]

乘客62與乘客830缺少上船的地方,天外來客?

cat(paste(We will infer their values for **embarkment** based on present data that we can imagine may be relevant: **passenger class** and **fare**. We see that they paid<b> $, full[c(62, 830), Fare][[1]][1], </b>and<b> $, full[c(62, 830), Fare][[1]][2], </b>respectively and their classes are<b>, full[c(62, 830), Pclass][[1]][1], </b>and<b>, full[c(62, 830), Pclass][[1]][2], </b>. So from where did they embark?))

但是任何事物都有關聯,我們可以根據其頭銜和票價來推斷出這個人來自哪裡。

# Get rid of our missing passenger IDsembark_fare <- full %>% filter(PassengerId != 62 & PassengerId != 830)# Use ggplot2 to visualize embarkment, passenger class, & median fareggplot(embark_fare, aes(x = Embarked, y = Fare, fill = factor(Pclass))) + geom_boxplot() + geom_hline(aes(yintercept=80), colour=red, linetype=dashed, lwd=2) + scale_y_continuous(labels=dollar_format()) + theme_few()

我們首先去除這兩個特殊的乘客,對剩下資料健全的乘客進行分析。從中可以推斷出這個人是在C上的船(上船地點-票價(因子:頭銜))

# Since their fare was $80 for 1st class, they most likely embarked from Cfull$Embarked[c(62, 830)] <- C#We』re close to fixing the handful of NA values here and there. Passenger on row 1044 has an NA Fare value.# Show row 1044full[1044, ]

而另一位缺少票價信息的乘客1044(PClass=3,Cabin=S),可以確定票價肯定在$0~$100。

ggplot(full[full$Pclass == 3 & full$Embarked == S, ], aes(x = Fare)) + geom_density(fill = #99d6ff, alpha=0.4) + geom_vline(aes(xintercept=median(Fare, na.rm=T)), colour=red, linetype=dashed, lwd=1) + scale_x_continuous(labels=dollar_format()) + theme_few()

進一步分析確定1044票價的區間,縮小範圍,因為$0~$100的範圍太大,不利於數據分析。

# Replace missing fare value with median fare for class/embarkmentfull$Fare[1044] <- median(full[full$Pclass == 3 & full$Embarked == S, ]$Fare, na.rm = TRUE)

最終得到最有可能的值為$8.05!!!

現在來讓我們預測一下東西

# Show number of missing Age valuessum(is.na(full$Age))

我們找到缺失年齡的乘客在第263行。

# Make variables factors into factorsfactor_vars <- c(PassengerId,Pclass,Sex,Embarked, Title,Surname,Family,FsizeD)full[factor_vars] <- lapply(full[factor_vars], function(x) as.factor(x))# Set a random seedset.seed(129)# Perform mice imputation, excluding certain less-than-useful variables:mice_mod <- mice(full[, !names(full) %in% c(PassengerId,Name,Ticket,Cabin,Family,Surname,Survived)], method=rf)

之後我們根據因子,設置隨機種子,創建MICE模型,來進行後續的處理分析並預測其年齡。

# Save the complete output mice_output <- complete(mice_mod)#Let』s compare the results we get with the original distribution of passenger ages to ensure that nothing has gone completely awry.# Plot age distributionspar(mfrow=c(1,2))hist(full$Age, freq=F, main=Age: Original Data, col=darkgreen, ylim=c(0,0.04))hist(mice_output$Age, freq=F, main=Age: MICE Output, col=lightgreen, ylim=c(0,0.04))

下面我們來對比一下根據原始數據和進行MICE分析的輸出,可以看出大概一致。

# Replace Age variable from the mice model.full$Age <- mice_output$Age# Show new number of missing Age valuessum(is.na(full$Age))

這裡顯示處理後缺失年齡值的總條數。

下面我們來進一步分析那些依賴年齡的變數

# First well look at the relationship between age & survivalggplot(full[1:891,], aes(Age, fill = factor(Survived))) + geom_histogram() + # I include Sex since we know (a priori) its a significant predictor facet_grid(.~Sex) + theme_few()

這裡我們輸出一張基於年齡根據性別劃分的生還人數圖。生還的大多是婦女兒童,並且以青中年為主,因為上船的人群大多以青中年為主。

# Create the column child, and indicate whether child or adultfull$Child[full$Age < 18] <- Childfull$Child[full$Age >= 18] <- Adult# Show countstable(full$Child, full$Survived)

這裡以具體數字統計了成年人與兒童的生還情況。

# Adding Mother variablefull$Mother <- Not Motherfull$Mother[full$Sex == female & full$Parch > 0 & full$Age > 18 & full$Title != Miss] <- Mother# Show countstable(full$Mother, full$Survived)

進一步觀察母親這一特殊觀察值。

# Finish by factorizing our two new factor variablesfull$Child <- factor(full$Child)full$Mother <- factor(full$Mother)

這裡將其因子化,作為因子進行最後的統計分析。

md.pattern(full)

再次確定沒有缺失值的數據,顯示缺失數據模式。

成為一個預言家

1 分離最開始的數據train和test

# Split the data back into a train set and a test settrain <- full[1:891,]test <- full[892:1309,]

2 建立預測模型(對test進行)

# Set a random seedset.seed(754)# Build the model (note: not all possible variables are used)rf_model <- randomForest(factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + FsizeD + Child + Mother, data = train)# Show model errorplot(rf_model, ylim=c(0,0.36))legend(topright, colnames(rf_model$err.rate), col=1:3, fill=1:3)

# Get importanceimportance <- importance(rf_model)varImportance <- data.frame(Variables = row.names(importance), Importance = round(importance[ ,MeanDecreaseGini],2))# Create a rank variable based on importancerankImportance <- varImportance %>% mutate(Rank = paste0(#,dense_rank(desc(Importance))))# Use ggplot2 to visualize the relative importance of variablesggplot(rankImportance, aes(x = reorder(Variables, Importance), y = Importance, fill = Importance)) + geom_bar(stat=identity) + geom_text(aes(x = Variables, y = 0.5, label = Rank), hjust=0, vjust=0.55, size = 4, colour = red) + labs(x = Variables) + coord_flip() + theme_few()

我們將各變數根據重要性得出結論,確定生還情況與變數之間的關係。

接下來我們開始對他們進行「宣判」!

# Predict using the test setprediction <- predict(rf_model, test)# Save the solution to a dataframe with two columns: PassengerId and Survived (prediction)solution <- data.frame(PassengerID = test$PassengerId, Survived = prediction)# Write the solution to filewrite.csv(solution, file = rf_mod_Solution.csv, row.names = F)

做完這些之後,我特別感嘆,明白了自己和別人之間的差距,需要學習的東西太多了。就像猴子老師說的,貪多必失,慢即是快。

以後的路上,做個永遠的行動派!

推薦閱讀:

「不婚族」、「求三觀」......用數據解讀讓人意外的95後婚戀觀
苟利富貴生死以,豈因數學避趨之
python基礎篇之小白滾躺式入坑
用特徵選擇方法優化模型|python數據挖掘思考筆記(1)
分析競爭力,數字時代的差異化競爭優勢

TAG:數據分析 |