人類史上最大惡性海難-TITANIC沉船事故
我以前和你一樣也是個冒險家,直到我的膝蓋中了一箭。
——《上古捲軸5:天際》
這是我第一次接觸kaggle,做得時候完全沒有頭緒和思路,不知道從哪裡入手,所以找了一個kaggler寫的文章,按照她的思路做了一遍,感謝她,讓我從中學習了很多數據分析的思路。
巧婦難為無米之炊,幹活之前肯定要先把數據都導入RStudio中,其中用到了許多包,我看了一下,用過的也就是ggplot2和dplyr,剩下的具體用到再看。(PS:提前需安裝)
ggthemes:可以理解為ggplot2的擴展包(This package contains extra themes, scales, and geoms, and functions for and related to ggplot2.)。
scales:數據標準化處理(Generic plot scaling methods)
mice:對缺失值進行填補的包(Generates Multivariate Imputations by Chained Equations (MICE))。
randomForest:對數據進行分類並回歸(Classification and Regression with Random Forest隨機森林分類與回歸)。
# Load packages library(ggplot2) # visualizationlibrary(ggthemes) # visualizationlibrary(scales) # visualizationlibrary(dplyr) # data manipulationlibrary(mice) # imputationlibrary(randomForest) # classification algorithmtrain <- read.csv(C:/Users/Administrator/Desktop/train.csv, stringsAsFactors = F)test <- read.csv(C:/Users/Administrator/Desktop/test.csv, stringsAsFactors = F)full <- bind_rows(train, test) # bind training & test data# check datastr(full)
// read.x() bind_rows()
現在我們得到了一組合併數據,1309條,12個觀察值,分別代表乘客編號,生還情況,階級(1高,2中,3低),姓名,性別,年齡,同胞(兄弟姐妹/配偶),直系(父母/孩子),船票編號,票價,船艙號,乘船港口(乘客上船的港口不一樣)。
構造特徵值(通過這些值分析得出一些我們需要的數據)
1 名字中藏了些什麼?
# Grab title from passenger namesfull$Title <- gsub((.*, )|(\..*), , full$Name)# Show title counts by sextable(full$Sex, full$Title)
// gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE,
fixed = FALSE, useBytes = FALSE) 模式的匹配和替換 *
我們從姓名中,提取出這些乘客的姓氏(頭銜),並按照性別進行統計。
#Titles with very low cell counts to be combined to "rare" levelrare_title <- c(Dona, Lady, the Countess,Capt, Col, Don, Dr, Major, Rev, Sir, Jonkheer)# Also reassign mlle, ms, and mme accordinglyfull$Title[full$Title == Mlle] <- Miss full$Title[full$Title == Ms] <- Missfull$Title[full$Title == Mme] <- Mrs full$Title[full$Title %in% rare_title] <- Rare Title# Show title counts by sex againtable(full$Sex, full$Title)
之後根據稀有度進行歸類,分為以下五類:
# Finally, grab surname from passenger namefull$Surname <- sapply(full$Name, function(x) strsplit(x, split = [,.])[[1]][1])cat(paste(We have <b>, nlevels(factor(full$Surname)), </b> unique surnames. I would be interested to infer ethnicity based on surname --- another time.))
// cat(... , file = "", sep = " ", fill = FALSE, labels = NULL,append = FALSE) 連接與列印 *
回答:藏了他們的身份與社會地位。
2 不求同年同月生,但求同年同月死?
這裡我們創建了新的觀察值full$Fsize,關於一個人的家庭規模(包括了同胞和直系)。以及full$Family中包含的家庭規模和姓氏(頭銜),進一步深入每名乘客的內心世界。
# Create a family size variable including the passenger themselvesfull$Fsize <- full$SibSp + full$Parch + 1# Create a family variable full$Family <- paste(full$Surname, full$Fsize, sep=_)# Use ggplot2 to visualize the relationship between family size & survivalggplot(full[1:891,], aes(x = Fsize, fill = factor(Survived))) + geom_bar(stat=count, position=dodge) + scale_x_continuous(breaks=c(1:11)) + labs(x = Family Size) + theme_few()
// paste (..., sep = " ", collapse = NULL) 連接字元串
可以看出,一個人在船上(船上沒其他家人)生還的可能不大,小家庭(船上有2~4)生還幾率明顯提高,大家庭(超過4)很慘。
# Discretize family sizefull$FsizeD[full$Fsize == 1] <- singletonfull$FsizeD[full$Fsize < 5 & full$Fsize > 1] <- smallfull$FsizeD[full$Fsize > 4] <- large# Show family size by survival using a mosaic plotmosaicplot(table(full$FsizeD, full$Survived), main=Family Size by Survival, shade=TRUE)
// mosaicplot(x, main = deparse(substitute(x)),sub = NULL, xlab = NULL, ylab = NULL,sort = NULL, off = NULL, dir = NULL,color = NULL, shade = FALSE, margin = NULL,cex.axis = 0.66, las = par("las"), border = NULL,type = c("pearson", "deviance", "FT"), ...) 鑲嵌圖 *
為了使觀測更加具有比例化,更直觀,採用鑲嵌圖來顯示。
回答:眾人拾柴火焰高,一人拾柴火不夠,一小群人拾柴剛剛好,一大群人拾柴火太大。小家庭模式更利於生還。
3 客艙藏匿的秘密?
# This variable appears to have a lot of missing valuesfull$Cabin[1:28]
似乎缺失了很多信息,比如船艙號。
# The first character is the deck. For example:strsplit(full$Cabin[2], NULL)[[1]]
// strsplit(x, split, fixed = FALSE, perl = FALSE, useBytes = FALSE) 分解特徵向量的元素 *
其中第一個字母代表的甲板,我們將其拆分。
創建的甲板變數,也為我們確定了乘客位於船的哪些位置,便於發掘出更多有用的信息。
# Create a Deck variable. Get passenger deck A - F:full$Deck<-factor(sapply(full$Cabin, function(x) strsplit(x, NULL)[[1]][1]))
神秘人?
# Passengers 62 and 830 are missing Embarkmentfull[c(62, 830), Embarked]
乘客62與乘客830缺少上船的地方,天外來客?
cat(paste(We will infer their values for **embarkment** based on present data that we can imagine may be relevant: **passenger class** and **fare**. We see that they paid<b> $, full[c(62, 830), Fare][[1]][1], </b>and<b> $, full[c(62, 830), Fare][[1]][2], </b>respectively and their classes are<b>, full[c(62, 830), Pclass][[1]][1], </b>and<b>, full[c(62, 830), Pclass][[1]][2], </b>. So from where did they embark?))
但是任何事物都有關聯,我們可以根據其頭銜和票價來推斷出這個人來自哪裡。
# Get rid of our missing passenger IDsembark_fare <- full %>% filter(PassengerId != 62 & PassengerId != 830)# Use ggplot2 to visualize embarkment, passenger class, & median fareggplot(embark_fare, aes(x = Embarked, y = Fare, fill = factor(Pclass))) + geom_boxplot() + geom_hline(aes(yintercept=80), colour=red, linetype=dashed, lwd=2) + scale_y_continuous(labels=dollar_format()) + theme_few()
我們首先去除這兩個特殊的乘客,對剩下資料健全的乘客進行分析。從中可以推斷出這個人是在C上的船(上船地點-票價(因子:頭銜))
# Since their fare was $80 for 1st class, they most likely embarked from Cfull$Embarked[c(62, 830)] <- C#We』re close to fixing the handful of NA values here and there. Passenger on row 1044 has an NA Fare value.# Show row 1044full[1044, ]
而另一位缺少票價信息的乘客1044(PClass=3,Cabin=S),可以確定票價肯定在$0~$100。
ggplot(full[full$Pclass == 3 & full$Embarked == S, ], aes(x = Fare)) + geom_density(fill = #99d6ff, alpha=0.4) + geom_vline(aes(xintercept=median(Fare, na.rm=T)), colour=red, linetype=dashed, lwd=1) + scale_x_continuous(labels=dollar_format()) + theme_few()
進一步分析確定1044票價的區間,縮小範圍,因為$0~$100的範圍太大,不利於數據分析。
# Replace missing fare value with median fare for class/embarkmentfull$Fare[1044] <- median(full[full$Pclass == 3 & full$Embarked == S, ]$Fare, na.rm = TRUE)
最終得到最有可能的值為$8.05!!!
現在來讓我們預測一下東西
# Show number of missing Age valuessum(is.na(full$Age))
我們找到缺失年齡的乘客在第263行。
# Make variables factors into factorsfactor_vars <- c(PassengerId,Pclass,Sex,Embarked, Title,Surname,Family,FsizeD)full[factor_vars] <- lapply(full[factor_vars], function(x) as.factor(x))# Set a random seedset.seed(129)# Perform mice imputation, excluding certain less-than-useful variables:mice_mod <- mice(full[, !names(full) %in% c(PassengerId,Name,Ticket,Cabin,Family,Surname,Survived)], method=rf)
之後我們根據因子,設置隨機種子,創建MICE模型,來進行後續的處理分析並預測其年齡。
# Save the complete output mice_output <- complete(mice_mod)#Let』s compare the results we get with the original distribution of passenger ages to ensure that nothing has gone completely awry.# Plot age distributionspar(mfrow=c(1,2))hist(full$Age, freq=F, main=Age: Original Data, col=darkgreen, ylim=c(0,0.04))hist(mice_output$Age, freq=F, main=Age: MICE Output, col=lightgreen, ylim=c(0,0.04))
下面我們來對比一下根據原始數據和進行MICE分析的輸出,可以看出大概一致。
# Replace Age variable from the mice model.full$Age <- mice_output$Age# Show new number of missing Age valuessum(is.na(full$Age))
這裡顯示處理後缺失年齡值的總條數。
下面我們來進一步分析那些依賴年齡的變數。
# First well look at the relationship between age & survivalggplot(full[1:891,], aes(Age, fill = factor(Survived))) + geom_histogram() + # I include Sex since we know (a priori) its a significant predictor facet_grid(.~Sex) + theme_few()
這裡我們輸出一張基於年齡根據性別劃分的生還人數圖。生還的大多是婦女兒童,並且以青中年為主,因為上船的人群大多以青中年為主。
# Create the column child, and indicate whether child or adultfull$Child[full$Age < 18] <- Childfull$Child[full$Age >= 18] <- Adult# Show countstable(full$Child, full$Survived)
這裡以具體數字統計了成年人與兒童的生還情況。
# Adding Mother variablefull$Mother <- Not Motherfull$Mother[full$Sex == female & full$Parch > 0 & full$Age > 18 & full$Title != Miss] <- Mother# Show countstable(full$Mother, full$Survived)
進一步觀察母親這一特殊觀察值。
# Finish by factorizing our two new factor variablesfull$Child <- factor(full$Child)full$Mother <- factor(full$Mother)
這裡將其因子化,作為因子進行最後的統計分析。
md.pattern(full)
再次確定沒有缺失值的數據,顯示缺失數據模式。
成為一個預言家
1 分離最開始的數據train和test
# Split the data back into a train set and a test settrain <- full[1:891,]test <- full[892:1309,]
2 建立預測模型(對test進行)
# Set a random seedset.seed(754)# Build the model (note: not all possible variables are used)rf_model <- randomForest(factor(Survived) ~ Pclass + Sex + Age + SibSp + Parch + Fare + Embarked + Title + FsizeD + Child + Mother, data = train)# Show model errorplot(rf_model, ylim=c(0,0.36))legend(topright, colnames(rf_model$err.rate), col=1:3, fill=1:3)
# Get importanceimportance <- importance(rf_model)varImportance <- data.frame(Variables = row.names(importance), Importance = round(importance[ ,MeanDecreaseGini],2))# Create a rank variable based on importancerankImportance <- varImportance %>% mutate(Rank = paste0(#,dense_rank(desc(Importance))))# Use ggplot2 to visualize the relative importance of variablesggplot(rankImportance, aes(x = reorder(Variables, Importance), y = Importance, fill = Importance)) + geom_bar(stat=identity) + geom_text(aes(x = Variables, y = 0.5, label = Rank), hjust=0, vjust=0.55, size = 4, colour = red) + labs(x = Variables) + coord_flip() + theme_few()
我們將各變數根據重要性得出結論,確定生還情況與變數之間的關係。
接下來我們開始對他們進行「宣判」!
# Predict using the test setprediction <- predict(rf_model, test)# Save the solution to a dataframe with two columns: PassengerId and Survived (prediction)solution <- data.frame(PassengerID = test$PassengerId, Survived = prediction)# Write the solution to filewrite.csv(solution, file = rf_mod_Solution.csv, row.names = F)
做完這些之後,我特別感嘆,明白了自己和別人之間的差距,需要學習的東西太多了。就像猴子老師說的,貪多必失,慢即是快。
以後的路上,做個永遠的行動派!
推薦閱讀:
※「不婚族」、「求三觀」......用數據解讀讓人意外的95後婚戀觀
※苟利富貴生死以,豈因數學避趨之
※python基礎篇之小白滾躺式入坑
※用特徵選擇方法優化模型|python數據挖掘思考筆記(1)
※分析競爭力,數字時代的差異化競爭優勢
TAG:數據分析 |