Kaggle Titanic 生存預測(Top1.4%)完整代碼分享

目錄

1 摘要

2 導入包與載入數據

3 數據可視化

4 數據清洗

5 建模和優化

6 預測

7 總結

1.摘要

本文詳述了Kaggle的Titanic倖存預測這一分類問題競賽的處理思路及代碼實現,包括探索性數據分析,特徵工程,缺失值填充和模型調優等方面,使用的語言為Python。

2.導入包與載入數據

%matplotlib inlineimport pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as snsfrom sklearn.ensemble import RandomForestRegressorfrom sklearn.pipeline import Pipeline,make_pipelinefrom sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifierfrom sklearn.feature_selection import SelectKBestfrom sklearn import cross_validation, metricsfrom sklearn.grid_search import GridSearchCV, RandomizedSearchCVimport warningswarnings.filterwarnings(ignore)train = pd.read_csv(train.csv,dtype={"Age": np.float64})test = pd.read_csv(test.csv,dtype={"Age": np.float64})PassengerId=test[PassengerId]all_data = pd.concat([train, test], ignore_index = True)

3.數據可視化

1)Sex Feature:女性倖存率遠高於男性

sns.barplot(x="Sex", y="Survived", data=train, palette=Set3)print("Percentage of females who survived:%.2f" % (train["Survived"][train["Sex"] == female].value_counts(normalize = True)[1]*100))print("Percentage of males who survived:%.2f" % (train["Survived"][train["Sex"] == male].value_counts(normalize = True)[1]*100))

2)Pclass Feature:乘客社會等級越高,倖存率越高

sns.barplot(x="Pclass", y="Survived", data=train, palette=Set3)print("Percentage of Pclass = 1 who survived:%.2f" % (train["Survived"][train["Pclass"] == 1].value_counts(normalize = True)[1]*100))print("Percentage of Pclass = 2 who survived:%.2f" % (train["Survived"][train["Pclass"] == 2].value_counts(normalize = True)[1]*100))print("Percentage of Pclass = 3 who survived:%.2f" % (train["Survived"][train["Pclass"] == 3].value_counts(normalize = True)[1]*100))

3)SibSp Feature:配偶及兄弟姐妹數適中的乘客倖存率更高

sns.barplot(x="SibSp", y="Survived", data=train, palette=Set3)

4)Parch Feature:父母與子女數適中的乘客倖存率更高

sns.barplot(x="Parch", y="Survived", data=train, palette=Set3)

5)Age Feature:未成年人倖存率高於成年人

facet = sns.FacetGrid(train, hue="Survived",aspect=2)facet.map(sns.kdeplot,Age,shade= True)facet.set(xlim=(0, train[Age].max()))facet.add_legend()

6)Fare Feature:支出船票費越高倖存率越高

facet = sns.FacetGrid(train, hue="Survived",aspect=2)facet.map(sns.kdeplot,Fare,shade= True)facet.set(xlim=(0, 200))facet.add_legend()

7)Title Feature(New):不同稱呼的乘客倖存率不同

新增Title特徵,從姓名中提取乘客的稱呼,歸納為六類。

all_data[Title] = all_data[Name].apply(lambda x:x.split(,)[1].split(.)[0].strip())Title_Dict = {}Title_Dict.update(dict.fromkeys([Capt, Col, Major, Dr, Rev], Officer))Title_Dict.update(dict.fromkeys([Don, Sir, the Countess, Dona, Lady], Royalty))Title_Dict.update(dict.fromkeys([Mme, Ms, Mrs], Mrs))Title_Dict.update(dict.fromkeys([Mlle, Miss], Miss))Title_Dict.update(dict.fromkeys([Mr], Mr))Title_Dict.update(dict.fromkeys([Master,Jonkheer], Master))all_data[Title] = all_data[Title].map(Title_Dict)sns.barplot(x="Title", y="Survived", data=all_data, palette=Set3)

8)FamilyLabel Feature(New):家庭人數為2到4的乘客倖存率較高

新增FamilyLabel特徵,先計算FamilySize=Parch+SibSp+1,然後把FamilySize分為三類。

all_data[FamilySize]=all_data[SibSp]+all_data[Parch]+1sns.barplot(x="FamilySize", y="Survived", data=all_data, palette=Set3)

按生存率把FamilySize分為三類,構成FamilyLabel特徵。

def Fam_label(s): if (s >= 2) & (s <= 4): return 2 elif ((s > 4) & (s <= 7)) | (s == 1): return 1 elif (s > 7): return 0all_data[FamilyLabel]=all_data[FamilySize].apply(Fam_label)sns.barplot(x="FamilyLabel", y="Survived", data=all_data, palette=Set3)

9)Deck Feature(New):不同甲板的乘客倖存率不同

新增Deck特徵,先把Cabin空缺值填充為Unknown,再提取Cabin中的首字母構成乘客的甲板號。

all_data[Cabin] = all_data[Cabin].fillna(Unknown)all_data[Deck]=all_data[Cabin].str.get(0)sns.barplot(x="Deck", y="Survived", data=all_data, palette=Set3)

10)TicketGroup Feature(New):與2至4人共票號的乘客倖存率較高

新增TicketGroup特徵,統計每個乘客的共票號數。

Ticket_Count = dict(all_data[Ticket].value_counts())all_data[TicketGroup] = all_data[Ticket].apply(lambda x:Ticket_Count[x])sns.barplot(x=TicketGroup, y=Survived, data=all_data, palette=Set3)

按生存率把TicketGroup分為三類。

def Ticket_Label(s): if (s >= 2) & (s <= 4): return 2 elif ((s > 4) & (s <= 8)) | (s == 1): return 1 elif (s > 8): return 0all_data[TicketGroup] = all_data[TicketGroup].apply(Ticket_Label)sns.barplot(x=TicketGroup, y=Survived, data=all_data, palette=Set3)

4.數據清洗

1)缺失值填充

Age Feature:Age缺失量為263,缺失量較大,用Sex, Title, Pclass三個特徵構建隨機森林模型,填充年齡缺失值。

age_df = all_data[[Age, Pclass,Sex,Title]]age_df=pd.get_dummies(age_df)known_age = age_df[age_df.Age.notnull()].as_matrix()unknown_age = age_df[age_df.Age.isnull()].as_matrix()y = known_age[:, 0]X = known_age[:, 1:]rfr = RandomForestRegressor(random_state=0, n_estimators=100, n_jobs=-1)rfr.fit(X, y)predictedAges = rfr.predict(unknown_age[:, 1::])all_data.loc[ (all_data.Age.isnull()), Age ] = predictedAges

Embarked Feature:Embarked缺失量為2,缺失Embarked信息的乘客的Pclass均為1,且Fare均為80,因為Embarked為C且Pclass為1的乘客的Fare中位數為80,所以缺失值填充為C。

all_data[all_data[Embarked].isnull()]

sns.boxplot(x="Embarked", y="Fare", hue="Pclass",data=all_data, palette="Set3")

all_data[Embarked] = all_data[Embarked].fillna(C)

Fare Feature:Fare缺失量為1,缺失Fare信息的乘客的Embarked為S,Pclass為3,所以用Embarked為S,Pclass為3的乘客的Fare中位數填充。

all_data[all_data[Fare].isnull()]

fare=all_data[(all_data[Embarked] == "S") & (all_data[Pclass] == 3)].Fare.median()all_data[Fare]=all_data[Fare].fillna(fare)

2)同組識別

把姓氏相同的乘客劃分為同一組,從人數大於一的組中分別提取出每組的婦女兒童和成年男性。

all_data[Surname]=all_data[Name].apply(lambda x:x.split(,)[0].strip())Surname_Count = dict(all_data[Surname].value_counts())all_data[FamilyGroup] = all_data[Surname].apply(lambda x:Surname_Count[x])Female_Child_Group=all_data.loc[(all_data[FamilyGroup]>=2) & ((all_data[Age]<=12) | (all_data[Sex]==female))]Male_Adult_Group=all_data.loc[(all_data[FamilyGroup]>=2) & (all_data[Age]>12) & (all_data[Sex]==male)]

發現絕大部分女性和兒童組的平均存活率都為1或0,即同組的女性和兒童要麼全部倖存,要麼全部遇難。

Female_Child=pd.DataFrame(Female_Child_Group.groupby(Surname)[Survived].mean().value_counts())Female_Child.columns=[GroupCount]Female_Child

sns.barplot(x=Female_Child.index, y=Female_Child["GroupCount"], palette=Set3).set_xlabel(AverageSurvived)

絕大部分成年男性組的平均存活率也為1或0。

Male_Adult=pd.DataFrame(Male_Adult_Group.groupby(Surname)[Survived].mean().value_counts())Male_Adult.columns=[GroupCount]Male_Adult

sns.barplot(x=Male_Adult.index, y=Male_Adult[GroupCount], palette=Set3).set_xlabel(AverageSurvived)

因為普遍規律是女性和兒童倖存率高,成年男性倖存較低,所以我們把不符合普遍規律的反常組選出來單獨處理。把女性和兒童組中倖存率為0的組設置為遇難組,把成年男性組中存活率為1的設置為倖存組,推測處於遇難組的女性和兒童倖存的可能性較低,處於倖存組的成年男性倖存的可能性較高。

Female_Child_Group=Female_Child_Group.groupby(Surname)[Survived].mean()Dead_List=set(Female_Child_Group[Female_Child_Group.apply(lambda x:x==0)].index)print(Dead_List)Male_Adult_List=Male_Adult_Group.groupby(Surname)[Survived].mean()Survived_List=set(Male_Adult_List[Male_Adult_List.apply(lambda x:x==1)].index)print(Survived_List)

為了使處於這兩種反常組中的樣本能夠被正確分類,對測試集中處於反常組中的樣本的Age,Title,Sex進行懲罰修改。

train=all_data.loc[all_data[Survived].notnull()]test=all_data.loc[all_data[Survived].isnull()]test.loc[(test[Surname].apply(lambda x:x in Dead_List)),Sex] = maletest.loc[(test[Surname].apply(lambda x:x in Dead_List)),Age] = 60test.loc[(test[Surname].apply(lambda x:x in Dead_List)),Title] = Mrtest.loc[(test[Surname].apply(lambda x:x in Survived_List)),Sex] = femaletest.loc[(test[Surname].apply(lambda x:x in Survived_List)),Age] = 5test.loc[(test[Surname].apply(lambda x:x in Survived_List)),Title] = Miss

3)特徵轉換

選取特徵,轉換為數值變數,劃分訓練集和測試集。

all_data=pd.concat([train, test])all_data=all_data[[Survived,Pclass,Sex,Age,Fare,Embarked,Title,FamilyLabel,Deck,TicketGroup]]all_data=pd.get_dummies(all_data)train=all_data[all_data[Survived].notnull()]test=all_data[all_data[Survived].isnull()].drop(Survived,axis=1)X = train.as_matrix()[:,1:]y = train.as_matrix()[:,0]

5.建模和優化

1)參數優化

用網格搜索自動化選取最優參數,事實上我用網格搜索得到的最優參數是n_estimators = 28,max_depth = 6。但是參考另一篇Kernel把參數改為n_estimators = 26,max_depth = 6之後交叉驗證分數和kaggle評分都有略微提升。

pipe=Pipeline([(select,SelectKBest(k=20)), (classify, RandomForestClassifier(random_state = 10, max_features = sqrt))])param_test = {classify__n_estimators:list(range(20,50,2)), classify__max_depth:list(range(3,60,3))}gsearch = GridSearchCV(estimator = pipe, param_grid = param_test, scoring=roc_auc, cv=10)gsearch.fit(X,y)print(gsearch.best_params_, gsearch.best_score_)

2)訓練模型

select = SelectKBest(k = 20)clf = RandomForestClassifier(random_state = 10, warm_start = True, n_estimators = 26, max_depth = 6, max_features = sqrt)pipeline = make_pipeline(select, clf)pipeline.fit(X, y)

3)交叉驗證

cv_score = cross_validation.cross_val_score(pipeline, X, y, cv= 10)print("CV Score : Mean - %.7g | Std - %.7g " % (np.mean(cv_score), np.std(cv_score)))

6.預測

predictions = pipeline.predict(test)submission = pd.DataFrame({"PassengerId": PassengerId, "Survived": predictions.astype(np.int32)})submission.to_csv("submission.csv", index=False)

上傳至kaggle準確率為83.732%,排名140/9653=1.45%。

7.總結

以上就是我參加Titanic生存預測競賽的處理思路和完整代碼分享,如有疑問歡迎在評論區和我討論,不足之處懇請批評指點:)

感謝你閱讀到這裡,如果對你有所幫助的話就請點個贊吧,畢竟同為新人的我也很希望收到一點點反饋^_^

參考文章:

1.RandomForest, Leaderboard 0.8134 | Kaggle

2.機器學習(二) 如何做到Kaggle排名前2%

3.Kaggle Titanic Data Analysis(Top 1.6%)經驗分享

4.Kaggle Titanic 生存預測 -- 詳細流程吐血梳理

推薦閱讀:

機器學習如何在小樣本高維特徵問題下獲得良好表現?
我在Kaggle數海獅
用126郵箱註冊了Kaggle,但是無法驗證,提示未輸入正確驗證碼,這是為何呢?
Google收購Kaggle!拿下最大機器學習及數據競賽平台

TAG:數據分析 | 數據挖掘 | Kaggle |