Kaggle泰坦尼克號生存模型——250個特徵量的融合模型,排名8%

@猴子 ,求第三關門票

本文參考了Kaggle的Kernels板塊中,網友分享的項目演算法。

Kernels

1. 數據總覽

Titanic生存預測中提供了兩組數據:train.csv 和test.csv,分別是訓練集和測試集。

import numpy as npimport pandas as pdimport matplotlib.pyplot as pltimport seaborn as sns%matplotlib inlinetrain_data = pd.read_csv(I://model/titanic/train.csv)test_data = pd.read_csv(I://model/titanic/test.csv)train_data.info()test_data.info()<class pandas.core.frame.DataFrame>RangeIndex: 891 entries, 0 to 890Data columns (total 12 columns):PassengerId 891 non-null int64Survived 891 non-null int64Pclass 891 non-null int64Name 891 non-null objectSex 891 non-null objectAge 714 non-null float64SibSp 891 non-null int64Parch 891 non-null int64Ticket 891 non-null objectFare 891 non-null float64Cabin 204 non-null objectEmbarked 889 non-null objectdtypes: float64(2), int64(5), object(5)memory usage: 83.6+ KB<class pandas.core.frame.DataFrame>RangeIndex: 418 entries, 0 to 417Data columns (total 11 columns):PassengerId 418 non-null int64Pclass 418 non-null int64Name 418 non-null objectSex 418 non-null objectAge 332 non-null float64SibSp 418 non-null int64Parch 418 non-null int64Ticket 418 non-null objectFare 417 non-null float64Cabin 91 non-null objectEmbarked 418 non-null objectdtypes: float64(2), int64(4), object(5)memory usage: 36.0+ KB

存活比例

train_data[Survived].value_counts().plot.pie(autopct=%1.1f%%)

image.png

2. 數據關係分析

(1)性別與生存的關係

train_data.groupby([Sex,Survived])[Survived].count()Sex Survivedfemale 0 81 1 233male 0 468 1 109Name: Survived, dtype: int64train_data[[Sex,Survived]].groupby([Sex]).mean().plot.bar()

不同性別的生存率

(2)船艙等級與生存的關係

train_data[[Pclass,Survived]].groupby([Pclass]).mean().plot.bar(color=[r,g,b])

不同等級船艙的生存率

train_data[[Sex,Pclass,Survived]].groupby([Pclass,Sex]).mean().plot.bar()

不同等級船艙的男女生存率

train_data.groupby([Sex,Pclass,Survived])[Survived].count()Sex Pclass Survivedfemale 1 0 3 1 91 2 0 6 1 70 3 0 72 1 72male 1 0 77 1 45 2 0 91 1 17 3 0 300 1 47Name: Survived, dtype: int64

從上圖和表中明顯可以看到,雖然泰坦尼克號逃生總體符合婦女優先,但是對各個等級船艙還是有區別的,而且一等艙中的男子憑藉自身的社會地位強行混入了救生艇。如白星航運公司主席伊斯梅(他否決了配備48艘救生艇的想法,認為少點也沒關係)則拋下他的乘客、他的船員、他的船,在最後一刻跳進可摺疊式救生艇C(共有39名乘客)。

(3)年齡與存活的關係

f,ax=plt.subplots(1,2,figsize=(18,8))sns.violinplot("Pclass","Age", hue="Survived", data=train_data,split=True,ax=ax[0])ax[0].set_title(Pclass and Age vs Survived)ax[0].set_yticks(range(0,110,10))sns.violinplot("Sex","Age", hue="Survived", data=train_data,split=True,ax=ax[1])ax[1].set_title(Sex and Age vs Survived)ax[1].set_yticks(range(0,110,10))plt.show()

image.png

(4)稱呼與存活關係

在數據的Name項中包含了對該乘客的稱呼,如Mr、Miss、Mrs等,這些信息包含了乘客的年齡、性別、也有可能包含社會地位,如Dr、Lady、Major、Master等稱呼。

這一項不方便用圖表展示,但是在特徵工程中,我們會將其加入到特徵中。

(5)登船港口與存活關係

泰坦尼克號從英國的南安普頓港出發,途徑法國瑟堡和愛爾蘭昆士敦,一部分在瑟堡或昆士敦下船的人逃過了一劫。

sns.countplot(Embarked,hue=Survived,data=train_data)plt.title(Embarked and Survived)

image.png

(6)船上親友人數與存活關係

f,ax=plt.subplots(1,2,figsize=(18,8))train_data[[Parch,Survived]].groupby([Parch]).mean().plot.bar(ax=ax[0])ax[0].set_title(Parch and Survived)train_data[[SibSp,Survived]].groupby([SibSp]).mean().plot.bar(ax=ax[1])ax[1].set_title(SibSp and Survived)

image.png

從圖中可以看到,孤身一人存活率很低,但是如果親友太多,難以估計周全,也很危險。

(7)其他因素

剩餘因素還有船票價格、船艙號和船票號,這三個因素都可能會影響乘客在船中的位置從而影響逃生順序,但是因為這三個因素與生存之間看不出明顯規律,所以在後期模型融合時,將這些因素交給模型來決定其重要性。

3. 特徵工程

首先將train和test合併一起進行特徵工程處理:

train_data_org = pd.read_csv(train.csv) test_data_org = pd.read_csv(test.csv) test_data_org[Survived] = 0 combined_train_test = train_data_org.append(test_data_org)

特徵工程即從各項參數中提取出可能影響到最終結果的特徵,作為模型的預測依據。特徵工程一般應先從含有缺失值即NaN的項開始。

(1)Embarked

先填充缺失值,對缺失的Embarked以眾數來填補

if combined_train_test[Embarked].isnull().sum() != 0: combined_train_test[Embarked].fillna(combined_train_test[Embarked].mode().iloc[0], inplace=True)

再將Embarked的三個上船港口分為3列,每一列均只包含0和1兩個值

emb_dummies_df = pd.get_dummies(combined_train_test[Embarked],prefix=combined_train_test[[Embarked]].columns[0])combined_train_test = pd.concat([combined_train_test, emb_dummies_df], axis=1)

(2)Sex

無缺失值,直接分列

sex_dummies_df = pd.get_dummies(combined_train_test[Sex], prefix=combined_train_test[[Sex]].columns[0]) combined_train_test = pd.concat([combined_train_test, sex_dummies_df], axis=1)

(3)Name

從名字中提取出稱呼:

combined_train_test[Title] = combined_train_test[Name].str.extract(.+,(.+)).str.extract( ^(.+?).).str.strip()

將各式稱呼統一:

title_Dict = {} title_Dict.update(dict.fromkeys([Capt, Col, Major, Dr, Rev], Officer)) title_Dict.update(dict.fromkeys([Jonkheer, Don, Sir, the Countess, Dona, Lady], Royalty)) title_Dict.update(dict.fromkeys([Mme, Ms, Mrs], Mrs)) title_Dict.update(dict.fromkeys([Mlle, Miss], Miss)) title_Dict.update(dict.fromkeys([Mr], Mr)) title_Dict.update(dict.fromkeys([Master], Master)) combined_train_test[Title] = combined_train_test[Title].map(title_Dict)

分列

title_dummies_df = pd.get_dummies(combined_train_test[Title], prefix=combined_train_test[[Title]].columns[0]) combined_train_test = pd.concat([combined_train_test, title_dummies_df], axis=1)

(4)Fare

填充NaN,按一二三等艙各自的均價來填充。

if combined_train_test[Fare].isnull().sum() != 0: combined_train_test[Fare] = combined_train_test[[Fare]].fillna(combined_train_test.groupby(Pclass).transform(mean))

泰坦尼克號中有家庭團體票(分析Ticket號可以得到),所以需要將團體票分到每個人。

combined_train_test[Group_Ticket] = combined_train_test[Fare].groupby(by=combined_train_test[Ticket]).transform(count) combined_train_test[Fare] = combined_train_test[Fare] / combined_train_test[Group_Ticket] combined_train_test.drop([Group_Ticket], axis=1, inplace=True)

票價分級

def fare_category(fare): if fare <= 4: return 0 elif fare <= 10: return 1 elif fare <= 30: return 2 elif fare <= 45: return 3 else: return 4 combined_train_test[Fare_Category] = combined_train_test[Fare].map(fare_category)

分列(這一項分列與不分列均可)

fare_cat_dummies_df = pd.get_dummies(combined_train_test[Fare_Category],prefix=combined_train_test[[Fare_Category]].columns[0]) combined_train_test = pd.concat([combined_train_test, fare_cat_dummies_df], axis=1)

(5)Pclass

Pclass項本身已經不需要處理,為了更好地利用這一項,我們假設一二三等艙各自內部的票價也與逃生方式相關,從而分出高價一等艙、低價一等艙……這樣的分類。

Pclass_1_mean_fare = combined_train_test[Fare].groupby(by=combined_train_test[Pclass]).mean().get([1]).values[0] Pclass_2_mean_fare = combined_train_test[Fare].groupby(by=combined_train_test[Pclass]).mean().get([2]).values[0] Pclass_3_mean_fare = combined_train_test[Fare].groupby(by=combined_train_test[Pclass]).mean().get([3]).values[0] # 建立Pclass_Fare Category combined_train_test[Pclass_Fare_Category] = combined_train_test.apply(pclass_fare_category, args=(Pclass_1_mean_fare, Pclass_2_mean_fare, Pclass_3_mean_fare), axis=1) p_fare = LabelEncoder() p_fare.fit(np.array([Pclass_1_Low_Fare, Pclass_1_High_Fare, Pclass_2_Low_Fare, Pclass_2_High_Fare, Pclass_3_Low_Fare,Pclass_3_High_Fare]))#給每一項添加標籤 combined_train_test[Pclass_Fare_Category] = p_fare.transform(combined_train_test[Pclass_Fare_Category])#轉換成數值

(6)Parch and SibSp

這兩組數據都能顯著影響到Survived,但是影響方式不完全相同,所以將這兩項合併成FamilySize組的同時保留這兩項。

combined_train_test[Family_Size] = combined_train_test[Parch] + combined_train_test[SibSp] + 1 combined_train_test[Family_Size_Category] = combined_train_test[Family_Size].map(family_size_category) le_family = LabelEncoder() le_family.fit(np.array([Single, Small_Family, Large_Family])) combined_train_test[Family_Size_Category] = le_family.transform(combined_train_test[Family_Size_Category]) fam_size_cat_dummies_df = pd.get_dummies(combined_train_test[Family_Size_Category], prefix=combined_train_test[[Family_Size_Category]].columns[0]) combined_train_test = pd.concat([combined_train_test, fam_size_cat_dummies_df], axis=1)

(7)Age

因為Age項缺失較多,所以不能直接將其填充為眾數或者平均數。常見有兩種填充法,一是根據Title項中的Mr、Master、Miss等稱呼的平均年齡填充,或者綜合幾項(Sex、Title、Pclass)的Age均值。二是利用其他組特徵量,採用機器學習演算法來預測Age,本例採用的是第二種方法。

將Age完整的項作為訓練集、將Age缺失的項作為測試集。

missing_age_df = pd.DataFrame(combined_train_test[[Age, Parch, Sex, SibSp, Family_Size, Family_Size_Category, Title, Fare, Fare_Category, Pclass, Embarked]]) missing_age_df = pd.get_dummies(missing_age_df,columns=[Title, Family_Size_Category, Fare_Category, Sex, Pclass ,Embarked]) missing_age_train = missing_age_df[missing_age_df[Age].notnull()] missing_age_test = missing_age_df[missing_age_df[Age].isnull()]

建立融合模型

def fill_missing_age(missing_age_train, missing_age_test): missing_age_X_train = missing_age_train.drop([Age], axis=1) missing_age_Y_train = missing_age_train[Age] missing_age_X_test = missing_age_test.drop([Age], axis=1) #模型1 gbm_reg = ensemble.GradientBoostingRegressor(random_state=42) gbm_reg_param_grid = {n_estimators: [2000], max_depth: [3],learning_rate: [0.01], max_features: [3]} gbm_reg_grid = model_selection.GridSearchCV(gbm_reg, gbm_reg_param_grid, cv=10, n_jobs=25, verbose=1, scoring=neg_mean_squared_error) gbm_reg_grid.fit(missing_age_X_train, missing_age_Y_train) print(Age feature Best GB Params: + str(gbm_reg_grid.best_params_)) print(Age feature Best GB Score: + str(gbm_reg_grid.best_score_)) print(GB Train Error for "Age" Feature Regressor:+ str(gbm_reg_grid.score(missing_age_X_train, missing_age_Y_train))) missing_age_test[Age_GB] = gbm_reg_grid.predict(missing_age_X_test) print(missing_age_test[Age_GB][:4]) #模型2 lrf_reg = LinearRegression() lrf_reg_param_grid = {fit_intercept: [True], normalize: [True]} lrf_reg_grid = model_selection.GridSearchCV(lrf_reg, lrf_reg_param_grid, cv=10, n_jobs=25, verbose=1, scoring=neg_mean_squared_error) lrf_reg_grid.fit(missing_age_X_train, missing_age_Y_train) print(Age feature Best LR Params: + str(lrf_reg_grid.best_params_)) print(Age feature Best LR Score: + str(lrf_reg_grid.best_score_)) print(LR Train Error for "Age" Feature Regressor + str(lrf_reg_grid.score(missing_age_X_train, missing_age_Y_train))) missing_age_test[Age_LRF] = lrf_reg_grid.predict(missing_age_X_test) print(missing_age_test[Age_LRF][:4]) #將兩個模型預測後的均值作為最終預測結果 print(shape1,missing_age_test[Age].shape,missing_age_test[[Age_GB,Age_LRF]].mode(axis=1).shape) #missing_age_test[Age] = missing_age_test[[Age_GB,Age_LRF]].mode(axis=1) missing_age_test[Age] = np.mean([missing_age_test[Age_GB],missing_age_test[Age_LRF]]) print(missing_age_test[Age][:4]) drop_col_not_req(missing_age_test, [Age_GB, Age_LRF]) return missing_age_test

填充Age

combined_train_test.loc[(combined_train_test.Age.isnull()), Age] = fill_missing_age(missing_age_train,missing_age_test)

(8)Ticket

將Ticket中的字母與數字分開,分為Ticket_Letter和Ticket_Number兩項。

combined_train_test[Ticket_Letter] = combined_train_test[Ticket].str.split().str[0] combined_train_test[Ticket_Letter] = combined_train_test[Ticket_Letter].apply(lambda x:np.nan if x.isnumeric() else x) combined_train_test[Ticket_Number] = combined_train_test[Ticket].apply(lambda x: pd.to_numeric(x,errors=coerce)) combined_train_test[Ticket_Number].fillna(0,inplace=True) combined_train_test = pd.get_dummies(combined_train_test,columns=[Ticket,Ticket_Letter])

(9)Cabin

Cabin項缺失太多,只能將有無Cain作為特徵值進行建模

combined_train_test[Cabin_Letter] = combined_train_test[Cabin].apply(lambda x:str(x)[0] if pd.notnull(x) else x) combined_train_test = pd.get_dummies(combined_train_test,columns=[Cabin,Cabin_Letter])

完成之後再將train和test分開:

train_data = combined_train_test[:891] test_data = combined_train_test[891:] titanic_train_data_X = train_data.drop([Survived],axis=1) titanic_train_data_Y = train_data[Survived] titanic_test_data_X = test_data.drop([Survived],axis=1)

4. 模型融合

模型融合分兩步進行:

(1)用幾個模型篩選出較為重要的特徵:

def get_top_n_features(titanic_train_data_X, titanic_train_data_Y, top_n_features): # 隨機森林 rf_est = RandomForestClassifier(random_state=42) rf_param_grid = {n_estimators: [500], min_samples_split: [2, 3], max_depth: [20]} rf_grid = model_selection.GridSearchCV(rf_est, rf_param_grid, n_jobs=25, cv=10, verbose=1) rf_grid.fit(titanic_train_data_X,titanic_train_data_Y) #將feature按Importance排序 feature_imp_sorted_rf = pd.DataFrame({feature: list(titanic_train_data_X), importance: rf_grid.best_estimator_.feature_importances_}).sort_values(importance, ascending=False) features_top_n_rf = feature_imp_sorted_rf.head(top_n_features)[feature] print(Sample 25 Features from RF Classifier) print(str(features_top_n_rf[:25])) # AdaBoost ada_est = ensemble.AdaBoostClassifier(random_state=42) ada_param_grid = {n_estimators: [500], learning_rate: [0.5, 0.6]} ada_grid = model_selection.GridSearchCV(ada_est, ada_param_grid, n_jobs=25, cv=10, verbose=1) ada_grid.fit(titanic_train_data_X, titanic_train_data_Y) #排序 feature_imp_sorted_ada = pd.DataFrame({feature: list(titanic_train_data_X),importance: ada_grid.best_estimator_.feature_importances_}).sort_values( importance, ascending=False) features_top_n_ada = feature_imp_sorted_ada.head(top_n_features)[feature] # ExtraTree et_est = ensemble.ExtraTreesClassifier(random_state=42) et_param_grid = {n_estimators: [500], min_samples_split: [3, 4], max_depth: [15]} et_grid = model_selection.GridSearchCV(et_est, et_param_grid, n_jobs=25, cv=10, verbose=1) et_grid.fit(titanic_train_data_X, titanic_train_data_Y) #排序 feature_imp_sorted_et = pd.DataFrame({feature: list(titanic_train_data_X), importance: et_grid.best_estimator_.feature_importances_}).sort_values(importance, ascending=False) features_top_n_et = feature_imp_sorted_et.head(top_n_features)[feature] print(Sample 25 Features from ET Classifier:) print(str(features_top_n_et[:25])) # 將三個模型挑選出來的前features_top_n_et合併 features_top_n = pd.concat([features_top_n_rf, features_top_n_ada, features_top_n_et], ignore_index=True).drop_duplicates() return features_top_n

(2)根據篩選出的特徵值挑選訓練集和測試集

feature_to_pick = 250 feature_top_n = get_top_n_features(titanic_train_data_X,titanic_train_data_Y,feature_to_pick) titanic_train_data_X = titanic_train_data_X[feature_top_n] del titanic_train_data_X[Ticket_Number]#後來發現刪除Ticket_Number後效果更好了 titanic_test_data_X = titanic_test_data_X[feature_top_n] del titanic_test_data_X[Ticket_Number]

(3)利用votingClassifer建立最終預測模型

rf_est = ensemble.RandomForestClassifier(n_estimators = 750, criterion = gini, max_features = sqrt, max_depth = 3, min_samples_split = 4, min_samples_leaf = 2, n_jobs = 50, random_state = 42, verbose = 1) gbm_est = ensemble.GradientBoostingClassifier(n_estimators=900, learning_rate=0.0008, loss=exponential, min_samples_split=3, min_samples_leaf=2, max_features=sqrt, max_depth=3, random_state=42, verbose=1) et_est = ensemble.ExtraTreesClassifier(n_estimators=750, max_features=sqrt, max_depth=35, n_jobs=50, criterion=entropy, random_state=42, verbose=1) voting_est = ensemble.VotingClassifier(estimators = [(rf, rf_est),(gbm, gbm_est),(et, et_est)], voting = soft, weights = [3,5,2], n_jobs = 50) voting_est.fit(titanic_train_data_X,titanic_train_data_Y)

ps:不想用VotingClassifier的也可以自己根據這幾個模型的測試準確率給幾個模型的結果自定義權重,將最終的加權平均值作為預測結果,本人親測自定義權重的效果不必VotingClassifier差。

(4)預測及生成提交文件

titanic_test_data_X[Survived] = voting_est.predict(titanic_test_data_X) submission = pd.DataFrame({PassengerId:test_data_org.loc[:,PassengerId], Survived:titanic_test_data_X.loc[:,Survived]}) submission.to_csv(submission_result.csv,index=False,sep=,)

至此全部結束。

以上代碼運行部分參考了Kernels的分享內容,最好運行結果為80.8%,排名8%,代碼比較繁瑣,共有400餘行,對各種因素考慮得比較周全,各種函數寫法也相當正規,適合給新手學習之用。

因為水平有限,若有錯誤、歡迎指正。


推薦閱讀:

Kaggle 入門 1.3——Titanic Solution Using speedml
如何看待Kaggle最新比赛Zillow禁止中国居民参加第二轮?
Kaggle實戰——點擊率預估
用Python下載Kaggle數據
Kaggle 入門 1.1——A Journey through Titanic

TAG:数据挖掘 | 数据分析 | Kaggle |