Titanic災難倖存者預測
來自專欄 AI很好玩
本文使用的數據集來自kaggle競賽平台,數據數據集分為train.csv和test.csv兩個文件,分別用於模型訓練和驗證。參加kaggle競賽的主要目的是培養數據分析思維,首先行成一個分析框架,框架形成以後再往裡面填東西,必須立足於需要解決的問題,總結出需求,再確定解決方案,然後思考選擇哪種數據挖掘技術來實現。下面是這次比賽的思考過程,主要分為五個部分:
本文的代碼放在我的github中:
https://github.com/irving2/Data-Mining/tree/master/pytanic一、數據集概覽
首先將數據集讀入:
import pandas as pdimport numpy as npfrom IPython.core.interactiveshell import InteractiveShellInteractiveShell = alldata_train = pd.read_csv(train.csv,index_col=PassengerId)data_train.head(5)data_train.info()
這就是train數據集所有的特徵,先對各個特徵做一個大概的分析,其中survived 特徵是titanic乘員是否幸的標籤,是我們測試集中需要預測的變數,其他的特徵都是用來預測的。對於特徵的數量,並不是特徵數量越多,就能訓練出更好的模型,而是要找到最合適的特徵。
PassengeId:樣本的隨機編號,沒有具體意義。
Ticket :隨機的標識符,沒有太大的意義,可以考慮捨棄掉這個特徵。
Pclass :船艙的等級,可以作為船員的社會經濟地位的象徵,1代表上層社會人員,2代表中層社會人員,3 代表下層社會人員。
Name :名字,可以考慮做一些特徵工程,從Name中得到性別,家庭人數,有前綴的稱呼的如doctor 或者master可以反應樣本的社會經濟地位。
age 和 Fare: 年齡和票價 都是連續型變數。
SibSp:代表船上面兄弟姐妹關係的人數。
Parch:代表船上父母子女關係的人數。
Cabin:船艙,可以考慮特徵工程做一下船艙分布在船的各個位置,但是只有204個有效樣本,空值太多,反而可能會帶偏預測模型,因此捨棄掉該特徵。
二、數據清洗、特徵工程
將test.csv的數據也導入,和data_train放在同一個列表data_cleaner中,是為了方便一起清洗:
data_test = pd.read_csv(test.csv)data_train_copy = data_train.copy(deep=True) # 深拷貝,複製創建一個全新的對象data_cleaner = [data_train_copy, data_test]
觀察兩個數據集中的空值情況:
data_train.isnull().sum()
data_test.isnull().sum()
對Age、Embarked、Fare的缺失值分別採取如下的填充處理,並且丟棄掉那些無關的特徵:
for data_set in data_cleaner: data_set[Age].fillna(data_set[Age].median(),inplace = True ) # 使用中位數填充 data_set[Embarked].fillna(data_set[Embarked].mode().iloc[0], inplace=True) # 使用眾數值填充 data_set[Fare].fillna(data_set[Fare].median(),inplace=True) # 使用中位數填充 drop_columns = [PassengerId,Cabin, Ticket] data_set.drop(drop_columns,axis=1,inplace=True) # 丟棄掉無關的特徵
下面開始做特徵工程,就是通過對現有特徵進行一些處理,創造出新的的特徵,新的特徵中承載了我們需要的一些信息:
for data_set in data_cleaner: data_set[FamilySize] = data_set[SibSp]+data_set[Parch]+1 # 家庭大小,將親屬數量都加起來+1,lucy的家庭成員包括未婚夫和媽咪 data_set[IsAlone] = 1 # 是否孤身一人,像jack那樣,先全部初始化為1 data_set[IsAlone].loc[data_set[FamilySize]>1] = 0 data_set[Title] = data_set[Name].str.split(,, expand=True)[1].str.split(.,expand=True)[0] # 提煉出稱呼 data_set[Fare_Bin] = pd.qcut(data_set[Fare], 4) # 數量離散化連續變數Fare,離散為4個區級 data_set[Age_Bin] = pd.cut(data_set[Age], 5) # 區間離散化連續變數,離散為5個區級data_train_copy.head(3)
利用如上的代碼,將連續變數Fare和Age離散化,我在data_train_copy 和data_test數據集中創建了FamilySize、IsAlone、Title、Fare_Bin、Age_Bin這五個特徵。
data_train_copy[Title].value_counts()
data_test[Title].value_counts()
發現特徵工程構建的新特徵Title有很多隻出現一次或者幾次的稱謂,有些也不知道具體意義,有些可能是任意選取的,以20為閾值,將少於20次的劃分為一類:
# 將出現頻率小的Title雜合在一起,用misc代替for data_set in data_cleaner: threhold_num = 20 title_state = data_set[Title].value_counts()<threhold_num title_change=[] for title in title_state.index: if title_state[title]: title_change.append(title) data_set[Title][data_set[Title].isin(title_change)] = miscprint(data_train_copy[Title].value_counts())print(data_test[Title].value_counts())
Name特徵已經完成構建特徵工程的任務,將Name列刪除:
for data_set in data_cleaner: data_set.drop(Name,axis=1, inplace=True)
對於categorical 特徵及objective特徵,演算法不能夠理解其意義的,需要進行編碼轉換,使用pandas的get_dummies方法,也可以使用sklearn的labelEncoder:
need_get_dummies = [Sex, Embarked, Title]data_cleaner_dummy = []for data_set in data_cleaner: has_get_dummy = pd.get_dummies(data_set[need_get_dummies],prefix=need_get_dummies) data_set_dummy = data_set.join(has_get_dummy) data_cleaner_dummy.append(data_set_dummy)from sklearn.preprocessing import LabelEncoderle = LabelEncoder()for dataset in data_cleaner_dummy: dataset[AgeBin_Code] = le.fit_transform(dataset[Age_Bin]) dataset[FareBin_Code] = le.fit_transform(dataset[Fare_Bin])drop_col = [Sex, Embarked, Title, Fare_Bin, Age_Bin]data_train_dummy = data_cleaner_dummy[0].drop(drop_col, axis=1)data_test_dummy = data_cleaner_dummy[1].drop(drop_col, axis=1)
上面的代碼對data_cleaner列表中的data_train_copy及data_test中的[Sex, Embarked, Title,]一起做了one-hot編碼處理,使用labelencoder對[Fare_Bin, Age_Bin]編碼。
接下來,先將訓練數據集為x、y兩部分提出來,再將訓練集劃分為75%的訓練集,25%的測試集:
X = data_train_dummy.drop(Survived,axis=1) Y = data_train[Survived]from sklearn import model_selectionx_train, x_test, y_train, y_test = model_selection.train_test_split(X, Y, random_state=3, test_size=0.25)
三、數據的統計學描述
數據清洗和特徵工程已經完成,這一節主要使用一些圖表對變數做統計學描述
訓練集中男女成員的構成,
% matplotlib inlineimport matplotlib.pyplot as plt# 中文亂碼和坐標軸負號的處理plt.rcParams[font.sans-serif] = [Microsoft YaHei]plt.rcParams[axes.unicode_minus] = Falseplt.axes(aspect=equal) # 將橫、縱坐標軸標準化處理,保證餅圖是一個正圓,否則為橢圓plt.pie(data_train_copy[Sex].value_counts(), explode=[0.1,0],startangle = 30, labels=[female,male], autopct=%.1f%%, radius = 2)
訓練集中64.8%為男性,35.2%為女性。
在看看成功獲救的乘員中的男女比例:
plt.axes(aspect=equal)plt.pie(data_train_copy[Sex].groupby(data_train_copy[Survived]).value_counts()[1], explode=[0.1,0],startangle = 30, autopct=%.1f%%, radius = 2,labels=[female, male])data_train_copy[Sex].groupby(data_train_copy[Survived]).value_counts()
在以上的餅圖中,成功獲救的人員中,女性比例為68.1%,男性僅為31.9%,從分析結果來看女性的獲救可能性遠遠大於男性。
接著分析,在不同的港口登船的人的生還情況,titanic總共有在三個港口停靠登船,分別為C = Cherbourg, Q = Queenstown, S = Southampton,首先用柱狀圖繪出各港口的登船人數分布:
data_train_copy[Embarked].value_counts().plot(kind=bar,grid=True, title=Embarked)
在 Southampton的登船人數最多,下面比較各港口登船的人數的生還率:
df_Embarked = data_train_copy[Embarked].groupby(data_train_copy[Survived]).value_counts()survive_rate = []for i in range(3): s_rate = df_Embarked[1][i]/(df_Embarked[0][i]+df_Embarked[1][i]) survive_rate.append(s_rate) plt.bar(np.arange(3),height=survive_rate, facecolor = lightskyblue)for x,y in zip(np.arange(3), survive_rate): plt.text(x, y+0.01, %.2f% y, ha=center, va= bottom)plt.xticks(range(3), (S, C, Q))
其中在S( Southampton)登船的乘客生還率為三個港口最低,僅34%,在C(Cherbourg)登船的乘客生還率最高,為55%。
對於其他的特徵相關的存活率,就不一一用圖表詳細表示了,下面幾行代碼可以計算出來:
data_tab = [Title,SibSp,Parch,FamilySize,IsAlone,Pclass,Embarked]for column in data_tab: print(與%s相關的存活率:% column) print(data_train_copy[[column, Survived]].groupby(column, as_index=False).mean()) print(**20)
代碼執行結果為:
與Title相關的存活率:
Title Survived0 Master 0.5750001 Miss 0.6978022 Mr 0.1566733 Mrs 0.7920004 misc 0.444444********************與SibSp相關的存活率: SibSp Survived0 0 0.345395
1 1 0.5358852 2 0.4642863 3 0.2500004 4 0.1666675 5 0.0000006 8 0.000000********************與Parch相關的存活率: Parch Survived0 0 0.343658
1 1 0.5508472 2 0.5000003 3 0.6000004 4 0.0000005 5 0.2000006 6 0.000000********************與FamilySize相關的存活率: FamilySize Survived0 1 0.303538
1 2 0.5527952 3 0.5784313 4 0.7241384 5 0.2000005 6 0.1363646 7 0.3333337 8 0.0000008 11 0.000000********************與IsAlone相關的存活率:
IsAlone Survived0 0 0.5056501 1 0.303538********************與Pclass相關的存活率: Pclass Survived0 1 0.6296301 2 0.4728262 3 0.242363********************與Embarked相關的存活率: Embarked Survived0 C 0.5535711 Q 0.3896102 S 0.339009********************從統計結果中可以發現,名字中帶有Mrs稱謂的存活率為79.2%;有一個兄弟姐妹的存活率為53.5%;父子關係人數為一個的存活率為55.1%; 單獨一人來坐船的人存活率僅為30.3%,而有人一起結伴出行的存活率為50.6%,出了事情認識熟人互相幫助果然結果會大有不同!
一等艙vs三等艙的存活率為 63.0% vs 24.2% 超過半數的富人活了下來,但是只有4分之一的窮人幸免於難。
可以用如下代碼作出fare,age ,familysize三個特徵的箱線圖:
# 畫箱線圖plt.figure(figsize=(16,7))plt.subplot(1,3,1)plt.boxplot(x=data_train_copy[Fare], showmeans=True, meanline=True)plt.title(Fare boxplot)plt.ylabel(price of ticket)plt.subplot(1,3,2)plt.boxplot(x=data_train_copy[Age], showmeans=True, meanline=True)plt.title(Age boxplot)plt.ylabel(Age)plt.subplot(1,3,3)plt.boxplot(x=data_train_copy[FamilySize], showmeans=True, meanline=True)plt.title(FamilySize boxplot)plt.ylabel(FamilySize)
從上面的箱線圖可以看出,船票價格75分位數和均值都在$30附近,票價的均值被少數的高票價乘客拉高,一等艙和三等艙的票價相差如此懸殊,最後享受的待遇肯定有天壤之別了;年齡的均值和中位數都在28~30歲範圍內,75分位數也只有35歲,可見Titanic號上面的30多歲的年輕乘客佔了大多數;familysize家庭規模,75%都在2個人及以下。
再用下面的代碼給出訓練集各特徵的皮爾遜相關係數:
import seaborn as snssns.heatmap(data_train_copy.corr(),cmap="BrBG",annot=True)
那麼年齡對於生存的影響是什麼樣子呢?是年富力強的人,利用自身優勢生還的概率大,還是真如Titanic電影裡面所描述的那樣,讓老弱兒童優先登上救生艇?用下面的代碼畫出生還和遇難對於年齡的核密度估計:
# 繪製年齡 生還和遇難的核密度估計import seaborn as snsa = sns.FacetGrid( data_train_copy, hue = Survived, aspect=4 )a.map(sns.kdeplot, Age, shade= True )a.set(xlim=(0 , data_train_copy[Age].max()))a.add_legend()
從上圖可見,結果真如電影描述那樣,在曲線的兩頭(兒童,老人)生還的概率(曲線下面的面積)大於遇難的概率,而在30歲附近,遇難的概率明顯大於生還的概率,比災難更震撼的,是人性的偉大。
三、建立模型
3.1 base model
這裡選擇決策樹作為basemodel,決策樹有很多優點比如:計算結果容易理解和解釋;對數據的格式(數據規整、dummy code)要求不高,categorical變數和數值型變數都可以用於訓練;可以很容易用於多分類問題;演算法時間複雜度為訓練數據的對數級別,需要的計算資源少。
當然,決策樹也有很多缺點:容易過擬合(可以通過剪枝、設置特徵最小分裂樣本數量、葉節點的最小樣本數量來避免)、決策樹不穩定,樣本的小干擾可能會導致完全不同的結果(可以採用ensemble技術來避免)。
為了避免在訓練集預測結果非常好,但是到了沒有使用過的預測數據時,模型可能會得到非常差的結果的情況,就好比學習只靠死記硬背,考試時候碰到背過的一模一樣的題目會作答,但是遇到沒背過的需要舉一反三的題目就不會做了,也就是泛化能力不夠,模型過擬合了。可以使用不同的訓練集和測試集來訓練和測試模型,因此使用交叉驗證技術十分重要。
# base model decision treefrom sklearn import treefrom sklearn import model_selectiondtree = tree.DecisionTreeClassifier(random_state=3)base_results = model_selection.cross_validate(dtree, X, Y, cv=5, return_train_score=True)print(dtree)print(train score: %s % (base_results[train_score].mean()*100))print(test score: %s %(base_results[test_score].mean()*100))print(test score std*3: ± %s % (base_results[test_score].std()*100*3))
輸出結果:
DecisionTreeClassifier(class_weight=None, criterion=gini, max_depth=None,
max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=3, splitter=best)train score: 98.4569457895test score: 76.7663834183 test score std*3: ± 5.93501171153如上面的輸出結果所示,使用sklearn中默認參數的決策樹分類器,進行5折交叉驗證,在訓練集上面準確率達到98%,但是在測試集上面僅為77%,並且準確率的3倍標準差的波動範圍為±5.93下面使用GridSearchCV對決策樹進行調參:
# gridsearchcv 調參import timeparam_grid = [{criterion:[gini,entropy], # 決策樹分裂的評價標準,基尼係數 、信息增益 max_depth:[2,3,4,5,6,7,8,9,10,None], # 樹的最大深度 min_samples_split:[2,4,6,8,10,0.01,0.03,0.05], # 樹分裂前子集最小樣本數量,小數則代表百分比 min_samples_leaf : [1,5,10,0.01,0.03,0.05], # 葉節點最小的樣本數目 max_features : [None, auto,3, 5,7,9], random_state : [3] }]tune_model = model_selection.GridSearchCV(tree.DecisionTreeClassifier(), param_grid=param_grid, scoring=roc_auc, cv=5)print(starting GridSearch...)start = time.time()tune_model.fit(X, Y)end = time.time()print(GridSearch ended...)time_cost = end-startprint(GridSearch has cost:,time_cost)
starting GridSearch...
GridSearch ended...GridSearch has cost: 148.57738137245178交叉驗證的參數放在param_grid裡面,在我筆記本電腦上運行整個時間花費不到3分鐘。
print(**30)print(tune_model.best_params_)print(**30)print(best train score:, tune_model.cv_results_[mean_train_score][tune_model.best_index_]*100)print(best test score:, tune_model.cv_results_[mean_test_score][tune_model.best_index_]*100)print(best test score std*3:±, tune_model.cv_results_[std_test_score][tune_model.best_index_]*100*3)
******************************
{criterion: entropy, max_depth: 4, max_features: 9, min_samples_leaf: 1, min_samples_split: 6, random_state: 3}******************************best train score: 88.9237895291best test score: 87.4746824813best test score std*3:± 8.04227955844這樣就輸出了交叉驗證得分最高的模型參數。
使用gridsearch的參數的到決策樹模型,預測結果,並提交kaggle平台
mytree = tree.DecisionTreeClassifier(**tune_model.best_params_)mytree.fit(X,Y)data_test_copy = pd.read_csv(test.csv)data_test_copy[Survived] = mytree.predict(data_test_dummy)submit = data_test_copy[[PassengerId, Survived]]submit.to_csv(dtree.csv, index=False)
看看成績吧:
準確率只有78.468%,接著優化,keep trying!
3.2 特徵選擇調參
前面已經提到過,並不是特徵越多越好,而是要找到最有價值的特徵,這裡我選擇用Sklearn裡面的RFE(recursive feature elimination)和CV(cross validation)處理
from sklearn import feature_selectiondtree = tree.DecisionTreeClassifier()dtree_rfe = feature_selection.RFECV(dtree,step=1,scoring=accuracy, cv=5 )dtree_rfe.fit(X=X,y=Y,)X_rfe = X.columns.values[dtree_rfe.get_support()] # 特徵選擇結果# print(X[X_rfe].head())rfe_results = model_selection.cross_validate(mytree, X[X_rfe],Y, cv=5)print(rfe_train_results:,rfe_results[train_score].mean()*100)print(rfe_test_results:,rfe_results[test_score].mean()*100)print(rfe_test_score std*3:±,rfe_results[test_score].std()*100*3)
rfe_train_results: 90.3198549855
rfe_test_results: 80.4774367709rfe_test_score std*3:± 8.19692080796下面對RFE模型進行調參:
param_grid = [{criterion:[gini,entropy], # 決策樹分裂的評價標準,基尼係數 、信息增益 max_depth:[2,3,4,5,6,7,8,9,10,None], # 樹的最大深度 min_samples_split:[2,4,6,8,10,0.01,0.03,0.05], # 樹分裂前子集最小樣本數量,小數則代表百分比 min_samples_leaf : [1,5,10,0.01,0.03,0.05], # 葉節點最小的樣本數目 max_features : [None, auto], random_state : [3] }]rfe_tune_modle = model_selection.GridSearchCV(tree.DecisionTreeClassifier(),param_grid=param_grid, scoring=roc_auc, cv=5,)rfe_tune_modle.fit(X[X_rfe], Y)print(rfe_tune_modle.best_params_,
,**25)print(best train score:%s % (rfe_tune_modle.cv_results_[mean_train_score][rfe_tune_modle.best_index_]*100))print(best test score:%s % (rfe_tune_modle.cv_results_[mean_test_score][rfe_tune_modle.best_index_]*100))print(best test score std*3: ±%s % (rfe_tune_modle.cv_results_[std_test_score][rfe_tune_modle.best_index_]*100))
輸出最佳參數及得分:
{criterion: gini, max_depth: 5, max_features: None, min_samples_leaf: 1, min_samples_split: 2, random_state: 3}
*************************best train score:87.8501939882best test score:83.8446592525best test score std*3: ±1.97841531283最後得到的準確率仍然不理想,單模型弱分類器能力有限。接下來使用模型組合,構建強分類器。
3.3 模型 Ensemble
首先,通過,StratifiedKFold採樣劃分數據集,比較常見的八種分類器,評估他們的分類準確率。StratifiedKFold用法類似Kfold,但是他是分層採樣,確保訓練集,測試集中各類別樣本的比例與原始數據集中相同。
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier, ExtraTreesClassifier, VotingClassifierfrom sklearn.discriminant_analysis import LinearDiscriminantAnalysisfrom sklearn.linear_model import LogisticRegressionfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.svm import SVCfrom sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold, learning_curvekfold = StratifiedKFold(n_splits=5)random_state = 3classifiers = []classifiers.append(SVC(random_state=random_state))classifiers.append(DecisionTreeClassifier(random_state=random_state))classifiers.append(AdaBoostClassifier(DecisionTreeClassifier(random_state=random_state),random_state=random_state,learning_rate=0.1))classifiers.append(RandomForestClassifier(random_state=random_state))classifiers.append(ExtraTreesClassifier(random_state=random_state))classifiers.append(GradientBoostingClassifier(random_state=random_state))classifiers.append(KNeighborsClassifier())classifiers.append(LogisticRegression(random_state = random_state))cv_results=[]cv_means=[]cv_std=[]for classifier in classifiers: cv_results.append(cross_val_score(classifier, X, Y, scoring=accuracy,cv=kfold))for cv_result in cv_results: cv_means.append(cv_result.mean()) cv_std.append(cv_result.std())cv_res = pd.DataFrame({score_means:cv_means, score_std:cv_std, Algorithm:["SVC","DecisionTree","AdaBoost", "RandomForest","ExtraTrees","GradientBoosting", "KNeighboors","LogisticRegression"]})print(cv_res.sort_values(score_means,ascending=False))
運行結果:
Algorithm score_means score_std
5 GradientBoosting 0.833913 0.0142367 LogisticRegression 0.820474 0.0246013 RandomForest 0.798008 0.0315812 AdaBoost 0.776716 0.0319444 ExtraTrees 0.773257 0.0203111 DecisionTree 0.767664 0.0197830 SVC 0.746473 0.0326606 KNeighboors 0.731840 0.029179#用圖來表直觀表示各分類演算法的得分比較g = sns.barplot("score_means", "Algorithm", data = cv_res.sort_values(score_means,ascending=False), palette="Set3", orient = "h",**{xerr:cv_std})g.set_xlabel("Mean Accuracy")g = g.set_title("Cross validation scores")
我決定選擇準確率最靠前的五個分類器,用作Ensemble的弱分類器,分別是GradientBoosting 、LogisticRegression、RandomForest 、AdaBoost 、ExtraTrees
接下來用GridSearchCV對這五個弱分類器進行調參:
- GradientBoosting 調參:
# Gradient boosting tunningGBC = GradientBoostingClassifier()gb_param_grid = {loss : ["deviance"], n_estimators : [200,300,350,400], learning_rate: [0.1, 0.05, 0.01], max_depth: [4,6,8], min_samples_leaf: [1,3,10,0.01,0.03], min_samples_split:[2,3,10,0.01,0.03], max_features: [ 2,3,5,auto,None] }gsGBC = GridSearchCV(GBC,param_grid = gb_param_grid, cv=kfold, scoring="accuracy", n_jobs= 2, verbose = 1)gsGBC.fit(X, Y)GBC_best = gsGBC.best_estimator_print(GradientBoostingClassifier best params:,GBC_best)print(**25)print(gsGBC.best_score_)
運行結果為:
GradientBoostingClassifier best params: GradientBoostingClassifier(criterion=friedman_mse, init=None,
learning_rate=0.01, loss=deviance, max_depth=6, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=0.01, min_samples_split=0.03, min_weight_fraction_leaf=0.0, n_estimators=300, presort=auto, random_state=None, subsample=1.0, verbose=0, warm_start=False)*************************0.847362514029- LogisticRegression調參:
# LogisticRegression tuningLR = LogisticRegression()LR.get_params().keys()lr_param_grid = { tol:[1e-4, 1e-5,1e-6], C:[0.1,0.3,0.5,0.7,1], max_iter:[200,300]}gsLR = GridSearchCV(LR, param_grid=lr_param_grid, cv=kfold, scoring="accuracy", n_jobs=1, verbose=1)gsLR.fit(X, Y)LR_best=gsLR.best_estimator_print(LogisticRegression best params:, LR_best)gsLR.best_score_
運行結果:
LogisticRegression best params: LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
intercept_scaling=1, max_iter=200, multi_class=ovr, n_jobs=1, penalty=l2, random_state=None, solver=liblinear, tol=0.0001, verbose=0, warm_start=False)0.8204264870931538
- RandomForest 調參:
# RFC Parameters tunning RFC = RandomForestClassifier()rf_param_grid = {max_depth: [None,5,7,9,10], max_features: [1, 3,5,10], min_samples_split: [2, 3, 10], min_samples_leaf: [1, 3, 10], bootstrap: [False], n_estimators :[100,200,300], criterion: [gini, entropy]}gsRFC = GridSearchCV(RFC,param_grid = rf_param_grid, cv=kfold, scoring="accuracy", n_jobs= 2, verbose = 1)gsRFC.fit(X,Y)RFC_best = gsRFC.best_estimator_print(RFC_best = gsRFC.best_estimator_:,RFC_best = gsRFC.best_estimator_)gsRFC.best_score_
RFC_best = gsRFC.best_estimator_: RandomForestClassifier(bootstrap=False, class_weight=None, criterion=gini,
max_depth=9, max_features=10, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=10, min_samples_split=10, min_weight_fraction_leaf=0.0, n_estimators=300, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False)0.84287317620650959
- AdaBoost 調參:
# Adaboost tuningDTC = DecisionTreeClassifier()adaDTC = AdaBoostClassifier(DTC, random_state=3)ada_param_grid = {"base_estimator__criterion" : ["gini", "entropy"], "base_estimator__splitter" : ["best", "random"], "algorithm" : ["SAMME","SAMME.R"], "n_estimators" :[200,400,600], "learning_rate": [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3,1.5]}gsadaDTC = GridSearchCV(adaDTC,param_grid = ada_param_grid, cv=kfold, scoring="accuracy", n_jobs=2, verbose = 1)gsadaDTC.fit(X,Y)ada_best = gsadaDTC.best_estimator_print(ada_best:,ada_best)print(**25)print(gsadaDTC.best_score_)
運行結果為:
ada_best: AdaBoostClassifier(algorithm=SAMME.R,
base_estimator=DecisionTreeClassifier(class_weight=None, criterion=entropy, max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter=random), learning_rate=0.01, n_estimators=600, random_state=3)*************************0.803591470258- ExtraTrees調參:
#ExtraTrees tuningExtC = ExtraTreesClassifier()ex_param_grid = {"max_depth": [None,4,6,8], "max_features": [2,3,7,10], "min_samples_split": [2,6, 10], "min_samples_leaf": [1, 3,5, 10], "bootstrap": [False], "n_estimators" :[200,400], "criterion": ["gini", entropy]}gsExtC = GridSearchCV(ExtC,param_grid = ex_param_grid, cv=kfold, scoring="accuracy", n_jobs= 2, verbose = 1)gsExtC.fit(X,Y)ExtC_best = gsExtC.best_estimator_print(ExtC_best,ExtC_best)print(**25)print(gsExtC.best_score_)
ExtC_best ExtraTreesClassifier(bootstrap=False, class_weight=None, criterion=entropy,
max_depth=None, max_features=7, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=10, min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False)*************************0.835016835017下面作出5個模型的學習率曲線,學習率曲線反映了隨著樣本數目的增加,訓練集和驗證集上準確率的變化趨勢,及bias和variance的關係,進而可以判斷模型是欠擬合還是過擬合。
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None, n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5)): """定義一個繪製學習曲線的函數""" plt.figure() plt.title(title) if ylim is not None: plt.ylim(*ylim) plt.xlabel("Training examples") plt.ylabel("Score") train_sizes, train_scores, test_scores = learning_curve( estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes) train_scores_mean = np.mean(train_scores, axis=1) train_scores_std = np.std(train_scores, axis=1) test_scores_mean = np.mean(test_scores, axis=1) test_scores_std = np.std(test_scores, axis=1) plt.grid() plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1, color="r") plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="g") plt.plot(train_sizes, train_scores_mean, o-, color="r", label="Training score") plt.plot(train_sizes, test_scores_mean, o-, color="g", label="Cross-validation score") plt.legend(loc="best") return pltg = plot_learning_curve(gsRFC.best_estimator_,"RF mearning curves",X,Y,cv=kfold)g = plot_learning_curve(gsExtC.best_estimator_,"ExtraTrees learning curves",X,Y,cv=kfold)g = plot_learning_curve(gsLR.best_estimator_,"LR learning curves",X,Y,cv=kfold)g = plot_learning_curve(gsadaDTC.best_estimator_,"AdaBoost learning curves",X,Y,cv=kfold)g = plot_learning_curve(gsGBC.best_estimator_,"GradientBoosting learning curves",X,Y,cv=kfold)
從學習曲線上面可以看到,AdaBoost模型 和 ExtraTrees模型出現了過擬合的趨勢,GradientBoosting 模型和Logistic Regression模型的兩條曲線最後靠的很近,這兩個模型預測泛化能力更強。 如果有更多的訓練樣本,RamdomForest模型的表現會更好。
接下來將5個模型通過VotingClassifier整合在一起,投票選擇出分類結果,投票參數設置soft類型,這是使用5個模型各自的預測概率,最終確定分類結果;相對應的的參數hard,就是根據5個模型的投票結果,採用少數服從多數的原則來確定最終的分類結果。
votingC = VotingClassifier(estimators=[(rfc, RFC_best), (extc, ExtC_best), (lr, LR_best), (adac,ada_best),(gbc,GBC_best)], voting=soft, n_jobs=2,weights=[0,0,0,0,1])votingC.fit(X, Y)#生成提交結果文件y_pred = pd.Series(votingC.predict(data_test_dummy), name=Survived)submit = pd.concat([data_test_copy[PassengerId],y_pred] , axis=1)submit.to_csv(submit_final.csv, index=False)
提交結果後,kaggle平台給出的預測成功率為79.9%
總結
通過這次數據挖掘實戰,鍛煉了自己分析問題,獨立解決問題的能力;數據清洗、特徵工程及模型的構建優化更加熟練,對於最終的結果,還有提升的空間,後續會繼續在 離群點處理、特徵工程兩個方向繼續優化。
推薦閱讀:
※kaggle:Titanic: Machine Learning from Disaster,有什麼比較好的feature可以提取,哪位大神hit 80%了?
※kaggle-(總結)泰坦尼克生存預測-R-隨機森林-TOP7%
※Zillow房價預測系列:Kmeans聚類地理信息並可視化
※Kaggle入門系列:(一)機器學習環境搭建
※Python機器學習小試...泰坦尼克號開啟的kaggle入門