分分鐘,殺入Kaggle TOP 5% 系列(2)

周知瑞

evilpsycho@icloud.com

2017.8.25

關鍵詞: 集成學習 kaggle xgboost sklearn

PS: 瑣事慢慢忙完,特快專列即將實現每周滾動發車發車,走過路過不要錯過啊。。

kaggle第一期地址:zhuanlan.zhihu.com/p/27

kaggle第二期地址:zhuanlan.zhihu.com/p/28

代碼github地址:EvilPsyCHo/zhihu_public_code

下期預告:挑戰百萬美金

-----------性感且妖嬈的分割線----------

劇情回顧

大家應該已經忘記的差不多了哈。。。送你個傳送門,KAGGLE 第一期傳送,我們對泰坦尼克生存預測題目的數據進行了比較全面的分析,通過做了一些數據處理/清洗以及特徵提取的工作,最後用logistic regression進行了預測,排名在TOP60%左右。

這個結果顯然不夠我們拿出去裝逼,所以我們又翻開了特徵工程的小褲衩倒騰了一番,並祭出了集成學習這個大殺器,成功進入TOP 5% ,話不多說,今天直入豬(主)蹄(題),肚子餓了還幫你們更新,我真心不容易吶。。。

特徵工程

  • 會用到的包

#數據處理import numpy as npimport pandas as pd#繪圖import seaborn as snsimport matplotlib.pyplot as plt%matplotlib inline#各種模型、數據處理方法from sklearn.preprocessing import LabelEncoderfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LogisticRegressionfrom sklearn.svm import SVC, LinearSVCfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.naive_bayes import GaussianNBfrom sklearn.linear_model import Perceptronfrom sklearn.linear_model import SGDClassifierfrom sklearn.tree import DecisionTreeClassifierfrom xgboost import XGBClassifierfrom sklearn.metrics import precision_scorefrom sklearn.ensemble import GradientBoostingClassifierfrom sklearn.model_selection import GridSearchCV, cross_val_score, StratifiedKFold, learning_curveimport warningswarnings.filterwarnings("ignore")

  • 數據讀入

train_df = pd.read_csv("E:/XXX/zhihu/kaggle/titanic/Part2/input/train.csv")test_df = pd.read_csv("E:/XXX/zhihu/kaggle/titanic/Part2/input/test.csv")combine_df = pd.concat([train_df,test_df])

  • Name特徵

還記得我們在第一期里,因為懶惰(請容許我留下一滴悔恨的淚水),把姓名這個特徵無情的拋棄了,今天,我們再一次拿起來。。。

經常出入各種社交場合的朋友應該比較清楚,我們對於別人的稱呼其實隱含了大量的信息,包括性別、地位、財富、婚姻情況等等,所以,我們將姓名中的稱呼提取出來。

同時順便將名字的長度也提取出來,因為你做一個統計會發現,名字的長度和是否獲救是相關的哦。

train_df.groupby(train_df.Name.apply(lambda x: len(x)))["Survived"].mean().plot()

名字越長,獲得概率越高?這裡面的邏輯大概可以去猜想下,所以我們把名字長度加入特徵.

combine_df["Name_Len"] = combine_df["Name"].apply(lambda x: len(x))combine_df["Name_Len"] = pd.qcut(combine_df["Name_Len"],5)

不同的稱謂類似,有顯著不同的獲救概率

combine_df.groupby(combine_df["Name"].apply(lambda x: x.split(", ")[1]).apply(lambda x: x.split(".")[0]))["Survived"].mean().plot()

於是我們把稱謂信息提取出來,由於有些稱謂的人數量過少,我們還需要做一個映射,怎麼映射呢?是時候來一波英語補習了。。。。

Mme:稱呼非英語民族的"上層社會"已婚婦女,及有職業的婦女,相當於Mrs

Jonkheer:鄉紳

Capt:船長?。。

Lady:貴族夫人的稱呼

Don唐:是西班牙語中貴族和有地位者的尊稱

sir:都懂

the Countess:女伯爵

Ms:Ms.或Mz 美國近來用來稱呼婚姻狀態不明的婦女

Col:中校:Lieutenant Colonel(Lt. Col.)上校:Colonel(Col.)

Major:少校

Mlle:小姐

Rev:牧師

測試集合中特殊的Dona:女士尊稱

哈哈,是不是又補充了一波英語知識?

combine_df["Title"] = combine_df["Name"].apply(lambda x: x.split(", ")[1]).apply(lambda x: x.split(".")[0])combine_df["Title"] = combine_df["Title"].replace(["Don","Dona", "Major", "Capt", "Jonkheer", "Rev", "Col","Sir","Dr"],"Mr")combine_df["Title"] = combine_df["Title"].replace(["Mlle","Ms"], "Miss")combine_df["Title"] = combine_df["Title"].replace(["the Countess","Mme","Lady","Dr"], "Mrs")df = pd.get_dummies(combine_df["Title"],prefix="Title")combine_df = pd.concat([combine_df,df],axis=1)

  • 有女性死亡的家庭 & 有男性存活的家庭

在泰坦尼克場景下,女性死亡和男性存活都是小概率事件,模型會很容易判斷女性乘客存活、男性乘客死亡,為了提升模型對於這一類群體的識別能力,我們分析數據找到了一個重要特徵,family,同一個family下的生存死亡模式有很大程度上是相同的,例如:有一個family有一個女性死亡,這個family其他的女性的死亡概率也比較高。

因此,我們標註出這些特殊的family即可

combine_df["Fname"] = combine_df["Name"].apply(lambda x:x.split(",")[0])combine_df["Familysize"] = combine_df["SibSp"]+combine_df["Parch"]dead_female_Fname = list(set(combine_df[(combine_df.Sex=="female") & (combine_df.Age>=12) & (combine_df.Survived==0) & (combine_df.Familysize>1)]["Fname"].values))survive_male_Fname = list(set(combine_df[(combine_df.Sex=="male") & (combine_df.Age>=12) & (combine_df.Survived==1) & (combine_df.Familysize>1)]["Fname"].values))combine_df["Dead_female_family"] = np.where(combine_df["Fname"].isin(dead_female_Fname),1,0)combine_df["Survive_male_family"] = np.where(combine_df["Fname"].isin(survive_male_Fname),1,0)combine_df = combine_df.drop(["Name","Fname"],axis=1)

  • Age

採用和第一期同樣的方法,添加一個小孩子標籤

group = combine_df.groupby(["Title", "Pclass"])["Age"]combine_df["Age"] = group.transform(lambda x: x.fillna(x.median()))combine_df = combine_df.drop("Title",axis=1)combine_df["IsChild"] = np.where(combine_df["Age"]<=12,1,0)combine_df["Age"] = pd.cut(combine_df["Age"],5)combine_df = combine_df.drop("Age",axis=1)

  • Familysize

我們將上面提取過的Familysize再離散化

combine_df["Familysize"] = np.where(combine_df["Familysize"]==0, "solo", np.where(combine_df["Familysize"]<=3, "normal", "big"))df = pd.get_dummies(combine_df["Familysize"],prefix="Familysize")combine_df = pd.concat([combine_df,df],axis=1).drop(["SibSp","Parch","Familysize"])

  • Ticket

combine_df["Ticket_Lett"] = combine_df["Ticket"].apply(lambda x: str(x)[0])combine_df["Ticket_Lett"] = combine_df["Ticket_Lett"].apply(lambda x: str(x))combine_df["High_Survival_Ticket"] = np.where(combine_df["Ticket_Lett"].isin(["1", "2", "P"]),1,0)combine_df["Low_Survival_Ticket"] = np.where(combine_df["Ticket_Lett"].isin(["A","W","3","7"]),1,0)combine_df = combine_df.drop(["Ticket","Ticket_Lett"],axis=1)

  • Embarked

預設的Embarked和第一期一樣,用S填充

combine_df.Embarked = combine_df.Embarked.fillna("S")df = pd.get_dummies(combine_df["Embarked"],prefix="Embarked")combine_df = pd.concat([combine_df,df],axis=1).drop("Embarked",axis=1)

  • Cabin

combine_df["Cabin_isNull"] = np.where(combine_df["Cabin"].isnull(),0,1)combine_df = combine_df.drop("Cabin",axis=1)

  • Pclass

df = pd.get_dummies(combine_df["Pclass"],prefix="Pclass")combine_df = pd.concat([combine_df,df],axis=1).drop("Pclass",axis=1)

  • Sex

df = pd.get_dummies(combine_df["Sex"],prefix="Sex")combine_df = pd.concat([combine_df,df],axis=1).drop("Sex",axis=1)

  • Fare

預設值用眾數填充,之後進行離散化

combine_df["Fare"] = pd.qcut(combine_df.Fare,3)df = pd.get_dummies(combine_df.Fare,prefix="Fare").drop("Fare_(-0.001, 8.662]",axis=1)combine_df = pd.concat([combine_df,df],axis=1).drop("Fare",axis=1)

  • 所有特徵轉化成數值型編碼

features = combine_df.drop(["PassengerId","Survived"], axis=1).columnsle = LabelEncoder()for feature in features: le = le.fit(combine_df[feature]) combine_df[feature] = le.transform(combine_df[feature])

  • 得到訓練/測試數據

X_all = combine_df.iloc[:891,:].drop(["PassengerId","Survived"], axis=1)Y_all = combine_df.iloc[:891,:]["Survived"]X_test = combine_df.iloc[891:,:].drop(["PassengerId","Survived"], axis=1)

模型調優

分別考察邏輯回歸、支持向量機、最近鄰、決策樹、隨機森林、gbdt、xgbGBDT幾類演算法的性能。

lr = LogisticRegression()svc = SVC()knn = KNeighborsClassifier(n_neighbors = 3)dt = DecisionTreeClassifier()rf = RandomForestClassifier(n_estimators=300,min_samples_leaf=4,class_weight={0:0.745,1:0.255})gbdt = GradientBoostingClassifier(n_estimators=500,learning_rate=0.03,max_depth=3)xgbGBDT = XGBClassifier(max_depth=3, n_estimators=300, learning_rate=0.05)clfs = [logreg, svc, knn, decision_tree, random_forest, gbdt, xgb]kfold = 10cv_results = []for classifier in clfs : cv_results.append(cross_val_score(classifier, X_all, y = Y_all, scoring = "accuracy", cv = kfold, n_jobs=4))cv_means = []cv_std = []for cv_result in cv_results: cv_means.append(cv_result.mean()) cv_std.append(cv_result.std())cv_res = pd.DataFrame({"CrossValMeans":cv_means,"CrossValerrors": cv_std, "Algorithm":["LR","SVC","KNN","decision_tree","random_forest","GBDT","xgbGBDT"]})g = sns.barplot("CrossValMeans","Algorithm",data = cv_res, palette="Set3",orient = "h",**{"xerr":cv_std})g.set_xlabel("Mean Accuracy")g = g.set_title("Cross validation scores")

觀察發現不同的模型的feature importance 有比較大的差別,,,把他們組合再一起會不會更好呢?

class Ensemble(object): def __init__(self,estimators): self.estimator_names = [] self.estimators = [] for i in estimators: self.estimator_names.append(i[0]) self.estimators.append(i[1]) self.clf = LogisticRegression() def fit(self, train_x, train_y): for i in self.estimators: i.fit(train_x,train_y) x = np.array([i.predict(train_x) for i in self.estimators]).T y = train_y self.clf.fit(x, y) def predict(self,x): x = np.array([i.predict(x) for i in self.estimators]).T #print(x) return self.clf.predict(x) def score(self,x,y): s = precision_score(y,self.predict(x)) return s

集成框架準備好了,我們把基分類器丟進去。

bag = Ensemble([("xgb",xgb),("lr",lr),("rf",rf),("svc",svc),("gbdt",gbdt)])score = 0for i in range(0,10): num_test = 0.20 X_train, X_cv, Y_train, Y_cv = train_test_split(X_all, Y_all, test_size=num_test) bag.fit(X_train, Y_train) #Y_test = bag.predict(X_test) acc_xgb = round(bag.score(X_cv, Y_cv) * 100, 2) score+=acc_xgbprint(score/10) #0.8786

推薦閱讀:

自學python中,遇到問題該如何解決?
【翻譯搬運】SciPy-Python科學演算法庫
知乎是怎麼運行 tornado web 服務的

TAG:Python | 机器学习 | 数据挖掘 |