分分鐘,殺入Kaggle TOP 5% 系列(1)
kaggle Titanic baseline
key word: 可視化 數據挖掘 kaggle sklearn matplotlib baseline
baseline,盡量簡單
周知瑞evilpsycho@http://icloud.com2017.6.25
kaggle第一期地址:https://zhuanlan.zhihu.com/p/27550334/edit
kaggle第二期地址:https://zhuanlan.zhihu.com/p/28795160/edit
目錄:
1、介紹
2、數據探索
3、簡單的特徵工程
4、baseline
5、下期預告片
1.1 背景介紹: kaggle 泰坦尼克
- Competition Description
發生在1912年的泰坦尼克事件,導致船上2224名遊客陣亡1502(我們的男主角也犧牲了),作為事後諸葛亮,我們掌握船上乘客的一些數據以及一部分乘客是否獲救的信息。我們希望能通過探索這些數據,發現一些不為人知的秘密- -,順便預測下另外一部分乘客是否能夠獲救~!
1.2 import 包
首先,將我們在做探(八)索(卦)的過程中,需要用到的利器導入進來~! 其中包含數據處理包(pandas
umpy)、可視化包(matplotlibseaborn)以及大名鼎鼎的機器學習包sklearn
#!/usr/bin/python3import os#數據處理import pandas as pdimport numpy as npimport randomimport sklearn.preprocessing as preprocessing#可視化import matplotlib.pyplot as pltimport seaborn as sns%matplotlib inline#MLfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.model_selection import GridSearchCVfrom sklearn.ensemble import (GradientBoostingClassifier, GradientBoostingRegressor, RandomForestClassifier, RandomForestRegressor)from sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import train_test_splitfrom sklearn.model_selection import learning_curve
1.3 數據讀入
接著,我們將從Kaggle下載下來的訓練數據train 待預測數據test讀進來~
path = "E:/data/kaggle/titanic/"train = pd.read_csv(path + "train.csv")test = pd.read_csv(path + "test.csv")#submission_sample = pd.read_csv(path + "gender_submission.csv")
1.4 數據概覽
數據長啥樣?
train.head(3)
train.info()
2.3 數據探索
train.describe()
觀察下各數值變數的協方差 ,表格看著累,還是看A V把,咳咳
sns.set(context="paper", font="monospace")sns.set(stylex="white")f, ax = plt.subplots(figsize=(10,6))train_corr = train.drop("PassengerId",axis=1).corr()sns.heatmap(train_corr, ax=ax, vmax=.9, square=True)ax.set_xticklabels(train_corr.index, size=15)ax.set_yticklabels(train_corr.columns[::-1], size=15)ax.set_title("train feature corr", fontsize=20)
2.1Age
好啦,讓我們更加深入一點(深入?莫名臉紅- - ),觀察下年齡,依舊看圖
from scipy import statsfig, axes = plt.subplots(2,1,figsize=(8,6))sns.set_style("white")sns.distplot(train.Age.fillna(-20), rug=True, color="b", ax=axes[0])ax0 = axes[0]ax0.set_title("age distribution")ax0.set_xlabel("")ax1 = axes[1]ax1.set_title("age survived distribution")k1 = sns.distplot(train[train.Survived==0].Age.fillna(-20), hist=False, color="r", ax=ax1, label="dead")k2 = sns.distplot(train[train.Survived==1].Age.fillna(-20), hist=False, color="g", ax=ax1, label="alive")ax1.set_xlabel("")ax1.legend(fontsize=16)
由於年齡預設值我們用-20進行填充,然後做年齡的分布以及不同存活情況下的年齡分布:
1. 無論獲救與否,Age分布都很寬,小孩和年紀中等偏大的人獲救更容易一些;
2. age和survived並不是線性關係,如果用線性模型,這個特徵也許需要離散處理,然後作為類別變數代入模型
3. 獲救的人之中,年齡預設更少
f, ax = plt.subplots(figsize=(8,3))ax.set_title("Sex Age dist", size=20)sns.distplot(train[train.Sex=="female"].dropna().Age, hist=False, color="pink", label="female")sns.distplot(train[train.Sex=="male"].dropna().Age, hist=False, color="blue", label="male")ax.legend(fontsize=15)
男性中老年人多,女性更年輕;小孩中男孩較多
f, ax = plt.subplots(figsize=(8,3))ax.set_title("Pclass Age dist", size=20)sns.distplot(train[train.Pclass==1].dropna().Age, hist=False, color="pink", label="P1")sns.distplot(train[train.Pclass==2].dropna().Age, hist=False, color="blue", label="p2")sns.distplot(train[train.Pclass==3].dropna().Age, hist=False, color="g", label="p3")ax.legend(fontsize=15)
倉位等級越高,年齡越偏大,蠻符合常識的
2.2Pclass
y_dead = train[train.Survived==0].groupby("Pclass")["Survived"].count()y_alive = train[train.Survived==1].groupby("Pclass")["Survived"].count()pos = [1, 2, 3]ax = plt.figure(figsize=(8,4)).add_subplot(111)ax.bar(pos, y_dead, color="r", alpha=0.6, label="dead")ax.bar(pos, y_alive, color="g", bottom=y_dead, alpha=0.6, label="alive")ax.legend(fontsize=16, loc="best")ax.set_xticks(pos)ax.set_xticklabels(["Pclass%d"%(i) for i in range(1,4)], size=15)ax.set_title("Pclass Surveved count", size=20)
頭等艙(Pclass=1)、商務艙(Pclass=2)、屌絲倉(Pclass=3)人數對比 :
1、不出所料,屌絲倉人數遙遙領先 - -。
2、從獲救比例來看, 頭等艙遙遙領先, 屌絲陣亡比例相當驚人。。
pos = range(0,6)age_list = []for Pclass_ in range(1,4): for Survived_ in range(0,2): age_list.append(train[(train.Pclass == Pclass_)&(train.Survived == Survived_)].Age.values)fig, axes = plt.subplots(3,1,figsize=(10,6))i_Pclass = 1for ax in axes: sns.distplot(age_list[i_Pclass*2-2], hist=False, ax=ax, label="Pclass:%d ,survived:0"%(i_Pclass), color="r") sns.distplot(age_list[i_Pclass*2-1], hist=False, ax=ax, label="Pclass:%d ,survived:1"%(i_Pclass), color="g") i_Pclass += 1 ax.set_xlabel("age", size=15) ax.legend(fontsize=15)
觀察結果0,0:
- 頭等艙獲救年齡偏低
- 商務艙小孩照顧的很好
- 屌絲倉同樣是小孩獲救多(誰說我們屌絲沒有愛?站出來)
2.3 性別
print(train.Sex.value_counts())print("********************************")print (train.groupby("Sex")["Survived"].mean())
人數來看,男性為主!577個呢
女性生存概率更高,74%,男性僅為18%
我是個紳士,我要徵婚,我要給lady讓座!。。。。
ax = plt.figure(figsize=(10,4)).add_subplot(111)sns.violinplot(x="Sex", y="Age", hue="Survived", data=train.dropna(), split=True)ax.set_xlabel("Sex",size=20)ax.set_xticklabels(["Female","male"], size=18)ax.set_ylabel("Age",size=20)ax.legend(fontsize=25,loc="best")
- 女性中,獲救的人集中在中段年齡;
- 男性中,年輕人尤其是小孩子更容易獲救- -
- 看來中青年大叔們都是挺不錯的人
label = []for sex_i in ["female","male"]: for pclass_i in range(1,4): label.append("sex:%s,Pclass:%d"%(sex_i, pclass_i)) pos = range(6)fig = plt.figure(figsize=(16,4))ax = fig.add_subplot(111)ax.bar(pos, train[train["Survived"]==0].groupby(["Sex","Pclass"])["Survived"].count().values, color="r", alpha=0.5, align="center", tick_label=label, label="dead")ax.bar(pos, train[train["Survived"]==1].groupby(["Sex","Pclass"])["Survived"].count().values, bottom=train[train["Survived"]==0].groupby(["Sex","Pclass"])["Survived"].count().values, color="g", alpha=0.5, align="center", tick_label=label, label="alive")ax.tick_params(labelsize=15)ax.set_title("sex_pclass_survived", size=30)ax.legend(fontsize=15,loc="best")
不同性別,不同客艙等級的人獲救情況 :
- 綜合來看女性綠色顯著更多,更容易獲救,
- 在同性別下,等級越低,則獲救概率也越高
2.4 Fare( 票價)
fig = plt.figure(figsize=(8, 6))ax = plt.subplot2grid((2,2), (0,0), colspan=2)ax.tick_params(labelsize=15)ax.set_title("Fare dist", size=20)ax.set_ylabel("dist", size=20)sns.kdeplot(train.Fare, ax=ax)sns.distplot(train.Fare, ax=ax)ax.legend(fontsize=15)pos = range(0,400,50)ax.set_xticks(pos)ax.set_xlim([0, 200])ax.set_xlabel("")ax1 = plt.subplot2grid((2,2), (1,0), colspan=2)ax.set_title("Fare Pclass dist", size=20)for i in range(1,4): sns.kdeplot(train[train.Pclass==i].Fare, ax=ax1, label="Pclass %d"%(i))ax1.set_xlim([0,200])ax1.legend(fontsize=15)
Fare 分布:
fig = plt.figure(figsize=(8,3))ax1 = fig.add_subplot(111)sns.kdeplot(train[train.Survived==0].Fare, ax=ax1, label="dead", color="r")sns.kdeplot(train[train.Survived==1].Fare, ax=ax1, label="alive", color="g")#sns.distplot(train[train.Survived==0].Fare, ax=ax1, color="r")#sns.distplot(train[train.Survived==1].Fare, ax=ax1, color="g")ax1.set_xlim([0,300])ax1.legend(fontsize=15)ax1.set_title("Fare survived", size=20)ax1.set_xlabel("Fare", size=15)
錢出的多的人,更容易獲救
2.5 sibsp & parch 表親和直親
fig = plt.figure(figsize=(8,4))ax1 = fig.add_subplot(211)sns.countplot(train.SibSp)ax1.set_title("SibSp", size=20)ax2 = fig.add_subplot(212, sharex=ax1)sns.countplot(train.Parch)ax2.set_title("Parch", size=20)
大多數都沒有親戚,表親1個居多,直系親戚1,2個居多
fig = plt.figure(figsize=(10,6))ax1 = fig.add_subplot(311)train.groupby("SibSp")["Survived"].mean().plot(kind="bar", ax=ax1)ax1.set_title("Sibsp Survived Rate", size=16)ax1.set_xlabel("")ax2 = fig.add_subplot(312)train.groupby("Parch")["Survived"].mean().plot(kind="bar", ax=ax2)ax2.set_title("Parch Survived Rate", size=16)ax2.set_xlabel("")ax3 = fig.add_subplot(313)train.groupby(train.SibSp+train.Parch)["Survived"].mean().plot(kind="bar", ax=ax3)ax3.set_title("Parch+Sibsp Survived Rate", size=16)
分組統計不同人數親戚的獲救率來看,都近似呈現先高後低, 親人數目多少和是否獲救不是簡單的線性關係
2.6 Embarked 上船地點
plt.style.use("ggplot")ax = plt.figure(figsize=(8,3)).add_subplot(111)pos = [1, 2, 3]y1 = train[train.Survived==0].groupby("Embarked")["Survived"].count().sort_index().valuesy2 = train[train.Survived==1].groupby("Embarked")["Survived"].count().sort_index().valuesax.bar(pos, y1, color="r", alpha=0.4, align="center", label="dead")ax.bar(pos, y2, color="g", alpha=0.4, align="center", label="alive", bottom=y1)ax.set_xticks(pos)ax.set_xticklabels(["C","Q","S"])ax.legend(fontsize=15, loc="best")ax.set_title("Embarked survived count", size=18)
從C上岸的人有很高的獲救概率
ax = plt.figure(figsize=(8,3)).add_subplot(111)ax.set_xlim([-20, 80])sns.kdeplot(train[train.Embarked=="C"].Age.fillna(-10), ax=ax, label="C")sns.kdeplot(train[train.Embarked=="Q"].Age.fillna(-10), ax=ax, label="Q")sns.kdeplot(train[train.Embarked=="S"].Age.fillna(-10), ax=ax, label="S")ax.legend(fontsize=18)ax.set_title("Embarked Age Dist ", size=18)
將年齡預設用-20填充了
- Q上岸的很多沒有年齡
- C上岸和S上岸的年齡分布較為相似,區別在於C上岸的年齡分布更加扁平,小孩和老人比例更高
y1 = train[train.Survived==0].groupby(["Embarked","Pclass"])["Survived"].count().reset_index()["Survived"].valuesy2 = train[train.Survived==1].groupby(["Embarked","Pclass"])["Survived"].count().reset_index()["Survived"].valuesax = plt.figure(figsize=(8,3)).add_subplot(111)pos = range(9)ax.bar(pos, y1, align="center", alpha=0.5, color="r", label="dead")ax.bar(pos, y2, align="center", bottom=y1, alpha=0.5, color="g", label="alive")ax.set_xticks(pos)xticklabels = []for embarked_val in ["C","Q","S"]: for pclass_val in range(1,4): xticklabels.append("%s/%d"%(embarked_val,pclass_val))ax.set_xticklabels(xticklabels,size=15)ax.legend(fontsize=15, loc="best")
從不同倉位的比例來看,似乎C上岸更容易獲救是因為頭等艙的人較多?
但進一步對比C/S 發現,同樣的倉位,C獲救概率依然更高
腦洞下:
- C地的人更加抱團,互幫互助- -
- 人數上來看S地的人更多,不同等級分布也更合常理,而C地的人頭等艙很多,商務艙幾乎沒有,屌絲倉的也不少;
猜想:C地的人更多是權貴,S地的人來自商貿發達的商人?,所有C地的人地位更高- -
2.7 Cabin 船艙號
2.9 Ticket 船票號
2.10 Name
3.簡單的特徵工程
print ("***********Train*************")print ("test")print (train.isnull().sum())print ("***********test*************")print (test.isnull().sum())
- age cabin在訓練集和待預測集中均有缺失,cabin缺失的個數很多
- embarked 在訓練集中有2個缺失值
train.Embarked.fillna("S",inplace=True)
正如我們分析Name時提到的,我們需要將年齡離散化處理,同時還要處理預設值
在baseline 我們用簡單方法處理下:
- 為空的歸為一類
- 分類的按年齡分段進行離散
#以5歲為一個周期離散,同時10以下,60歲以上的年分別歸類def age_map(x): if x<10: return "10-" if x<60: return "%d-%d"%(x//5*5, x//5*5+5) elif x>=60: return "60+" else: return "Null"train["Age_map"] = train["Age"].apply(lambda x: age_map(x))test["Age_map"] = test["Age"].apply(lambda x: age_map(x))#列印出來看看train.groupby("Age_map")["Survived"].agg(["count","mean"])
還有test有個Fare預設
數據中Fare分布太寬,做一下scaling,加速模型收斂
import sklearn.preprocessing as preprocessingscaler = preprocessing.StandardScaler()#Farefare_scale_param = scaler.fit(train["Fare"].values.reshape(-1, 1))train.Fare = fare_scale_param.transform(train["Fare"].values.reshape(-1, 1))test.Fare = fare_scale_param.transform(test["Fare"].values.reshape(-1, 1))
將類別型變數全部onehot
train_x = pd.concat([train[["SibSp","Parch","Fare"]], pd.get_dummies(train[["Pclass","Sex","Cabin","Embarked","Age_map"]])],axis=1)train_y = train.Survivedtest_x = pd.concat([test[["SibSp","Parch","Fare"]], pd.get_dummies(test[["Pclass", "Sex","Cabin","Embarked", "Age_map"]])],axis=1)
4、BASE LINE MODEL
採用邏輯回歸作為base line model 並對參數做簡單搜索
base_line_model = LogisticRegression()param = {"penalty":["l1","l2"], "C":[0.1, 0.5, 1.0,5.0]}grd = GridSearchCV(estimator=base_line_model, param_grid=param, cv=5, n_jobs=3)grd.fit(train_x, train_y)
將模型訓練過程的學習曲線列印出來,看下是否存在過擬合/欠擬合情況
def plot_learning_curve(clf, title, X, y, ylim=None, cv=None, n_jobs=3, train_sizes=np.linspace(.05, 1., 5)): train_sizes, train_scores, test_scores = learning_curve( clf, x, y, train_sizes=train_sizes) train_scores_mean = np.mean(train_scores, axis=1) train_scores_std = np.std(train_scores, axis=1) test_scores_mean = np.mean(test_scores, axis=1) test_scores_std = np.std(test_scores, axis=1) ax = plt.figure().add_subplot(111) ax.set_title(title) if ylim is not None: ax.ylim(*ylim) ax.set_xlabel(u"train_num_of_samples") ax.set_ylabel(u"score") ax.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1, color="b") ax.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="r") ax.plot(train_sizes, train_scores_mean, "o-", color="b", label=u"train score") ax.plot(train_sizes, test_scores_mean, "o-", color="r", label=u"testCV score") ax.legend(loc="best") midpoint = ((train_scores_mean[-1] + train_scores_std[-1]) + (test_scores_mean[-1] - test_scores_std[-1])) / 2 diff = (train_scores_mean[-1] + train_scores_std[-1]) - (test_scores_mean[-1] - test_scores_std[-1]) return midpoint, diffplot_learning_curve(grd, u"learning_rate", train_x, train_y)
表現正常,提交我們的結果看看-,-
gender_submission = pd.DataFrame({"PassengerId":test.iloc[:,0],"Survived":grd.predict(test_x)})gender_submission.to_csv("C:/Users/evilpsycho/Desktop/gender_submission.csv", index=None)
看看結果:
其實還算不錯,畢竟,這只是我們簡單分析處理過後出的一個baseline
5.下期預告
排名進入TOP 5%
1. 引入pipeline管道機制,流水化特徵提取及模型調優
2. badcase研究
3. 更加深入特徵工程
4. 集成學習
5. 動態可視化模型訓練過程
推薦閱讀:
※Python 家族有多龐大
※Python數據分析及可視化實例之CentOS7.2+Python3x+Flask部署標準化配置流程
※Flask 實現小說網站 (二)
※Python實現3D建模工具
※Flask模板引擎:Jinja2語法介紹