分分鐘,殺入Kaggle TOP 5% 系列(1)

kaggle Titanic baseline

key word: 可視化 數據挖掘 kaggle sklearn matplotlib baseline

baseline,盡量簡單

周知瑞

evilpsycho
@icloud.com

2017.6.25

kaggle第一期地址:zhuanlan.zhihu.com/p/27

kaggle第二期地址:zhuanlan.zhihu.com/p/28

目錄:

1、介紹

2、數據探索

3、簡單的特徵工程

4、baseline

5、下期預告片

1.1 背景介紹: kaggle 泰坦尼克

  • Competition Description

  發生在1912年的泰坦尼克事件,導致船上2224名遊客陣亡1502(我們的男主角也犧牲了),作為事後諸葛亮,我們掌握船上乘客的一些數據以及一部分乘客是否獲救的信息。我們希望能通過探索這些數據,發現一些不為人知的秘密- -,順便預測下另外一部分乘客是否能夠獲救~!

1.2 import 包

首先,將我們在做探(八)索(卦)的過程中,需要用到的利器導入進來~! 其中包含數據處理包(pandas
umpy)、可視化包(matplotlibseaborn)以及大名鼎鼎的機器學習包sklearn

#!/usr/bin/python3import os#數據處理import pandas as pdimport numpy as npimport randomimport sklearn.preprocessing as preprocessing#可視化import matplotlib.pyplot as pltimport seaborn as sns%matplotlib inline#MLfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.model_selection import GridSearchCVfrom sklearn.ensemble import (GradientBoostingClassifier, GradientBoostingRegressor, RandomForestClassifier, RandomForestRegressor)from sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import train_test_splitfrom sklearn.model_selection import learning_curve

1.3 數據讀入

接著,我們將從Kaggle下載下來的訓練數據train 待預測數據test讀進來~

path = "E:/data/kaggle/titanic/"train = pd.read_csv(path + "train.csv")test = pd.read_csv(path + "test.csv")#submission_sample = pd.read_csv(path + "gender_submission.csv")

1.4 數據概覽

數據長啥樣?

train.head(3)

train.info()

2.3 數據探索

train.describe()

觀察下各數值變數的協方差 ,表格看著累,還是看A V把,咳咳

sns.set(context="paper", font="monospace")sns.set(stylex="white")f, ax = plt.subplots(figsize=(10,6))train_corr = train.drop("PassengerId",axis=1).corr()sns.heatmap(train_corr, ax=ax, vmax=.9, square=True)ax.set_xticklabels(train_corr.index, size=15)ax.set_yticklabels(train_corr.columns[::-1], size=15)ax.set_title("train feature corr", fontsize=20)

2.1Age

好啦,讓我們更加深入一點(深入?莫名臉紅- - ),觀察下年齡,依舊看圖

from scipy import statsfig, axes = plt.subplots(2,1,figsize=(8,6))sns.set_style("white")sns.distplot(train.Age.fillna(-20), rug=True, color="b", ax=axes[0])ax0 = axes[0]ax0.set_title("age distribution")ax0.set_xlabel("")ax1 = axes[1]ax1.set_title("age survived distribution")k1 = sns.distplot(train[train.Survived==0].Age.fillna(-20), hist=False, color="r", ax=ax1, label="dead")k2 = sns.distplot(train[train.Survived==1].Age.fillna(-20), hist=False, color="g", ax=ax1, label="alive")ax1.set_xlabel("")ax1.legend(fontsize=16)

由於年齡預設值我們用-20進行填充,然後做年齡的分布以及不同存活情況下的年齡分布:

1. 無論獲救與否,Age分布都很寬,小孩和年紀中等偏大的人獲救更容易一些;

2. age和survived並不是線性關係,如果用線性模型,這個特徵也許需要離散處理,然後作為類別變數代入模型

3. 獲救的人之中,年齡預設更少

f, ax = plt.subplots(figsize=(8,3))ax.set_title("Sex Age dist", size=20)sns.distplot(train[train.Sex=="female"].dropna().Age, hist=False, color="pink", label="female")sns.distplot(train[train.Sex=="male"].dropna().Age, hist=False, color="blue", label="male")ax.legend(fontsize=15)

男性中老年人多,女性更年輕;小孩中男孩較多

f, ax = plt.subplots(figsize=(8,3))ax.set_title("Pclass Age dist", size=20)sns.distplot(train[train.Pclass==1].dropna().Age, hist=False, color="pink", label="P1")sns.distplot(train[train.Pclass==2].dropna().Age, hist=False, color="blue", label="p2")sns.distplot(train[train.Pclass==3].dropna().Age, hist=False, color="g", label="p3")ax.legend(fontsize=15)

倉位等級越高,年齡越偏大,蠻符合常識的

2.2Pclass

y_dead = train[train.Survived==0].groupby("Pclass")["Survived"].count()y_alive = train[train.Survived==1].groupby("Pclass")["Survived"].count()pos = [1, 2, 3]ax = plt.figure(figsize=(8,4)).add_subplot(111)ax.bar(pos, y_dead, color="r", alpha=0.6, label="dead")ax.bar(pos, y_alive, color="g", bottom=y_dead, alpha=0.6, label="alive")ax.legend(fontsize=16, loc="best")ax.set_xticks(pos)ax.set_xticklabels(["Pclass%d"%(i) for i in range(1,4)], size=15)ax.set_title("Pclass Surveved count", size=20)

頭等艙(Pclass=1)、商務艙(Pclass=2)、屌絲倉(Pclass=3)人數對比 :

1、不出所料,屌絲倉人數遙遙領先 - -。

2、從獲救比例來看, 頭等艙遙遙領先, 屌絲陣亡比例相當驚人。。

pos = range(0,6)age_list = []for Pclass_ in range(1,4): for Survived_ in range(0,2): age_list.append(train[(train.Pclass == Pclass_)&(train.Survived == Survived_)].Age.values)fig, axes = plt.subplots(3,1,figsize=(10,6))i_Pclass = 1for ax in axes: sns.distplot(age_list[i_Pclass*2-2], hist=False, ax=ax, label="Pclass:%d ,survived:0"%(i_Pclass), color="r") sns.distplot(age_list[i_Pclass*2-1], hist=False, ax=ax, label="Pclass:%d ,survived:1"%(i_Pclass), color="g") i_Pclass += 1 ax.set_xlabel("age", size=15) ax.legend(fontsize=15)

觀察結果0,0:

  1. 頭等艙獲救年齡偏低
  2. 商務艙小孩照顧的很好
  3. 屌絲倉同樣是小孩獲救多(誰說我們屌絲沒有愛?站出來)

2.3 性別

print(train.Sex.value_counts())print("********************************")print (train.groupby("Sex")["Survived"].mean())

人數來看,男性為主!577個呢

女性生存概率更高,74%,男性僅為18%

我是個紳士,我要徵婚,我要給lady讓座!。。。。

ax = plt.figure(figsize=(10,4)).add_subplot(111)sns.violinplot(x="Sex", y="Age", hue="Survived", data=train.dropna(), split=True)ax.set_xlabel("Sex",size=20)ax.set_xticklabels(["Female","male"], size=18)ax.set_ylabel("Age",size=20)ax.legend(fontsize=25,loc="best")

  1. 女性中,獲救的人集中在中段年齡;
  2. 男性中,年輕人尤其是小孩子更容易獲救- -
  3. 看來中青年大叔們都是挺不錯的人

label = []for sex_i in ["female","male"]: for pclass_i in range(1,4): label.append("sex:%s,Pclass:%d"%(sex_i, pclass_i)) pos = range(6)fig = plt.figure(figsize=(16,4))ax = fig.add_subplot(111)ax.bar(pos, train[train["Survived"]==0].groupby(["Sex","Pclass"])["Survived"].count().values, color="r", alpha=0.5, align="center", tick_label=label, label="dead")ax.bar(pos, train[train["Survived"]==1].groupby(["Sex","Pclass"])["Survived"].count().values, bottom=train[train["Survived"]==0].groupby(["Sex","Pclass"])["Survived"].count().values, color="g", alpha=0.5, align="center", tick_label=label, label="alive")ax.tick_params(labelsize=15)ax.set_title("sex_pclass_survived", size=30)ax.legend(fontsize=15,loc="best")

不同性別,不同客艙等級的人獲救情況 :

  1. 綜合來看女性綠色顯著更多,更容易獲救,
  2. 在同性別下,等級越低,則獲救概率也越高

2.4 Fare( 票價)

fig = plt.figure(figsize=(8, 6))ax = plt.subplot2grid((2,2), (0,0), colspan=2)ax.tick_params(labelsize=15)ax.set_title("Fare dist", size=20)ax.set_ylabel("dist", size=20)sns.kdeplot(train.Fare, ax=ax)sns.distplot(train.Fare, ax=ax)ax.legend(fontsize=15)pos = range(0,400,50)ax.set_xticks(pos)ax.set_xlim([0, 200])ax.set_xlabel("")ax1 = plt.subplot2grid((2,2), (1,0), colspan=2)ax.set_title("Fare Pclass dist", size=20)for i in range(1,4): sns.kdeplot(train[train.Pclass==i].Fare, ax=ax1, label="Pclass %d"%(i))ax1.set_xlim([0,200])ax1.legend(fontsize=15)

Fare 分布:

fig = plt.figure(figsize=(8,3))ax1 = fig.add_subplot(111)sns.kdeplot(train[train.Survived==0].Fare, ax=ax1, label="dead", color="r")sns.kdeplot(train[train.Survived==1].Fare, ax=ax1, label="alive", color="g")#sns.distplot(train[train.Survived==0].Fare, ax=ax1, color="r")#sns.distplot(train[train.Survived==1].Fare, ax=ax1, color="g")ax1.set_xlim([0,300])ax1.legend(fontsize=15)ax1.set_title("Fare survived", size=20)ax1.set_xlabel("Fare", size=15)

錢出的多的人,更容易獲救

2.5 sibsp & parch 表親和直親

fig = plt.figure(figsize=(8,4))ax1 = fig.add_subplot(211)sns.countplot(train.SibSp)ax1.set_title("SibSp", size=20)ax2 = fig.add_subplot(212, sharex=ax1)sns.countplot(train.Parch)ax2.set_title("Parch", size=20)

大多數都沒有親戚,表親1個居多,直系親戚1,2個居多

fig = plt.figure(figsize=(10,6))ax1 = fig.add_subplot(311)train.groupby("SibSp")["Survived"].mean().plot(kind="bar", ax=ax1)ax1.set_title("Sibsp Survived Rate", size=16)ax1.set_xlabel("")ax2 = fig.add_subplot(312)train.groupby("Parch")["Survived"].mean().plot(kind="bar", ax=ax2)ax2.set_title("Parch Survived Rate", size=16)ax2.set_xlabel("")ax3 = fig.add_subplot(313)train.groupby(train.SibSp+train.Parch)["Survived"].mean().plot(kind="bar", ax=ax3)ax3.set_title("Parch+Sibsp Survived Rate", size=16)

分組統計不同人數親戚的獲救率來看,都近似呈現先高後低, 親人數目多少和是否獲救不是簡單的線性關係

2.6 Embarked 上船地點

plt.style.use("ggplot")ax = plt.figure(figsize=(8,3)).add_subplot(111)pos = [1, 2, 3]y1 = train[train.Survived==0].groupby("Embarked")["Survived"].count().sort_index().valuesy2 = train[train.Survived==1].groupby("Embarked")["Survived"].count().sort_index().valuesax.bar(pos, y1, color="r", alpha=0.4, align="center", label="dead")ax.bar(pos, y2, color="g", alpha=0.4, align="center", label="alive", bottom=y1)ax.set_xticks(pos)ax.set_xticklabels(["C","Q","S"])ax.legend(fontsize=15, loc="best")ax.set_title("Embarked survived count", size=18)

從C上岸的人有很高的獲救概率

ax = plt.figure(figsize=(8,3)).add_subplot(111)ax.set_xlim([-20, 80])sns.kdeplot(train[train.Embarked=="C"].Age.fillna(-10), ax=ax, label="C")sns.kdeplot(train[train.Embarked=="Q"].Age.fillna(-10), ax=ax, label="Q")sns.kdeplot(train[train.Embarked=="S"].Age.fillna(-10), ax=ax, label="S")ax.legend(fontsize=18)ax.set_title("Embarked Age Dist ", size=18)

將年齡預設用-20填充了

  1. Q上岸的很多沒有年齡
  2. C上岸和S上岸的年齡分布較為相似,區別在於C上岸的年齡分布更加扁平,小孩和老人比例更高

y1 = train[train.Survived==0].groupby(["Embarked","Pclass"])["Survived"].count().reset_index()["Survived"].valuesy2 = train[train.Survived==1].groupby(["Embarked","Pclass"])["Survived"].count().reset_index()["Survived"].valuesax = plt.figure(figsize=(8,3)).add_subplot(111)pos = range(9)ax.bar(pos, y1, align="center", alpha=0.5, color="r", label="dead")ax.bar(pos, y2, align="center", bottom=y1, alpha=0.5, color="g", label="alive")ax.set_xticks(pos)xticklabels = []for embarked_val in ["C","Q","S"]: for pclass_val in range(1,4): xticklabels.append("%s/%d"%(embarked_val,pclass_val))ax.set_xticklabels(xticklabels,size=15)ax.legend(fontsize=15, loc="best")

從不同倉位的比例來看,似乎C上岸更容易獲救是因為頭等艙的人較多?

但進一步對比C/S 發現,同樣的倉位,C獲救概率依然更高

腦洞下:

  1. C地的人更加抱團,互幫互助- -
  2. 人數上來看S地的人更多,不同等級分布也更合常理,而C地的人頭等艙很多,商務艙幾乎沒有,屌絲倉的也不少;

猜想:C地的人更多是權貴,S地的人來自商貿發達的商人?,所有C地的人地位更高- -

2.7 Cabin 船艙號

2.9 Ticket 船票號

2.10 Name

3.簡單的特徵工程

print ("***********Train*************")print ("test")print (train.isnull().sum())print ("***********test*************")print (test.isnull().sum())

  1. age cabin在訓練集和待預測集中均有缺失,cabin缺失的個數很多
  2. embarked 在訓練集中有2個缺失值

train.Embarked.fillna("S",inplace=True)

正如我們分析Name時提到的,我們需要將年齡離散化處理,同時還要處理預設值

在baseline 我們用簡單方法處理下:

  1. 為空的歸為一類
  2. 分類的按年齡分段進行離散

#以5歲為一個周期離散,同時10以下,60歲以上的年分別歸類def age_map(x): if x<10: return "10-" if x<60: return "%d-%d"%(x//5*5, x//5*5+5) elif x>=60: return "60+" else: return "Null"train["Age_map"] = train["Age"].apply(lambda x: age_map(x))test["Age_map"] = test["Age"].apply(lambda x: age_map(x))#列印出來看看train.groupby("Age_map")["Survived"].agg(["count","mean"])

還有test有個Fare預設

數據中Fare分布太寬,做一下scaling,加速模型收斂

import sklearn.preprocessing as preprocessingscaler = preprocessing.StandardScaler()#Farefare_scale_param = scaler.fit(train["Fare"].values.reshape(-1, 1))train.Fare = fare_scale_param.transform(train["Fare"].values.reshape(-1, 1))test.Fare = fare_scale_param.transform(test["Fare"].values.reshape(-1, 1))

將類別型變數全部onehot

train_x = pd.concat([train[["SibSp","Parch","Fare"]], pd.get_dummies(train[["Pclass","Sex","Cabin","Embarked","Age_map"]])],axis=1)train_y = train.Survivedtest_x = pd.concat([test[["SibSp","Parch","Fare"]], pd.get_dummies(test[["Pclass", "Sex","Cabin","Embarked", "Age_map"]])],axis=1)

4、BASE LINE MODEL

採用邏輯回歸作為base line model 並對參數做簡單搜索

base_line_model = LogisticRegression()param = {"penalty":["l1","l2"], "C":[0.1, 0.5, 1.0,5.0]}grd = GridSearchCV(estimator=base_line_model, param_grid=param, cv=5, n_jobs=3)grd.fit(train_x, train_y)

將模型訓練過程的學習曲線列印出來,看下是否存在過擬合/欠擬合情況

def plot_learning_curve(clf, title, X, y, ylim=None, cv=None, n_jobs=3, train_sizes=np.linspace(.05, 1., 5)): train_sizes, train_scores, test_scores = learning_curve( clf, x, y, train_sizes=train_sizes) train_scores_mean = np.mean(train_scores, axis=1) train_scores_std = np.std(train_scores, axis=1) test_scores_mean = np.mean(test_scores, axis=1) test_scores_std = np.std(test_scores, axis=1) ax = plt.figure().add_subplot(111) ax.set_title(title) if ylim is not None: ax.ylim(*ylim) ax.set_xlabel(u"train_num_of_samples") ax.set_ylabel(u"score") ax.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1, color="b") ax.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="r") ax.plot(train_sizes, train_scores_mean, "o-", color="b", label=u"train score") ax.plot(train_sizes, test_scores_mean, "o-", color="r", label=u"testCV score") ax.legend(loc="best") midpoint = ((train_scores_mean[-1] + train_scores_std[-1]) + (test_scores_mean[-1] - test_scores_std[-1])) / 2 diff = (train_scores_mean[-1] + train_scores_std[-1]) - (test_scores_mean[-1] - test_scores_std[-1]) return midpoint, diffplot_learning_curve(grd, u"learning_rate", train_x, train_y)

表現正常,提交我們的結果看看-,-

gender_submission = pd.DataFrame({"PassengerId":test.iloc[:,0],"Survived":grd.predict(test_x)})gender_submission.to_csv("C:/Users/evilpsycho/Desktop/gender_submission.csv", index=None)

看看結果:

其實還算不錯,畢竟,這只是我們簡單分析處理過後出的一個baseline

5.下期預告

排名進入TOP 5%

1. 引入pipeline管道機制,流水化特徵提取及模型調優

2. badcase研究

3. 更加深入特徵工程

4. 集成學習

5. 動態可視化模型訓練過程


推薦閱讀:

Python 家族有多龐大
Python數據分析及可視化實例之CentOS7.2+Python3x+Flask部署標準化配置流程
Flask 實現小說網站 (二)
Python實現3D建模工具
Flask模板引擎:Jinja2語法介紹

TAG:数据挖掘 | Python | 机器学习 |