數據挖掘02-----kaggle Titanic
賽事鏈接:
Titanic: Machine Learning from Disaster上次的那份代碼明顯不適合新手閱讀,我自己在閱讀過程中也出現了一點小問題,於是不得不停下上次的代碼。此次參考另外一人的代碼,相較於上一個,這份代碼比較輕鬆的能夠閱讀下來,下面來簡單看一下:
# data analysis and wranglingimport pandas as pdimport numpy as npimport random as rnd# visualizationimport seaborn as snsimport matplotlib.pyplot as plt%matplotlib inline# machine learningfrom sklearn.linear_model import LogisticRegressionfrom sklearn.svm import SVC, LinearSVCfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.naive_bayes import GaussianNBfrom sklearn.linear_model import Perceptronfrom sklearn.linear_model import SGDClassifierfrom sklearn.tree import DecisionTreeClassifiertrain_df = pd.read_csv(train.csv)test_df = pd.read_csv(test.csv)combine = [train_df, test_df]print(train_df.columns.values)# preview the datatrain_df.head()train_df.tail()train_df.info()print(_*40)test_df.info()
前面部分代碼差不多,讀取文件,列印文件的基本信息等
train_df.describe()# Review survived rate using `percentiles=[.61, .62]` knowing our problem description mentions 38% survival rate.# Review Parch distribution using `percentiles=[.75, .8]`# SibSp distribution `[.68, .69]`# Age and Fare `[.1, .2, .3, .4, .5, .6, .7, .8, .9, .99]`
可以看到describe()函數的功能,統計出各數字類型標籤的計數、平均、最小、最大值、還有百分數還不太懂是什麼意思,可能是第百分之多少的數據,這個函數也是一個統計的數據。
train_df.describe(include=[O])
這句話與上面的略微不同,不同的地方在於多了一個參數,describe()函數一共有三個參數,這是其中的一個,這個代表的意思是統計數據文件中非數字類型的標籤量,運行結果如下:
下面我們再逐步分析一下,其標籤變數與最重要的survived之間的關係,使用的是
train_df[[Pclass, Survived]].groupby([Pclass], as_index=False).mean().sort_values(by=Survived, ascending=False)
train_df[["SibSp", "Survived"]].groupby([SibSp], as_index=False).mean().sort_values(by=Survived, ascending=False)
等,可以將數字型數據域survived的聯繫分析出來,如果是字元型的標籤變數,會報錯。
下面是可視化圖像分析,使用的是seaborn庫,簡單的來看一個:
#g = sns.FacetGrid(train_df, col=Survived)#g.map(plt.hist, Age, bins=20)g = sns.FacetGrid(train_df, col=Survived)g.map(plt.hist, Age, bins=30)
age與survived之間的柱狀圖,其中參數bins是將數據等分為bins塊那麼多,下面是兩個變數的分析:
# grid = sns.FacetGrid(train_df, col=Pclass, hue=Survived)grid = sns.FacetGrid(train_df, col=Survived, row=Pclass, size=2.2, aspect=1.6)grid.map(plt.hist, Age, alpha=0.5, bins=20)grid.add_legend();
size和aspect分別代表著圖像的長和寬,然後這個是age與pclass和survived之間的關係圖,從圖中可以直觀的發現一些數據。
# grid = sns.FacetGrid(train_df, col=Embarked)grid = sns.FacetGrid(train_df, row=Embarked, size=2.2, aspect=1.6)grid.map(sns.pointplot, Pclass, Survived, Sex, palette=deep)grid.add_legend()
還可以畫出三變數與survived之間的關係,sex、pclass、embarked與survived之間的圖像如上,當然,還可以下面這麼畫
# grid = sns.FacetGrid(train_df, col=Embarked, hue=Survived, palette={0: k, 1: w})grid = sns.FacetGrid(train_df, row=Embarked, col=Survived, size=2.2, aspect=1.6)grid.map(sns.barplot, Sex, Fare, alpha=.5, ci=None)grid.add_legend()
前面的數據分析部分差不多結束了,其中的函數不是很難,但是還是有些不懂的地方,這是一門需要下功夫深入學習的學科,還是要花大力的功夫去學習。
下面是基於上面的分析構建出模型的過程,簡單來看一下:
print("Before", train_df.shape, test_df.shape, combine[0].shape, combine[1].shape)train_df = train_df.drop([Ticket, Cabin], axis=1)test_df = test_df.drop([Ticket, Cabin], axis=1)combine = [train_df, test_df]"After", train_df.shape, test_df.shape, combine[0].shape, combine[1].shape
經上面的分析和常識可知,船票和客艙在是否能夠生存下來的過程中起不到任何作用,所以將此變數從數據中剔除。
下面是我們在整個數據集中需要保存或者需要修改標籤的操作
for dataset in combine: dataset[Title] = dataset.Name.str.extract( ([A-Za-z]+)., expand=False)pd.crosstab(train_df[Title], train_df[Sex])
for循環在組合的集合中取到一個個的數據,然後下面添加一列Title信息,這一列信息是從Name數據列中提取出能夠正則匹配的數據,並返回匹配到的正則子集
這裡是列印出來的結果,從列印的結果來看,能夠正則匹配的是name數據欄里的身份信息,對比數據集可以發現完全一樣,但是個數好像比數據里多了一點。
這是從原始數據中統計到的信息,但是在分析過程中,這麼多的信息顯然不利於分析,所以將一些其他類似的信息或者過少的或者性質相似的信息合併成一個其他的標籤
for dataset in combine: dataset[Title] = dataset[Title].replace([Lady, Countess,Capt, Col, Don, Dr, Major, Rev, Sir, Jonkheer, Dona], Rare) dataset[Title] = dataset[Title].replace(Mlle, Miss) dataset[Title] = dataset[Title].replace(Ms, Miss) dataset[Title] = dataset[Title].replace(Mme, Mrs) train_df[[Title, Survived]].groupby([Title], as_index=False).mean()
這裡將title信息列分成四個部分,分別是Miss、Mrs、Miss、Rare,然後列印出與survived之間的關係:
然後再整理一下剩餘的信息:
title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Rare": 5}for dataset in combine: dataset[Title] = dataset[Title].map(title_mapping) dataset[Title] = dataset[Title].fillna(0)train_df.head()
使用12345代替title中的信息,然後用0補充沒有這樣信息的數據,這樣title這一列信息就加入到了我們的信息表裡,這樣我們就可以將name這一列信息刪去,因為有用的信息已經被提取出來,所以簡化我們的數據
train_df = train_df.drop([Name, PassengerId], axis=1)test_df = test_df.drop([Name], axis=1)combine = [train_df, test_df]train_df.shape, test_df.shape
這樣train和test的數據列已經一樣的,再將字元型數據sex換成數字型數字
for dataset in combine: dataset[Sex] = dataset[Sex].map( {female: 1, male: 0} ).astype(int)train_df.head()
這樣剩餘的9列數據全是數字類型的數據,便於剩下的數據分析。
然後處理空值的問題,其中年齡的空值比較多,而年齡是關於是否存活的重要特徵,所以怎麼補充年齡的控制是接下來我們需要做的工作,現在有簡單的3個方法:
1、簡單方式:在中值和標準偏差間產生一個隨機數
2、準確方式:通過相關特徵猜測缺失值
3、聯合1、2基於特徵組合,在中值和偏差間產生一個隨機數
對於13來說,引入的雜訊太大,不適合直接進行數據分析,然後直接使用第二種方法去處理空值的問題,先使用一個空的數組,進行存儲猜測的年齡值,在此猜測的年齡依據的是pclass和sex
# grid = sns.FacetGrid(train_df, col=Pclass, hue=Gender)grid = sns.FacetGrid(train_df, row=Pclass, col=Sex, size=2.2, aspect=1.6)grid.map(plt.hist, Age, alpha=.5, bins=20)grid.add_legend()
然後直接進行猜測數據的賦值,
guess_ages = np.zeros((2,3))guess_agesfor dataset in combine: for i in range(0, 2): for j in range(0, 3): guess_df = dataset[(dataset[Sex] == i) & (dataset[Pclass] == j+1)][Age].dropna() # age_mean = guess_df.mean() # age_std = guess_df.std() # age_guess = rnd.uniform(age_mean - age_std, age_mean + age_std) age_guess = guess_df.median() # Convert random age float to nearest .5 age guess_ages[i,j] = int( age_guess/0.5 + 0.5 ) * 0.5 for i in range(0, 2): for j in range(0, 3): dataset.loc[ (dataset.Age.isnull()) & (dataset.Sex == i) & (dataset.Pclass == j+1), Age] = guess_ages[i,j] dataset[Age] = dataset[Age].astype(int)train_df.head()
裡面有某個值的具體計算過程。計算後以後,數組直接賦值給空值的數據
接著處理年齡的問題,年齡的跨度有點大,所以將其分成五組,分別計算出每組的survived比率
train_df[AgeBand] = pd.cut(train_df[Age], 5)train_df[[AgeBand, Survived]].groupby([AgeBand], as_index=False).mean().sort_values(by=AgeBand, ascending=True)
再將年齡這一標籤變數換成數字類型的數據,並且重新賦值與age
for dataset in combine: dataset.loc[ dataset[Age] <= 16, Age] = 0 dataset.loc[(dataset[Age] > 16) & (dataset[Age] <= 32), Age] = 1 dataset.loc[(dataset[Age] > 32) & (dataset[Age] <= 48), Age] = 2 dataset.loc[(dataset[Age] > 48) & (dataset[Age] <= 64), Age] = 3 dataset.loc[ dataset[Age] > 64, Age]train_df.head()
刪除ageband標籤
train_df = train_df.drop([AgeBand], axis=1)combine = [train_df, test_df]
設置一個新的變數familysize,包括從數據中的SibSp和Parch兩個變數,這個新變數可以取代原始數據中的兩個標籤量,而且還可以從新的數據變數裡邊得到新的數據
for dataset in combine: dataset[FamilySize] = dataset[SibSp] + dataset[Parch] + 1train_df[[FamilySize, Survived]].groupby([FamilySize], as_index=False).mean().sort_values(by=Survived, ascending=False)
家庭人數在3的存活率最高,這裡是我們新構建的一個標籤,其中的存活率也是重新統計的,這時就可以刪除原始數據中的兩個標籤,用一個代替。接著構建一個變數叫isalone(),這個變數是基於familysize實現的,再次統計這個的存活率
for dataset in combine: dataset[IsAlone] = 0 dataset.loc[dataset[FamilySize] == 1, IsAlone] = 1train_df[[IsAlone, Survived]].groupby([IsAlone], as_index=False).mean()
當然我們可以使用更加嚴密的分析方法去分析各個變數之間的關係,構建出最正確的變數的特徵,下面是pclass和age的結合
for dataset in combine: dataset[Age*Class] = dataset.Age * dataset.Pclasstrain_df.loc[:, [Age*Class, Age, Pclass]].head(10)
freq_port = train_df.Embarked.dropna().mode()[0]freq_portfor dataset in combine: dataset[Embarked] = dataset[Embarked].fillna(freq_port) train_df[[Embarked, Survived]].groupby([Embarked], as_index=False).mean().sort_values(by=Survived, ascending=False)
計算出眾值,並填充在embarked數據列的空值處,再將embarked這個標籤轉化為數字類型的數據
for dataset in combine: dataset[Embarked] = dataset[Embarked].map( {S: 0, C: 1, Q: 2} ).astype(int)train_df.head()
因為在fare中還有空值,所以還要將fare中的值填充,票價這個因素對於結果的關係不是很大,所以簡單的用中值去填充即可
test_df[Fare].fillna(test_df[Fare].dropna().median(), inplace=True)test_df.head()
再將票價劃分出區間,然後重新賦值給fare
train_df[FareBand] = pd.qcut(train_df[Fare], 4)train_df[[FareBand, Survived]].groupby([FareBand], as_index=False).mean().sort_values(by=FareBand, ascending=True)for dataset in combine: dataset.loc[ dataset[Fare] <= 7.91, Fare] = 0 dataset.loc[(dataset[Fare] > 7.91) & (dataset[Fare] <= 14.454), Fare] = 1 dataset.loc[(dataset[Fare] > 14.454) & (dataset[Fare] <= 31), Fare] = 2 dataset.loc[ dataset[Fare] > 31, Fare] = 3 dataset[Fare] = dataset[Fare].astype(int)train_df = train_df.drop([FareBand], axis=1)combine = [train_df, test_df] train_df.head(10)
至此所有的數據都變成了想要的簡單的數字類型,剩下的任務就是使用我們處理過的數據並且設計出一個模型去預測測試集中的數據是否存活。
Model, predict and solve
Now we are ready to train a model and predict the required solution. There are 60+ predictive modelling algorithms to choose from. We must understand the type of problem and solution requirement to narrow down to a select few models which we can evaluate. Our problem is a classification and regression problem. We want to identify relationship between output (Survived or not) with other variables or features (Gender, Age, Port...). We are also perfoming a category of machine learning which is called supervised learning as we are training our model with a given dataset. With these two criteria - Supervised Learning plus Classification and Regression, we can narrow down our choice of models to a few. These include:
- Logistic Regression
- KNN or k-Nearest Neighbors
- Support Vector Machines
- Naive Bayes classifier
- Decision Tree
- Random Forrest
- Perceptron
- Artificial neural network
- RVM or Relevance Vector Machine
這是我參考的那篇文章中說的,大概意思是明白自己想要的東西與各種標籤變數的關係,預測演算法有很多,我們可以根據自己的實際需求去縮小選擇的範圍,我們發現自己的模型符合監督學習的特性,所以根據這個監督學習大概可以縮小演算法的選擇範圍,下面加點的是我們的待選項。
X_train = train_df.drop("Survived", axis=1)Y_train = train_df["Survived"]X_test = test_df.drop("PassengerId", axis=1).copy()X_train.shape, Y_train.shape, X_test.shape
簡單處理一下數據,這裡採用的是邏輯回歸
# Logistic Regressionlogreg = LogisticRegression()logreg.fit(X_train, Y_train)Y_pred = logreg.predict(X_test)acc_log = round(logreg.score(X_train, Y_train) * 100, 2)acc_log
很簡單的使用邏輯回歸的函數,根據train的數據構造函數,直接帶入預測數據,有點糙,最後輸出的好像是一個置信值,可以看一下最後的結果:輸出為正代表相應了我們預測的結果,負值則相反,從結果可以看出,性別的影響是最大的,高達2.2。代表著性別沖0~1變大的過程中survived的概率也是在瘋狂增長,相反pclass這個變數,從0~2的增大過程中,survived的幾率越來越小
coeff_df = pd.DataFrame(train_df.columns.delete(0))coeff_df.columns = [Feature]coeff_df["Correlation"] = pd.Series(logreg.coef_[0])coeff_df.sort_values(by=Correlation, ascending=False)
下面是支持向量機的一種簡單方法介紹
# Support Vector Machinessvc = SVC()svc.fit(X_train, Y_train)Y_pred = svc.predict(X_test)acc_svc = round(svc.score(X_train, Y_train) * 100, 2)acc_svc
輸出的結果也是一個置信度
下面是一個KNN的方法
knn = KNeighborsClassifier(n_neighbors = 3)knn.fit(X_train, Y_train)Y_pred = knn.predict(X_test)acc_knn = round(knn.score(X_train, Y_train) * 100, 2)acc_knn
貝葉斯分類器
# Gaussian Naive Bayesgaussian = GaussianNB()gaussian.fit(X_train, Y_train)Y_pred = gaussian.predict(X_test)acc_gaussian = round(gaussian.score(X_train, Y_train) * 100, 2)acc_gaussian
感知器
# Perceptronperceptron = Perceptron()perceptron.fit(X_train, Y_train)Y_pred = perceptron.predict(X_test)acc_perceptron = round(perceptron.score(X_train, Y_train) * 100, 2)acc_perceptron
等一些其他方法:
# Linear SVClinear_svc = LinearSVC()linear_svc.fit(X_train, Y_train)Y_pred = linear_svc.predict(X_test)acc_linear_svc = round(linear_svc.score(X_train, Y_train) * 100, 2)acc_linear_svc# Stochastic Gradient Descentsgd = SGDClassifier()sgd.fit(X_train, Y_train)Y_pred = sgd.predict(X_test)acc_sgd = round(sgd.score(X_train, Y_train) * 100, 2)acc_sgd# Decision Treedecision_tree = DecisionTreeClassifier()decision_tree.fit(X_train, Y_train)Y_pred = decision_tree.predict(X_test)acc_decision_tree = round(decision_tree.score(X_train, Y_train) * 100, 2)acc_decision_tree# Random Forestrandom_forest = RandomForestClassifier(n_estimators=100)random_forest.fit(X_train, Y_train)Y_pred = random_forest.predict(X_test)random_forest.score(X_train, Y_train)acc_random_forest = round(random_forest.score(X_train, Y_train) * 100, 2)acc_random_forest
根據各個方法的置信度,簡單排一下序,輸出結果如下:
models = pd.DataFrame({ Model: [Support Vector Machines, KNN, Logistic Regression, Random Forest, Naive Bayes, Perceptron, Stochastic Gradient Decent, Linear SVC, Decision Tree], Score: [acc_svc, acc_knn, acc_log, acc_random_forest, acc_gaussian, acc_perceptron, acc_sgd, acc_linear_svc, acc_decision_tree]})models.sort_values(by=Score, ascending=False)
其中表現最好的是隨機深林和決策樹,於是我們選擇表現最好的預測結果填入最後的文件中
submission = pd.DataFrame({ "PassengerId": test_df["PassengerId"], "Survived": Y_pred })# submission.to_csv(../output/submission.csv, index=False)
啊!可算是完成了,整個的代碼分析,從開始看代碼到現在大概分析了一整遍,而且還做了一下記錄,前前後後花了大概一周的時間吧,自己實在是有點懶,但是這篇文章教會了我很多很多,大概算是半個入門了吧(微笑),鼓勵一下自己,加油!
推薦閱讀:
※機器學習/數據挖掘面試總結
※數據分析基礎—2.2.5 SWOT分析法
※如何使用線性回歸探索數據?數據分析初學者指南
※分類模型性能評估——Accuracy, Precision, Recall, F-Score...
※手把手教你快速構建自定義分類器