機器學習實戰|人口普查收入預測案例(很全!建議收藏)
來自專欄數據分析俠20 人贊了文章
# Data Manipulation import numpy as npimport pandas as pd# Visualization import matplotlib.pyplot as pltimport missingnoimport seaborn as snsfrom pandas.tools.plotting import scatter_matrixfrom mpl_toolkits.mplot3d import Axes3D# Feature Selection and Encodingfrom sklearn.feature_selection import RFE, RFECVfrom sklearn.svm import SVRfrom sklearn.decomposition import PCAfrom sklearn.preprocessing import OneHotEncoder, LabelEncoder, label_binarize# Machine learning import sklearn.ensemble as skefrom sklearn import datasets, model_selection, tree, preprocessing, metrics, linear_modelfrom sklearn.svm import LinearSVCfrom sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifierfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.naive_bayes import GaussianNBfrom sklearn.linear_model import LinearRegression, LogisticRegression, Ridge, Lasso, SGDClassifierfrom sklearn.tree import DecisionTreeClassifier#import tensorflow as tf# Grid and Random Searchimport scipy.stats as stfrom scipy.stats import randint as sp_randintfrom sklearn.model_selection import GridSearchCVfrom sklearn.model_selection import RandomizedSearchCV# Metricsfrom sklearn.metrics import precision_recall_fscore_support, roc_curve, auc# Managing Warnings import warningswarnings.filterwarnings(ignore)# Plot the Figures Inline%matplotlib inline
(篇幅受限,完整code戳:Python全套代碼 實戰 圖片 數據演示 案例 )
Objective
我們的任務:預測一個人的收入能否超過五萬美元
人口普查數據集: https://archive.ics.uci.edu/ml/datasets/adult
# Load Training and Test Data Setsheaders = [age, workclass, fnlwgt, education, education-num, marital-status, occupation, relationship, race, sex, capital-gain, capital-loss, hours-per-week, native-country, predclass]training_raw = pd.read_csv(dataset/adult.data, header=None, names=headers, sep=,s, na_values=["?"], engine=python)test_raw = pd.read_csv(dataset/adult.test, header=None, names=headers, sep=,s, na_values=["?"], engine=python, skiprows=1)
import pandas as pdprint (help(pd.read_csv))
# Join Datasetsdataset_raw = training_raw.append(test_raw)dataset_raw.reset_index(inplace=True)dataset_raw.drop(index,inplace=True,axis=1)dataset_raw.head()
age 年齡 double
workclass 工作類型 stringfnlwgt 序號 stringeducation 教育程度 stringeducation_num 受教育時間 doublemaritial_status 婚姻狀況 stringoccupation 職業 stringrelationship 關係 stringrace 種族 stringsex 性別 stringcapital_gain 資本收益 string
capital_loss 資本損失 stringhours_per_week 每周工作小時數 doublenative_country 原籍 stringincome 收入 string
單特徵分析
關於特徵,我們可以分析單特徵,也可以分析不同特徵之間的關係,首先來看單特徵特徵簡單分為兩種:類別型和數值型- Numerical: 都是數
- Categorical: 種類或者字元串
# 會展示所有數值型的特徵dataset_raw.describe()
# 展示所有種類型特徵dataset_raw.describe(include=[O])
# 單特徵展示import mathdef plot_distribution(dataset, cols=5, width_=20, height=15, hspace=0.2, wspace=0.5): plt.style.use(seaborn-whitegrid) fig = plt.figure(figsize=(width,height)) fig.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=wspace, hspace=hspace) rows = math.ceil(float(dataset.shape[1]) / cols) for i, column in enumerate(dataset.columns): ax = fig.add_subplot(rows, cols, i + 1) ax.set_title(column) if dataset.dtypes[column] == np.object: g = sns.countplot(y=column, data=dataset) substrings = [s.get_text()[:18] for s in g.get_yticklabels()] g.set(yticklabels=substrings) plt.xticks(rotation=25) else: #直方圖,頻數 g = sns.distplot(dataset[column]) plt.xticks(rotation=25) plot_distribution(dataset_raw, cols=3, width_=20, height=20, hspace=0.45, wspace=0.5)
# 缺失值顯示missingno.matrix(dataset_raw, figsize = (30,5))
missingno.bar(dataset_raw, sort=ascending, figsize = (30,5))
Feature Cleaning, Engineering
清洗: 數據預處理工作:
- 缺失值: 對缺失值進行填充
- 特殊值: 一些錯誤導致的特殊值,例如 ±Inf, NA NaN
- 離群點: 這些點可能會對結果產生影響,先把它們找出來
- 錯誤值: 比如人的年齡不可能出現負數
特徵工程: There are multiple techniques for feature engineering:
- 特徵分解: 比如將時間數據2014-09-20T20:45:40Z 轉換成天,小時等信息.
- 離散化: 我們可以選擇離散一些我們所擁有的連續變數,因為一些演算法會執行得更快。但是會對結果產生什麼樣的影響呢?需要比較離散和非離散的建模結果
- dataset_bin => 連續值被離散化的數據集
- dataset_con => 非離散化的數據集
- 特徵組合: 將不同的特徵組合成一個新特徵
缺失值問題: 我們可以填補缺失值,在許多不同的方式::
- 額外的數據補充: 有點難弄
- 均值填充: 這樣可以不改變當前數據集整體的均值
- 回歸模型預測: 建立一個回歸模型去得到預測值
# 創建兩個新的數據集dataset_bin = pd.DataFrame() # To contain our dataframe with our discretised continuous variables dataset_con = pd.DataFrame() # To contain our dataframe with our continuous variables
標籤轉換
如果收入大於 $50K. 那麼就是1 反之就是0
# Lets fix the Class Featuredataset_raw.loc[dataset_raw[predclass] == >50K, predclass] = 1dataset_raw.loc[dataset_raw[predclass] == >50K., predclass] = 1dataset_raw.loc[dataset_raw[predclass] == <=50K, predclass] = 0dataset_raw.loc[dataset_raw[predclass] == <=50K., predclass] = 0dataset_bin[predclass] = dataset_raw[predclass]dataset_con[predclass] = dataset_raw[predclass]
#數據不太均衡的plt.style.use(seaborn-whitegrid)fig = plt.figure(figsize=(20,1)) sns.countplot(y="predclass", data=dataset_bin);
Feature: Age
dataset_bin[age] = pd.cut(dataset_raw[age], 10) # 將連續值進行切分dataset_con[age] = dataset_raw[age] # non-discretised
#左圖是切分後的結果 右圖是根據不同的收入等級劃分plt.style.use(seaborn-whitegrid)fig = plt.figure(figsize=(20,5)) plt.subplot(1, 2, 1)sns.countplot(y="age", data=dataset_bin);plt.subplot(1, 2, 2)sns.distplot(dataset_con.loc[dataset_con[predclass] == 1][age], kde_kws={"label": ">$50K"});sns.distplot(dataset_con.loc[dataset_con[predclass] == 0][age], kde_kws={"label": "<$50K"});
。。。
。。。
。。。
。。(此處省略其他feature處理,完整code戳:Python全套代碼 實戰 圖片 數據演示 案例 )
雙變數分析
接下來要看特徵之間的關係了
# 在不同類別屬性上觀察兩種標籤的分布情況def plot_bivariate_bar(dataset, hue, cols=5, width_=20, height=15, hspace=0.2, wspace=0.5): dataset = dataset.select_dtypes(include=[np.object]) plt.style.use(seaborn-whitegrid) fig = plt.figure(figsize=(width,height)) fig.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=wspace, hspace=hspace) rows = math.ceil(float(dataset.shape[1]) / cols) for i, column in enumerate(dataset.columns): ax = fig.add_subplot(rows, cols, i + 1) ax.set_title(column) if dataset.dtypes[column] == np.object: g = sns.countplot(y=column, hue=hue, data=dataset) substrings = [s.get_text()[:10] for s in g.get_yticklabels()] g.set(yticklabels=substrings) plot_bivariate_bar(dataset_con, hue=predclass, cols=3, width_=20, height=12, hspace=0.4, wspace=0.5)
# 婚姻狀況和教育對收入的影響plt.style.use(seaborn-whitegrid)g = sns.FacetGrid(dataset_con, col=marital-status, size=4, aspect=.7)g = g.map(sns.boxplot, predclass, education-num)
# 性別、教育對收入的影響plt.style.use(seaborn-whitegrid)fig = plt.figure(figsize=(20,4)) plt.subplot(1, 3, 1)sns.violinplot(x=sex, y=education-num, hue=predclass, data=dataset_con, split=True, scale=count);plt.subplot(1, 3, 2)sns.violinplot(x=sex, y=hours-per-week, hue=predclass, data=dataset_con, split=True, scale=count);plt.subplot(1, 3, 3)sns.violinplot(x=sex, y=age, hue=predclass, data=dataset_con, split=True, scale=count);
# 不同特徵之間的散點圖分布sns.pairplot(dataset_con[[age,education-num,hours-per-week,predclass,capital-gain,capital-loss]], hue="predclass", diag_kind="kde", size=4);
Feature Crossing: Age + Hours Per Week
開發新的變數啦
# Crossing Numerical Featuresdataset_con[age-hours] = dataset_con[age] * dataset_con[hours-per-week]dataset_bin[age-hours] = pd.cut(dataset_con[age-hours], 10)dataset_con[age-hours] = dataset_con[age-hours]plt.style.use(seaborn-whitegrid)fig = plt.figure(figsize=(20,5)) plt.subplot(1, 2, 1)sns.countplot(y="age-hours", data=dataset_bin);plt.subplot(1, 2, 2)sns.distplot(dataset_con.loc[dataset_con[predclass] == 1][age-hours], kde_kws={"label": ">$50K"});sns.distplot(dataset_con.loc[dataset_con[predclass] == 0][age-hours], kde_kws={"label": "<$50K"});
。。。
。。(此處省略其他feature處理,完整code戳:Python全套代碼 實戰 圖片 數據演示 案例 )
Feature Encoding
對特徵進行編碼,因為機器學習只認識數字 Additional Resources: http://pbpython.com/categorical-encoding.html
# One Hot Encodes one_hot_cols = dataset_bin.columns.tolist()one_hot_cols.remove(predclass)dataset_bin_enc = pd.get_dummies(dataset_bin, columns=one_hot_cols)dataset_bin_enc.head()
# Label Encode dataset_con_test = dataset_condataset_con_test[workclass] = dataset_con[workclass].factorize()[0]dataset_con_test[occupation] = dataset_con[occupation].factorize()[0]dataset_con_test[native-country] = dataset_con[native-country].factorize()[0]dataset_con_enc = dataset_con_test.apply(LabelEncoder().fit_transform)dataset_con_enc.head()
特徵選擇
特徵多並不代表都是好用的,咱們得來挑一挑,哪些比較有價值,這樣才給他留下來- 降維:
- 主成分分析 (PCA): 降維最常用的手段,需要指定基坐標系,然後變換到指定的維度
- 奇異值分解(SVD): 找出來有具有特定含義的特徵
- 線性判別分析(LDA): 拿到最適合分類的特徵空間
- 特徵重要性/相關性:
- 篩選: 找出來哪些對結果最能產生影響的特正門
- 評估子集: 用部分特徵數據進行實驗
- 集成方法: 類似隨機森林
特徵相關性
兩個隨機變數共同變化的相關度量。特徵應該彼此不相關,同時與我們試圖預測的特性高度相關。# 創建兩個數據集的相關圖s.plt.style.use(seaborn-whitegrid)fig = plt.figure(figsize=(25,10)) plt.subplot(1, 2, 1)mask = np.zeros_like(dataset_bin_enc.corr(), dtype=np.bool)mask[np.triu_indices_from(mask)] = Truesns.heatmap(dataset_bin_enc.corr(), vmin=-1, vmax=1, square=True, cmap=sns.color_palette("RdBu_r", 100), mask=mask, linewidths=.5);plt.subplot(1, 2, 2)mask = np.zeros_like(dataset_con_enc.corr(), dtype=np.bool)mask[np.triu_indices_from(mask)] = Truesns.heatmap(dataset_con_enc.corr(), vmin=-1, vmax=1, square=True, cmap=sns.color_palette("RdBu_r", 100), mask=mask, linewidths=.5);
特徵重要性
可以基於隨機森林來進行特徵重要性的評估
# Using Random Forest to gain an insight on Feature Importanceclf = RandomForestClassifier()clf.fit(dataset_con_enc.drop(predclass, axis=1), dataset_con_enc[predclass])plt.style.use(seaborn-whitegrid)importance = clf.feature_importances_importance = pd.DataFrame(importance, index=dataset_con_enc.drop(predclass, axis=1).columns, columns=["Importance"])importance.sort_values(by=Importance, ascending=True).plot(kind=barh, figsize=(20,len(importance)/2));
PCA
到底降不降維?沒有一個固定的說法,在機器學習中沒有說一個演算法一個方案就一定對的,我們需要嘗試
涉及參數:
- n_components:這個參數可以幫我們指定希望PCA降維後的特徵維度數目。最常用的做法是直接指定降維到的維度數目,此時n_components是一個大於等於1的整數。當然,我們也可以指定主成分的方差和所佔的最小比例閾值,讓PCA類自己去根據樣本特徵方差來決定降維到的維度數,此時n_components是一個(0,1]之間的數
- whiten :判斷是否進行白化。所謂白化,就是對降維後的數據的每個特徵進行歸一化,讓方差都為1.對於PCA降維本身來說,一般不需要白化。如果你PCA降維後有後續的數據處理動作,可以考慮白化。默認值是False,即不進行白化。
- 除了這些輸入參數外,有兩個PCA類的成員值得關注。第一個是explained_variance_,它代表降維後的各主成分的方差值。方差值越大,則說明越是重要的主成分。第二個是explained_variance_ratio_,它代表降維後的各主成分的方差值佔總方差值的比例,這個比例越大,則越是重要的主成分。
# Calculating PCA for both datasets, and graphing the Variance for each feature, per datasetstd_scale = preprocessing.StandardScaler().fit(dataset_bin_enc.drop(predclass, axis=1))X = std_scale.transform(dataset_bin_enc.drop(predclass, axis=1))pca1 = PCA(n_components=len(dataset_bin_enc.columns)-1)fit1 = pca1.fit(X)std_scale = preprocessing.StandardScaler().fit(dataset_con_enc.drop(predclass, axis=1))X = std_scale.transform(dataset_con_enc.drop(predclass, axis=1))pca2 = PCA(n_components=len(dataset_con_enc.columns)-2)fit2 = pca2.fit(X)# Graphing the variance per featureplt.style.use(seaborn-whitegrid)plt.figure(figsize=(25,7)) plt.subplot(1, 2, 1)plt.xlabel(PCA Feature)plt.ylabel(Variance)plt.title(PCA for Discretised Dataset)plt.bar(range(0, fit1.explained_variance_ratio_.size), fit1.explained_variance_ratio_);plt.subplot(1, 2, 2)plt.xlabel(PCA Feature)plt.ylabel(Variance)plt.title(PCA for Continuous Dataset)plt.bar(range(0, fit2.explained_variance_ratio_.size), fit2.explained_variance_ratio_);
# PCAs components graphed in 2D and 3D# Apply Scaling std_scale = preprocessing.StandardScaler().fit(dataset_con_enc.drop(predclass, axis=1))X = std_scale.transform(dataset_con_enc.drop(predclass, axis=1))y = dataset_con_enc[predclass]# Formattingtarget_names = [0,1]colors = [navy,darkorange]lw = 2alpha = 0.3# 2 Components PCAplt.style.use(seaborn-whitegrid)plt.figure(2, figsize=(20, 8))plt.subplot(1, 2, 1)pca = PCA(n_components=2)X_r = pca.fit(X).transform(X)for color, i, target_name in zip(colors, [0, 1], target_names): plt.scatter(X_r[y == i, 0], X_r[y == i, 1], color=color, alpha=alpha, lw=lw, label=target_name)plt.legend(loc=best, shadow=False, scatterpoints=1)plt.title(First two PCA directions);# 3 Components PCAax = plt.subplot(1, 2, 2, projection=3d)pca = PCA(n_components=3)X_reduced = pca.fit(X).transform(X)for color, i, target_name in zip(colors, [0, 1], target_names): ax.scatter(X_reduced[y == i, 0], X_reduced[y == i, 1], X_reduced[y == i, 2], color=color, alpha=alpha, lw=lw, label=target_name)plt.legend(loc=best, shadow=False, scatterpoints=1)ax.set_title("First three PCA directions")ax.set_xlabel("1st eigenvector")ax.set_ylabel("2nd eigenvector")ax.set_zlabel("3rd eigenvector")# rotate the axesax.view_init(30, 10)
遞歸特徵消除
- 遞歸特徵消除的主要思想是反覆的構建模型(如SVM或者回歸模型)然後選出最好的(或者最差的)的特徵(可以根據係數來選),把選出來的特徵放到一遍,然後在剩餘的特徵上重複這個過程,直到所有特徵都遍歷了。這個過程中特徵被消除的次序就是特徵的排序。因此,這是一種尋找最優特徵子集的貪心演算法。
# Calculating RFE for non-discretised dataset, and graphing the Importance for each feature, per datasetselector1 = RFECV(LogisticRegression(), step=1, cv=5, n_jobs=-1)selector1 = selector1.fit(dataset_con_enc.drop(predclass, axis=1).values, dataset_con_enc[predclass].values)print("Feature Ranking For Non-Discretised: %s" % selector1.ranking_)print("Optimal number of features : %d" % selector1.n_features_)# Plot number of features VS. cross-validation scoresplt.style.use(seaborn-whitegrid)plt.figure(figsize=(20,5)) plt.xlabel("Number of features selected - Non-Discretised")plt.ylabel("Cross validation score (nb of correct classifications)")plt.plot(range(1, len(selector1.grid_scores_) + 1), selector1.grid_scores_);# Feature space could be subsetted like so:dataset_con_enc = dataset_con_enc[dataset_con_enc.columns[np.insert(selector1.support_, 0, True)]]
選擇不同的編碼數據集
分別嘗試下不同機器學習演算法的效果
機器學習演算法
- KNN
- Logistic Regression
- Random Forest
- Naive Bayes
- Stochastic Gradient Decent
- Linear SVC
- Decision Tree
- Gradient Boosted Trees
在sklearn中有很多通用函數,可以自定義一套方案
# 在不同閾值上計算fprdef plot_roc_curve(y_test, preds): fpr, tpr, threshold = metrics.roc_curve(y_test, preds) roc_auc = metrics.auc(fpr, tpr) plt.title(Receiver Operating Characteristic) plt.plot(fpr, tpr, b, label = AUC = %0.2f % roc_auc) plt.legend(loc = lower right) plt.plot([0, 1], [0, 1],r--) plt.xlim([-0.01, 1.01]) plt.ylim([-0.01, 1.01]) plt.ylabel(True Positive Rate) plt.xlabel(False Positive Rate) plt.show()
# 返回結果def fit_ml_algo(algo, X_train, y_train, X_test, cv): # One Pass model = algo.fit(X_train, y_train) test_pred = model.predict(X_test) if (isinstance(algo, (LogisticRegression, KNeighborsClassifier, GaussianNB, DecisionTreeClassifier, RandomForestClassifier, GradientBoostingClassifier))): probs = model.predict_proba(X_test)[:,1] else: probs = "Not Available" acc = round(model.score(X_test, y_test) * 100, 2) # CV train_pred = model_selection.cross_val_predict(algo, X_train, y_train, cv=cv, n_jobs = -1) acc_cv = round(metrics.accuracy_score(y_train, train_pred) * 100, 2) return train_pred, test_pred, acc, acc_cv, probs
# Logistic Regressionimport datetimestart_time = time.time()train_pred_log, test_pred_log, acc_log, acc_cv_log, probs_log = fit_ml_algo(LogisticRegression(n_jobs = -1), X_train, y_train, X_test, 10)log_time = (time.time() - start_time)print("Accuracy: %s" % acc_log)print("Accuracy CV 10-Fold: %s" % acc_cv_log)print("Running Time: %s" % datetime.timedelta(seconds=log_time))
Accuracy: 83.17
Accuracy CV 10-Fold: 82.79Running Time: 0:00:07.055535。。(此處省略其他AUC處理,完整code戳:Python全套代碼 實戰 圖片 數據演示 案例 )
Ranking Results
Lets rank the results for all the algorithms we have used
models = pd.DataFrame({ Model: [KNN, Logistic Regression, Random Forest, Naive Bayes, Stochastic Gradient Decent, Linear SVC, Decision Tree, Gradient Boosting Trees], Score: [ acc_knn, acc_log, acc_rf, acc_gaussian, acc_sgd, acc_linear_svc, acc_dt, acc_gbt ]})models.sort_values(by=Score, ascending=False)
models = pd.DataFrame({ Model: [KNN, Logistic Regression, Random Forest, Naive Bayes, Stochastic Gradient Decent, Linear SVC, Decision Tree, Gradient Boosting Trees], Score: [ acc_cv_knn, acc_cv_log, acc_cv_rf, acc_cv_gaussian, acc_cv_sgd, acc_cv_linear_svc, acc_cv_dt, acc_cv_gbt ]})models.sort_values(by=Score, ascending=False)
。。。
。。(此處省略其他rank,完整code戳:Python全套代碼 實戰 圖片 數據演示 案例 )
—END—
微信公眾號:數據分析聯盟
加群微信助手:lestat911
——
【Python全套代碼案例材料】Python全套代碼 實戰 圖片 數據演示 案例
手機淘寶用戶複製下面:
【Python全套代碼 實戰 圖片 數據演示 案例】http://m.tb.cn/h.34wSLrP 點擊鏈接,再選擇瀏覽器咑閞;或復·制這段描述€hi79bdU0FGR€後到??淘♂寳♀??[來自超級會員的分享]
推薦閱讀:
※TensorFlow入門:線性回歸
※菜鳥學tensorflow.2
※2017年AI技術前沿進展與趨勢
※圖像語義分割-判別特徵網路DFN簡析
※機器學習-使用邏輯回歸分類
TAG:機器學習 | 深度學習DeepLearning | 數據挖掘 |