Kaggle實戰(二)：

05-12

em..還是上次的房價預測問題:Kaggle：House Price

參考這篇文章:Stacked Regressions 最終取得了前12%的成績，相比上次的25%有了一定的提升，也更加清楚了做kaggle比賽的整個流程和注意事項,還是收穫了很多的。

閑言少敘，咱們開始吧:

導入一些必要的庫

import numpy as np # linear algebraimport pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)%matplotlib inlineimport matplotlib.pyplot as plt # Matlab-style plottingimport seaborn as snscolor = sns.color_palette()sns.set_style(darkgrid)import warningsdef ignore_warn(*args, **kwargs): passwarnings.warn = ignore_warn #ignore annoying warning (from sklearn and seaborn)from scipy import statsfrom scipy.stats import norm, skew #for some statisticspd.set_option(display.float_format, lambda x: {:.3f}.format(x)) #Limiting floats output to 3 decimal pointsfrom subprocess import check_outputprint(check_output(["ls", "data"]).decode("utf8")) #check the files available in the directory

讀取文件和看一下train,test的前五行:

train = pd.read_csv(data/train.csv)test = pd.read_csv(data/test.csv)train.head(5)test.head(5)

#Save the Id columntrain_ID = train[Id]test_ID = test[Id]#Now drop the Id colum since its unnecessary for the prediction process.train.drop("Id", axis = 1, inplace = True)test.drop("Id", axis = 1, inplace = True)

上面我們保留了去除掉Id這一列的數據

接下來看一下散點圖

fig, ax = plt.subplots()ax.scatter(x = train[GrLivArea], y = train[SalePrice])plt.ylabel(SalePrice, fontsize=13)plt.xlabel(GrLivArea, fontsize=13)plt.show()

很明顯，右下角有兩個異常點。我們需要去除他們的影響。

#Deleting outlierstrain = train.drop(train[(train[GrLivArea]>4000) & (train[SalePrice]<300000)].index)#Check the graphic againfig, ax = plt.subplots()ax.scatter(train[GrLivArea], train[SalePrice])plt.ylabel(SalePrice, fontsize=13)plt.xlabel(GrLivArea, fontsize=13)plt.show()

Ok，我們已經去除掉了這兩個異常的點:

為了刻畫單變數的分布情況，seaborn的distplot函數是比較適合的工具

以下畫出了SalePrice distribution的KDE曲線並與正態分布做對比，也作出了QQ圖(Probablity Plot)與紅線越近說明越符合正態分布，參考:Quantile-Quantile Plot

sns.distplot(train[SalePrice] , fit=norm);# Get the fitted parameters used by the function(mu, sigma) = norm.fit(train[SalePrice])print( mu = {:.2f} and sigma = {:.2f} .format(mu, sigma))#Now plot the distributionplt.legend([Normal dist. ($mu=$ {:.2f} and $sigma=$ {:.2f} ).format(mu, sigma)], loc=best)plt.ylabel(Frequency)plt.title(SalePrice distribution)#Get also the QQ-plotfig = plt.figure()res = stats.probplot(train[SalePrice], plot=plt)plt.show()

明顯能看出我們的目標變數的峰度是右偏的,而我們一般希望分布是呈正態的

train["SalePrice"] = np.log1p(train["SalePrice"])sns.distplot(train[SalePrice], fit=norm);(mu, sigma) = norm.fit(train[SalePrice])print( mu = {:.2f} and sigma = {:.2f} .format(mu, sigma))plt.legend([Normal dist. ($mu=$ {:.2f} and $sigma=$ {:.2f} ).format(mu, sigma)], loc=best)plt.ylabel(Frequency)plt.title(SalePrice distribution)fig = plt.figure()res = stats.probplot(train[SalePrice], plot=plt)plt.show()

看下新分布:

很明顯可以看出該數據集已經很接近正態分布了

特徵工程

ntrain = train.shape[0]ntest = test.shape[0]y_train = train.SalePrice.valuesall_data = pd.concat((train, test)).reset_index(drop=True)all_data.drop([SalePrice], axis=1, inplace=True)print("all_data size is : {}".format(all_data.shape))

丟失的數據

all_data_na = (all_data.isnull().sum() / len(all_data)) * 100all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)[:30]missing_data = pd.DataFrame({Missing Ratio :all_data_na})missing_data.head(20)

我們可以看到缺失率前十的特徵:

f, ax = plt.subplots(figsize=(15, 12))plt.xticks(rotation=90)sns.barplot(x=all_data_na.index, y=all_data_na)plt.xlabel(Features, fontsize=15)plt.ylabel(Percent of missing values, fontsize=15)plt.title(Percent missing data by feature, fontsize=15)

可以看出，數據缺失最嚴重的幾個特徵，稍後我們會對其進行處理

數據的相關性

我們畫出熱圖來看下不同特徵與SalePrice之間的相關性

corrmat = train.corr()plt.subplots(figsize=(12, 9))sns.heatmap(corrmat, vmax=0.9, square=True);

接下來我們處理缺失的特徵值

對於object屬性的特徵

all_data["PoolQC"] = all_data["PoolQC"].fillna("None")all_data["MiscFeature"] = all_data["MiscFeature"].fillna("None")all_data["Alley"] = all_data["Alley"].fillna("None")all_data["Fence"] = all_data["Fence"].fillna("None")all_data["FireplaceQu"] = all_data["FireplaceQu"].fillna("None")

處理LotFrontage,對於這個特徵，我們可以用Neighborhood的平均值來填充

all_data["LotFrontage"] = all_data.groupby("Neighborhood")

批量化得處理數據的缺失,對這幾個對象式的特徵，填充None

for col in (GarageType, GarageFinish, GarageQual, GarageCond): all_data[col] = all_data[col].fillna(None)

批量化得填充特徵，對幾個數值化的特徵，填充0

for col in (GarageYrBlt, GarageArea, GarageCars): all_data[col] = all_data[col].fillna(0)for col in (BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2): all_data[col] = all_data[col].fillna(0)all_data["MasVnrType"] = all_data["MasVnrType"].fillna("None")all_data["MasVnrArea"] = all_data["MasVnrArea"].fillna(0)

接下來的特徵，我們採用特徵值的眾數來填充:

all_data[MSZoning] = all_data[MSZoning].fillna(all_data[MSZoning].mode()[0])all_data = all_data.drop([Utilities], axis=1)all_data["Functional"] = all_data["Functional"].fillna("Typ")all_data[Electrical] = all_data[Electrical].fillna(all_data[Electrical].mode()[0])all_data[Electrical].mode()[0])all_data[KitchenQual] = all_data[KitchenQual].fillna(all_data[KitchenQual].mode()[0])all_data[Exterior1st] = all_data[Exterior1st].fillna(all_data[Exterior1st].mode()[0])all_data[Exterior2nd] = all_data[Exterior2nd].fillna(all_data[Exterior2nd].mode()[0])all_data[SaleType] = all_data[SaleType].fillna(all_data[SaleType].mode()[0])all_data[MSSubClass] = all_data[MSSubClass].fillna("None")all_data_na = (all_data.isnull().sum() / len(all_data)) * 100all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)missing_data = pd.DataFrame({Missing Ratio :all_data_na})missing_data.head()

進一步的特徵工程

改變一些非常絕對化的數值變數

from sklearn.preprocessing import LabelEncodercols = (FireplaceQu, BsmtQual, BsmtCond, GarageQual, GarageCond, ExterQual, ExterCond,HeatingQC, PoolQC, KitchenQual, BsmtFinType1, BsmtFinType2, Functional, Fence, BsmtExposure, GarageFinish, LandSlope, LotShape, PavedDrive, Street, Alley, CentralAir, MSSubClass, OverallCond, YrSold, MoSold)# process columns, apply LabelEncoder to categorical featuresfor c in cols: lbl = LabelEncoder() lbl.fit(list(all_data[c].values)) all_data[c] = lbl.transform(list(all_data[c].values))# shape print(Shape all_data: {}.format(all_data.shape))

Shape all_data: (2917, 78)

添加一個重要的特徵

既然面積對於房價是至關重要的。TotalSF,TotalSF=TotalBsmtSF+lsFlrSF+2ndFlrSF總面積是地下室面積加第一層的面積加第二層的面積。其中TotalSF是總面積

all_data[TotalSF] = all_data[TotalBsmtSF] + all_data[1stFlrSF] + all_data[2ndFlrSF]

查看偏度較大的特徵

numeric_feats = all_data.dtypes[all_data.dtypes != "object"].indexskewed_feats = all_data[numeric_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)print(" Skew in numerical features: ")skewness = pd.DataFrame({Skew :skewed_feats})skewness.head(10)

對於高偏度值的特徵進行調整:

skewness = skewness[abs(skewness) > 0.75]print("There are {} skewed numerical features to Box Cox transform".format(skewness.shape[0]))from scipy.special import boxcox1pskewed_features = skewness.indexlam = 0.15for feat in skewed_features: #all_data[feat] += 1 all_data[feat] = boxcox1p(all_data[feat], lam)all_data = pd.get_dummies(all_data)print(all_data.shape)train = all_data[:ntrain]test = all_data[ntrain:]

模型：

導入必要的庫

from sklearn.linear_model import ElasticNet, Lasso, BayesianRidge, LassoLarsICfrom sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressorfrom sklearn.kernel_ridge import KernelRidgefrom sklearn.pipeline import make_pipelinefrom sklearn.preprocessing import RobustScalerfrom sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clonefrom sklearn.model_selection import KFold, cross_val_score, train_test_splitfrom sklearn.metrics import mean_squared_errorimport xgboost as xgbimport lightgbm as lgb

交叉驗證策略

n_folds = 5def rmsle_cv(model): kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(train.values) rmse= np.sqrt(-cross_val_score(model, train.values, y_train, scoring="neg_mean_squared_error", cv = kf)) return(rmse)

基本的模型

1.LaSSO Regression(脊回歸）

這個模型對異常點非常的敏感，因此，我們使用sklearn庫的Robuscaler()函數

lasso = make_pipeline(RobustScaler(), Lasso(alpha =0.0005, random_state=1))

2.Elastic Net Regression模型:

ENet = make_pipeline(RobustScaler(), ElasticNet(alpha=0.0005, l1_ratio=.9, random_state=3)

3.Kernel Ridge Regression模型

KRR = KernelRidge(alpha=0.6, kernel=polynomial, degree=2, coef0=2.5)

4.Gradient Boosting Regression模型

GBoost = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05, max_depth=4, max_features=sqrt, min_samples_leaf=15, min_samples_split=10, loss=huber, random_state =5)

5.XGBoost模型

model_xgb = xgb.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468, learning_rate=0.05, max_depth=3, min_child_weight=1.7817, n_estimators=2200, reg_alpha=0.4640, reg_lambda=0.8571, subsample=0.5213, silent=1, random_state =7, nthread = -1)

6.LightGBM模型

model_lgb = lgb.LGBMRegressor(objective=regression,num_leaves=5, learning_rate=0.05, n_estimators=720, max_bin = 55, bagging_fraction = 0.8, bagging_freq = 5, feature_fraction = 0.2319, feature_fraction_seed=9, bagging_seed=9, min_data_in_leaf =6, min_sum_hessian_in_leaf = 11)

模型混合

最簡單的疊加方法:平均基本模型

class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin): def __init__(self, models): self.models = models # we define clones of the original models to fit the data in def fit(self, X, y): self.models_ = [clone(x) for x in self.models] # Train cloned base models for model in self.models_: model.fit(X, y) return self #Now we do the predictions for cloned models and average them def predict(self, X): predictions = np.column_stack([ model.predict(X) for model in self.models_ ]) return np.mean(predictions, axis=1)

我們剛剛只是取了，ENet，GBoost，KRR and lasso這幾個模型的平均

averaged_models = AveragingModels(models = (ENet, GBoost, KRR, lasso))score = rmsle_cv(averaged_models)print(" Averaged base models score: {:.4f} ({:.4f}) ".format(score.mean(), score.std()))

即使是最簡單的疊加方法也能真正提高分數。這鼓勵我們更進一步，探索一個不那麼簡單的堆積方法。

一個不太簡單的疊加:添加一個元模型

class StackingAveragedModels(BaseEstimator, RegressorMixin, TransformerMixin): def __init__(self, base_models, meta_model, n_folds=5): self.base_models = base_models self.meta_model = meta_model self.n_folds = n_folds # We again fit the data on clones of the original models def fit(self, X, y): self.base_models_ = [list() for x in self.base_models] self.meta_model_ = clone(self.meta_model) kfold = KFold(n_splits=self.n_folds, shuffle=True, random_state=156) # Train cloned base models then create out-of-fold predictions # that are needed to train the cloned meta-model out_of_fold_predictions = np.zeros((X.shape[0], len(self.base_models))) for i, model in enumerate(self.base_models): for train_index, holdout_index in kfold.split(X, y): instance = clone(model) self.base_models_[i].append(instance) instance.fit(X[train_index], y[train_index]) y_pred = instance.predict(X[holdout_index]) out_of_fold_predictions[holdout_index, i] = y_pred # Now train the cloned meta-model using the out-of-fold predictions as new feature self.meta_model_.fit(out_of_fold_predictions, y) return self #Do the predictions of all base models on the test data and use the averaged predictions as #meta-features for the final prediction which is done by the meta-model def predict(self, X): meta_features = np.column_stack([ np.column_stack([model.predict(X) for model in base_models]).mean(axis=1) for base_models in self.base_models_ ]) return self.meta_model_.predict(meta_features)stacked_averaged_models = StackingAveragedModels(base_models = (ENet, GBoost, KRR), meta_model = lasso)score = rmsle_cv(stacked_averaged_models)print("Stacking Averaged models score: {:.4f} ({:.4f})".format(score.mean(), score.std()))

組合堆疊的組合，XGBoost和LightGBM

首先定義一個rmsle評價函數

def rmsle(y, y_pred): return np.sqrt(mean_squared_error(y, y_pred))

最終的訓練和預測

堆疊的回歸

stacked_averaged_models.fit(train.values, y_train)stacked_train_pred = stacked_averaged_models.predict(train.values)stacked_pred = np.expm1(stacked_averaged_models.predict(test.values))print(rmsle(y_train, stacked_train_pred))

0.077665778507

XGBoost:

model_xgb.fit(train, y_train)xgb_train_pred = model_xgb.predict(train)xgb_pred = np.expm1(model_xgb.predict(test))print(rmsle(y_train, xgb_train_pred))

0.0786703917101

LightGBM:

model_lgb.fit(train, y_train)lgb_train_pred = model_lgb.predict(train)lgb_pred = np.expm1(model_lgb.predict(test.values))print(rmsle(y_train, lgb_train_pred))

將以上訓練的數據平均化：

print(RMSLE score on train data:)print(rmsle(y_train,stacked_train_pred*0.70 + xgb_train_pred*0.15 + lgb_train_pred*0.15 ))

模型壓縮預測:

ensemble = stacked_pred*0.70 + xgb_pred*0.15 + lgb_pred*0.15sub = pd.DataFrame()sub[Id] = test_IDsub[SalePrice] = ensemblesub.to_csv(submission1.csv,index=False)

提交結果: