Kaggle競賽 —— 房價預測 (House Prices)

Kaggle競賽 —— 房價預測 (House Prices)

來自專欄數據科學の雜談4 人贊了文章

完整代碼見kaggle kernel 或 Github

比賽頁面:kaggle.com/c/house-pric

這個比賽總的情況就是給你79個特徵然後根據這些預測房價 (SalePrice),這其中既有離散型也有連續性特徵,而且存在大量的缺失值。不過好在比賽方提供了data_description.txt這個文件,裡面對各個特徵的含義進行了描述,理解了其中內容後對於大部分缺失值就都能順利插補了。

參加比賽首先要做的事是了解其評價指標,如果一開始就搞錯了到最後可能就白費功夫了-。- House Prices的評估指標是均方根誤差 (RMSE),這是常見的用於回歸問題的指標 :

sqrt{frac{sum_{i=1}^{N}(y_i-hat{y_i})^2}{N}}

我目前的得分是0.11421

對我的分數提升最大的主要有兩塊:

  • 特徵工程 : 主要為離散型變數的排序賦值,特徵組合和PCA
  • 模型融合 : 主要為加權平均和Stacking

將在下文中一一說明。

目錄:

  1. 探索性可視化(Exploratory Visualization)
  2. 數據清洗(Data Cleaning)
  3. 特徵工程(Feature Engineering)
  4. 基本建模&評估(Basic Modeling & Evaluation)
  5. 參數調整(Hyperparameters Tuning)
  6. 集成方法(Ensemble Methods)

探索性可視化(Exploratory Visualization)

由於原始特徵較多,這裡只選擇建造年份 (YearBuilt) 來進行可視化:

plt.figure(figsize=(15,8))sns.boxplot(train.YearBuilt, train.SalePrice)

一般認為新房子比較貴,老房子比較便宜,從圖上看大致也是這個趨勢,由於建造年份 (YearBuilt) 這個特徵存在較多的取值 (從1872年到2010年),直接one hot encoding會造成過於稀疏的數據,因此在特徵工程中會將其進行數字化編碼 (LabelEncoder) 。


數據清洗 (Data Cleaning)

這裡主要的工作是處理缺失值,首先來看各特徵的缺失值數量:

aa = full.isnull().sum()aa[aa>0].sort_values(ascending=False)PoolQC 2908MiscFeature 2812Alley 2719Fence 2346SalePrice 1459FireplaceQu 1420LotFrontage 486GarageQual 159GarageCond 159GarageFinish 159GarageYrBlt 159GarageType 157BsmtExposure 82BsmtCond 82BsmtQual 81BsmtFinType2 80BsmtFinType1 79MasVnrType 24MasVnrArea 23MSZoning 4BsmtFullBath 2BsmtHalfBath 2Utilities 2Functional 2Electrical 1BsmtUnfSF 1Exterior1st 1Exterior2nd 1TotalBsmtSF 1GarageCars 1BsmtFinSF2 1BsmtFinSF1 1KitchenQual 1SaleType 1GarageArea 1

如果我們仔細觀察一下data_description裡面的內容的話,會發現很多缺失值都有跡可循,比如上表第一個PoolQC,表示的是游泳池的質量,其值缺失代表的是這個房子本身沒有游泳池,因此可以用 「None」 來填補。

下面給出的這些特徵都可以用 「None」 來填補:

cols1 = ["PoolQC" , "MiscFeature", "Alley", "Fence", "FireplaceQu", "GarageQual", "GarageCond", "GarageFinish", "GarageYrBlt", "GarageType", "BsmtExposure", "BsmtCond", "BsmtQual", "BsmtFinType2", "BsmtFinType1", "MasVnrType"]for col in cols1: full[col].fillna("None", inplace=True)

下面的這些特徵多為表示XX面積,比如 TotalBsmtSF 表示地下室的面積,如果一個房子本身沒有地下室,則缺失值就用0來填補。

cols=["MasVnrArea", "BsmtUnfSF", "TotalBsmtSF", "GarageCars", "BsmtFinSF2", "BsmtFinSF1", "GarageArea"]for col in cols: full[col].fillna(0, inplace=True)

LotFrontage這個特徵與LotAreaCut和Neighborhood有比較大的關係,所以這裡用這兩個特徵分組後的中位數進行插補。

full[LotFrontage]=full.groupby([LotAreaCut,Neighborhood])[LotFrontage].transform(lambda x: x.fillna(x.median()))


特徵工程 (Feature Engineering)

離散型變數的排序賦值

對於離散型特徵,一般採用pandas中的get_dummies進行數值化,但在這個比賽中光這樣可能還不夠,所以下面我採用的方法是按特徵進行分組,計算該特徵每個取值下SalePrice的平均數和中位數,再以此為基準排序賦值,下面舉個例子:

MSSubClass這個特徵表示房子的類型,將數據按其分組:

full.groupby([MSSubClass])[[SalePrice]].agg([mean,median,count])

按表中進行排序:

180 : 1 30 : 2 45 : 2 190 : 3, 50 : 3, 90 : 3, 85 : 4, 40 : 4, 160 : 4 70 : 5, 20 : 5, 75 : 5, 80 : 5, 150 : 5 120: 6, 60 : 6

我總共大致排了20多個特徵,具體見完整代碼。

特徵組合

將原始特徵進行組合通常能產生意想不到的效果,然而這個數據集中原始特徵有很多,不可能所有都一一組合,所以這裡先用Lasso進行特徵篩選,選出較重要的一些特徵進行組合。

lasso=Lasso(alpha=0.001)lasso.fit(X_scaled,y_log)FI_lasso = pd.DataFrame({"Feature Importance":lasso.coef_}, index=data_pipe.columns)FI_lasso[FI_lasso["Feature Importance"]!=0].sort_values("Feature Importance").plot(kind="barh",figsize=(15,25))plt.xticks(rotation=90)plt.show()

最終加了這些特徵,這其中也包括了很多其他的各種嘗試:

class add_feature(BaseEstimator, TransformerMixin): def __init__(self,additional=1): self.additional = additional def fit(self,X,y=None): return self def transform(self,X): if self.additional==1: X["TotalHouse"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"] X["TotalArea"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"] + X["GarageArea"] else: X["TotalHouse"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"] X["TotalArea"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"] + X["GarageArea"] X["+_TotalHouse_OverallQual"] = X["TotalHouse"] * X["OverallQual"] X["+_GrLivArea_OverallQual"] = X["GrLivArea"] * X["OverallQual"] X["+_oMSZoning_TotalHouse"] = X["oMSZoning"] * X["TotalHouse"] X["+_oMSZoning_OverallQual"] = X["oMSZoning"] + X["OverallQual"] X["+_oMSZoning_YearBuilt"] = X["oMSZoning"] + X["YearBuilt"] X["+_oNeighborhood_TotalHouse"] = X["oNeighborhood"] * X["TotalHouse"] X["+_oNeighborhood_OverallQual"] = X["oNeighborhood"] + X["OverallQual"] X["+_oNeighborhood_YearBuilt"] = X["oNeighborhood"] + X["YearBuilt"] X["+_BsmtFinSF1_OverallQual"] = X["BsmtFinSF1"] * X["OverallQual"] X["-_oFunctional_TotalHouse"] = X["oFunctional"] * X["TotalHouse"] X["-_oFunctional_OverallQual"] = X["oFunctional"] + X["OverallQual"] X["-_LotArea_OverallQual"] = X["LotArea"] * X["OverallQual"] X["-_TotalHouse_LotArea"] = X["TotalHouse"] + X["LotArea"] X["-_oCondition1_TotalHouse"] = X["oCondition1"] * X["TotalHouse"] X["-_oCondition1_OverallQual"] = X["oCondition1"] + X["OverallQual"] X["Bsmt"] = X["BsmtFinSF1"] + X["BsmtFinSF2"] + X["BsmtUnfSF"] X["Rooms"] = X["FullBath"]+X["TotRmsAbvGrd"] X["PorchArea"] = X["OpenPorchSF"]+X["EnclosedPorch"]+X["3SsnPorch"]+X["ScreenPorch"] X["TotalPlace"] = X["TotalBsmtSF"] + X["1stFlrSF"] + X["2ndFlrSF"] + X["GarageArea"] + X["OpenPorchSF"]+X["EnclosedPorch"]+X["3SsnPorch"]+X["ScreenPorch"] return X

PCA

PCA是非常重要的一環,對於最終分數的提升很大。因為我新增的這些特徵都是和原始特徵高度相關的,這可能導致較強的多重共線性 (Multicollinearity) ,而PCA恰可以去相關性。因為這裡使用PCA的目的不是降維,所以 n_components 用了和原來差不多的維度,這是我多方實驗的結果,即前面加XX特徵,後面再降到XX維。

pca = PCA(n_components=410)X_scaled=pca.fit_transform(X_scaled)test_X_scaled = pca.transform(test_X_scaled)


基本建模&評估(Basic Modeling & Evaluation)

首先定義RMSE的交叉驗證評估指標:

def rmse_cv(model,X,y): rmse = np.sqrt(-cross_val_score(model, X, y, scoring="neg_mean_squared_error", cv=5)) return rmse

使用了13個演算法和5折交叉驗證來評估baseline效果:

  • LinearRegression
  • Ridge
  • Lasso
  • Random Forrest
  • Gradient Boosting Tree
  • Support Vector Regression
  • Linear Support Vector Regression
  • ElasticNet
  • Stochastic Gradient Descent
  • BayesianRidge
  • KernelRidge
  • ExtraTreesRegressor
  • XgBoost

names = ["LR", "Ridge", "Lasso", "RF", "GBR", "SVR", "LinSVR", "Ela","SGD","Bay","Ker","Extra","Xgb"]for name, model in zip(names, models): score = rmse_cv(model, X_scaled, y_log) print("{}: {:.6f}, {:.4f}".format(name,score.mean(),score.std()))

結果如下, 總的來說樹模型普遍不如線性模型,可能還是因為get_dummies後帶來的數據稀疏性,不過這些模型都是沒調過參的。

LR: 1026870159.526766, 488528070.4534Ridge: 0.117596, 0.0091Lasso: 0.121474, 0.0060RF: 0.140764, 0.0052GBR: 0.124154, 0.0072SVR: 0.112727, 0.0047LinSVR: 0.121564, 0.0081Ela: 0.111113, 0.0059SGD: 0.159686, 0.0092Bay: 0.110577, 0.0060Ker: 0.109276, 0.0055Extra: 0.136668, 0.0073Xgb: 0.126614, 0.0070

接下來建立一個調參的方法,應時刻牢記評估指標是RMSE,所以列印出的分數也要是RMSE。

class grid(): def __init__(self,model): self.model = model def grid_get(self,X,y,param_grid): grid_search = GridSearchCV(self.model,param_grid,cv=5, scoring="neg_mean_squared_error") grid_search.fit(X,y) print(grid_search.best_params_, np.sqrt(-grid_search.best_score_)) grid_search.cv_results_[mean_test_score] = np.sqrt(-grid_search.cv_results_[mean_test_score]) print(pd.DataFrame(grid_search.cv_results_)[[params,mean_test_score,std_test_score]])

舉例Lasso的調參:

grid(Lasso()).grid_get(X_scaled,y_log,{alpha: [0.0004,0.0005,0.0007,0.0006,0.0009,0.0008],max_iter:[10000]}){max_iter: 10000, alpha: 0.0005} 0.111296607965 params mean_test_score std_test_score0 {max_iter: 10000, alpha: 0.0003} 0.111869 0.0015131 {max_iter: 10000, alpha: 0.0002} 0.112745 0.0017532 {max_iter: 10000, alpha: 0.0004} 0.111463 0.0013923 {max_iter: 10000, alpha: 0.0005} 0.111297 0.0013394 {max_iter: 10000, alpha: 0.0007} 0.111538 0.0012845 {max_iter: 10000, alpha: 0.0006} 0.111359 0.0013156 {max_iter: 10000, alpha: 0.0009} 0.111915 0.0012067 {max_iter: 10000, alpha: 0.0008} 0.111706 0.001229

經過漫長的多輪測試,最後選擇了這六個模型:

lasso = Lasso(alpha=0.0005,max_iter=10000)ridge = Ridge(alpha=60)svr = SVR(gamma= 0.0004,kernel=rbf,C=13,epsilon=0.009)ker = KernelRidge(alpha=0.2 ,kernel=polynomial,degree=3 , coef0=0.8)ela = ElasticNet(alpha=0.005,l1_ratio=0.08,max_iter=10000)bay = BayesianRidge()


集成方法 (Ensemble Methods)

加權平均

根據權重對各個模型加權平均:

class AverageWeight(BaseEstimator, RegressorMixin): def __init__(self,mod,weight): self.mod = mod self.weight = weight def fit(self,X,y): self.models_ = [clone(x) for x in self.mod] for model in self.models_: model.fit(X,y) return self def predict(self,X): w = list() pred = np.array([model.predict(X) for model in self.models_]) # for every data point, single model prediction times weight, then add them together for data in range(pred.shape[1]): single = [pred[model,data]*weight for model,weight in zip(range(pred.shape[0]),self.weight)] w.append(np.sum(single)) return w

weightavg = AverageWeight(mod = [lasso,ridge,svr,ker,ela,bay],weight=[w1,w2,w3,w4,w5,w6])score = rmsecv(weightavg,Xscaled,y_log)print(score.mean()) # 0.10768459878025885

分數為0.10768,比任何單個模型都好。

然而若只用SVR和Kernel Ridge兩個模型,則效果更好,看來是其他幾個模型拖後腿了。。

weight_avg = AverageWeight(mod = [svr,ker],weight=[0.5,0.5])score = rmse_cv(weight_avg,X_scaled,y_log)print(score.mean()) # 0.10668349587195189

Stacking

Stacking的原理見下圖:

如果是像圖中那樣的兩層stacking,則是第一層5個模型,第二層1個元模型。第一層模型的作用是訓練得到一個 mathbb{R}^{n×m} 的特徵矩陣來用於輸入第二層模型訓練,其中n為訓練數據行數,m為第一層模型個數。

class stacking(BaseEstimator, RegressorMixin, TransformerMixin): def __init__(self,mod,meta_model): self.mod = mod self.meta_model = meta_model self.kf = KFold(n_splits=5, random_state=42, shuffle=True) def fit(self,X,y): self.saved_model = [list() for i in self.mod] oof_train = np.zeros((X.shape[0], len(self.mod))) for i,model in enumerate(self.mod): for train_index, val_index in self.kf.split(X,y): renew_model = clone(model) renew_model.fit(X[train_index], y[train_index]) self.saved_model[i].append(renew_model) oof_train[val_index,i] = renew_model.predict(X[val_index]) self.meta_model.fit(oof_train,y) return self def predict(self,X): whole_test = np.column_stack([np.column_stack(model.predict(X) for model in single_model).mean(axis=1) for single_model in self.saved_model]) return self.meta_model.predict(whole_test) def get_oof(self,X,y,test_X): oof = np.zeros((X.shape[0],len(self.mod))) test_single = np.zeros((test_X.shape[0],5)) test_mean = np.zeros((test_X.shape[0],len(self.mod))) for i,model in enumerate(self.mod): for j, (train_index,val_index) in enumerate(self.kf.split(X,y)): clone_model = clone(model) clone_model.fit(X[train_index],y[train_index]) oof[val_index,i] = clone_model.predict(X[val_index]) test_single[:,j] = clone_model.predict(test_X) test_mean[:,i] = test_single.mean(axis=1) return oof, test_mean

最開始我用get_oof的方法將第一層模型的特徵矩陣提取出來,再和原始特徵進行拼接,最後的cv分數下降到了0.1018,然而在leaderboard上的分數卻變差了,看來這種方法會導致過擬合。

X_train_stack, X_test_stack = stack_model.get_oof(a,b,test_X_scaled)X_train_add = np.hstack((a,X_train_stack)) X_test_add = np.hstack((test_X_scaled,X_test_stack))print(rmse_cv(stack_model,X_train_add,b).mean()) # 0.101824682747

最後的結果提交,我用了Lasso,Ridge,SVR,Kernel Ridge,ElasticNet,BayesianRidge作為第一層模型,Kernel Ridge作為第二層模型。

stack_model = stacking(mod=[lasso,ridge,svr,ker,ela,bay],meta_model=ker)stack_model.fit(a,b)pred = np.exp(stack_model.predict(test_X_scaled))result=pd.DataFrame({Id:test.Id, SalePrice:pred})result.to_csv("submission.csv",index=False)

/

推薦閱讀:

kaggle房價預測
Kaggle泰坦尼克號案例
一文搞懂泰坦尼克生存率分析
房價預測 (一)(lasso與xgboost)

TAG:Kaggle | 機器學習 | Python |