Kaggle進階系列：zillow競賽特徵提取與模型融合（LB~0.644）

02-02

本文介紹Kaggle平台上Zillow房價預測比賽的解決方案，主要是介紹特徵工程（Feature Enginering）和模型融合（Model Ensemble）部分，感興趣的同學可以深挖這兩個環節，預祝大家取得好成績。

1.Zillow簡介

Zillow是美國最大的在線房產交易平台。Zestimate房屋定價模型是zillow的核心競爭力之一，該模型的median margin of error從11年前的 14%提升到了今年的5%。

參賽者通過建立新的模型來幫助zillow提高Zestimate模型的準確率。

比賽分為兩個階段：

階段1：2017-5-24 至 2018-1-17
階段2：2018-2-1 至 2019-1-15

比賽規則：

每天可以提交5次
禁止使用外部數據

數據探索部分在Kaggle平台上有很多比較好的Kernel，本文主要介紹特徵工程和模型融合部分。

註：本文使用Python語言，需要安裝Numpy、Pandas、Matplotlib、sciket-learn以及目前非常火的XGBOOST和微軟的LightGBM。

2.特徵工程（Feature Engineering）

特徵工程分為兩部分：特徵變換和添加特徵。

我們用pandas自帶info()函數來大概看一下數據類型信息：

這裡直接貼我的代碼了，我根據代碼來講我的思路。

def load_data():n train = pd.read_csv(../input/train_2016_v2.csv)n properties = pd.read_csv(../input/properties_2016.csv)n sample = pd.read_csv(../input/sample_submission.csv)n n print("Preprocessing...")n for c, dtype in zip(properties.columns, properties.dtypes):n if dtype == np.float64:n properties[c] = properties[c].astype(np.float32)n n print("Set train/test data...")n id_feature = [heatingorsystemtypeid,propertylandusetypeid, storytypeid, airconditioningtypeid,n architecturalstyletypeid, buildingclasstypeid, buildingqualitytypeid, typeconstructiontypeid]n for c in properties.columns:n properties[c]=properties[c].fillna(-1)n if properties[c].dtype == object:n lbl = LabelEncoder()n lbl.fit(list(properties[c].values))n properties[c] = lbl.transform(list(properties[c].values))n if c in id_feature:n lbl = LabelEncoder()n lbl.fit(list(properties[c].values))n properties[c] = lbl.transform(list(properties[c].values))n dum_df = pd.get_dummies(properties[c])n dum_df = dum_df.rename(columns=lambda x:c+str(x))n properties = pd.concat([properties,dum_df],axis=1)n #print np.get_dummies(properties[c])n n #n # Add Featuren #n # error in calculation of the finished living area of homen properties[N-LivingAreaError] = properties[calculatedfinishedsquarefeet] / properties[n finishedsquarefeet12]n # proportion of living arean n properties[N-LivingAreaProp] = properties[calculatedfinishedsquarefeet] / properties[n lotsizesquarefeet]n properties[N-LivingAreaProp2] = properties[finishedsquarefeet12] / properties[n finishedsquarefeet15]n # Total number of roomsn properties[N-TotalRooms] = properties[bathroomcnt] + properties[bedroomcnt]n # Average room sizen #properties[N-AvRoomSize] = properties[calculatedfinishedsquarefeet] / properties[roomcnt]nn properties["N-location-2"] = properties["latitude"] * properties["longitude"]nn # Ratio of tax of property over parceln properties[N-ValueRatio] = properties[taxvaluedollarcnt] / properties[taxamount]nn # TotalTaxScoren properties[N-TaxScore] = properties[taxvaluedollarcnt] * properties[taxamount]n n n #n # Make train and test dataframen #n train = train.merge(properties, on=parcelid, how=left)n sample[parcelid] = sample[ParcelId]n test = sample.merge(properties, on=parcelid, how=left)nn # drop out ouliersn train = train[train.logerror > -0.4]n train = train[train.logerror < 0.42]nn train["transactiondate"] = pd.to_datetime(train["transactiondate"])n train["Month"] = train["transactiondate"].dt.monthnn x_train = train.drop([parcelid, logerror,transactiondate, propertyzoningdesc, propertycountylandusecode], axis=1)n y_train = train["logerror"].valuesn test["Month"] = 10n x_test = test[x_train.columns]n del test, train n print(x_train.shape, y_train.shape, x_test.shape)n n return x_train, y_train, x_testn

我定義了一個load_data函數，將特徵工程相關的代碼都放在這個函數裡面。

第一部分特徵變換

（1）讀取數據

有三個文件需要讀取：train_2016.csv、propertie_2016.csv和sample_submission.csv。可以用pandas 的read_csv()將數據讀取為dataframe。

文件描述

Properties_2016.csv：包含2016年房屋特徵的所有內容。

Train_2016_v2.csv：2016年1月到2016年12月的訓練數據集

Sample_submission.csv：正確提交文件的實例

（2）類型轉換

我們大部分的模型都只支持數值型的數據，所以，我們需要將非數值類型的數據轉換為我們數值類型。這裡，我使用了兩種方案：

第一種方案：將『object』類型的數據進行Label Encode

第二種方案：將id類型的數據首先進行Labe lEncode，然後進行One Hot編碼。

在Zillow比賽給出的數據描述文件「zillow_data_dictionary.xlsx」

第二部分構造新特徵

我對原始的特徵進行加、乘、除運算構造了一些新的特徵，對成績是有一定幫助的。

到此，我的特徵工程基本結束，最終使用的特徵差不多有130個左右。特徵工程是Kaggle競賽里最為重要的一步，需要我們花大量的時間和精力來嘗試，我80%的精力幾乎都是用在特徵工程上。

3.模型融合（Model Ensemble）

特徵工程決定我們機器學習的上限，而模型讓我們不斷去逼近這個上限。

在kaggle比賽中，如果我們沒有其他很好的思路，那麼一個很好的選擇就是模型融合。

關於模型融合，網上有很多講解非常好的文章。模型融合的理論知識不是本篇文章的重點，這裡介紹幾篇寫的不錯的文章，供大家參考。

KAGGLE ENSEMBLING GUIDE,
英文鏈接：https://mlwave.com/kaggle-ensembling-guide/
中文鏈接：http://blog.csdn.net/a358463121/article/details/53054686
Kaggle機器學習之模型融合（stacking）心得
鏈接：https://zhuanlan.zhihu.com/p/26890738

模型融合的方法也很多，我採用的是Stacking，相關的理論上面的提到文章寫的比較清晰，我就直接上我的代碼了。

class Ensemble(object):n def __init__(self, n_splits, stacker, base_models):n self.n_splits = n_splitsn self.stacker = stackern self.base_models = base_modelsnn def fit_predict(self, X, y, T):n X = np.array(X)n y = np.array(y)n T = np.array(T)nn folds = list(KFold(n_splits=self.n_splits, shuffle=True, random_state=2016).split(X, y))nn S_train = np.zeros((X.shape[0], len(self.base_models)))n S_test = np.zeros((T.shape[0], len(self.base_models)))n for i, clf in enumerate(self.base_models):nn S_test_i = np.zeros((T.shape[0], self.n_splits))nn for j, (train_idx, test_idx) in enumerate(folds):n X_train = X[train_idx]n y_train = y[train_idx]n X_holdout = X[test_idx]n y_holdout = y[test_idx]n print ("Fit Model %d fold %d" % (i, j))n clf.fit(X_train, y_train)n y_pred = clf.predict(X_holdout)[:] nn S_train[test_idx, i] = y_predn S_test_i[:, j] = clf.predict(T)[:]n S_test[:, i] = S_test_i.mean(axis=1)nn # results = cross_val_score(self.stacker, S_train, y, cv=5, scoring=r2)n # print("Stacker score: %.4f (%.4f)" % (results.mean(), results.std()))n # exit()nn self.stacker.fit(S_train, y)n res = self.stacker.predict(S_test)[:]n return resnn# rf paramsnrf_params = {}nrf_params[n_estimators] = 50nrf_params[max_depth] = 8nrf_params[min_samples_split] = 100nrf_params[min_samples_leaf] = 30nn# xgb paramsnxgb_params = {}nxgb_params[n_estimators] = 50nxgb_params[min_child_weight] = 12nxgb_params[learning_rate] = 0.27nxgb_params[max_depth] = 6nxgb_params[subsample] = 0.77nxgb_params[reg_lambda] = 0.8nxgb_params[reg_alpha] = 0.4nxgb_params[base_score] = 0n#xgb_params[seed] = 400nxgb_params[silent] = 1nnn# lgb paramsnlgb_params = {}nlgb_params[n_estimators] = 50nlgb_params[max_bin] = 10nlgb_params[learning_rate] = 0.321 # shrinkage_ratenlgb_params[metric] = l1 # or maenlgb_params[sub_feature] = 0.34 nlgb_params[bagging_fraction] = 0.85 # sub_rownlgb_params[bagging_freq] = 40nlgb_params[num_leaves] = 512 # num_leafnlgb_params[min_data] = 500 # min_data_in_leafnlgb_params[min_hessian] = 0.05 # min_sum_hessian_in_leafnlgb_params[verbose] = 0nlgb_params[feature_fraction_seed] = 2nlgb_params[bagging_seed] = 3nnn# XGB modelnxgb_model = XGBRegressor(**xgb_params)nn# lgb modelnlgb_model = LGBMRegressor(**lgb_params)nn# RF modelnrf_model = RandomForestRegressor(**rf_params)nn# ET modelnet_model = ExtraTreesRegressor()nn# SVR modeln# SVM is too slow in more then 10000 setn#svr_model = SVR(kernel=rbf, C=1.0, epsilon=0.05)nn# DecsionTree modelndt_model = DecisionTreeRegressor()nn# AdaBoost modelnada_model = AdaBoostRegressor()nnstack = Ensemble(n_splits=5,n stacker=LinearRegression(),n base_models=(rf_model, xgb_model, lgb_model, et_model, ada_model, dt_model))nny_test = stack.fit_predict(x_train, y_train, x_test)n

這些代碼看起來很長，其實非常簡單。

核心思想：我用了兩層的模型融合，Level 1使用了：XGBoost、LightGBM、RandomForest、ExtraTrees、DecisionTree、AdaBoost，一共6個模型，Level 2使用了LinearRegression來擬合第一層的結果。

代碼實現：

定義一個Ensemble對象。我們將模型和數據扔進去它就會返回給我們預測值。文章篇幅有限，這個演算法具體的實現方式以後再說。
設定模型參數，構造模型。
將我們的模型和數據，傳到Ensemble對象里就可以得到預測結果。

最後一步就是提交我們的結果了。

結束語

我寫這篇文章的時候，我的zillow排名在400名13%。新手入門，第一次參賽，歡迎交流。

我用到的代碼在kaggle上公開了。

https://www.kaggle.com/wangsg/ensemble-stacking-lb-0-644

微信公眾號：kaggle數據分析