Kaggle入門：House Prices Prediction

05-06

1、Data Description

Ask a home buyer to describe their dream house, and they probably wont begin with the height of the basement ceiling or the proximity to an east-west railroad. But this dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence.

With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa to predict the final price of each home.

dataset link :

House Prices: Advanced Regression Techniques

2、Feature Exploration

import pandas as pdimport matplotlib.pyplot as pltimport seaborn as snsimport numpy as npfrom scipy.stats import normfrom sklearn.preprocessing import StandardScalerfrom scipy import statsimport warningswarnings.filterwarnings(ignore)%matplotlib inlinef = open(../train.csv)df_train = pd.read_csv(f)f1 = open(../test.csv)df_test = pd.read_csv(f1)test_ID = df_test[Id]#Now drop the Id colum since its unnecessary for the prediction process.df_train.drop("Id", axis = 1, inplace = True)df_test.drop("Id", axis = 1, inplace = True)df_train.columnsIndex([MSSubClass, MSZoning, LotFrontage, LotArea, Street, Alley, LotShape, LandContour, Utilities, LotConfig, LandSlope, Neighborhood, Condition1, Condition2, BldgType, HouseStyle, OverallQual, OverallCond, YearBuilt, YearRemodAdd, RoofStyle, RoofMatl, Exterior1st, Exterior2nd, MasVnrType, MasVnrArea, ExterQual, ExterCond, Foundation, BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinSF1, BsmtFinType2, BsmtFinSF2, BsmtUnfSF, TotalBsmtSF, Heating, HeatingQC, CentralAir, Electrical, 1stFlrSF, 2ndFlrSF, LowQualFinSF, GrLivArea, BsmtFullBath, BsmtHalfBath, FullBath, HalfBath, BedroomAbvGr, KitchenAbvGr, KitchenQual, TotRmsAbvGrd, Functional, Fireplaces, FireplaceQu, GarageType, GarageYrBlt, GarageFinish, GarageCars, GarageArea, GarageQual, GarageCond, PavedDrive, WoodDeckSF, OpenPorchSF, EnclosedPorch, 3SsnPorch, ScreenPorch, PoolArea, PoolQC, Fence, MiscFeature, MiscVal, MoSold, YrSold, SaleType, SaleCondition, SalePrice], dtype=object)

Analysing SalePrice

df_train[SalePrice].describe()count 1460.000000mean 180921.195890std 79442.502883min 34900.00000025% 129975.00000050% 163000.00000075% 214000.000000max 755000.000000Name: SalePrice, dtype: float64#histogramsns.distplot(df_train[SalePrice])

Deviate from the normal distribution.

Have appreciable positive skewness.

Show peakedness.

2. Relationship with numerical variables

#scatter plot grlivarea/salepricevar = GrLivAreadata = pd.concat([df_train[SalePrice], df_train[var]], axis=1)data.plot.scatter(x=var, y=SalePrice, ylim=(0,800000))<matplotlib.axes._subplots.AxesSubplot at 0x1edd61b2da0>

SalePrice and GrLivArea are a positive linear relationship.

#scatter plot totalbsmtsf/salepricevar = TotalBsmtSFdata = pd.concat([df_train[SalePrice], df_train[var]], axis=1)data.plot.scatter(x=var, y=SalePrice, ylim=(0,800000))<matplotlib.axes._subplots.AxesSubplot at 0x1edd61f3208>

SalePrice and TotalBsmtSF are a positive linear relationship.

Moreover, its clear that sometimes TotalBsmtSF closes in itself and gives zero credit to SalePrice.

#scatter plot LotFrontage/salepricevar=LotFrontagedata = pd.concat([df_train[SalePrice], df_train[var]], axis=1)data.plot.scatter(x=var, y=SalePrice, ylim=(0,800000))<matplotlib.axes._subplots.AxesSubplot at 0x1edd623dc18>

SalePrice and LotFrontage are a positive linear relationship.

#scatter plot 1stFlrSF/salepricevar=1stFlrSFdata = pd.concat([df_train[SalePrice], df_train[var]], axis=1)data.plot.scatter(x=var, y=SalePrice, ylim=(0,800000))<matplotlib.axes._subplots.AxesSubplot at 0x1edd62c6cf8>

SalePrice and 1stFlrSF are a positive linear relationship.

3. Relationship with categorical features

#boxplot overallqual/salepricevar = OverallQualdata = pd.concat([df_train[SalePrice], df_train[var]], axis=1)f, ax = plt.subplots(figsize=(8, 6))fig = sns.boxplot(x=var, y="SalePrice", data=data, palette=sns.color_palette("hls", 8))fig.axis(ymin=0, ymax=800000)(-0.5, 9.5, 0, 800000)

SalePrice and OverallQual are a positive linear relationship.

#boxplot YearBuilt/salepricevar = YearBuiltdata = pd.concat([df_train[SalePrice], df_train[var]], axis=1)f, ax = plt.subplots(figsize=(16, 8))fig = sns.boxplot(x=var, y="SalePrice", data=data, palette=sns.color_palette("hls", 8))fig.axis(ymin=0, ymax=800000)plt.xticks(rotation=90)(array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111]), <a list of 112 Text xticklabel objects>)

Although its not a strong tendency, SalePrice is more prone to spend more money in recent years than in old years.

#boxplot Foundation/salepricevar = Foundationdata = pd.concat([df_train[SalePrice], df_train[var]], axis=1)f, ax = plt.subplots(figsize=(8, 6))fig = sns.boxplot(x=var, y="SalePrice", order=["PConc","Stone","Wood","CBlock","BrkTil","Slab"], data=data, palette=sns.color_palette("hls", 8))fig.axis(ymin=0, ymax=800000)(-0.5, 5.5, 0, 800000)

SalePrice and Foundation are a linear relationship.

4. Correlation Matrix (heatmap style)

#correlation matrixcorrmat = df_train.corr()f, ax = plt.subplots(figsize=(12, 9))sns.heatmap(corrmat, vmax=.8, square=True)<matplotlib.axes._subplots.AxesSubplot at 0x1edd5dd2160>

#saleprice correlation matrixk = 10 #number of variables for heatmapcols = corrmat.nlargest(k, SalePrice)[SalePrice].indexcm = np.corrcoef(df_train[cols].values.T)sns.set(font_scale=1.25)hm = sns.heatmap(cm, cbar=True, annot=True, square=True, fmt=.2f, annot_kws={size: 10}, yticklabels=cols.values, xticklabels=cols.values)plt.show()

According to our crystal ball, these are the variables most correlated with SalePrice.

OverallQual, GrLivArea and TotalBsmtSF are strongly correlated with SalePrice.

GarageCars and GarageArea are also some of the most strongly correlated variables. However, as we discussed in the last sub-point, the number of cars that fit into the garage is a consequence of the garage area. GarageCars and GarageArea are like twin brothers. Youll never be able to distinguish them. Therefore, we just need one of these variables in our analysis (we can keep GarageCars since its correlation with SalePrice is higher).

TotalBsmtSF and 1stFloor also seem to be twin brothers. We can keep TotalBsmtSF just to say that our first guess was right (re-read So... What can we expect?).

TotRmsAbvGrd and GrLivArea, same again.

It seems that YearBuilt is slightly correlated with SalePrice. Honestly, it scares me to think about YearBuilt because I start feeling that we should do a little bit of time-series analysis to get this right.

3、Data Preprocessing

Log-transformation of the target variable

#Get also the SalePrice QQ-plotfig = plt.figure()res = stats.probplot(df_train[SalePrice], plot=plt)plt.show()

The target variable is right skewed. As (linear) models love normally distributed data , we need to transform this variable and make it more normally distributed.

# We use the numpy fuction log1p which applies log(1+x) to all elements of the columndf_train["SalePrice"] = np.log1p(df_train["SalePrice"])# Check the new distribution sns.distplot(df_train[SalePrice] , fit=norm)# Get the fitted parameters used by the function(mu, sigma) = norm.fit(df_train[SalePrice])print( mu = {:.2f} and sigma = {:.2f} .format(mu, sigma))#Now plot the distributionplt.legend([Normal dist. ($mu=$ {:.2f} and $sigma=$ {:.2f} ).format(mu, sigma)],loc=best)plt.ylabel(Frequency)plt.title(SalePrice distribution)#Get also the QQ-plotfig = plt.figure()res = stats.probplot(df_train[SalePrice], plot=plt)plt.show()mu = 12.02 and sigma = 0.40

2. Missing Value

ntrain = df_train.shape[0]ntest = df_test.shape[0]y_train = df_train.SalePrice.valuesall_data = pd.concat((df_train, df_test)).reset_index(drop=True)all_data.drop([SalePrice], axis=1, inplace=True)print("all_data size is : {}".format(all_data.shape))all_data_na = (all_data.isnull().sum() / len(all_data)) * 100all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)[:30]missing_data = pd.DataFrame({Missing Ratio :all_data_na})missing_data.head(20)f, ax = plt.subplots(figsize=(15, 12))plt.xticks(rotation=90)sns.barplot(x=all_data_na.index, y=all_data_na)plt.xlabel(Features, fontsize=15)plt.ylabel(Percent of missing values, fontsize=15)plt.title(Percent missing data by feature, fontsize=15)all_data["PoolQC"] = all_data["PoolQC"].fillna("None")all_data["MiscFeature"] = all_data["MiscFeature"].fillna("None")all_data["Alley"] = all_data["Alley"].fillna("None")all_data["Fence"] = all_data["Fence"].fillna("None")all_data["FireplaceQu"] = all_data["FireplaceQu"].fillna("None")all_data["LotFrontage"] = all_data.groupby("Neighborhood")["LotFrontage"].transform( lambda x: x.fillna(x.median()))for col in (GarageType, GarageFinish, GarageQual, GarageCond): all_data[col] = all_data[col].fillna(None)for col in (GarageYrBlt, GarageArea, GarageCars): all_data[col] = all_data[col].fillna(0)for col in (BsmtFinSF1, BsmtFinSF2, BsmtUnfSF,TotalBsmtSF, BsmtFullBath, BsmtHalfBath): all_data[col] = all_data[col].fillna(0)for col in (BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2): all_data[col] = all_data[col].fillna(None)all_data["MasVnrType"] = all_data["MasVnrType"].fillna("None")all_data["MasVnrArea"] = all_data["MasVnrArea"].fillna(0)all_data[MSZoning] = all_data[MSZoning].fillna(all_data[MSZoning].mode()[0])all_data = all_data.drop([Utilities], axis=1)all_data["Functional"] = all_data["Functional"].fillna("Typ")all_data[Electrical] = all_data[Electrical].fillna(all_data[Electrical].mode()[0])all_data[KitchenQual] = all_data[KitchenQual].fillna(all_data[KitchenQual].mode()[0])all_data[Exterior1st] = all_data[Exterior1st].fillna(all_data[Exterior1st].mode()[0])all_data[Exterior2nd] = all_data[Exterior2nd].fillna(all_data[Exterior2nd].mode()[0])all_data[SaleType] = all_data[SaleType].fillna(all_data[SaleType].mode()[0])all_data[MSSubClass] = all_data[MSSubClass].fillna("None")#Check remaining missing values if any all_data_na = (all_data.isnull().sum() / len(all_data)) * 100all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)missing_data = pd.DataFrame({Missing Ratio :all_data_na})print("remaining missing value:",missing_data)all_data size is : (2919, 79)remaining missing value: Empty DataFrameColumns: [Missing Ratio]Index: []

The work of missing value have done!

3. Transforming some numerical variables that are really categorical

#Transforming some numerical variables that are really categorical#MSSubClass=The building classall_data[MSSubClass] = all_data[MSSubClass].apply(str)#Changing OverallCond into a categorical variableall_data[OverallCond] = all_data[OverallCond].astype(str)#Year and month sold are transformed into categorical features.all_data[YrSold] = all_data[YrSold].astype(str)all_data[MoSold] = all_data[MoSold].astype(str)

4. Label Encoding some categorical variables that may contain information in their ordering set

from sklearn.preprocessing import LabelEncodercols = (FireplaceQu, BsmtQual, BsmtCond, GarageQual, GarageCond, ExterQual, ExterCond,HeatingQC, PoolQC, KitchenQual, BsmtFinType1, BsmtFinType2, Functional, Fence, BsmtExposure, GarageFinish, LandSlope, LotShape, PavedDrive, Street, Alley, CentralAir, MSSubClass, OverallCond, YrSold, MoSold)# process columns, apply LabelEncoder to categorical featuresfor c in cols: lbl = LabelEncoder() lbl.fit(list(all_data[c].values)) all_data[c] = lbl.transform(list(all_data[c].values))# shape print(Shape all_data: {}.format(all_data.shape))Shape all_data: (2919, 78)

5. Adding one more important feature

Since area related features are very important to determine house prices, we add one more feature which is the total area of basement, first and second floor areas of each house

all_data[TotalSF] = all_data[TotalBsmtSF] + all_data[1stFlrSF] + all_data[2ndFlrSF]

6. Skewed features

from scipy.stats import skewnumeric_feats = all_data.dtypes[all_data.dtypes != "object"].index# Check the skew of all numerical featuresskewed_feats = all_data[numeric_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)print(" Skew in numerical features: ")skewness = pd.DataFrame({Skew :skewed_feats})skewness.head(10)Skew in numerical features:

7. Box Cox Transformation of (highly) skewed features

We use the scipy function boxcox1p which computes the Box-Cox transformation of 1+x .

Note that setting λ=0 is equivalent to log1p used above for the target variable.

skewness = skewness[abs(skewness) > 0.75]print("There are {} skewed numerical features to Box Cox transform".format(skewness.shape[0]))from scipy.special import boxcox1pskewed_features = skewness.indexlam = 0.15for feat in skewed_features: #all_data[feat] += 1 all_data[feat] = boxcox1p(all_data[feat], lam)#all_data[skewed_features] = np.log1p(all_data[skewed_features])There are 59 skewed numerical features to Box Cox transform

8. Getting dummy categorical features

all_data = pd.get_dummies(all_data)print(all_data.shape)(2919, 221)

9. Getting the new train and test sets.

train = all_data[:ntrain]test = all_data[ntrain:]

4、Modelling

from sklearn.linear_model import ElasticNet, Lasso, BayesianRidge, LassoLarsICfrom sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressorfrom sklearn.kernel_ridge import KernelRidgefrom sklearn.pipeline import make_pipelinefrom sklearn.preprocessing import RobustScalerfrom sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clonefrom sklearn.model_selection import KFold, cross_val_score, train_test_splitfrom sklearn.metrics import mean_squared_error

(1) Define a cross validation strategy

We use the cross_val _score function of Sklearn.

#Validation functionn_folds = 5def rmsle_cv(model): kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(train.values) rmse= np.sqrt(-cross_val_score(model, train.values, y_train, scoring="neg_mean_squared_error", cv = kf)) return(rmse)

(2) Base models

LASSO Regression :

lasso = make_pipeline(RobustScaler(), Lasso(alpha =0.0005, random_state=1))

Elastic Net Regression :

ENet = make_pipeline(RobustScaler(), ElasticNet(alpha=0.0005, l1_ratio=.9, random_state=3))

Kernel Ridge Regression :

KRR = KernelRidge(alpha=0.6, kernel=polynomial, degree=2, coef0=2.5)

Gradient Boosting Regression :

GBoost = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05, max_depth=4, max_features=sqrt, min_samples_leaf=15, min_samples_split=10, loss=huber, random_state =5)

(3) Base models scores

score = rmsle_cv(lasso)print(" Lasso score: {:.4f} ({:.4f}) ".format(score.mean(), score.std()))Lasso score: 0.1240 (0.0165)score = rmsle_cv(ENet)print("ElasticNet score: {:.4f} ({:.4f}) ".format(score.mean(), score.std()))ElasticNet score: 0.1240 (0.0165)score = rmsle_cv(KRR)print("Kernel Ridge score: {:.4f} ({:.4f}) ".format(score.mean(), score.std()))Kernel Ridge score: 0.1262 (0.0127)score = rmsle_cv(GBoost)print("Gradient Boosting score: {:.4f} ({:.4f}) ".format(score.mean(), score.std()))Gradient Boosting score: 0.1248 (0.0130)

(4) Stacking models

class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin): def __init__(self, models): self.models = models # we define clones of the original models to fit the data in def fit(self, X, y): self.models_ = [clone(x) for x in self.models] # Train cloned base models for model in self.models_: model.fit(X, y) return self #Now we do the predictions for cloned models and average them def predict(self, X): predictions = np.column_stack([ model.predict(X) for model in self.models_ ]) return np.mean(predictions, axis=1)

score

averaged_models = AveragingModels(models = (ENet, GBoost, KRR, lasso))score = rmsle_cv(averaged_models)print(" Averaged base models score: {:.4f} ({:.4f}) ".format(score.mean(), score.std()))Averaged base models score: 0.1197 (0.0147)

(5) Final Prediction

averaged_models.fit(train.values, y_train)y_pred = np.expm1(averaged_models.predict(test.values))y_test = pd.DataFrame()y_test[Id] = test_IDy_test[SalePrice] = y_predy_test.to_csv(C:/Users/test/Desktop/機器學習小組作業/數據/y_test.csv,index=False)