用線性回歸預測灣區的房屋價格

02-25

動機

為了預測灣區的房價，我選擇了來自灣區住房銷售資料庫和Zillow的房價數據集。此數據集基於2013年1月至2015年12月期間出售的房屋。它具有許多學習特點，數據集可從此處下載。

數據預處理

import pandas as pd
sf = pd.read_csv(final_data.csv)
sf.head()

有幾個我們不需要的特徵，比如「info」，「z_address」，「zipcode」（我們有「neighborhood」作為位置變數），「zipid」和「zestimate」（這是Zillow估計的價格，我們不希望我們的模型受到這個影響），所以會丟棄它們。

sf.drop(sf.columns[[0, 2, 3, 15, 17, 18]], axis=1, inplace=True)
sf.info()

「zindexvalue」的數據類型應該是數字，所以讓我們更改它。

sf[zindexvalue] = sf[zindexvalue].str.replace(,, )
sf[zindexvalue] = pd.to_numeric(sf[zindexvalue])
sf.lastsolddate.min(), sf.lastsolddate.max()

(『01/02/2013』, 『12/31/2015』)

房屋銷售期限為2013年1月至2015年12月。

我現在使用describe（）方法來顯示數字變數的匯總統計信息。

sf.describe()

計數，平均值，最小值和最大值是不言自明的。 std顯示標準偏差，25％，50％和75％的行顯示相應的百分位數。

為了了解我們處理的數據類型，我們繪製了每個數值變數的直方圖。

%matplotlib inline
import matplotlib.pyplot as plt
sf.hist(bins=50, figsize=(20,15))
plt.savefig("attribute_histogram_plots")
plt.show()

圖1.每個數值變數的直方圖

有些直方圖有一點偏右，但這並不是異常的。

讓我們創建一個包含經度和緯度的散點圖來顯示數據：

sf.plot(kind="scatter", x="longitude", y="latitude", alpha=0.2)
plt.savefig(map1.png)

圖2.數據的散點圖

現在讓我們根據價格從高到低繪圖：

sf.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4, figsize=(10,7),
c="lastsoldprice", cmap=plt.get_cmap("jet"), colorbar=True,
sharex=False)

圖3.灣區房價

這張圖片告訴我們，最昂貴的房子出售在北部地區。

我們要預測的變數是「最後售出的價格」。那麼我們來看看每個獨立變數與這個因變數的相關程度。

corr_matrix = sf.corr()
corr_matrix["lastsoldprice"].sort_values(ascending=False)

當平方尺和浴室數量增加時，最後售出的價格往往會增加。您可以看到建成年份和最後售出價格之間存在小的負相關關係。最後，接近零的係數表明沒有線性相關。

我們現在要通過使用Pandas的scatter_matrix函數來可視化變數之間的相關性。我們只關注一些有「前途」的變數，這些變數似乎與最後售出的價格最相關。

from pandas.plotting import scatter_matrix
attributes = ["lastsoldprice", "finishedsqft", "bathrooms", "zindexvalue"]
scatter_matrix(sf[attributes], figsize=(12, 8))
plt.savefig(matrix.png)

圖4.散布矩陣

影響最後售出價格的最有可能的變數是平方英尺，所以讓我們放大它們的相關散點圖。

sf.plot(kind="scatter", x="finishedsqft", y="lastsoldprice", alpha=0.5)
plt.savefig(scatter.png)

圖5.平方英尺與最近出售的價格

相關性確實非常強;你可以清楚地看到上升的趨勢，並且這些點不是太分散。

由於每棟房屋面積不同，每個社區的房價都不同，我們真正需要的是每平方英尺的價格。所以，我們添加一個新的變數「pricepersqft」。然後，我們檢查這個新的自變數與最後售出價格的相關程度。

sf[price_per_sqft] = sf[lastsoldprice]/sf[finishedsqft]
corr_matrix = sf.corr()
corr_matrix["lastsoldprice"].sort_values(ascending=False)

不巧的是，新的pricepersqft變數只顯示與最後售出價格非常小的正相關。但是我們仍然需要這個變數來分組社區。

數據中有71個街區，我們將對他們進行分組。

len(sf[neighborhood].value_counts())

以下步驟將街區分為三組：1.低價格; 2.高價低頻; 3.高價高頻。

freq = sf.groupby(neighborhood).count()[address]
mean = sf.groupby(neighborhood).mean()[price_per_sqft]
cluster = pd.concat([freq, mean], axis=1)
cluster[neighborhood] = cluster.index
cluster.columns = [freq, price_per_sqft,neighborhood]
cluster.describe()

這些是低價格的街區：

cluster1 = cluster[cluster.price_per_sqft < 756]
cluster1.index

Index([『Bayview』, 『Central Richmond』, 『Central Sunset』, 『Crocker Amazon』,

『Daly City』, 『Diamond Heights』, 『Excelsior』, 『Forest Hill』,
『Forest Hill Extension』, 『Golden Gate Heights』, 『Ingleside』,
『Ingleside Heights』, 『Ingleside Terrace』, 『Inner Parkside』,
『Inner Richmond』, 『Inner Sunset』, 『Lakeshore』, 『Little Hollywood』,
『Merced Heights』, 『Mission Terrace』, 『Mount Davidson Manor』,
『Oceanview』, 『Outer Mission』, 『Outer Parkside』, 『Outer Richmond』,
『Outer Sunset』, 『Parkside』, 『Portola』, 『Silver Terrace』, 『Sunnyside』,
『Visitacion Valley』, 『West Portal』, 『Western Addition』,
『Westwood Highlands』, 『Westwood Park』],

dtype=』object』, name=』neighborhood』

這些是高價格和低頻率的街區：

cluster_temp = cluster[cluster.price_per_sqft >= 756]
cluster2 = cluster_temp[cluster_temp.freq <123]
cluster2.index

Index([『Buena Vista Park』, 『Central Waterfront?—?Dogpatch』, 『Corona Heights』, 『Haight-Ashbury』, 『Lakeside』, 『Lone Mountain』, 『Midtown Terrace』,

『North Beach』, 『North Waterfront』, 『Parnassus?—?Ashbury』, 『Presidio Heights』, 『Sea Cliff』, 『St. Francis Wood』, 『Telegraph Hill』, 『Twin Peaks』], dtype=』object』, name=』neighborhood』)

這些是高價格和高頻率的街區：

cluster3 = cluster_temp[cluster_temp.freq >=123]
cluster3.index

Index([『Bernal Heights』, 『Cow Hollow』, 『Downtown』, 『Eureka Valley?—?Dolores Heights?—?Castro』, 『Glen Park』, 『Hayes Valley』, 『Lake』, 『Lower Pacific Heights』, 『Marina』, 『Miraloma Park』, 『Mission』, 『Nob Hill』, 『Noe Valley』, 『North Panhandle』, 『Pacific Heights』, 『Potrero Hill』, 『Russian Hill』, 『South Beach』, 『South of Market』, 『Van Ness?—?Civic Center』, 『Yerba Buena』],

dtype=』object』, name=』neighborhood』)

我們根據集群添加一個組列：

def get_group(x):
if x in cluster1.index:
return low_price
elif x in cluster2.index:
return high_price_low_freq
else:
return high_price_high_freq
sf[group] = sf.neighborhood.apply(get_group)

在執行上述預處理之後，我們不再需要以下列：「address, lastsolddate, latitude, longitude, neighborhood, pricepersqft」，所以將它們刪除。

sf.drop(sf.columns[[0, 4, 6, 7, 8, 13]], axis=1, inplace=True)
sf = sf[[bathrooms, bedrooms, finishedsqft, totalrooms, usecode, yearbuilt,zindexvalue, group, lastsoldprice]]
sf.head()

數據看起來完美！

但在構建模型之前，我們需要為這兩個分類變數創建啞變數：「usecode」和「group」。

X = sf[[bathrooms, bedrooms, finishedsqft, totalrooms, usecode, yearbuilt,
zindexvalue, group]]
Y = sf[lastsoldprice]
n = pd.get_dummies(sf.group)
X = pd.concat([X, n], axis=1)
m = pd.get_dummies(sf.usecode)
X = pd.concat([X, m], axis=1)
drops = [group, usecode]
X.drop(drops, inplace=True, axis=1)
X.head()

如下圖所示：

訓練並建立線性回歸模型

from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=0)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

LinearRegression(copyX=True, fitintercept=True, n_jobs=1, normalize=False)

搞定！我們現在有一個可用的線性回歸模型。

計算R平方：

y_pred = regressor.predict(X_test)
print(Linear Regression R squared: %.4f % regressor.score(X_test, y_test))

Linear Regression R squared: 0.5619

因此，我們的模型為56.19％。這並不令人興奮。

計算均方根誤差（RMSE）：

import numpy as np
from sklearn.metrics import mean_squared_error
lin_mse = mean_squared_error(y_pred, y_test)
lin_rmse = np.sqrt(lin_mse)
print(Linear Regression RMSE: %.4f % lin_rmse)

Linear Regression RMSE: 616071.5748

我們的模型能夠在真實價格的616071美元之內預測測試集中每個房子的價值。

計算平均絕對誤差（MAE）：

from sklearn.metrics import mean_absolute_error
lin_mae = mean_absolute_error(y_pred, y_test)
print(Linear Regression MAE: %.4f % lin_mae)

Linear Regression MAE: 363742.1631

隨機森林

讓我們嘗試一個更複雜的模型來看看是否可以改進結果 - RandomForestRegressor：

from sklearn.ensemble import RandomForestRegressor
forest_reg = RandomForestRegressor(random_state=42)
forest_reg.fit(X_train, y_train)

RandomForestRegressor(bootstrap=True, criterion=』mse』, maxdepth=None, maxfeatures=』auto』, maxleafnodes=None,

minimpuritysplit=1e-07, minsamplesleaf=1,
minsamplessplit=2, minweightfractionleaf=0.0,
nestimators=10, njobs=1, oobscore=False, randomstate=42,
verbose=0, warmstart=False)

print(Random Forest R squared": %.4f % forest_reg.score(X_test, y_test))

Random Forest R squared」: 0.6491

y_pred = forest_reg.predict(X_test)
forest_mse = mean_squared_error(y_pred, y_test)
forest_rmse = np.sqrt(forest_mse)
print(Random Forest RMSE: %.4f % forest_rmse)

Random Forest RMSE: 551406.0926

好多了！讓我們再試試。

Gradient boosting

from sklearn import ensemble
from sklearn.ensemble import GradientBoostingRegressor
model = ensemble.GradientBoostingRegressor()
model.fit(X_train, y_train)

GradientBoostingRegressor(alpha=0.9, criterion=friedmanmse, init=None,

learningrate=0.1, loss=ls, maxdepth=3, maxfeatures=None,
maxleafnodes=None, minimpuritydecrease=0.0,
minimpuritysplit=None, minsamplesleaf=1,
minsamplessplit=2, minweightfractionleaf=0.0,
nestimators=100, presort=auto, randomstate=None,
subsample=1.0, verbose=0, warmstart=False

print(Gradient Boosting R squared": %.4f % model.score(X_test, y_test))

Gradient Boosting R squared」: 0.6616

y_pred = model.predict(X_test)
model_mse = mean_squared_error(y_pred, y_test)
model_rmse = np.sqrt(model_mse)
print(Gradient Boosting RMSE: %.4f % model_rmse)

Gradient Boosting RMSE: 544579.8296

這些是迄今為止我們獲得的最佳結果，因此，我認為這是我們的最終模型。

特徵的重要性

我們在模型中使用了19個特徵（變數）。讓我們找出哪些功能很重要，反之亦然。

feature_labels = np.array([bathrooms, bedrooms, finishedsqft, totalrooms, yearbuilt, zindexvalue,
high_price_high_freq, high_price_low_freq, low_price, Apartment, Condominium, Cooperative,
Duplex, Miscellaneous, Mobile, MultiFamily2To4, MultiFamily5Plus, SingleFamily,
Townhouse])
importance = model.feature_importances_
feature_indexes_by_importance = importance.argsort()
for index in feature_indexes_by_importance:
print({}-{:.2f}%.format(feature_labels[index], (importance[index] *100.0)))

最重要的特徵是finishedsqft，zindexvalue，bathrooms，totalrooms，yearsbuilt等等。而最不重要的特徵是Apartment，這意味著無論是否是公寓，與售價無關。總體而言，這19個特徵中的大多數都被使用。

到你了！

希望這篇文章能夠讓你了解機器學習回歸項目是什麼樣的。正如您所看到的，大部分工作都在數據準備步驟中，而機器學習大部分時間花在這些過程上。

現在是時候行動，開始探索和清理您的數據。嘗試兩種或三種演算法，並讓我知道它是怎麼回事。

這篇文章的源代碼可以在這裡找到。我很樂意收到任何反饋或問題。

原作者：Susan Li

原文地址：https://towardsdatascience.com/linear-regression-in-python-predict-the-bay-areas-home-price-5c91c8378878

以上。