Kaggle HousePrice : LB 0.11666(前15%), 用搭積木的方式(2.實踐-特徵工程部分)

關鍵詞: 機器學習,Kaggle 比賽,特徵工程,pandas Pipe, sklearn Pipeline,刷法大法, 自動化

從上篇文章發布到我這篇文章,一共收到了78個贊。謝謝各位看官捧場。

本文正文部分閱讀預計要花30分鐘左右。假定讀者已經對Kaggle, Python, Pandas,統計有一定了解。後面附相關代碼,閱讀需要時間因人而異。

這兩天在忙著刷Kaggle梅塞德斯賓士生產線測試案例,剛剛有了些思路,還是用管道方法達了個積木。這才有空開始寫第二篇文章。(吐個槽,Kaggle上面的很多比賽,比的是財力。伺服器內存不行,或者計算速度不夠就是浪費時間。)

上回說道,用搭樂高積木的方式就可以多快好省的刷Kaggle分。整個過程可以分成兩個部分,一是特徵工程,二是管道調參。

今天這篇文章,主要分享和討論的是特徵工程這部分。 主要使用的是Pandas 的表級別函數Pipe 。

這個Pipe就像是樂高小火車。有火車頭,火車身,火車廂。根據需要連接起來就是一輛漂亮的小火車。有什麼功能,有多少功能,全看各種組合的方式。

首先,火車頭:

在Kaggle 比賽中,有原始數據,train, 和test 部分。 把這兩部分合併在一起。作為火車頭的輸入。以House Price 為例:

import numpy as nptrain = pd.read_csv("train.csv.gz")test = pd.read_csv("test.csv.gz")#gz文件不需要解壓縮,強大的Pandas 內置的解壓功能。combined = pd.concat([train,test],axis =0, ignore_index =True)#合併好的train和test文件,SalePrice列只有前1460個數據有效,後面1459個數據都是nan指,也就是要預測的值

火車頭有了,要搞清楚火車往哪裡開?

在House Price 比賽中,對應為目標是什麼?方向盤是什麼? 終點到了後送什麼貨?

目標是 預測Test 數據集中的SalePrice (房屋銷售價)

方向盤(或者說衡量標準)是 Root-Mean-Squared-Error (RMSE) 均方根誤差

到站後,要將貨物送出。一個文件,兩列,一列是ID, 另一列是機器學習後預測出來的房屋銷售價。 RMSE值越小,則越好。

這個比賽要在Kaggle拿到好名次,需要方向盤對準,要把RMSE值將下來,降低預測的均方根誤差。

註:比賽方也特別註明了會用log轉換消除太貴或太便宜導致的誤差)

下面來看看目標的分布情況(預測房屋售價的變數分布圖)

如果是老司機的話,基本上可以看出來如下幾個特點:

  1. 基本上是正態分布(如果不是,就可以洗洗睡了,或者要重新讓數據變成正態分布)
  2. 長尾, 尤其是右邊(不是完美的正態, 看起來有清洗工作要做)
  3. 可能存在Outliers(異常點),特別是60萬以上的房價(能處理就處理,不能就粗暴的說,我沒看見。 哈哈)
  4. 用KDEPlot其時間能看見多個核。(在梅塞德斯賓士比賽中,有人用聚類方法提煉出特徵,也可以提高比分0.0X。由於提前引入了預測功能, 不太符合我簡約的觀點。 所以我的這個分享中沒有專門涉及。)

接下來,火車身:

如果說combined Dataframe是火車頭的輸出,那麼Pandas Pipe就是火車身的輸入。 火車身好長,可以放很多車廂。 一節連著一節。每一節的輸出是上一節的輸入,直到最後。

Pandas Pipe可以作為火車身,每一個函數可以向樂高積木一樣,插在一個pipe 裡面。

Pipe是Pandas 裡面一個Tablewise的函數(v16.2的新功能原廠說明鏈接)。

比較一下,下面兩種方法,哪種更加簡潔和易於理解?

函數方法

# f, g, and h are functions taking and returning ``DataFrames`` >>> f(g(h(df), arg1=1), arg2=2, arg3=3)

Pipe大法

>>> (df.pipe(h) .pipe(g, arg1=1) .pipe(f, arg2=2, arg3=3) )

儘管看起來很簡潔,不過非常不起眼,貌似也沒apply, map, group等明星函數受到重視。其實原廠還有一個例子,pipe就是用來做statsmodel回歸特徵工程的前期清洗。不過當時,我沒有特別留意。 直到,Kaggle 比賽的House Price,出現瓶頸。

- local CV及pulic board成績不穩定上升。 我刷Kaggle 出現了成績上上下下,結果非常不穩定時。我想Pipe 把各種功能放在一起,會不會好一些?

- 調參太費時間(每次以小時計算,並且特徵工程結果稍微有些改變,優化好的參數又要重新再調一遍),貌似特徵工程差之毫厘,調參的參數和性能,預測的結果也會謬之千里。

-繁雜的程序,導致內存經常用光。基本上用上一會兒,就要重新啟動Jupyter 的 Kernel來回收內存。 (尤其是我租的阿里雲伺服器只有1核1G內存+2G虛擬內存。用硬碟虛擬內存。物理內存用完後, 一個簡單的回歸演算法也能算上幾分鐘時間)

這是,Pandas pipe(原廠說明鏈接) 重新回到了我的視野。

pipe

pipe

pipe

重要的事情說三遍。

比較早的時候,我學會了用Sklearn的pipeline和Gridsearch 一起調參,調函數,那個功能之強大是誰用誰知道。

只要備選參數設好,備選演算法設好,就交給計算就可以。所以特別喜歡pipe這類自動化的東西。 可以把苦臟累的活一下子變成高大上的,可以偷懶的活。訓練雖然花時間,還是有pipeline的套路可以使用。

不過在發現和利用Pandas 的pipe之前,特徵工程簡直就是苦臟累的活。例如:

- 大量的特徵需要琢磨,

尤其是House Price 這個案例,一共有80個特徵,1個目標(售價)。 為了搞明白美國的房價到底和那些有關,我還特別讀了好幾篇美國的房價深度分析報告。然並卵,這些特徵還會互相影響。

- 絕大多數的特徵都不知道琢磨後是否有價值,(單變數回歸)

例如,房子外立面材料,房間的電器開關用的什麼標準,多少安培等等等,Frontage大小,宗土圖形狀等等, 貸款是否還清了,

- 更不知道和其他特徵配合後結果會如何。(多元變數回歸)

街區,成交的月份,年份(中間還有07,08年經濟危機),房屋大小,宗土面積互相關關聯。

- 是不是有聚類的情況(非監管內機器學習方法)

前兩天剛學了一個知識,用Kmeans方法可以挖掘出來新的特徵。這種特徵方法不是基於經驗和知識,而僅僅是依賴於機器學習。 

插句話,特徵工程,尤其本文後面附錄的的特徵工程函數,大量依賴於個人的經驗和知識水平,老司機和新司機的特徵工程結果會差別很大。 基於機器學習的特徵工程,應該是未來的一個趨勢。前兩個月在網上看到以為百度前搜索廣告推廣工程師發的文章。講的就是百度在2010年前後的變化,之前還是比較依賴於人工的特徵工程。後來特徵工程太多,人工完全無法適應,他用類似的Kmeans方法作了聚類方法的特徵工程(希望我沒記錯)。

上面說了4中特徵工程的苦臟累。 我在House Price 比賽中全都碰到了。 前期還有點興趣,後期簡直是乏味。 在加上我的機器學習環境只有 1核1G內存的Centos阿里雲主機。經常調試就內存用光來,或者變成用硬碟虛擬內存,慢的無法忍受。

Pandas pipe來了, 簡單,有效。 可以有效地隔離,有效地連接,並且可以批量產生大量特徵工程結果組合。

上面嘮叨了許多Pipe的文字觀點,也許讀者已經有點煩了,下面再上兩張積木的照片,來圖示比較積木和pipe的關係。

Pipe 就像樂高小火車積木中的車身,本身沒有任何功能,但是有很好的輸入、輸出機制

車身搭上一個積木,小火車好玩。Pipe也要裝入特徵工程函數,才有用

兩個或者更多車身積木連在一起,就變成長長的火車身。兩個或者多個pipe(裝入特徵函數)就變成了特徵工程管道

接下來,我們來看看,這個積木方法是否能夠解決特徵工程中4個問題?

- 大量的特徵需要琢磨,

一個特徵就來一個pipe。pipe裡面放對應的特徵函數。多個特徵就用多個pipe.

雖然大量的特徵依然需要逐一琢磨。但是pipe可能帶入項目管理中WBS工作分解的思路。把多個特徵分解給不同的人(不同領域有不同的專家)來做,最後用pipe鏈接起來。

- 絕大多數的特徵都不知道琢磨後是否有價值,(單變數分析)

pipe分解了特徵工作後,每一個具體的特徵函數,可以深入的使用本領域的方法來設kpi和指標,在函數級別確定是否有價值,或者設定處理的量綱,過濾級別等等。

- 更不知道和其他特徵配合後結果會如何。(多元變數回歸)

pipe的強處就在這裡。搭積木呀,簡單的各種pipe連在一起就好。甚至還可以專門做一個測試成績的pipe.在pipes的結尾放一個快速測試函數(例如:測試R2)就可以方便快速的得到這次特徵工程大致效果。

註:

回歸類比賽,成績通常是兩種:R2 或者是 MSE以及其變種(RMSE,AMSE等)

分類內比賽,成績類型比較多:accuracy, precise,f1, mutual等等。

Sklearn 的成績評估函數介紹 -比較詳細

- 是不是有聚類的情況(非監管內機器學習方法)

簡單,做一個聚類的特徵函數,然後用一個pipe裝起來就好。

下面,開始無聊的代碼時間吧!

機器環境:

硬體- 阿里雲最便宜一款 -1核 1GB(共享基本型 xn4,ecs.xn4.small)(

系統 , Centos 7(虛擬內存2G)

Python: 3.6, Jupyter notebook 4.3, 用Anaconda裝好了常見的科學計算庫。

手工裝了:XGBoost(kaggle刷分利器之一) , lightbm 庫(微軟開源,速度比XGBoost快)。有兩上面這兩個庫,sklearn 裡面的gradientboost就沒有必要用了,太慢了,score也不如這兩個庫好。

  • 1.導入函數和Panas庫

import numpy as npimport pandas as pdfrom my-lib import *#例如: eda_plot#定義函數和pipe都放在my-lib裡面(這部分沒有測試過)。我用my-lib.func方式來調用這些函數#為了文章顯示美觀,我在最後逐一分享函數,實際代碼這些函數是方法最前面的from datetime import datetimeimport matplotlib.pyplot as pltimport seaborn as sns%matplotlib inlineSKIPMAP = False #此外全局變數開關,用來關閉/啟動自定義函數的畫圖。節約時間和內存#註:特徵工程階段一般來說Pandas 和numpy就夠了。 本文側重特徵工程,EDA部分跳過,不談。

  • 2.導入數據,準備combined數據集。做好火車頭

train = pd.read_csv("train.csv.gz")test = pd.read_csv("test.csv.gz")combined = pd.concat([train,test],axis =0, ignore_index =True)ntrain = train.shape[0]Y_train = train["SalePrice"]X_train = train.drop(["Id","SalePrice"],axis=1)print("train data shape: ",train.shape)print("test data shape: ",test.shape)print("combined data shape: ",combined.shape)

train data shape: (1460, 81)

test data shape: (1459, 80)

combined data shape: (2919, 81)

數據集:train 1460個,test,1459個,特徵為80個,另外一個是需要預測的目標(SalePrice)

  • 3. Exploring Data analysis 數據探索分析

3.1 要事第一,先看看目標數據情況(Y -SalePrice)

print(Y_train.skew()) #skew是單變數工具,用來監測數據是否有長尾,左偏或者右偏eda_plot(cols=["SalePrice"]) #自定義函數,用來顯示單變數的正太分布圖和雙變數(和銷售價格)關係圖。

文章開頭已經介紹過了,銷售價格是正態分布,但是長尾(high skews)且有離散點(outliers)

  • 3.2 特徵分析 (X), 80個,每列代表了一個特徵
  • 3.2.1 偏度分析 - 銷售價格由偏度角度,那麼就先對特徵檢查一下偏度。

np.abs(combined[:ntrain].skew()).sort_values(ascending = False ).head(20)#np.abs 是絕對值函數,用來取整個向量絕對值

MiscVal 24.476794

PoolArea 14.828374

LotArea 12.207688

3SsnPorch 10.304342

LowQualFinSF 9.011341

KitchenAbvGr 4.488397

BsmtFinSF2 4.255261

ScreenPorch 4.122214

BsmtHalfBath 4.103403

EnclosedPorch 3.089872

MasVnrArea 2.669084

OpenPorchSF 2.364342

LotFrontage 2.163569

SalePrice 1.882876

BsmtFinSF1 1.685503

WoodDeckSF 1.541376

TotalBsmtSF 1.524255

MSSubClass 1.407657

1stFlrSF 1.376757

GrLivArea 1.366560

dtype: float64

看起來蠻厲害,前面20個特徵偏度都超過了1.36。後期需要處理

Found:

More than 20 features(columns) showed high skews(left,right).

Next Actions:

if the skew() value > certain value, they have to been transfer by log1p to improve the accuracy

3.2.2 缺失值分析

cols_missing_value = combined.isnull().sum()/combined.shape[0]cols_missing_value = cols_missing_value[cols_missing_value>0]print("How many features is bad/missing value? The answer is:",cols_missing_value.shape[0])cols_missing_value.sort_values(ascending=False).head(10).plot.barh()

How many features is bad/missing value? The answer is: 35

Out[10]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f303cf5aef0>

有35個數據都有缺失值情況。 大多數機器學習演算法要求不可以有缺失值。 因此,處理缺失值是 必要工作之一。

Fill the missing value is first step to in Machine learning as most of the learning estimator expect the pre-process data.

Found:

35 features were found missing values

top 6 features missed more than 50% of data. They were: PoolQC, MiscFeature, Alley, Fence, FirePlaceQu

Next Actions:

All features found missing value will be fillNA with different approaches. for example:

Basic_fillna, Just fiilna with median/mean(). In this case, we choise median as significant skew() was identified. It quick & easy.

Manual_fillna, fillna with detail data exploring manually. It will be time consuming actions & manual handled, some time depend on the people peronsal experience. To wash the data with my knowledge(maybe bias), and fillNA with

grouped mean/medain (based on location, similar group)

specific values.(0, "NA",and etc)

總結:初步EDA了解到:

1.有偏度 - 需要處理。通常是用log1p (提分關鍵)

2.有缺失 - 需要填充或者刪除,通常用均值或者中指,或者用人工分析(人工分析是提分關鍵)

3.有類別變數 - 需要轉換成數值 (個人感覺刷近前20%關鍵)

(此處沒有做EDA, 直接看data_description.txt文件)

  • 4.準備pipe,裝載特徵函數,小火車出發啦!!!

from my-lib import *#導入一下自定義的特徵函數#例如:#定義函數和pipe都放在my-lib裡面(這部分沒有測試過)。我用my-lib.func方式來調用這些函數#為了文章顯示美觀,我在最後逐一分享函數,實際代碼這些函數是方法最前面的

為了演示,我定義三個pipes, 每個pipes裡面都有若干個特徵處理函數和一個快速測試R2(越高越好,最大值是1)的函數。實際刷分時更多,加上不同的特徵函數參數,做的pipes組合大概至少幾十種。

為了對齊美觀,用bypass函數來填充空白的地方。用一個列表,將這三個pipe放入。

現在三輛火車準備出發,都在車站等候待命。 他們分別是簡單型,自定義型(部分特徵轉換成有序的category類型),和概念型(提煉並新增加房屋每平米單價係數)。

pipe_basic = [pipe_basic_fillna,pipe_bypass, pipe_bypass,pipe_bypass, pipe_bypass,pipe_bypass, pipe_log_getdummies,pipe_bypass, pipe_export,pipe_r2test]pipe_ascat = [pipe_fillna_ascat,pipe_drop_cols, pipe_drop4cols,pipe_outliersdrop, pipe_extract,pipe_bypass, pipe_log_getdummies,pipe_drop_dummycols, pipe_export,pipe_r2test]pipe_ascat_unitprice = [pipe_fillna_ascat,pipe_drop_cols, pipe_drop4cols,pipe_outliersdrop, pipe_extract,pipe_unitprice, pipe_log_getdummies,pipe_drop_dummycols, pipe_export,pipe_r2test]pipes = [pipe_basic,pipe_ascat,pipe_ascat_unitprice ]

好了,小火車啟動了,發車。

for i in range(len(pipes)): print("*"*10,"
") pipe_output=pipes[i] output_name ="_".join([x.__name__[5:] for x in pipe_output if x.__name__ is not "pipe_bypass"]) output_name = "PIPE_" +output_name print(output_name) (combined.pipe(pipe_output[0]) .pipe(pipe_output[1]) .pipe(pipe_output[2]) .pipe(pipe_output[3]) .pipe(pipe_output[4]) .pipe(pipe_output[5]) .pipe(pipe_output[6]) .pipe(pipe_output[7]) .pipe(pipe_output[8],name=output_name) .pipe(pipe_output[9]) )

精簡的結果如下:

PIPE_basic_fillna_log_getdummies_export_r2test

****Testing*****

R2 Scoring by lightGBM = 0.7290 PIPE_fillna_ascat_drop_cols_drop4cols_outliersdrop_extract_log_getdummies_drop_dummycols_export_r2test

*****Drop outlier based on ratio > 0.999 quantile :

****Testing*****

R2 Scoring by lightGBM = 0.7342 PIPE_fillna_ascat_drop_cols_drop4cols_outliersdrop_extract_unitprice_log_getdummies_drop_dummycols_export_r2test

*****Drop outlier based on ratio > 0.999 quantile :

****Testing*****

R2 Scoring by lightGBM = 0.7910

三個pipes的r2(默認參數,無優化調參)結果分別是

0.729, (填充均值)

0.7390,(自定義填充,和類型轉換)

0.7910(增加單價的特徵工程)

那麼在這一步,我們可以初步看到三個特徵工程的性能。並且文件已經輸出到hd5格式文件。後期在訓練和預測時,直接取出預處理的文件就可以。

好了,特徵工程到這裡就完成了。這三個小火車PIPES的R2結果不是最好,只能說尚可。都保留下來。可以進入訓練,優化,調參,集成的階段了。本文也就寫完了。

註:特徵工程函數代碼見文章最後部分。

非常感謝你的閱讀和花費的時間。

如果本文的點贊數量超過100,那麼我將會完成第三步分 Sklearn的 pipeline 調參大法。

同時你會發現本文中三個小火車裡面有一個已經埋下了過擬合的種子。

分享是對自己最好的投資!

======================================================

如果,你讀到這裡還有耐心繼續往下看的話,我把幾個重要的函數分享出來。

註:火車頭要保護好。 Pipe裡面的第一個特徵函數一定要用 copy函數獲得一份完整DataFrame.這樣可以避免原始數據被修改,且不能重複使用的問題。

def pipe_basic_fillna(df=combined): local_ntrain = ntrain pre_combined=df.copy() #print("The input train dimension: ", pre_combined[0:ntrain].shape) #print("The input test dimension: ", pre_combined[ntrain:].drop("SalePrice",axis=1).shape) num_cols = pre_combined.drop(["Id","SalePrice"],axis=1).select_dtypes(include=[np.number]).columns cat_cols = pre_combined.select_dtypes(include=[np.object]).columns pre_combined[num_cols]= pre_combined[num_cols].fillna(pre_combined[num_cols].median()) # Median is my favoraite fillna mode, which can eliminate the skew impact. pre_combined[cat_cols]= pre_combined[cat_cols].fillna("NA") pre_combined= pd.concat([pre_combined[["Id","SalePrice"]],pre_combined[cat_cols],pre_combined[num_cols]],axis=1) return pre_combined

def pipe_drop4cols(pre_combined=pipe_fillna_ascat()): cols_drop =["PoolQC", "MiscFeature", "Alley", "Fence", "FireplaceQu"] # the 4 features(columns was identified earlier which missing data >50%) #pre_combined, ntrain = customize_fillna_extract_outliersdrop() #pre_combined = customize_fillna_extract_outliersdrop() pre_combined = pre_combined.drop(cols_drop,axis=1) return pre_combined

def pipe_drop_cols(pre_combined = pipe_fillna_ascat()): pre_combied = pre_combined.drop(["Street", "Utilities", "Condition2", "RoofMatl", "Heating"], axis = 1) return pre_combined

def pipe_fillna_ascat(df=combined): from datetime import datetime local_ntrain = train.shape[0] pre_combined=df.copy() #convert quality feature to category type #def feature group with same categories value cols_Subclass = ["MSSubClass"] cols_Zone = ["MSZoning"] cols_Overall =["OverallQual","OverallCond"] cols_Qual = ["BsmtCond","BsmtQual", "ExterQual","ExterCond", "FireplaceQu","GarageQual","GarageCond", "HeatingQC","KitchenQual", "PoolQC"] cols_BsmtFinType = ["BsmtFinType1","BsmtFinType2"] cols_access = ["Alley","Street"] cols_condition = ["Condition1","Condition2"] cols_fence =["Fence"] cols_exposure = ["BsmtExposure"] cols_miscfeat = ["MiscFeature"] cols_exter = ["Exterior1st","Exterior2nd"] cols_MasVnr =["MasVnrType"] cols_GarageType = ["GarageType"] cols_GarageFinish =["GarageFinish"] cols_Functional = ["Functional"] cols_Util =["Utilities"] cols_SaleType = ["SaleType"] cols_Electrical = ["Electrical"] #define the map of categories valus group cat_Subclass = ["20",#1-STORY 1946 & NEWER ALL STYLES "30",#1-STORY 1945 & OLDER "40",#1-STORY W/FINISHED ATTIC ALL AGES "45",#1-1/2 STORY - UNFINISHED ALL AGES "50",#1-1/2 STORY FINISHED ALL AGES "60",#2-STORY 1946 & NEWER "70",#2-STORY 1945 & OLDER "75",#2-1/2 STORY ALL AGES "80",#SPLIT OR MULTI-LEVEL "85",#SPLIT FOYER "90",#DUPLEX - ALL STYLES AND AGES "120",#1-STORY PUD (Planned Unit Development) - 1946 & NEWER "150",#1-1/2 STORY PUD - ALL AGES "160",#2-STORY PUD - 1946 & NEWER "180",#PUD - MULTILEVEL - INCL SPLIT LEV/FOYER "190",#2 FAMILY CONVERSION - ALL STYLES AND AGES ] cat_Zone = ["A",#Agriculture "C (all)",#Commercial #the train/test value is different than the data_description file. "FV",#Floating Village Residential "I",#Industrial "RH",#Residential High Density "RL",#Residential Low Density "RP",#Residential Low Density Park "RM",#Residential Medium Density ] cat_Overall = ["10","9","8","7","6","5","4","3","2","1"] cat_Qual = ["Ex","Gd","TA","Fa","Po","NA"] cat_BsmtFinType = ["GLQ","ALQ","BLQ","Rec","LwQ","Unf","NA"] cat_access = ["Grvl","Pave","NA"] cat_conditions= ["Artery","Feedr","Norm","RRNn","RRAn","PosN","PosA","RRNe","RRAe"] cat_fence = ["GdPrv",#Good Privacy "MnPrv",#Minimum Privacy "GdWo",#Good Wood "MnWw",#Minimum Wood/Wire "NA",#No Fence ] cat_exposure = ["Gd", #Good Exposure "Av", #Average Exposure (split levels or foyers typically score average or above) "Mn", #Mimimum Exposure "No", #No Exposure "NA", #No Basement ] cat_miscfeat = ["Elev",#Elevator "Gar2",#2nd Garage (if not described in garage section) "Othr",#Other "Shed",#Shed (over 100 SF) "TenC",#Tennis Court "NA",#None ] cat_exter =["AsbShng",#Asbestos Shingles "AsphShn",#Asphalt Shingles "BrkComm",#Brick Common Brk Cmn BrkComm "BrkFace",#Brick Face "CBlock",#Cinder Block "CementBd",#Cement Board #CementBd was the data_description value "HdBoard",#Hard Board "ImStucc",#Imitation Stucco "MetalSd",#Metal Siding "Other",#Other "Plywood",#Plywood "PreCast",#PreCast,# "Stone",#Stone "Stucco",#Stucco "VinylSd",#Vinyl Siding "Wd Sdng",#Wood Siding "WdShing",#Wood Shingles #Wd Shng WdShing ] cat_MasVnr =["BrkCmn",#Brick Common "BrkFace",#Brick Face "CBlock",#Cinder Block "None",#None "Stone",#Stone ] cat_GarageType =["2Types",#More than one type of garage "Attchd",#Attached to home "Basment",#Basement Garage "BuiltIn",#Built-In (Garage part of house - typically has room above garage) "CarPort",#Car Port "Detchd",#Detached from home "NA",#No Garage ] cat_GarageFinish =["Fin",#Finished "RFn",#Rough Finished,# "Unf",#Unfinished "NA",#No Garage ] cat_Functional = ["Typ",#Typical Functionality "Min1",#Minor Deductions 1 "Min2",#Minor Deductions 2 "Mod",#Moderate Deductions "Maj1",#Major Deductions 1 "Maj2",#Major Deductions 2 "Sev",#Severely Damaged "Sal",#Salvage only ] cat_Util =["AllPub",#All public Utilities (E,G,W,& S) "NoSewr",#Electricity, Gas, and Water (Septic Tank) "NoSeWa",#Electricity and Gas Only "ELO",#Electricity only,# ] cat_SaleType =["WD",#Warranty Deed - Conventional "CWD",#Warranty Deed - Cash "VWD",#Warranty Deed - VA Loan "New",#Home just constructed and sold "COD",#Court Officer Deed/Estate "Con",#Contract 15% Down payment regular terms "ConLw",#Contract Low Down payment and low interest "ConLI",#Contract Low Interest "ConLD",#Contract Low Down "Oth",#Other ] cat_Electrical = ["SBrkr",#Standard Circuit Breakers & Romex "FuseA",#Fuse Box over 60 AMP and all Romex wiring (Average),# "FuseF",#60 AMP Fuse Box and mostly Romex wiring (Fair) "FuseP",#60 AMP Fuse Box and mostly knob & tube wiring (poor) "Mix",#Mixed ] ########################################################################### #define the collection of group features &categories value by diction type Dict_category={"Qual":[cols_Qual,cat_Qual,"NA","Ordinal"], "Overall":[cols_Overall,cat_Overall,"5","Ordinal"], # It is integer already. no need overwork "BsmtFinType":[cols_BsmtFinType,cat_BsmtFinType,"NA","Ordinal"], "Access":[cols_access,cat_access,"NA","Ordinal"], "Fence":[cols_fence,cat_fence,"NA","Ordinal"], "Exposure":[cols_exposure,cat_exposure,"NA","v"], "GarageFinish":[cols_GarageFinish,cat_GarageFinish,"NA","Ordinal"], "Functional":[cols_Functional,cat_Functional,"Typ","Ordinal"], #fill na with lowest quality "Utility":[cols_Util,cat_Util,"ELO","Ordinal"], # fillNA with lowest quality "Subclass":[cols_Subclass,cat_Subclass,"NA","Nominal"], "Zone":[cols_Zone,cat_Zone,"RL","Nominal"], #RL is most popular zone value. "C(all) is the study result" "Cond":[cols_condition,cat_conditions,"Norm","Nominal"], "MiscFeature":[cols_miscfeat,cat_miscfeat,"NA","Nominal"], "Exter":[cols_exter,cat_exter,"Other","Nominal"], "MasVnr":[cols_MasVnr,cat_MasVnr,"None","Nominal"], "GarageType":[cols_GarageType,cat_GarageType,"NA","Nominal"], "SaleType":[cols_SaleType, cat_SaleType,"WD","Nominal"], "Electrical":[cols_Electrical,cat_Electrical,"SBrkr","Nominal"], } #Change input feature type to string, especailly to below integer type pre_combined[cols_Overall] = pre_combined[cols_Overall].astype(str) pre_combined[cols_Subclass] = pre_combined[cols_Subclass].astype(str) #fix the raw data mistyping exter_map = {"Brk Cmn":"BrkComm", "CmentBd":"CementBd", "CemntBd":"CementBd", "Wd Shng":"WdShing" } pre_combined[cols_exter]=pre_combined[cols_exter].replace(exter_map) for v in Dict_category.values(): cols_cat = v[0] cat_order =v[1] cat_fillnavalue=v[2] pre_combined[cols_cat]=pre_combined[cols_cat].fillna(cat_fillnavalue) #if not isOrdinal: if v[3] =="Nominal": for col in cols_cat: pre_combined[col]=pre_combined[col].astype("category",ordered =True,categories=cat_order) elif v[3]=="Ordinal": for col in cols_cat: pre_combined[col]=pre_combined[col].astype("category",ordered =True,categories=cat_order).cat.codes pre_combined[col] = pre_combined[col].astype(np.number) #pre_combined[cols_Overall] = pre_combined[cols_Overall].fillna(pre_combined[cols_Overall].median()) #Lotfrontage fill mssing value pre_combined["LotFrontage"] = pre_combined.groupby("Neighborhood")["LotFrontage"].transform(lambda x: x.fillna(x.median())) #fill missing value to Garage related features #Assuming no garage for thos missing value pre_combined["GarageCars"] =pre_combined["GarageCars"].fillna(0).astype(int) pre_combined["GarageArea"] =pre_combined["GarageArea"].fillna(0).astype(int) #fill missing value to Basement related features pre_combined[["BsmtFinSF1","BsmtFinSF2","BsmtUnfSF"]]= pre_combined[["BsmtFinSF1","BsmtFinSF2","BsmtUnfSF"]].fillna(0) pre_combined["TotalBsmtSF"]= pre_combined["BsmtFinSF1"] + pre_combined["BsmtFinSF2"]+pre_combined["BsmtUnfSF"] cols_Bsmt_Bath = ["BsmtHalfBath","BsmtFullBath"] pre_combined[cols_Bsmt_Bath] =pre_combined[cols_Bsmt_Bath].fillna(0) #assuming mean pre_combined["MasVnrArea"] = pre_combined["MasVnrArea"].fillna(0) #filled per study #solve Year related feature missing value #cols_time = ["YearBuilt","YearRemodAdd","GarageYrBlt","MoSold","YrSold"] pre_combined["GarageYrBlt"] = pre_combined["GarageYrBlt"].fillna(pre_combined["YearBuilt"]) #use building year for garage even no garage. return pre_combined

def pipe_outliersdrop(pre_combined=pipe_fillna_ascat(),ratio =0.001): # note, it could done by statsmodel as well. it will explored in future ratio =1-ratio ntrain = pre_combined["SalePrice"].notnull().sum() Y_train = pre_combined["SalePrice"][:ntrain] num_cols = pre_combined.select_dtypes(include=[np.number]).columns out_df = pre_combined[0:ntrain][num_cols] top5 = np.abs(out_df.corrwith(Y_train)).sort_values(ascending=False)[:5] #eda_plot(df=pre_combined[:ntrain],cols=top5.index) limit = out_df["GrLivArea"].quantile(ratio) # limit use to remove the outliers dropindex = out_df[out_df["GrLivArea"]>limit].index dropped_pre_combined =pre_combined.drop(dropindex) #***************************** dropped_Y_train = Y_train.drop(dropindex) #***************************** print("

*****Drop outlier based on ratio > {0:.3f} quantile :".format(ratio)) #print("New shape of collected data",dropped_pre_combined.shape) return dropped_pre_combined

def cat_col_compress(col, threshold=0.005): #copy the code from stackoverflow # removes the bind dummy_col=col.copy() # what is the ratio of a dummy in whole column count = pd.value_counts(dummy_col) / len(dummy_col) # cond whether the ratios is higher than the threshold mask = dummy_col.isin(count[count > threshold].index) # replace the ones which ratio is lower than the threshold by a special name dummy_col[~mask] = "dum_others" return dummy_coldef pipe_log_getdummies(pre_combined = pipe_fillna_ascat(),skew_ratio=0.75,cat_ratio=0): from scipy.stats import skew skew_limit =skew_ratio #I got this limit from Kaggle directly. Someone use 1 , someone use 0.75. I just use 0.75 by random and has no detail study yet cat_threshold = cat_ratio num_cols = pre_combined.select_dtypes(include=[np.number]).columns cat_cols = pre_combined.select_dtypes(include=[np.object]).columns #log transform skewed numeric features: skewed_Series = np.abs(pre_combined[num_cols].skew()) #compute skewness skewed_cols = skewed_Series[skewed_Series > skew_limit].index.values pre_combined[skewed_cols] = np.log1p(pre_combined[skewed_cols]) skewed_Series = abs(pre_combined.skew()) #compute skewness skewed_cols = skewed_Series[skewed_Series > skew_limit].index.tolist() for col in cat_cols: pre_combined[col]=cat_col_compress(pre_combined[col],threshold=cat_threshold) # threshold set to zero as it get high core for all estimatior except ridge based pre_combined= pd.get_dummies(pre_combined,drop_first=True) return pre_combined

def pipe_drop_dummycols(pre_combined=pipe_log_getdummies()): cols = ["MSSubClass_160","MSZoning_C (all)"] pre_combined=pre_combined.drop(cols,axis=1) return pre_combined

def pipe_export(pre_output,name): if (pre_output is None) : print("None input! Expect pre_combined dataframe name as parameter") return elif pre_output.drop("SalePrice",axis=1).isnull().sum().sum()>0: print("Dataframe still missing value! pls check again") return elif type(name) is not str: print("Expect preparing option name to generate output file") print("The out file name will be [Preparing_Output_<name>_20171029.h5] ") return else: from datetime import datetime savetime=datetime.now().strftime("%m-%d-%H_%M") directory_name = "./prepare/" filename = directory_name + name +"_"+ savetime +".h5" local_ntrain = pre_output.SalePrice.notnull().sum() pre_train = pre_output[0:local_ntrain] pre_test =pre_output[local_ntrain:].drop("SalePrice",axis=1) pre_train.to_hdf(filename,"pre_train") pre_test.to_hdf(filename,"pre_test") #print("
***Exported*** :{0}".format(filename)) #print(" train set size : ",local_ntrain) #print(" pre_train shape: ", pre_train.shape) #print(" pre_test shape: ", pre_test.shape) return pre_output

def pipe_r2test(df): import statsmodels.api as sm import warnings warnings.filterwarnings("ignore") print("****Testing*****") train_df = df ntrain = train_df["SalePrice"].notnull().sum() train = train_df[:ntrain] X_train = train.drop(["Id","SalePrice"],axis =1) Y_train = train["SalePrice"] from lightgbm import LGBMRegressor,LGBMClassifier from sklearn.model_selection import cross_val_score LGB =LGBMRegressor() nCV=3 score = cross_val_score(LGB,X_train,Y_train,cv=nCV,scoring="r2") print("R2 Scoring by lightGBM = {0:.4f}".format(score.mean())) #print(pd.concat([X_train,Y_train],axis=1).head()) result = sm.OLS(Y_train, X_train).fit() result_str= str(result.summary()) results1 = result_str.split("
")[:10] for result in results1: print(result) print("*"*20) return df

def pipe_extract( pre_combined = pipe_fillna_ascat()): #extract 3 age feature(building, Garage, and remodel) pre_combined["BldAge"] = pre_combined["YrSold"] - pre_combined["YearBuilt"] pre_combined["GarageAge"] = pre_combined["YrSold"] - pre_combined["GarageYrBlt"] pre_combined["RemodelAge"] = pre_combined["YrSold"] - pre_combined["YearRemodAdd"] SoldYM_df = pd.DataFrame({"year":pre_combined.YrSold,"month":pre_combined.MoSold.astype("int").astype("object"),"day":1}) SoldYM_df = pd.to_datetime(SoldYM_df,format="%Y%m%d",unit="D") pre_combined["SoldYM"]=SoldYM_df.apply(lambda x: x.toordinal()) #extract total space features # three options for calculating the total Square feet. (Garage & basement has very high sknew, remove them ) #pre_combined["TotalSQF"] = pre_combined["GarageArea"] + pre_combined["TotalBsmtSF"] +pre_combined["GrLivArea"]+pre_combined["1stFlrSF"] + pre_combined["2ndFlrSF"] #pre_combined["TotalSQF"] = pre_combined["TotalBsmtSF"] + pre_combined["1stFlrSF"] + pre_combined["2ndFlrSF"] pre_combined["TotalSQF"] = pre_combined["1stFlrSF"] + pre_combined["2ndFlrSF"] +pre_combined["GrLivArea"] return pre_combined

def pipe_drop_dummycols(pre_combined=pipe_log_getdummies()): cols = ["MSSubClass_160","MSZoning_C (all)"] pre_combined=pre_combined.drop(cols,axis=1) return pre_combined

好了,非常感謝你的閱讀和花費的時間。 如果本文的點贊數量超過100,那麼我將會完成第三步分 Sklearn的 pipeline 調參大法。

同時你會發現本文中三個小火車裡面有一個已經埋下了過擬合的種子。

分享是對自己最好的投資!

相關鏈接:

Kaggle HousePrice : LB 0.11666(前15%), 用搭積木的方式(2.實踐-特徵工程部分)

Kaggle HousePrice : LB 0.11666(排名前15%), 用搭積木的方式(1.原理)

用Python分析指數: 11月16日熱門指數Z值表

用Python分析指數: 10月18日指數高低Z值表。

滬深300進入低估區了嗎? 沒有。好像中證環保現在低估了。

一個韭菜用Python採集,清洗和分析中證指數


推薦閱讀:

八大排序演算法python實現合輯
[Python + SQL] NBA史上最弱的球隊是哪一個
簡單三步,用 Python 發郵件

TAG:Python | 机器学习 | Kaggle |