kaggle編碼categorical feature總結
來自專欄機器學習演算法與自然語言處理81 人贊了文章
kaggle競賽本質上是套路的競賽。這篇文章講講kaggle競賽里categorical feature的常用處理套路,主要基於樹模型(lightgbm,xgboost, etc.)。重點是target encoding 和 beta target encoding。
總結:
- label encoding
- 特徵存在內在順序 (ordinal feature)
- one hot encoding
- 特徵無內在順序,category數量 < 4
- target encoding (mean encoding, likelihood encoding, impact encoding)
- 特徵無內在順序,category數量 > 4
- beta target encoding
- 特徵無內在順序,category數量 > 4, K-fold cross validation
- 不做處理(模型自動編碼)
- CatBoost,lightgbm
1. Label encoding
對於一個有m個category的特徵,經過label encoding以後,每個category會映射到0到m-1之間的一個數。label encoding適用於ordinal feature (特徵存在內在順序)。
代碼:
# train -> training dataframe# test -> test dataframe# cat_cols -> categorical columnsfor col in cat_cols: le = LabelEncoder() le.fit(np.concatenate([train[col], test[col]])) train[col] = le.transform(train[col]) test[col] = le.transform(test[col])
2. One-hot encoding (OHE)
對於一個有m個category的特徵,經過獨熱編碼(OHE)處理後,會變為m個二元特徵,每個特徵對應於一個category。這m個二元特徵互斥,每次只有一個激活。
獨熱編碼解決了原始特徵缺少內在順序的問題,但是缺點是對於high-cardinality categorical feature (category數量很多),編碼之後特徵空間過大(此處可以考慮PCA降維),而且由於one-hot feature 比較unbalanced,樹模型里每次的切分增益較小,樹模型通常需要grow very deep才能得到不錯的精度。因此OHE一般用於category數量 <4的情況。
參考:Using Categorical Data with One Hot Encoding
代碼:
# train -> training dataframe# test -> test dataframe# cat_cols -> categorical columnsdf = train.append(test).reset_index()original_column = list(df.columns)df = pd.get_dummies(df, columns = cat_cols, dummy_na = True)new_column = [c for c in df.columns if c not in original_column ]
3. Target encoding (or likelihood encoding, impact encoding, mean encoding)
Target encoding 採用 target mean value (among each category) 來給categorical feature做編碼。為了減少target variable leak,主流的方法是使用2 levels of cross-validation求出target mean,思路如下:
- 把train data劃分為20-folds (舉例:infold: fold #2-20, out of fold: fold #1)
- 將每一個 infold (fold #2-20) 再次劃分為10-folds (舉例:inner_infold: fold #2-10, Inner_oof: fold #1)
- 計算 10-folds的 inner out of folds值 (舉例:使用inner_infold #2-10 的target的均值,來作為inner_oof #1的預測值)
- 對10個inner out of folds 值取平均,得到 inner_oof_mean
- 計算oof_mean (舉例:使用 infold #2-20的inner_oof_mean 來預測 out of fold #1的oof_mean
- 將train data 的 oof_mean 映射到test data完成編碼
參考: Likelihood encoding of categorical features
open source package category_encoders: scikit-learn-contrib/categorical-encoding
代碼:
# train -> training dataframe# test -> test dataframen_folds = 20n_inner_folds = 10likelihood_encoded = pd.Series()likelihood_coding_map = {}oof_default_mean = train[target].mean() # global prior meankf = KFold(n_splits=n_folds, shuffle=True)oof_mean_cv = pd.DataFrame()split = 0for infold, oof in kf.split(train[feature]): print (==============level 1 encoding..., fold %s ============ % split) inner_kf = KFold(n_splits=n_inner_folds, shuffle=True) inner_oof_default_mean = train.iloc[infold][target].mean() inner_split = 0 inner_oof_mean_cv = pd.DataFrame() likelihood_encoded_cv = pd.Series() for inner_infold, inner_oof in inner_kf.split(train.iloc[infold]): print (==============level 2 encoding..., inner fold %s ============ % inner_split) # inner out of fold mean oof_mean = train.iloc[inner_infold].groupby(by=feature)[target].mean() # assign oof_mean to the infold likelihood_encoded_cv = likelihood_encoded_cv.append(train.iloc[infold].apply( lambda x : oof_mean[x[feature]] if x[feature] in oof_mean.index else inner_oof_default_mean, axis = 1)) inner_oof_mean_cv = inner_oof_mean_cv.join(pd.DataFrame(oof_mean), rsuffix=inner_split, how=outer) inner_oof_mean_cv.fillna(inner_oof_default_mean, inplace=True) inner_split += 1 oof_mean_cv = oof_mean_cv.join(pd.DataFrame(inner_oof_mean_cv), rsuffix=split, how=outer) oof_mean_cv.fillna(value=oof_default_mean, inplace=True) split += 1 print (============final mapping...===========) likelihood_encoded = likelihood_encoded.append(train.iloc[oof].apply( lambda x: np.mean(inner_oof_mean_cv.loc[x[feature]].values) if x[feature] in inner_oof_mean_cv.index else oof_default_mean, axis=1))######################################### map into test dataframetrain[feature] = likelihood_encodedlikelihood_coding_mapping = oof_mean_cv.mean(axis = 1)default_coding = oof_default_meanlikelihood_coding_map[feature] = (likelihood_coding_mapping, default_coding)mapping, default_mean = likelihood_coding_map[feature]test[feature] = test.apply(lambda x : mapping[x[feature]] if x[feature] in mapping else default_mean,axis = 1)
4. beta target encoding
我第一次看到這個方法是在kaggle競賽Avito Demand Prediction Challenge 第14名的solution分享: 14th Place Solution: The Almost Golden Defenders
和target encoding 一樣,beta target encoding 也採用 target mean value (among each category) 來給categorical feature做編碼。不同之處在於,為了進一步減少target variable leak,beta target encoding發生在在5-fold CV內部,而不是在5-fold CV之前:
- 把train data劃分為5-folds (5-fold cross validation)
- target encoding based on infold data
- train model
- get out of fold prediction
同時beta target encoding 加入了smoothing term,用 bayesian mean 來代替mean。Bayesian mean (Bayesian average) 的思路: 某一個category如果數據量較少(<N_min),noise就會比較大,需要補足數據,達到smoothing 的效果。補足數據值 = prior mean。N_min 是一個regularization term,N_min 越大,regularization效果越強。
參考:Beta Target Encoding
代碼:
# train -> training dataframe# test -> test dataframe# N_min -> smoothing term, minimum sample size, if sample size is less than N_min, add up to N_min # target_col -> target column# cat_cols -> categorical colums# Step 1: fill NA in train and test dataframe# Step 2: 5-fold CV (beta target encoding within each fold)kf = KFold(n_splits=5, shuffle=True, random_state=0)for i, (dev_index, val_index) in enumerate(kf.split(train.index.values)): # split data into dev set and validation set dev = train.loc[dev_index].reset_index(drop=True) val = train.loc[val_index].reset_index(drop=True) feature_cols = [] for var_name in cat_cols: feature_name = f{var_name}_mean feature_cols.append(feature_name) prior_mean = np.mean(dev[target_col]) stats = dev[[target_col, var_name]].groupby(var_name).agg([sum, count])[target_col].reset_index() ### beta target encoding by Bayesian average for dev set df_stats = pd.merge(dev[[var_name]], stats, how=left) df_stats[sum].fillna(value = prior_mean, inplace = True) df_stats[count].fillna(value = 1.0, inplace = True) N_prior = np.maximum(N_min - df_stats[count].values, 0) # prior parameters dev[feature_name] = (prior_mean * N_prior + df_stats[sum]) / (N_prior + df_stats[count]) # Bayesian mean ### beta target encoding by Bayesian average for val set df_stats = pd.merge(val[[var_name]], stats, how=left) df_stats[sum].fillna(value = prior_mean, inplace = True) df_stats[count].fillna(value = 1.0, inplace = True) N_prior = np.maximum(N_min - df_stats[count].values, 0) # prior parameters val[feature_name] = (prior_mean * N_prior + df_stats[sum]) / (N_prior + df_stats[count]) # Bayesian mean ### beta target encoding by Bayesian average for test set df_stats = pd.merge(test[[var_name]], stats, how=left) df_stats[sum].fillna(value = prior_mean, inplace = True) df_stats[count].fillna(value = 1.0, inplace = True) N_prior = np.maximum(N_min - df_stats[count].values, 0) # prior parameters test[feature_name] = (prior_mean * N_prior + df_stats[sum]) / (N_prior + df_stats[count]) # Bayesian mean # Bayesian mean is equivalent to adding N_prior data points of value prior_mean to the data set. del df_stats, stats # Step 3: train model (K-fold CV), get oof prediction
另外,對於target encoding和beta target encoding,不一定要用target mean (or bayesian mean),也可以用其他的統計值包括 medium, frqequency, mode, variance, skewness, and kurtosis -- 或任何與target有correlation的統計值。
5. 不做任何處理(模型自動編碼)
- XgBoost和Random Forest,不能直接處理categorical feature,必須先編碼成為numerical feature。
- lightgbm和CatBoost,可以直接處理categorical feature。
- lightgbm: 需要先做label encoding。用特定演算法(On Grouping for Maximum Homogeneity)找到optimal split,效果優於ONE。也可以選擇採用one-hot encoding,。Features - LightGBM documentation
- CatBoost: 不需要先做label encoding。可以選擇採用one-hot encoding,target encoding (with regularization)。CatBoost — Transforming categorical features to numerical features — Yandex Technologies
參考: https://towardsdatascience.com/catboost-vs-light-gbm-vs-xgboost-5f93620723db
Lets connect!
https://www.linkedin.com/in/liyinxiao/推薦閱讀:
※【特徵工程】特徵選擇與特徵學習
※OneHotEncoder獨熱編碼和 LabelEncoder標籤編碼
※特徵工程簡介
※特徵工程-Outliers