標籤:

ZT:機器學習中如何選擇特徵值

Selecting good features – Part I: univariate selection

Posted November 2, 2014 Selecting good features - Part I: univariate selection

Having a good understanding of feature selection/ranking can be a great asset for a data scientist or machine learning practitioner. A good grasp of these methods leads to better performing models, better understanding of the underlying structure and characteristics of the data and leads to better intuition about the algorithms that underlie many machine learning models.

There are in general two reasons why feature selection is used:

1. Reducing the number of features, to reduce overfitting and improve the generalization of models.

2. To gain a better understanding of the features and their relationship to the response variables.

These two goals are often at odds with each other and thus require different approaches: depending on the data at hand a feature selection method that is good for goal (1) isn』t necessarily good for goal (2) and vice versa. What seems to happen often though is that people use their favourite method (or whatever is most conveniently accessible from their tool of choice) indiscriminately, especially methods more suitable for (1) for achieving (2).

At the same time, feature selection is not particularly thoroughly covered in machine learning or data mining textbooks, partly because they are often looked at as natural side effects of learning algorithms that don』t require separate coverage.

In this blog post I』ll try to cover some of the more popular approaches for feature selection and their pros, cons and gotchas along with code samples in Python and scikit-learn. I』ll also run the methods side-by-side on a sample dataset, which should highlight some of the major differences between them. In all my examples, I concentrate on regression datasets, but most of the discussion and examples are equally applicable for classification datasets and methods.

Univariate feature selection

Univariate feature selection examines each feature individually to determine the strength of the relationship of the feature with the response variable. These methods are simple to run and understand and are in general particularly good for gaining a better understanding of data (but not necessarily for optimizing the feature set for better generalization). There are lot of different options for univariate selection.

Pearson Correlation

One of the simplest method for understanding a feature』s relation to the response variable is Pearson correlation coefficient, which measures linear correlation between two variables. The resulting value lies in [-1;1], with -1 meaning perfect negative correlation (as one variable increases, the other decreases), +1 meaning perfect positive correlation and 0 meaning no linear correlation between the two variables.

It』s fast and easy to calculate and is often the first thing to be run on the data. Scipy『s pearsonr method computes both the correlation and p-value for the correlation, roughly showing the probability of an uncorrelated system creating a correlation value of this magnitude.

1

2

3

4

5

6

7

import numpy as np

from scipy.stats import pearsonr

np.random.seed(0)

size = 300

x = np.random.normal(0, 1, size)

print "Lower noise", pearsonr(x, x + np.random.normal(0, 1, size))

print "Higher noise", pearsonr(x, x + np.random.normal(0, 10, size))

Lower noise (0.71824836862138386, 7.3240173129992273e-49)

Higher noise (0.057964292079338148, 0.31700993885324746)

As you can see from the example, we compare a variable with a noisy version of itself. With smaller amount of noise, the correlation is relatively strong, with a very low p-value, while for the noisy comparison, the correlation is very small and furthermore, the p-value is high meaning that it is very likely to observe such correlation on a dataset of this size purely by chance.

Scikit-learn provides f_regrssion method for computing the p-values (and underlying F-values) in batch for a set of features, so it is a convenient way to run a corre1ation test on a dataset in one go and for example include it in a sklearn』s pipeline.

One obvious drawback of Pearson correlation as a feature ranking mechanism is that it is only sensitive to a linear relationship. If the relation is non-linear, Pearson correlation can be close to zero even if there is a 1-1 correspondence between the two variables.

For example, correlation between x and x^2 is zero, when x is centered on 0.

1

2 x = np.random.uniform(-1, 1, 100000)

print pearsonr(x, x**2)[0]

-0.00230804707612

For more examples similar to the above, see the following sample plots.

Furthermore as illustrated by Anscombe』s quartet, relying only on the correlation value on interpreting the relationship of two variables can be highly misleading, so it is always worth plotting the data.

Mutual information and maximal information coefficient (MIC)

Another, more robust option for correlation estimation is mutual information, defined as

I(X,Y)=∑ y∈Y ∑ x∈X p(x,y)log(p(x,y)p(x)p(y) ),

which measures mutual dependence between variables, typically in bits.

It can be inconvenient to use directly for feature ranking for two reasons though. Firstly, it is not a metric and not normalized (i.e. doesn』t lie in a fixed range), so the MI values can be incomparable between two datasets. Secondly, it can be inconvenient to compute for continuous variables: in general the variables need to be discretized by binning, but the mutual information score can be quite sensitive to bin selection.

Maximal information coefficient is a technique developed to address these shortcomings. It searches for optimal binning and turns mutual information score into a metric that lies in range [0;1]. In python, MIC is available in the minepy library.

Looking back at the y=x^2 example, MIC finds that the mutual information is 1, i.e. maximal.

from minepy import MINE

m = MINE()

x = np.random.uniform(-1, 1, 10000)

m.compute_score(x, x**2)

print m.mic()

1.0

There has been some critique about MIC』s statistical power, i.e. the ability to reject the null hypothesis when the null hypothesis is false. This may or may not be a concern, depending on the particular dataset and its noisiness.

Distance correlation

Another robust, competing method of correlation estimation is distance correlation, designed explicitly to address the shortcomings of Pearson correlation. While for Pearson correlation, the correlation value 0 does not imply independence (as we saw from the x vs x^2 example), distance correlation of 0 does imply that there is no dependence between the two variables.

Distance correlation is available for example in R』s energy package (and there』s also a Python gist).

#R-code

> x = runif (1000, -1, 1)

> dcor(x, x**2)

[1] 0.4943864

There are at least two reasons why to prefer Pearson correlation over more sophisticated methods such as MIC or distance correlation when the relationship is close to linear. For one, computing the Pearson correlation is much faster, which may be important in case of big datasets. Secondly, the range of the correlation coefficient is [-1;1] (instead of [0;1] for MIC and distance correlation). This can relay useful extra information on whether the relationship is negative or positive, i.e. do higher feature values imply higher values of the response variables or vice versa. But of course the question of negative versus positive correlation is only well-posed if the relationship between the two variables is monotonic.

Model based ranking

Finally one can use an arbitrary machine learning method to build a predictive model for the response variable using each individual feature, and measure the performance of each model. In fact, this is already put to use with Pearson』s correlation coefficient, since it is equivalent to standardized regression coefficient that is used for prediction in linear regression. If the relationship between a feature and the response variable is non-linear, there are a number of alternatives, for example tree based methods (decision trees, random forest), linear model with basis expansion etc. Tree based methods are probably among the easiest to apply, since they can model non-linear relations well and don』t require much tuning. The main thing to avoid is overfitting, so the depth of tree(s) should be relatively small, and cross-validation should be applied.

Here』s univariate selection using sklearn』s random forest regressor on Boston housing price data set, a sample which includes housing prices in suburbs of Boston together with a number of key attributes.

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

from sklearn.cross_validation import cross_val_score, ShuffleSplit

from sklearn.datasets import load_boston

from sklearn.ensemble import RandomForestRegressor

#Load boston housing dataset as an example

boston = load_boston()

X = boston["data"]

Y = boston["target"]

names = boston["feature_names"]

rf = RandomForestRegressor(n_estimators=20, max_depth=4)

scores = []

for i in range(X.shape[1]):

score = cross_val_score(rf, X[:, i:i+1], Y, scoring="r2",

cv=ShuffleSplit(len(X), 3, .3))

scores.append((round(np.mean(score), 3), names[i]))

print sorted(scores, reverse=True)

17 from sklearn.cross_validation import cross_val_score, ShuffleSplit

from sklearn.datasets import load_boston

from sklearn.ensemble import RandomForestRegressor

#Load boston housing dataset as an example

boston = load_boston()

X = boston["data"]

Y = boston["target"]

names = boston["feature_names"]

rf = RandomForestRegressor(n_estimators=20, max_depth=4)

scores = []

for i in range(X.shape[1]):

score = cross_val_score(rf, X[:, i:i+1], Y, scoring="r2",

cv=ShuffleSplit(len(X), 3, .3))

scores.append((round(np.mean(score), 3), names[i]))

print sorted(scores, reverse=True)

[(0.636, LSTAT), (0.59, RM), (0.472, NOX), (0.369, INDUS), (0.311, PTRATIO), (0.24, TAX), (0.24, CRIM), (0.185, RAD), (0.16, ZN), (0.087, B), (0.062, DIS), (0.036, CHAS), (0.027, AGE)]

Summary

Univariate feature selection is in general best to get a better understanding of the data, its structure and characteristics. It can work for selecting top features for model improvement in some settings, but since it is unable to remove redundancy (for example selecting only the best feature among a subset of strongly correlated features), this task is better left for other methods. Which brings us to the next topic: Selecting good features Part II: Linear models and regularization.

Selecting good features – Part II: linear models and regularization

Posted November 12, 2014

In my previous post I discussed univariate feature selection where each feature is evaluated independently with respect to the response variable. Another popular approach is to utilize machine learning models for feature ranking. Many machine learning models have either some inherent internal ranking of features or it is easy to generate the ranking from the structure of the model. This applies to regression models, SVM』s, decision trees, random forests, etc.

In this post, I will discuss using coefficients of regression models for selecting and interpreting features. This is based on the idea that when all features are on the same scale, the most important features should have the highest coefficients in the model, while features uncorrelated with the output variables should have coefficient values close to zero. This approach can work well even with simple linear regression models when the data is not very noisy (or there is a lot of data compared to the number of features) and the features are (relatively) independent:

from sklearn.linear_model import LinearRegression

import numpy as np

np.random.seed(0)

size = 5000

#A dataset with 3 features

X = np.random.normal(0, 1, (size, 3))

#Y = X0 + 2*X1 + noise

Y = X[:,0] + 2*X[:,1] + np.random.normal(0, 2, size)

lr = LinearRegression()

lr.fit(X, Y)

#A helper method for pretty-printing linear models

def pretty_print_linear(coefs, names = None, sort = False):

if names == None:

names = ["X%s" % x for x in range(len(coefs))]

lst = zip(coefs, names)

if sort:

lst = sorted(lst, key = lambda x:-np.abs(x[0]))

return " + ".join("%s * %s" % (round(coef, 3), name)

for coef, name in lst)

print "Linear model:", pretty_print_linear(lr.coef_)

Linear model: 0.984 * X0 + 1.995 * X1 + -0.041 * X2

As you can see in this example, the model indeed recovers the underlying structure of the data very well, despite quite significant noise in the data. But actually, this learning problem was particularly well suited for a linear model: purely linear relationship between features and the response variable, and no correlations between features.

When there are multiple (linearly) correlated features (as is the case with very many real life datasets), the model becomes unstable, meaning that small changes in the data can cause large changes in the model (i.e. coefficient values), making model interpretation very difficult (so called multicollinearity problem). For example, assume we have a dataset where the 「true」 model for the data is Y=X 1 +X 2, while we observe Y ^ =X 1 +X 2 +?, with ? being noise. Furthermore, assume that X 1 and X 2 are linearly correlated such that X 1 ≈X 2. Ideally the learned model will be Y=X 1 +X 2. But depending on the amount of noise ?, the amount of data at hand and the correlation between X 1 and X 2, it could also be Y=2X 1 (i.e. using only X 1 as the predictor) or Y=?X 1 +3X 2(shifting of the coefficients might happen to give a better fit in the noisy training set) etc.

Let』s look at the same correlated dataset as in the random forest example, with some noise added.

from sklearn.linear_model import LinearRegression

size = 100

np.random.seed(seed=5)

X_seed = np.random.normal(0, 1, size)

X1 = X_seed + np.random.normal(0, .1, size)

X2 = X_seed + np.random.normal(0, .1, size)

X3 = X_seed + np.random.normal(0, .1, size)

Y = X1 + X2 + X3 + np.random.normal(0,1, size)

X = np.array([X1, X2, X3]).T

lr = LinearRegression()

lr.fit(X,Y)

print "Linear model:", pretty_print_linear(lr.coef_)

Linear model: -1.291 * X0 + 1.591 * X1 + 2.747 * X2

Coefficients sum up to ~3, so we can expect the learned model to perform well. On the other hand, if we were to interpret the coefficients at face value, then according to the model X 3 has a strong positive impact on the output variable, while X 1 has a negative one. Actually all features are correlated almost equally to the output variable.

The same method and concerns apply to other similar linear methods, for example logistic regression.

Regularized models

Regularization is a method for adding additional constraints or penalty to a model, with the goal of preventing overfitting and improving generalization. Instead of minimizing a loss function E(X,Y), the loss function to minimize becomes E(X,Y)+α∥w∥, where w is the vector of model coefficients, ∥?∥ is typically L1 or L2 norm and α is a tunable free parameter, specifying the amount of regularization (so α=0 implies an unregularized model). For regression models, the two widely used regularization methods are L1 and L2 regularization, also called lasso and ridge regression when applied in linear regression.

L1 regularization / Lasso

L1 regularization adds a penalty α∑ n i=1 |w i | to the loss function (L1-norm). Since each non-zero coefficient adds to the penalty, it forces weak features to have zero as coefficients. Thus L1 regularization produces sparse solutions, inherently performing feature selection.

For regression, Scikit-learn offers Lasso for linear regression and Logistic regression with L1 penalty for classification.

Lets run Lasso on the Boston housing dataset with a good α (which can be found for example via grid search):

from sklearn.linear_model import Lasso

from sklearn.preprocessing import StandardScaler

from sklearn.datasets import load_boston

boston = load_boston()

scaler = StandardScaler()

X = scaler.fit_transform(boston["data"])

Y = boston["target"]

names = boston["feature_names"]

lasso = Lasso(alpha=.3)

lasso.fit(X, Y)

print "Lasso model: ", pretty_print_linear(lasso.coef_, names, sort = True)

Lasso model: -3.707 * LSTAT + 2.992 * RM + -1.757 * PTRATIO + -1.081 * DIS + -0.7 * NOX + 0.631 * B + 0.54 * CHAS + -0.236 * CRIM + 0.081 * ZN + -0.0 * INDUS + -0.0 * AGE + 0.0 * RAD + -0.0 * TAX

We see that a number of features have coefficient 0. If we increase α further, the solution would be sparser and sparser, i.e. more and more features would have 0 as coefficients.

Note however that L1 regularized regression is unstable in a similar way as unregularized linear models are, meaning that the coefficients (and thus feature ranks) can vary significantly even on small data changes when there are correlated features in the data. Which brings us to L2 regularization.

L2 regularization / Ridge regression

L2 regularization (called ridge regression for linear regression) adds the L2 norm penalty (α∑ n i=1 w 2 i) to the loss function.

Since the coefficients are squared in the penalty expression, it has a different effect from L1-norm, namely it forces the coefficient values to be spread out more equally. For correlated features, it means that they tend to get similar coefficients. Coming back to the example of Y=X 1 +X 2, with strongly correlated X 1 and X 2, then for L1, the penalty is the same whether the learned model is Y=1?X 1 +1?X 2 or Y=2?X 1 +0?X 2. In both cases the penalty is 2?α. For L2 however, the first model』s penalty is 1 2 +1 2 =2α, while for the second model is penalized with 2 2 +0 2 =4α.

The effect of this is that models are much more stable (coefficients do not fluctuate on small data changes as is the case with unregularized or L1 models). So while L2 regularization does not perform feature selection the same way as L1 does, it is more useful for feature *interpretation*: a predictive feature will get a non-zero coefficient, which is often not the case with L1.

Lets look at the example with three correlated features again, running the example 10 times with different random seeds, to emphasize the stability of L2 regression.

from sklearn.linear_model import Ridge

from sklearn.metrics import r2_score

size = 100

#We run the method 10 times with different random seeds

for i in range(10):

print "Random seed %s" % i

np.random.seed(seed=i)

X_seed = np.random.normal(0, 1, size)

X1 = X_seed + np.random.normal(0, .1, size)

X2 = X_seed + np.random.normal(0, .1, size)

X3 = X_seed + np.random.normal(0, .1, size)

Y = X1 + X2 + X3 + np.random.normal(0, 1, size)

X = np.array([X1, X2, X3]).T

lr = LinearRegression()

lr.fit(X,Y)

print "Linear model:", pretty_print_linear(lr.coef_)

ridge = Ridge(alpha=10)

ridge.fit(X,Y)

print "Ridge model:", pretty_print_linear(ridge.coef_)

print

Random seed 0

Linear model: 0.728 * X0 + 2.309 * X1 + -0.082 * X2

Ridge model: 0.938 * X0 + 1.059 * X1 + 0.877 * X2

Random seed 1

Linear model: 1.152 * X0 + 2.366 * X1 + -0.599 * X2

Ridge model: 0.984 * X0 + 1.068 * X1 + 0.759 * X2

Random seed 2

Linear model: 0.697 * X0 + 0.322 * X1 + 2.086 * X2

Ridge model: 0.972 * X0 + 0.943 * X1 + 1.085 * X2

Random seed 3

Linear model: 0.287 * X0 + 1.254 * X1 + 1.491 * X2

Ridge model: 0.919 * X0 + 1.005 * X1 + 1.033 * X2

Random seed 4

Linear model: 0.187 * X0 + 0.772 * X1 + 2.189 * X2

Ridge model: 0.964 * X0 + 0.982 * X1 + 1.098 * X2

Random seed 5

Linear model: -1.291 * X0 + 1.591 * X1 + 2.747 * X2

Ridge model: 0.758 * X0 + 1.011 * X1 + 1.139 * X2

Random seed 6

Linear model: 1.199 * X0 + -0.031 * X1 + 1.915 * X2

Ridge model: 1.016 * X0 + 0.89 * X1 + 1.091 * X2

Random seed 7

Linear model: 1.474 * X0 + 1.762 * X1 + -0.151 * X2

Ridge model: 1.018 * X0 + 1.039 * X1 + 0.901 * X2

Random seed 8

Linear model: 0.084 * X0 + 1.88 * X1 + 1.107 * X2

Ridge model: 0.907 * X0 + 1.071 * X1 + 1.008 * X2

Random seed 9

Linear model: 0.714 * X0 + 0.776 * X1 + 1.364 * X2

Ridge model: 0.896 * X0 + 0.903 * X1 + 0.98 * X2

As you can see from the example, the coefficients can vary widely for linear regression, depending on the generated data. For L2 regularized model however, the coefficients are quite stable and closely reflect how the data was generated (all coefficients close to 1).

Summary

Regularized linear models are a powerful set of tool for feature interpretation and selection. Lasso produces sparse solutions and as such is very useful selecting a strong subset of features for improving model performance. Ridge regression on the other hand can be used for data interpretation due to its stability and the fact that useful features tend to have non-zero coefficients. Since the relationship between the response variable and features in often non-linear, basis expansion can be used to convert features into a more suitable space, while keeping the simple linear models fully applicable.

Next up: Selecting good features Part III: tree based methods.

Selecting good features – Part III: random forests

In my previous posts, I looked at univariate feature selection and linear models and regularization for feature selection.

In this post, I』ll discuss random forests, another popular approach for feature ranking.

Random forest feature importance

Random forests are among the most popular machine learning methods thanks to their relatively good accuracy, robustness and ease of use. They also provide two straightforward methods for feature selection: mean decrease impurity and mean decrease accuracy.

Mean decrease impurity

Random forest consists of a number of decision trees. Every node in the decision trees is a condition on a single feature, designed to split the dataset into two so that similar response values end up in the same set. The measure based on which the (locally) optimal condition is chosen is called impurity. For classification, it is typically either Gini impurity or information gain/entropy and for regression trees it is variance. Thus when training a tree, it can be computed how much each feature decreases the weighted impurity in a tree. For a forest, the impurity decrease from each feature can be averaged and the features are ranked according to this measure.

This is the feature importance measure exposed in sklearn』s Random Forest implementations (random forest classifier and random forest regressor).

from sklearn.datasets import load_boston

from sklearn.ensemble import RandomForestRegressor

import numpy as np

#Load boston housing dataset as an example

boston = load_boston()

X = boston["data"]

Y = boston["target"]

names = boston["feature_names"]

rf = RandomForestRegressor()

rf.fit(X, Y)

print "Features sorted by their score:"

print sorted(zip(map(lambda x: round(x, 4), rf.feature_importances_), names),

reverse=True)

Features sorted by their score:

[(0.5298, LSTAT), (0.4116, RM), (0.0252, DIS), (0.0172, CRIM), (0.0065, NOX), (0.0035, PTRATIO), (0.0021, TAX), (0.0017, AGE), (0.0012, B), (0.0008, INDUS), (0.0004, RAD), (0.0001, CHAS), (0.0, ZN)]

There are a few things to keep in mind when using the impurity based ranking. Firstly, feature selection based on impurity reduction is biased towards preferring variables with more categories (see Bias in random forest variable importance measures). Secondly, when the dataset has two (or more) correlated features, then from the point of view of the model, any of these correlated features can be used as the predictor, with no concrete preference of one over the others. But once one of them is used, the importance of others is significantly reduced since effectively the impurity they can remove is already removed by the first feature. As a consequence, they will have a lower reported importance. This is not an issue when we want to use feature selection to reduce overfitting, since it makes sense to remove features that are mostly duplicated by other features. But when interpreting the data, it can lead to the incorrect conclusion that one of the variables is a strong predictor while the others in the same group are unimportant, while actually they are very close in terms of their relationship with the response variable.

The effect of this phenomenon is somewhat reduced thanks to random selection of features at each node creation, but in general the effect is not removed completely. In the following example, we have three correlated variables X 0 ,X 1 ,X 2 , and no noise in the data, with the output variable simply being the sum of the three features:

size = 10000

np.random.seed(seed=10)

X_seed = np.random.normal(0, 1, size)

X0 = X_seed + np.random.normal(0, .1, size)

X1 = X_seed + np.random.normal(0, .1, size)

X2 = X_seed + np.random.normal(0, .1, size)

X = np.array([X0, X1, X2]).T

Y = X0 + X1 + X2

rf = RandomForestRegressor(n_estimators=20, max_features=2)

rf.fit(X, Y);

print "Scores for X0, X1, X2:", map(lambda x:round (x,3),

rf.feature_importances_)

Scores for X0, X1, X2: [0.278, 0.66, 0.062]

When we compute the feature importances, we see that X1 is computed to have over 10x higher importance than X2, while their 「true」 importance is very similar. This happens despite the fact that the data is noiseless, we use 20 trees, random selection of features (at each split, only two of the three features are considered) and a sufficiently large dataset.

One thing to point out though is that the difficulty of interpreting the importance/ranking of correlated variables is not random forest specific, but applies to most model based feature selection methods.

Mean decrease accuracy

Another popular feature selection method is to directly measure the impact of each feature on accuracy of the model. The general idea is to permute the values of each feature and measure how much the permutation decreases the accuracy of the model. Clearly, for unimportant variables, the permutation should have little to no effect on model accuracy, while permuting important variables should significantly decrease it.

This method is not directly exposed in sklearn, but it is straightforward to implement it. Continuing from the previous example of ranking the features in the Boston housing dataset:

from sklearn.cross_validation import ShuffleSplit

from sklearn.metrics import r2_score

from collections import defaultdict

X = boston["data"]

Y = boston["target"]

rf = RandomForestRegressor()

scores = defaultdict(list)

#crossvalidate the scores on a number of different random splits of the data

for train_idx, test_idx in ShuffleSplit(len(X), 100, .3):

X_train, X_test = X[train_idx], X[test_idx]

Y_train, Y_test = Y[train_idx], Y[test_idx]

r = rf.fit(X_train, Y_train)

acc = r2_score(Y_test, rf.predict(X_test))

for i in range(X.shape[1]):

X_t = X_test.copy()

np.random.shuffle(X_t[:, i])

shuff_acc = r2_score(Y_test, rf.predict(X_t))

scores[names[i]].append((acc-shuff_acc)/acc)

print "Features sorted by their score:"

print sorted([(round(np.mean(score), 4), feat) for

feat, score in scores.items()], reverse=True)

Features sorted by their score:

[(0.7276, LSTAT), (0.5675, RM), (0.0867, DIS), (0.0407, NOX), (0.0351, CRIM), (0.0233, PTRATIO), (0.0168, TAX), (0.0122, AGE), (0.005, B), (0.0048, INDUS), (0.0043, RAD), (0.0004, ZN), (0.0001, CHAS)]

In this example LSTAT and RM are two features that strongly impact model performance: permuting them decreases model performance by ~73% and ~57% respectively. Keep in mind though that these measurements are made only after the model has been trained (and is depending) on all of these features. This doesn』t mean that if we train the model without one these feature, the model performance will drop by that amount, since other, correlated features can be used instead.

Summary

Random forests are a popular method for feature ranking, since they are so easy to apply: in general they require very little feature engineering and parameter tuning and mean decrease impurity is exposed in most random forest libraries. But they come with their own gotchas, especially when data interpretation is concerned. With correlated features, strong features can end up with low scores and the method can be biased towards variables with many categories. As long as the gotchas are kept in mind, there really is no reason not to try them out on your data.

Next up: Stability selection, recursive feature elimination, and an example comparing all discussed methods side by side.

Selecting good features – Part IV: stability selection, RFE and everything side by side

Posted December 20, 2014

In my previous posts, I looked at univariate methods,linear models and regularization and random forests for feature selection.

In this post, I』ll look at two other methods: stability selection and recursive feature elimination (RFE), which can both considered wrapper methods. They both build on top of other (model based) selection methods such as regression or SVM, building models on different subsets of data and extracting the ranking from the aggregates.

As a wrap-up I』ll run all previously discussed methods, to highlight their pros, cons and gotchas with respect to each other.

Stability selection

Stability selection is a relatively novel method for feature selection, based on subsampling in combination with selection algorithms (which could be regression, SVMs or other similar method). The high level idea is to apply a feature selection algorithm on different subsets of data and with different subsets of features. After repeating the process a number of times, the selection results can be aggregated, for example by checking how many times a feature ended up being selected as important when it was in an inspected feature subset. We can expect strong features to have scores close to 100%, since they are always selected when possible. Weaker, but still relevant features will also have non-zero scores, since they would be selected when stronger features are not present in the currently selected subset, while irrelevant features would have scores (close to) zero, since they would never be among selected features.

Sklearn implements stability selection in the randomized lasso and randomized logistics regression classes.

from sklearn.linear_model import RandomizedLasso

from sklearn.datasets import load_boston

boston = load_boston()

#using the Boston housing data.

#Data gets scaled automatically by sklearns implementation

X = boston["data"]

Y = boston["target"]

names = boston["feature_names"]

rlasso = RandomizedLasso(alpha=0.025)

rlasso.fit(X, Y)

print "Features sorted by their score:"

print sorted(zip(map(lambda x: round(x, 4), rlasso.scores_),

names), reverse=True)

Features sorted by their score:

[(1.0, RM), (1.0, PTRATIO), (1.0, LSTAT), (0.62, CHAS), (0.595, B), (0.39, TAX), (0.385, CRIM), (0.25, DIS), (0.22, NOX), (0.125, INDUS), (0.045, ZN), (0.02, RAD), (0.015, AGE)]

As you can see from the example, the top 3 features have equal scores of 1.0, meaning they were always selected as useful features (of course this could and would change when changing the regularization parameter, but sklearn』s randomized lasso implementation can choose a good α parameter automatically). The scores drop smoothly from there, but in general, the drop off is not sharp as is often the case with pure lasso, or random forest. This means stability selection is useful for both pure feature selection to reduce overfitting, but also for data interpretation: in general, good features won』t get 0 as coefficients just because there are similar, correlated features in the dataset (as is the case with lasso). For feature selection, I』ve found it to be among the top performing methods for many different datasets and settings.

Recursive feature elimination

Recursive feature elimination is based on the idea to repeatedly construct a model (for example an SVM or a regression model) and choose either the best or worst performing feature (for example based on coefficients), setting the feature aside and then repeating the process with the rest of the features. This process is applied until all features in the dataset are exhausted. Features are then ranked according to when they were eliminated. As such, it is a greedy optimization for finding the best performing subset of features.

The stability of RFE depends heavily on the type of model that is used for feature ranking at each iteration. Just as non-regularized regression can be unstable, so can RFE when utilizing it, while using ridge regression can provide more stable results.

Sklearn provides RFE for recursive feature elimination and RFECV for finding the ranks together with optimal number of features via a cross validation loop.

from sklearn.feature_selection import RFE

from sklearn.linear_model import LinearRegression

boston = load_boston()

X = boston["data"]

Y = boston["target"]

names = boston["feature_names"]

#use linear regression as the model

lr = LinearRegression()

#rank all features, i.e continue the elimination until the last one

rfe = RFE(lr, n_features_to_select=1)

rfe.fit(X,Y)

print "Features sorted by their rank:"

print sorted(zip(map(lambda x: round(x, 4), rfe.ranking_), names))

Features sorted by their rank:

[(1.0, NOX), (2.0, RM), (3.0, CHAS), (4.0, PTRATIO), (5.0, DIS), (6.0, LSTAT), (7.0, RAD), (8.0, CRIM), (9.0, INDUS), (10.0, ZN), (11.0, TAX), (12.0, B), (13.0, AGE)]

Example: running the methods side by side

I』ll now take all the examples from this post, and the three previous ones and run the methods on a sample dataset to compare them side by side. The dataset will be the so called Friedman #1 regression dataset (from Friedman』s Multivariate Adaptive Regression Splines paper). The data is generated according to formula y=10sin(πx 1 x 2 )+20(x 3 –0.5) 2 +10X 4 +5X 5 +?, where the x 1 to x 5 are drawn from uniform distribution and ? is the standard normal deviate N(0,1). Additionally, the original dataset had five noise variables x 6 ,…,x 10, independent of the response variable. We will increase the number of variables further and add four variables x 11 ,…,x 14 each of which are very strongly correlated with x 1 ,…,x 4, respectively, generated by f(x)=x+N(0,0.01). This yields a correlation coefficient of more than 0.999 between the variables. This will illustrate how different feature ranking methods deal with correlations in the data.

We』ll apply run each of the above listed methods on the dataset and normalize the scores so that that are between 0 (for lowest ranked feature) and 1 (for the highest feature). For recursive feature elimination, the top five feature will all get score 1, with the rest of the ranks spaced equally between 0 and 1 according to their rank.

from sklearn.datasets import load_boston

from sklearn.linear_model import (LinearRegression, Ridge,

Lasso, RandomizedLasso)

from sklearn.feature_selection import RFE, f_regression

from sklearn.preprocessing import MinMaxScaler

from sklearn.ensemble import RandomForestRegressor

import numpy as np

from minepy import MINE

np.random.seed(0)

size = 750

X = np.random.uniform(0, 1, (size, 14))

#"Friedamn #1」 regression problem

Y = (10 * np.sin(np.pi*X[:,0]*X[:,1]) + 20*(X[:,2] - .5)**2 +

10*X[:,3] + 5*X[:,4] + np.random.normal(0,1))

#Add 3 additional correlated variables (correlated with X1-X3)

X[:,10:] = X[:,:4] + np.random.normal(0, .025, (size,4))

names = ["x%s" % i for i in range(1,15)]

ranks = {}

def rank_to_dict(ranks, names, order=1):

minmax = MinMaxScaler()

ranks = minmax.fit_transform(order*np.array([ranks]).T).T[0]

ranks = map(lambda x: round(x, 2), ranks)

return dict(zip(names, ranks ))

lr = LinearRegression(normalize=True)

lr.fit(X, Y)

ranks["Linear reg"] = rank_to_dict(np.abs(lr.coef_), names)

ridge = Ridge(alpha=7)

ridge.fit(X, Y)

ranks["Ridge"] = rank_to_dict(np.abs(ridge.coef_), names)

lasso = Lasso(alpha=.05)

lasso.fit(X, Y)

ranks["Lasso"] = rank_to_dict(np.abs(lasso.coef_), names)

rlasso = RandomizedLasso(alpha=0.04)

rlasso.fit(X, Y)

ranks["Stability"] = rank_to_dict(np.abs(rlasso.scores_), names)

#stop the search when 5 features are left (they will get equal scores)

rfe = RFE(lr, n_features_to_select=5)

rfe.fit(X,Y)

ranks["RFE"] = rank_to_dict(map(float, rfe.ranking_), names, order=-1)

rf = RandomForestRegressor()

rf.fit(X,Y)

ranks["RF"] = rank_to_dict(rf.feature_importances_, names)

f, pval = f_regression(X, Y, center=True)

ranks["Corr."] = rank_to_dict(f, names)

mine = MINE()

mic_scores = []

for i in range(X.shape[1]):

mine.compute_score(X[:,i], Y)

m = mine.mic()

mic_scores.append(m)

ranks["MIC"] = rank_to_dict(mic_scores, names)

r = {}

for name in names:

r[name] = round(np.mean([ranks[method][name]

for method in ranks.keys()]), 2)

methods = sorted(ranks.keys())

ranks["Mean"] = r

methods.append("Mean")

print " %s" % " ".join(methods)

for name in names:

print "%s %s" % (name, " ".join(map(str,

[ranks[method][name] for method in methods])))

Here』s the resulting table (sortable by clicking on the column header), with the results from each method + the mean

FEATURE LIN. CORR. LINEAR REG. LASSO MIC RF RFE RIDGE STABILITY MEAN

x1 0.3 1.0 0.79 0.39 0.18 1.0 0.77 0.61 0.63

x2 0.44 0.56 0.83 0.61 0.24 1.0 0.75 0.7 0.64

x3 0.0 0.5 0.0 0.34 0.01 1.0 0.05 0.0 0.24

x4 1.0 0.57 1.0 1.0 0.45 1.0 1.0 1.0 0.88

x5 0.1 0.27 0.51 0.2 0.04 0.78 0.88 0.6 0.42

x6 0.0 0.02 0.0 0.0 0.0 0.44 0.05 0.0 0.06

x7 0.01 0.0 0.0 0.07 0.0 0.0 0.01 0.0 0.01

x8 0.02 0.03 0.0 0.05 0.0 0.56 0.09 0.0 0.09

x9 0.01 0.0 0.0 0.09 0.0 0.11 0.0 0.0 0.03

x10 0.0 0.01 0.0 0.04 0.0 0.33 0.01 0.0 0.05

x11 0.29 0.6 0.0 0.43 0.14 1.0 0.59 0.39 0.43

x12 0.44 0.14 0.0 0.71 0.12 0.67 0.68 0.42 0.4

x13 0.0 0.48 0.0 0.23 0.01 0.89 0.02 0.0 0.2

x14 0.99 0.0 0.16 1.0 1.0 0.22 0.95 0.53 0.61

The example should highlight some the interesting characteristics of the different methods.

With linear correlation (Lin. corr.), each feature is evaluated independently, so the scores for features x 1 …x 4 are very similar to x 11 …x 14, while the noise features x 5 …x 10 are correctly identified to have almost no relation with the response variable. It』s not able to identify any relationship between x 3 and the response variable, since the relationship is quadratic (in fact, this applies almost all other methods except for MIC). It』s also clear that while the method is able to measure the linear relationship between each feature and the response variable, it is not optimal for selecting the top performing features for improving the generalization of a model, since all top performing features would essentially be picked twice.

Lasso picks out the top performing features, while forcing other features to be close to zero. It is clearly useful when reducing the number of features is required, but not necessarily for data interpretation (since it might lead one to believe that features x 11 …x 13 do not have a strong relationship with the output variable).

MIC is similar to correlation coefficient in treating all features 「equally」, additionally it is able to find the non-linear a relationship between x 3 and the response.

Random forest』s impurity based ranking is typically aggressive in the sense that there is a sharp drop-off of scores after the first few top ones. This can be seen from the example where the third ranked feature has already 4x smaller score than the top feature (whereas for the other ranking methods, the drop-off is clearly not that aggressive).

Ridge regression forces regressions coefficients to spread out similarly between correlated variables. This is clearly visible in the example where x 11 …x 14 are close to x 1 …x 4 in terms of scores.

Stability selection is often able to make a useful compromise between data interpretation and top feature selection for model improvement. This is illustrated well in the example. Just like Lasso it is able to identify the top features (x1, x2, x4, x5). At the same time their correlated shadow variables also get a high score, illustrating their relation with the response.

Conclusions

Feature ranking can be incredibly useful in a number of machine learning and data mining scenarios. The key though is to have the end goal clearly in mind and understand which method works best for achieving it. When selecting top features for model performance improvement, it is easy to verify if a particular method works well against alternatives simply by doing cross-validation. It』s not as straightforward when using feature ranking for data interpretation, where stability of the ranking method is crucial and a method that doesn』t have this property (such as lasso) could easily lead to incorrect conclusions. What can help there is subsampling the data and running the selection algorithms on the subsets. If the results are consistent across the subsets, it is relatively safe to trust the stability of the method on this particular data and therefor straightforward to interpret the data in terms of the ranking.


推薦閱讀:

【機器學習】監督學習技巧整理概述
Zero-Shot Learning with Semantic Output Codes(NIPS2009)
機器學習面試題精講(一)
機器學習常見演算法分類匯總
機器學習與深度學習視頻講解

TAG:機器學習 |