Never do this mistake when using Feature Selection
Feature selection is a powerful way of reducing the complexity of a machine learning or statistical model. But feature selection must be done in the right way, or you may end up cheating yourself or others. Later we will also see if green jelly beans are truly associated with acne.
When working with machine learning, it may be a good idea to reduce the number of used features in the model. Maybe the features or measurements are costly to achieve and using fewer features in the model can ease interpretation. Additionally the model will use less computational resources and time of execution.
There is however an inherent danger of (un?)willingly introduce a bias into the data modelling. This example script demonstrates the effect. First some imports is done. Numpy for data arrays and numeric computing. Some scikit-learn (sklearn) modules for feature selection and model building and matplotlib for plotting.
import numpy as np from sklearn.feature_selection import SelectKBest, f_regression from sklearn.linear_model import Ridge from sklearn.cross_validation import train_test_split import matplotlib.pyplot as plt
Numerical python is used to generate some data to play with.
# Some dataset to play with X = np.random.randn(80,10000) y = np.random.randn(80)
For feature selection I use the sklearn utilities. I use the SelectKbest, which selects the specified number of features based on the passed test, here the f_regression test also from the sklearn package. All features are evaluated each on their own with the test and ranked according to the f statistical regression test. The 40 best features are selected and a new X data block is build. All with one line of code.
#Feature Selection X_selected = SelectKBest(f_regression, k=40).fit_transform(X, y)
The next steps more or less follow the same approach to model building as previously described. The dataset is divided into a test and a training set, followed by a tuning of the alpha hyper parameter for the ridge regression model. To keep things simple I just use the test set for tuning, instead of cross validation.
#Divide into training and test-set X_train, X_test, y_train, y_test = train_test_split(X_selected, y, test_size=0.50, random_state=42) #Tune Alpha #prepare scoring lists fitscores =  predictscores =  #prepare a log spaced list of alpha values to test alphas = np.logspace(-1, 4, num=25) #Iterate through alphas and fit with Ridge Regression for alpha in alphas: estimator = Ridge(alpha = alpha) estimator.fit(X_train,y_train) fitscores.append(estimator.score(X_train,y_train)) predictscores.append(estimator.score(X_test,y_test))
This results in three lists, one with the alphas tests, one with the scores for the fit of the training set and one with the scores of the test set prediction. To get a visual overview a plot is useful.
#make a plot ax = plt.gca() ax.set_xscale('log') ax.plot(alphas, fitscores,'g', label = 'Train') ax.plot(alphas, predictscores,'b', label = 'Test') #Set limits and titles plt.ylim([0,1]) plt.xlabel('alpha') plt.ylabel('Correlation Coefficient') plt.legend() plt.title('Tune regularization') plt.savefig('Tuning.png') plt.show()
The plot nicely follows what is expected. Some degree of regularization gives a model with a not too large over-fit to the data. An alpha around 45 seems adequate.
A final model is produced from the entire training set and the prediction of the test set is compared to the actual values with an additional plot.
#Build a model on the training set and test it. ridgeEstimator = Ridge(alpha = 45) ridgeEstimator.fit(X_train,y_train) ridgeEstimator.score(X_train,y_train) ridgeEstimator.score(X_test,y_test) plt.plot(y_train,ridgeEstimator.predict(X_train),'go', label = 'Train') plt.plot(y_test,ridgeEstimator.predict(X_test),'b*', label = 'Test') plt.legend() plt.xlabel('Y') plt.ylabel('Predicted Y') plt.title('Prediction Plot') plt = watermark(plt) #Can be skipped plt.savefig('Prediction.png') plt.show()
This looks like a fairly good model. But how can this be? The data was made in a completely random way. Is there a mistake in the random module of Numerical Python? Should I have used cross validation on the training set to tune the hyper-parameter? The first is highly unlikely and the second could give a small effect, but the true problem is the way feature selection was done. Is there then an error in the way the sklearn module works? No, the true reason is related to a sampling effect as also featured in this comic strip by Randall Munroe: Jelly Beans Causes Acne (XKCD.org)
If enough random data is sampled and tested with statistical tools that do not take the number of testings into account, a wrong understanding of the correlations in the dataset will result. The mistake in the example script above is that the feature selection is done on the entire dataset and not just using the training set. This way 40 features was selected out a large pool that by chance had a correlation to the y values. Later splitting and model building will then re find this correlation no matter how the dataset is split and tested.
This effect may not just be active when using automated feature selection as above, but also as the human modeler tries to optimize the models. The more learners, the more features, the more different ways of producing fingerprints from molecules are tested, the more likely one is to find a combination that by chance works for the dataset at hand. This is why is it always a good idea to split out a validation set at the very beginning of the modeling exercise or to obtain a brand new validation set after modeling by procuring new data or do new measurements in the lab.
Esben Jannik Bjerrum
I leave it as an exercise to interested readers to switch the order of feature selection and train-test splitting.
The entire script for convenience: Feature_Selection_Script.py