Safer fitting through regularization
Last time a simple multiple linear regression (MLR) model was seriously overfitted to molecular solubility data. This time the concept of regularization will be tested.
Recall that the MLR model was of the form:
y = b1*x1 + b2*x2 + … + bi*xi + b0
Where each x correspond to the absence or presence of a particular bit in the morgan fingerprint and the coefficients b1 to bi was fit to the solubility data in y. With 1024 features and only 1311 molecular samples the model was seriously over-fit and had no predictive power in cross validation.
Regularization is the concept of penalizing “complex” solutions. So what is a complex solution and what is a simple solution? A complex solution may be one that has a lot of large coefficients with opposing signs, whereas a simple solution could be seen as one where there was a lot of small contributions with low coefficients. This can be build into the fit itself.
The coefficients in the MLR is fit so that the sum of squared errors between the predicted y and the measured y is lowest possible. The average squared error is usually used as the cost function that is optimized in the machine learning model. By adding an extra term to the cost function that penalizes “complex” solutions it is possible to regularize the solution so that over-fitting is avoided. For MLR the sum of squared coefficients are usually added to the cost-function multiplied by a parameter alpha that thus controls the amount of regularization.
This regularized MLR is available in Scikit-learn as Ridge Regression. It will be explored what happens in cross-validation of the previously generated data when varying the regularization parameter alpha.
The X matrix and y vector were produced in the same way as in the last blog post. In python a tuning of the regularization could be implemented in this way.
#prepare scoring lists fitscores =  predictscores =  #prepare a log spaced list of alpha values to test alphas = np.logspace(-1, 4, num=25) #Iterate through alphas and fit with Ridge Regression for alpha in alphas: estimator = linear_model.Ridge(alpha = alpha) estimator.fit(X_train,y_train) fitscores.append(estimator.score(X_train,y_train)) predictscores.append(estimator.score(X_test,y_test))
#show a plot import matplotlib.pyplot as plt ax = plt.gca() ax.set_xscale('log') ax.plot(alphas, fitscores,'g') ax.plot(alphas, predictscores,'b') plt.xlabel('alpha') plt.ylabel('Correlation Coefficient') plt.show()
The plot is exactly as expected. The figure shows that in the left side a low regularization leads to a high correlation with the training set (green curve) but low correlation with the test set (blue curve). Raising the regularization parameter alpha makes it harder to over-fit as the fit to the train set goes down while at the same time the prediction of the test set becomes better. An optimum is found around alpha = 10. Going further the correlation with both the test and train set is worsened, as the over regularized model can not fit the complexity found in the data set.The web page from where the dataset was downloaded cites a model with RMS around 0.49 and spread of 0.38. How well does thissimple Ridge Regression model compare?
estimator= linear_model.Ridge(alpha = 10) estimator.fit(X_train,y_train) #Predict the test set y_pred = estimator.predict(X_test) rms = (np.mean((y_test - y_pred)**2))**0.5 s = np.std(y_test -y_pred) print rms, s
The fastly build model has an RMSE around 1.15 with a similar standard deviation. Obviously the simple linear modelling done here has a way to go compared with the results obtained with Neural Networks using molecular weight and E-state indices(1). Molecular weight and E-state indices (rdkit.Chem.EState.Fingerprinter) can also be calculated with RDKit, maybe this will be attempted another time…..
1: Tetko, I. V.; Tanchuk, V. Y.; Kasheva, T. N.; Villa, A. E. Estimation of aqueous solubility of chemical compounds using E-state indices, J. Chem. Inf. Comput. Sci., 2001, 41, 1488-93