Wash that gold: Modelling solubility with Molecular fingerprints….
Last blog entry the conversion between molecule and fingerprint was briefly touched upon. Now the fingerprints will be used as the basis for a simple attempt to build a solubility model using some data from the web. The file with some 1000 molecules as smiles and their aqueous solubility was downloaded fromhttp://www.vcclab.org/lab/alogps/
To prepare it for import into python a header was added to the file and the comments in the end double hashed ##.
name smiles solubility
60-35-5 CC(N)=O 1.58
Then this can be done in Python
import numpy as np from rdkit import Chem from rdkit.Chem import AllChem from sklearn import linear_model #First import data from the text file using NumPy's genfromtxt data = np.genfromtxt("Solubility_data.smiles",delimiter=' ',names=True, dtype=None, comments='##') #For each row, convert the Smiles to mols, mols to fingerprints and store in prepared python lists molfps =  solub =  for row in data: mol = (Chem.MolFromSmiles(row['smiles'])) fp = AllChem.GetMorganFingerprintAsBitVect(mol,2,nBits=1024) #2 bitstring = fp.ToBitString() fp_vect = map(int, bitstring) molfps.append(fp_vect) solub.append(row['solubility']) #Convert the python list into NumPy arrays for efficiency. X = np.array(molfps) y = np.array(solub) print X.shape
This gives a matrix X, with all the molecules (samples) along dimension 1 and the morgan fingerprints along dimension 2 (features). The y is a vector with the solubility data.
Now deep learning neural networks will be used to… Joke aside, 95% of machine learning problems do not require deep learning ;-), so a simple solution will be tried first.
Multiple Linear regression (MLR) is simply an expansion of the well know linear formula y=a*x + b.
y = b1*x1 + b2*x2 + … + bi*xi + b0
The parameters b1 .. bi will be fit so the features of each sample (x1 to xi, the morgan fp bits) can predict the solubility (y). In python it can be done with scikit-learn in three lines:-)
#Load the LinearRegression from sklearn.linear_model. estimator = linear_model.LinearRegression() #Fit and score estimator.fit(X,y) print estimator.score(X,y)
This reports a correlation coefficient around 0.96. With 1024 features and 1311 molecules there’s a high risk of over fitting, which can be tested with cross-validation. Scikit-learn also contain modules for exactly that.
#Cross Validate from sklearn import cross_validation X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.4, random_state=0) estimator.fit(X_train,y_train) print estimator.score(X_train,y_train) #Out: 0.98429997844791894 print estimator.score(X_test,y_test) #Out: -2.7528207967268459e+24
Obviously something is wrong, it should be between 0 and 1. Maybe some round off error??? A fast plot show the bad correlation:
import matplotlib.pyplot as plt plt.scatter(estimator.predict(X_test),y_test) plt.show()
Simple multiple linear regression using 1024 molecular bits needs way more than 1311 molecular samples to avoid over fitting. However, next time I’ll try a special trick….
Esben Jannik Bjerrum