### Wash that gold: Modelling solubility with Molecular fingerprints….

Last blog entry the conversion between molecule and fingerprint was briefly touched upon. Now the fingerprints will be used as the basis for a simple attempt to build a solubility model using some data from the web. The file with some 1000 molecules as smiles and their aqueous solubility was downloaded fromhttp://www.vcclab.org/lab/alogps/

To prepare it for import into python a header was added to the file and the comments in the end double hashed ##.

`name smiles solubility `

`60-35-5 CC(N)=O 1.58`

....

Then this can be done in Python

```
import numpy as np
from rdkit import Chem
from rdkit.Chem import AllChem
from sklearn import linear_model
#First import data from the text file using NumPy's genfromtxt
data = np.genfromtxt("Solubility_data.smiles",delimiter=' ',names=True, dtype=None, comments='##')
#For each row, convert the Smiles to mols, mols to fingerprints and store in prepared python lists
molfps = []
solub = []
for row in data:
mol = (Chem.MolFromSmiles(row['smiles']))
fp = AllChem.GetMorganFingerprintAsBitVect(mol,2,nBits=1024) #2
bitstring = fp.ToBitString()
fp_vect = map(int, bitstring)
molfps.append(fp_vect)
solub.append(row['solubility'])
#Convert the python list into NumPy arrays for efficiency.
X = np.array(molfps)
y = np.array(solub)
print X.shape
```

This gives a matrix X, with all the molecules (samples) along dimension 1 and the morgan fingerprints along dimension 2 (features). The y is a vector with the solubility data.

Now deep learning neural networks will be used to… Joke aside, 95% of machine learning problems do not require deep learning ;-), so a simple solution will be tried first.

Multiple Linear regression (MLR) is simply an expansion of the well know linear formula **y=a*x + b.**

y = b1*x1 + b2*x2 + … + bi*xi + b0

The parameters b1 .. bi will be fit so the features of each sample (x1 to xi, the morgan fp bits) can predict the solubility (y). In python it can be done with scikit-learn in three lines:-)

```
#Load the LinearRegression from sklearn.linear_model.
estimator = linear_model.LinearRegression()
#Fit and score
estimator.fit(X,y)
print estimator.score(X,y)
```

This reports a correlation coefficient around 0.96. With 1024 features and 1311 molecules there’s a high risk of over fitting, which can be tested with cross-validation. Scikit-learn also contain modules for exactly that.

```
#Cross Validate
from sklearn import cross_validation
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.4, random_state=0)
estimator.fit(X_train,y_train)
print estimator.score(X_train,y_train)
#Out[95]: 0.98429997844791894
print estimator.score(X_test,y_test)
#Out[94]: -2.7528207967268459e+24
```

Obviously something is wrong, it should be between 0 and 1. Maybe some round off error??? A fast plot show the bad correlation:

```
import matplotlib.pyplot as plt
plt.scatter(estimator.predict(X_test),y_test)
plt.show()
```

Simple multiple linear regression using 1024 molecular bits needs way more than 1311 molecular samples to avoid over fitting. However, next time I’ll try a special trick….

Best Regards

Esben Jannik Bjerrum

## Trackbacks for this post