The Blog is a collection of News, Blog posts, technical tips and tricks and other writings from my sphere of interest.
- Esben Jannik Bjerrum


Wash that gold: Modelling solubility with Molecular fingerprints….

Last blog entry the conversion between molecule and fingerprint was briefly touched upon. Now the fingerprints will be used as the basis for a simple attempt to build a solubility model using some data from the web. The file with some 1000 molecules as smiles and their aqueous solubility was downloaded from
To prepare it for import into python a header was added to the file and the comments in the end double hashed ##.

name smiles solubility
60-35-5 CC(N)=O 1.58


Then this can be done in Python

import numpy as np
from rdkit import Chem
from rdkit.Chem import AllChem
from sklearn import linear_model

#First import data from the text file using NumPy's genfromtxt
data = np.genfromtxt("Solubility_data.smiles",delimiter=' ',names=True, dtype=None, comments='##')

#For each row, convert the Smiles to mols, mols to fingerprints and store in prepared python lists
molfps = []
solub = []
for row in data:
	mol = (Chem.MolFromSmiles(row['smiles']))
	fp = AllChem.GetMorganFingerprintAsBitVect(mol,2,nBits=1024) #2 
	bitstring = fp.ToBitString()
	fp_vect = map(int, bitstring)

#Convert the python list into NumPy arrays for efficiency.
X = np.array(molfps)
y = np.array(solub)

print X.shape

This gives a matrix X, with all the molecules (samples) along dimension 1 and the morgan fingerprints along dimension 2 (features). The y is a vector with the solubility data.

Now deep learning neural networks will be used to… Joke aside, 95% of machine learning problems do not require deep learning ;-), so a simple solution will be tried first.

Multiple Linear regression (MLR) is simply an expansion of the well know linear formula y=a*x + b.

y = b1*x1 + b2*x2 + … + bi*xi + b0

The parameters b1 .. bi will be fit so the features of each sample (x1 to xi, the morgan fp bits) can predict the solubility (y). In python it can be done with scikit-learn in three lines:-)

#Load the LinearRegression from sklearn.linear_model.
estimator = linear_model.LinearRegression()
#Fit and score,y)
print estimator.score(X,y)

This reports a correlation coefficient around 0.96. With 1024 features and 1311 molecules there’s a high risk of over fitting, which can be tested with cross-validation. Scikit-learn also contain modules for exactly that.

#Cross Validate
from sklearn import cross_validation

X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.4, random_state=0),y_train)
print estimator.score(X_train,y_train)
#Out[95]: 0.98429997844791894

print estimator.score(X_test,y_test)
#Out[94]: -2.7528207967268459e+24

Obviously something is wrong, it should be between 0 and 1. Maybe some round off error??? A fast plot show the bad correlation:

import matplotlib.pyplot as plt



Simple multiple linear regression using 1024 molecular bits needs way more than 1311 molecular samples to avoid over fitting. However, next time I’ll try a special trick….

Best Regards
Esben Jannik Bjerrum




Trackbacks for this post
  1. Molecular neural network models with RDKit and Keras in Python | Wildcard Pharmaceutical Consulting

Leave a Reply

Your email address will not be published. Required fields are marked *