Blog

The Blog is a collection of News, Blog posts, technical tips and tricks and other writings from my sphere of interest.
- Esben Jannik Bjerrum

Nov

Learn how to teach your computer to “See” Chemistry: Free Chemception models with RDKit and Keras

Chemception with RDKit and Keras

The film Inception with Leonardo Di Caprio is about dreams in dreams, and gave rise to the meme “We need to go deeper”. The title has also given name to the Inception networks used by Google in their Inception network. I recently stumbled across two interesting papers by Garrett Goh and coworkers https://arxiv.org/abs/1706.06689 and https://arxiv.org/abs/1710.02238 where the authors used encoded images for reading in molecules into these deep neural networks designed for image recognition. Their code was not available but it should not be too difficult to recreate their Chemception networks with RDKit and Keras. Read on below to see how. It may be a bit technical and long, but I hope you will read on as there’s some pretty images in the end đŸ˜‰

First there’s a couple of imports of some Python libraries.

from __future__ import print_function
import rdkit
from rdkit import Chem
from rdkit.Chem import AllChem
import pandas as pd
import numpy as np

from matplotlib import pyplot as plt
%matplotlib inline
print("RDKit: %s"%rdkit.__version__)
RDKit: 2017.09.1.dev1

Next follows some of the Keras objects which will be needed

import keras
from keras.models import Sequential, Model
from keras.layers import Conv2D, MaxPooling2D, Input, GlobalMaxPooling2D
from keras.layers.core import Dense, Dropout, Activation, Flatten
from keras.optimizers import Adam
from keras.preprocessing.image import ImageDataGenerator
from keras.callbacks import ReduceLROnPlateau
print("Keras: %s"%keras.__version__)
Using TensorFlow backend.

Keras: 2.1.1

Preparing the Dataset

I had a pet dataset lying around, which I have been using previously. It has the smiles string in a column named “smiles” and the assay values in a column named PC_uM_values. After reading in the dataset RDKit molecules are added to a column called “mol” using the pandas dataframes apply function.

data = pd.read_hdf("Sutherland.h5","Data")
data["mol"] = data["smiles"].apply(Chem.MolFromSmiles)

Chemception is named after the Inception modules which will be used for the neural network. The first step is to encode the molecule into an “image”. The function below takes an RDKit mol and encodes the molecular graph as an image with 4 channels.
After reading in the molecule the Gasteiger charges are calculated and the 2D drawing coordinates computed. They are usually computed before generating depictions of the molecule, but we are goind to need them “raw”, so they are extracted to the coords matrix.
The vect matrix is defined and filled with zeros (vacuum) and is of the shape (image_width, image_height,4). Each layer is then used to encode different information from the molecule. Layer zero is filled with information about the bonds and encoded the bondorder. The next three layers are encoded with the atomic number, Gasteiger charges and hybridization. The reason I choose this information is that this combination seem to fare well in most of the tests used in this preprint, but other combinations could of course also be tried.
When we have a 4 channel image, it will be possible to use the standard image processing tools from Keras which are needed to get proper training. More channels can of course be added but that require some extra hacking of some Keras modules.

def chemcepterize_mol(mol, embed=20.0, res=0.5):
    dims = int(embed*2/res)
    cmol = Chem.Mol(mol.ToBinary())
    cmol.ComputeGasteigerCharges()
    AllChem.Compute2DCoords(cmol)
    coords = cmol.GetConformer(0).GetPositions()
    vect = np.zeros((dims,dims,4))
    #Bonds first
    for i,bond in enumerate(mol.GetBonds()):
        bondorder = bond.GetBondTypeAsDouble()
        bidx = bond.GetBeginAtomIdx()
        eidx = bond.GetEndAtomIdx()
        bcoords = coords[bidx]
        ecoords = coords[eidx]
        frac = np.linspace(0,1,int(1/res*2)) #
        for f in frac:
            c = (f*bcoords + (1-f)*ecoords)
            idx = int(round((c[0] + embed)/res))
            idy = int(round((c[1]+ embed)/res))
            #Save in the vector first channel
            vect[ idx , idy ,0] = bondorder
    #Atom Layers
    for i,atom in enumerate(cmol.GetAtoms()):
            idx = int(round((coords[i][0] + embed)/res))
            idy = int(round((coords[i][1]+ embed)/res))
            #Atomic number
            vect[ idx , idy, 1] = atom.GetAtomicNum()
            #Gasteiger Charges
            charge = atom.GetProp("_GasteigerCharge")
            vect[ idx , idy, 3] = charge
            #Hybridization
            hyptype = atom.GetHybridization().real
            vect[ idx , idy, 2] = hyptype
    return vect

To better understand what the code has done, lets try to “chemcepterize” a molecule and show it as an image. The embedding and the resolution are set lower than they will be for the final dataset. Matplotlib only supports RGB, so only the first three channels are used.

mol = data["mol"][0]
v = chemcepterize_mol(mol, embed=10, res=0.2)
print(v.shape)
plt.imshow(v[:,:,:3])
(100, 100, 4)

 

Nice. We can see the different atoms and bonds are coloured differently as they have different chemical information. Next step is to “chemcepterize” the entire collection of RDKit molecules and add a new column with the “images” to the dataframe

def vectorize(mol):
    return chemcepterize_mol(mol, embed=12)
data["molimage"] = data["mol"].apply(vectorize)

The dataset already had a split value indicating if it should be train or test set. The shape of the final numpy arrays are (samples, height, width, channels)

X_train = np.array(list(data["molimage"][data["split"]==1]))
X_test = np.array(list(data["molimage"][data["split"]==0]))
print(X_train.shape)
print(X_test.shape)
(609, 48, 48, 4)
(147, 48, 48, 4)

We also need to the prepare the values to predict. Here it is the IC50 for some DHFR inhibitors. The data is converted to log space and the robust scaler from scikit-learn is used to scale the data to somewhat between -1 and 1 (neural networks like this range and it makes training somewhat easier).

assay = "PC_uM_value"
y_train = data[assay][data["split"]==1].values.reshape(-1,1)
y_test = data[assay][data["split"]==0].values.reshape(-1,1)

from sklearn.preprocessing import RobustScaler
rbs = RobustScaler(with_centering=True, with_scaling=True, quantile_range=(5.0, 95.0), copy=True)
y_train_s = rbs.fit_transform(np.log(y_train))
y_test_s = rbs.transform(np.log(y_test))
h = plt.hist(y_train_s, bins=20)

 

Building the Chemception model

Now comes the interesting part where we will build the chemception network. For that we will use the Inception modules. These are interesting architectures. Instead of choosing/optimizing a convolutional kernel size, the modules split the data stream into three. The first tower is a recombination of the image channels/previous kernels with a convolution of 1, whereas the other two have kernel sizes of 3,3 and 5,5, respectively. The first tower is combined with a MaxPooling2D layer, whereas the others are combined with a prior 1 by 1 convolutional layer. This serves a dimension reduction by recombining the features from previous layers into a lower number of channels.
The first inception module is special, because I have removed the MaxPooling layer. The justification is that the features encoded in are not naturally ordered. Are positive charges more relevant than negatively charged? Is oxygen more important than carbon and nitrogen because it has a higher atomic weight? I think not, so I let the network figure it out with the first layer by recombining the channels, before I start to use the standard unit with max pooling.

After defining the Inception modules in some helper functions the network is build using Keras functional API, where the flow of Keras tensor objects is defined.

After the inception modules, I put a max pooling layer that spans the output of each kernel from the inception module, in a hope to create some features in the different kernels. We are interested if the features are present, but maybe not exactly where on the image. The output is flattened to one dimension and fed into a standard dense network with 100 neurons with the RELU activation function and a single output neuron with a linear activation function as we are aiming at a regression model.

In the end the computational graph is used with the model class, where the inputs and outputs are defined.

input_shape = X_train.shape[1:]
print(input_shape)
(48, 48, 4)
def Inception0(input):
    tower_1 = Conv2D(16, (1, 1), padding='same', activation='relu')(input)
    tower_1 = Conv2D(16, (3, 3), padding='same', activation='relu')(tower_1)

    tower_2 = Conv2D(16, (1, 1), padding='same', activation='relu')(input)
    tower_2 = Conv2D(16, (5, 5), padding='same', activation='relu')(tower_2)

    tower_3 = Conv2D(16, (1, 1), padding='same', activation='relu')(input)

    output = keras.layers.concatenate([tower_1, tower_2, tower_3], axis=-1)
    return output
def Inception(input):
    tower_1 = Conv2D(16, (1, 1), padding='same', activation='relu')(input)
    tower_1 = Conv2D(16, (3, 3), padding='same', activation='relu')(tower_1)

    tower_2 = Conv2D(16, (1, 1), padding='same', activation='relu')(input)
    tower_2 = Conv2D(16, (5, 5), padding='same', activation='relu')(tower_2)

    tower_3 = MaxPooling2D((3, 3), strides=(1, 1), padding='same')(input)
    tower_3 = Conv2D(16, (1, 1), padding='same', activation='relu')(tower_3)

    output = keras.layers.concatenate([tower_1, tower_2, tower_3], axis=-1)
    return output
input_img = Input(shape=input_shape)

x = Inception0(input_img)
x = Inception(x)
x = Inception(x)
od=int(x.shape[1])
x = MaxPooling2D(pool_size=(od,od), strides=(1,1))(x)
x = Flatten()(x)
x = Dense(100, activation='relu')(x)
output = Dense(1, activation='linear')(x)

model = Model(inputs=input_img, outputs=output)

print(model.summary())
__________________________________________________________________________________________________
 Layer (type) Output Shape Param # Connected to
 ==================================================================================================
 input_1 (InputLayer) (None, 48, 48, 4) 0
 __________________________________________________________________________________________________
 conv2d_1 (Conv2D) (None, 48, 48, 16) 80 input_1[0][0]
 __________________________________________________________________________________________________
 conv2d_3 (Conv2D) (None, 48, 48, 16) 80 input_1[0][0]
 __________________________________________________________________________________________________
 conv2d_2 (Conv2D) (None, 48, 48, 16) 2320 conv2d_1[0][0]
 __________________________________________________________________________________________________
 conv2d_4 (Conv2D) (None, 48, 48, 16) 6416 conv2d_3[0][0]
 __________________________________________________________________________________________________
 conv2d_5 (Conv2D) (None, 48, 48, 16) 80 input_1[0][0]
 __________________________________________________________________________________________________
 concatenate_1 (Concatenate) (None, 48, 48, 48) 0 conv2d_2[0][0]
 conv2d_4[0][0]
 conv2d_5[0][0]
 __________________________________________________________________________________________________
 conv2d_6 (Conv2D) (None, 48, 48, 16) 784 concatenate_1[0][0]
 __________________________________________________________________________________________________
 conv2d_8 (Conv2D) (None, 48, 48, 16) 784 concatenate_1[0][0]
 __________________________________________________________________________________________________
 max_pooling2d_1 (MaxPooling2D) (None, 48, 48, 48) 0 concatenate_1[0][0]
 __________________________________________________________________________________________________
 conv2d_7 (Conv2D) (None, 48, 48, 16) 2320 conv2d_6[0][0]
 __________________________________________________________________________________________________
 conv2d_9 (Conv2D) (None, 48, 48, 16) 6416 conv2d_8[0][0]
 __________________________________________________________________________________________________
 conv2d_10 (Conv2D) (None, 48, 48, 16) 784 max_pooling2d_1[0][0]
 __________________________________________________________________________________________________
 concatenate_2 (Concatenate) (None, 48, 48, 48) 0 conv2d_7[0][0]
 conv2d_9[0][0]
 conv2d_10[0][0]
 __________________________________________________________________________________________________
 conv2d_11 (Conv2D) (None, 48, 48, 16) 784 concatenate_2[0][0]
 __________________________________________________________________________________________________
 conv2d_13 (Conv2D) (None, 48, 48, 16) 784 concatenate_2[0][0]
 __________________________________________________________________________________________________
 max_pooling2d_2 (MaxPooling2D) (None, 48, 48, 48) 0 concatenate_2[0][0]
 __________________________________________________________________________________________________
 conv2d_12 (Conv2D) (None, 48, 48, 16) 2320 conv2d_11[0][0]
 __________________________________________________________________________________________________
 conv2d_14 (Conv2D) (None, 48, 48, 16) 6416 conv2d_13[0][0]
 __________________________________________________________________________________________________
 conv2d_15 (Conv2D) (None, 48, 48, 16) 784 max_pooling2d_2[0][0]
 __________________________________________________________________________________________________
 concatenate_3 (Concatenate) (None, 48, 48, 48) 0 conv2d_12[0][0]
 conv2d_14[0][0]
 conv2d_15[0][0]
 __________________________________________________________________________________________________
 max_pooling2d_3 (MaxPooling2D) (None, 1, 1, 48) 0 concatenate_3[0][0]
 __________________________________________________________________________________________________
 flatten_1 (Flatten) (None, 48) 0 max_pooling2d_3[0][0]
 __________________________________________________________________________________________________
 dense_1 (Dense) (None, 100) 4900 flatten_1[0][0]
 __________________________________________________________________________________________________
 dense_2 (Dense) (None, 1) 101 dense_1[0][0]
 ==================================================================================================
 Total params: 36,153
 Trainable params: 36,153
 Non-trainable params: 0
 __________________________________________________________________________________________________
 None

In the end the model ends up not having a terrible lot of parameters, only 36.153, which is a lot lower than others I have trained on even smaller datasets. This is because I have lowered the number of kernels for each tower from 64 to 16, as I judge the dataset to be rather small. The images may be a lot simpler than, say, photos of cats, so it may work anyway.
For the optimization I use the Adam optimizer and the mean squared error as a loss function.

optimizer = Adam(lr=0.00025)
model.compile(loss="mse", optimizer=optimizer)

The next part is crucial to avoid overfitting. Here the ImageDataGenerator object is used to perform random rotations and flips of the images before the training as a way of augmenting the training dataset. By doing this, the network will learn how to handle rotations and seeing the features in different orientations will help the model generalize better. Not including this will lead to completely overfit models. We have not encoded stereochemical information in the images, otherwise the flipping should be done by other means.
The training set is concatenated to 50 times the length to have some sensible size epochs.

from image import ImageDataGenerator
generator = ImageDataGenerator(rotation_range=180,
                               width_shift_range=0.1,height_shift_range=0.1,
                               fill_mode="constant",cval = 0,
                               horizontal_flip=True, vertical_flip=True,data_format='channels_last',
                           
                               )

#Concatenate for longer epochs
Xt = np.concatenate([X_train]*50, axis=0)
yt = np.concatenate([y_train_s]*50, axis=0)

batch_size=128
g = generator.flow(Xt, yt, batch_size=batch_size, shuffle=True)
steps_per_epoch = 10000/batch_size

Training the Chemception model

Now for the interesting part: Training. To lower the learning rate once the validation loss starts to plateau off I use the ReduceLROnPlateau callback avaible as part of Keras.

reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.5,patience=10, min_lr=1e-6, verbose=1)
history = model.fit_generator(g,
                              steps_per_epoch=len(Xt)//batch_size,
                              epochs=150,
                              validation_data=(X_test,y_test_s),
                              callbacks=[reduce_lr])
Epoch 1/150
 237/237 [==============================] - 31s 131ms/step - loss: 0.1241 - val_loss: 0.1208
 Epoch 2/150
 237/237 [==============================] - 28s 119ms/step - loss: 0.1135 - val_loss: 0.1068
 Epoch 3/150
 237/237 [==============================] - 28s 119ms/step - loss: 0.1123 - val_loss: 0.1071
 Epoch 4/150


.....

237/237 [==============================] - 29s 122ms/step - loss: 0.0362 - val_loss: 0.0361
 Epoch 150/150
 237/237 [==============================] - 29s 122ms/step - loss: 0.0355 - val_loss: 0.0360

Models can be saved and loaded. The history objects history dictionary is pickled.

name = "Chemception_std_notebook_demo"
model.save("%s.h5"%name)

hist = history.history
import pickle
pickle.dump(hist, file("%s_history.pickle"%name,"w"))
#from keras.model import load_model
#model = load_model("%s.h5"%name)

The convergence of the training can be judged from a plot of the learning process. Somewhat unusual, when theres no regularization: The validation loss drops before the loss. The validation set is not augmented and thus consists of some “perfect” pictures, whereas maybe it may take the network some longer to deal with all the rotations, which also introduces some pixel artifacts due to the low resolution.

for label in ['val_loss','loss']:
    plt.plot(hist[label], label = label)
plt.legend()
plt.yscale("log")
plt.xlabel("Epochs")
plt.ylabel("Loss/lr")

Plotting and Evaluating the Performance

 

y_pred_t = rbs.inverse_transform(model.predict(X_train))
y_pred = rbs.inverse_transform(model.predict(X_test))
plt.scatter(np.log(y_train), y_pred_t, label="Train")
plt.scatter(np.log(y_test), y_pred, label="Test")
plt.xlabel("log(PC_uM)")
plt.ylabel("predicted")
plt.plot([-10,6],[-10,6])
plt.legend()

corr2 = np.corrcoef(np.log(y_test).reshape(1,-1), y_pred.reshape(1,-1))[0][1]**2
rmse = np.mean((np.log(y_test) - y_pred)**2)**0.5
print("R2 : %0.2F"%corr2)
print("RMSE : %0.2F"%rmse)
R2 : 0.67
RMSE : 1.50

This looks very similar to the performance I have obtained with this dataset previously with other means.

Visualizing the Layers

It can be interesting to try and understand how the model “sees” the molecules. For this I’ll take an example molecule and plot some of the outputs from the different layers. I’ve taken the compound with the lowest IC50, number 143 in the dataset.

molnum = 143
molimage = np.array(list(data["molimage"][molnum:molnum+1]))
mol = data["mol"][molnum]

The molecule looks like this

from rdkit.Chem import Draw
Draw.MolToImage(mol)

And has this “chemcepterized” image as shown below

plt.imshow(molimage[0,:,:,:3])

The first example is the third layer, which is the 1,1 convolution which feeds the 3,3 convolutional layer in tower 2.

layer1_model = Model(inputs=model.input,
                    outputs=model.layers[2].output)

kernels1 = layer1_model.predict(molimage)[0]

def plot_kernels(kernels):
    fig, axes = plt.subplots(2,3, figsize=(12,8))
    for i,ax in enumerate(axes.flatten()):
        ax.matshow(kernels[:,:,i])
        ax.set_title("Kernel %s"%i)

plot_kernels(kernels1)

The different kernels (here the first 6 out of 16), have already recombined the information in Chemception image. Kernel 5, seem to focus on bonds, as it has removed the bond information when there were atoms in the other layers. Kernel 1 focuses on atoms and is most activated for aliphatic carbon. Kernel 4 is most exited with the chlorine atoms, but also contain bond information. Kernel 2 and 3 seems empty. Maybe they are activated by other molecules on features not present in the current molecule, or maybe they were unneeded. Lets go deeper…..

for layer in [7,13,15,19,20]:
    print("Layer %i"%layer)
    plot_kernels(Model(inputs=model.input,outputs=model.layers[layer].output).predict(molimage)[0])
    plt.show()

Layer 7

Layer 13

Layer 15

Layer 19

Layer 20

A generel trend as we go through the layers is the more and more abstract and specific features that are created. The chlorine atoms seem to light up in a lot of the kernels, but others focus on other features. Kernel 0 to 2 in layer 19 seem to focus on all that is not the chlorine atoms. Kernel 5 in the same layer activates near the double bonded oxygens. The last layer just before the total kernel-wise max pooling only seem to focus on very specific parts of the molecule. Kernel 0 could be the amide oxygen. Kernel 2 and 5 the chlorine atoms. Kernel 4 seem to like the double bonden oxygens, but only from the carboxylic acid groups, not the amide. Kernel three gets activated near the terminal carboxylic acid’s OH. These are only the first 6 of the features that are extracted from the images and fed forward to the dense layer.

Summary

In this blog-post I have shown how to create a chemception model with a few lines of python code using RDKit for the chemistry and Keras for the neural networks. The Sutherland DHFR dataset is very dependant on the classes of molecules in there, and the random division of the classes into the train and test set do not show if these kind of chemception models can carry any transfer of the SAR from class to class, or they merely recognize the compound class and assigns an average IC50. If you do happen to recreate a Chemception model and make a division based on classes for the train and test sets, let me know how the performance was in the comments below.


Comment

  1. Troels Marstrand
    November 30, 2017 at 16:54 Reply

    Hi Esben, this looks really interesting. Have you tried using a similar approach in a variational auto-encoder to get a latent space representation? This could be of interest to have the algorithm dream up new molecules directly from drawings — thus getting rid of some of the problems around SMILE representation.

    Also using this with approaches such as grad-cam and similar, are you able to show which features of the molecule / drawing are most important for a given prediction. This would be interesting within a chemical series to see which areas lights up for particular assay read-outs.

    • Esben Jannik Bjerrum
      November 30, 2017 at 19:47 Reply

      Hi Troels,
      Good to hear from you. Hope everything is fine in Boston.
      I have thought about using this representation as a base for auto-encoders, but haven’t had the time to experiment yet, and yes, it also opens up some interesting possibilities for mapping the importance. Let me know your results if you try it yourself.

      Best Regards
      Esben

Leave a Reply

Your email address will not be published. Required fields are marked *