The Blog is a collection of News, Blog posts, technical tips and tricks and other writings from my sphere of interest.
- Esben Jannik Bjerrum


Master your molecule generator: Seq2seq RNN models with SMILES in Keras

I’ve previously written about molecular generators based on long short-term memory recurrent neural networks (LSTM-RNNs). The networks learn rules about how SMILES strings of molecules are formatted and are then able to create novel SMILES following the same rules by iterating through the characters. The results are “creative” computers that can draw correct molecules.
An interesting aspect is that the distribution of molecular properties are the same for the training dataset and the later samples “You get what you train for” could sound like a commercial for a fitness center but equally apply here.
However it takes a few thousand molecules to train one of these character based generators without over-fitting and getting the chemical rules underlying the SMILES generation correct. Pre-training on larger datasets work to some degree, but the molecules anyway end up looking strange with unlikely anti-aromatic rings.
Another approach is to make an autoencoder. The basic idea is to code the input (here SMILES strings) string to a latent space and then recreate the original input from this space. After training, the autoencoder can be split into an encoder and a decoder. The decoder can be used to sample the area around a molecule of interest (Lead-compound?) and generate similar, but not equal, molecules.
I have also written about my experiences using keras-molecules to map molecules to latent space with subsequent visualization. This code uses a convolutional encoder and a GRU based decoder.
However, it should also be possible to use an RNN based encoder, which was done by Zheng Xu and Co-workers. Their code doesn’t seem to be readily available, but read on below and see how it can be implemented with Keras.

# General Imports
import os
import pandas as pd
import numpy as np
from rdkit import Chem
from rdkit.Chem import Draw, Descriptors
from matplotlib import pyplot as plt
%matplotlib inline

There is of course a need for some SMILES strings to train and test on. To keep things a bit simple (and training shorter) for this blog post, the file with SMILES containing 8 atoms from the GDB-11 dataset will be used. It contains all molecules possible to generate with a subset of atoms and bond types. The dataset can be downloaded here: A random split into train- and test-set is done immediately before further processing.

smifile = "gdb11_size08.smi"
data = pd.read_csv(smifile, delimiter = "\t", names = ["smiles","No","Int"])
from sklearn.cross_validation import train_test_split
smiles_train, smiles_test = train_test_split(data["smiles"], random_state=42)
print smiles_train.shape
print smiles_test.shape

The SMILES must be vectorized to one-hot encoded arrays. To do this a character set is built from all characters found in the SMILES string (both train and test). Also, some start and stop characters are added, which will be used to initiate the decoder and to signal when SMILES generation has stopped. The stop character also work as padding to get the same length of all vectors, so that the network can be trained in batch mode. The character set is used to define two dictionaries to translate back and forth between index and character. The maximum length of the SMILES strings is needed as the RNN’s will be trained in batch mode, and is set to the maximum encountered + some extra.

charset = set("".join(list(data.smiles))+"!E")
char_to_int = dict((c,i) for i,c in enumerate(charset))
int_to_char = dict((i,c) for i,c in enumerate(charset))
embed = max([len(smile) for smile in data.smiles]) + 5
print str(charset)
print(len(charset), embed)
    set(['!', '#', ')', '(', '+', '-', '1', '3', '2', '4', '=', 'C', 'E', 'F', 'H', 'O', 'N', '[', ']', 'c', 'o', 'n'])
    (22, 28)

Afterwards the character set and dictionaries are used to set the necessary bits in the Numpy arrays. The result will be a “piano-roll” of each molecules SMILES string. The X data starts with !, but the output Y is offset by one character, and starts with the first character of the actual SMILES.

def vectorize(smiles):
        one_hot =  np.zeros((smiles.shape[0], embed , len(charset)),dtype=np.int8)
        for i,smile in enumerate(smiles):
            #encode the startchar
            one_hot[i,0,char_to_int["!"]] = 1
            #encode the rest of the chars
            for j,c in enumerate(smile):
                one_hot[i,j+1,char_to_int] = 1
            #Encode endchar
            one_hot[i,len(smile)+1:,char_to_int["E"]] = 1
        #Return two, one for input and the other for output
        return one_hot[:,0:-1,:], one_hot[:,1:,:]
X_train, Y_train = vectorize(smiles_train.values)
X_test,Y_test = vectorize(smiles_test.values)
print smiles_train.iloc[0]
#print X_train.shape

The int_to_char dictionary can be used to go from vectorized form back to a readable string, here with a joined list comprehension.

"".join([int_to_char[idx] for idx in np.argmax(X_train[0,:,:], axis=1)])

With the vectorized data ready, it time to build the autoencoder. First some Keras objects will be imported and the dimensions for the input and output calculated from the vectorized data. Additionally the number of LSTM cell to use for the decoder and encoder is specified and the latent dimension is specified.

#Import Keras objects
from keras.models import Model
from keras.layers import Input
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import Concatenate
from keras import regularizers
input_shape = X_train.shape[1:]
output_dim = Y_train.shape[-1]
latent_dim = 64
lstm_dim = 64
    Using TensorFlow backend.

Long Short-Term Memory Cells (LSTM)

It may sound like an oxymoron, but long short-term memory cells are special kinds of neural network units that are designed to keep an internal state for longer iterations through a recurrent neural network.  They have been designed with input, output and forget gates, that control what to do with the cells internal state C. The state H is passed as a copy between the recurrent iterations and helps determine what should be done with the internal state C in the next iteration. The gating mechanisms ensures that the network can easily retain an internal state by closing off the input and forget gates until a new condition or input opens them. They work as drop-in replacements for standard recurrent units and usually increase the performance of the RNN with regard to recognizing long term interactions. Luckily they are already implemented in Keras. An also popular, simplified version is the gated recurrent unit (GRU) that is also available in Keras.

The model is overall kept rather simple. A single layer of 64 LSTM cells is used to read the input SMILES string. The outputs from the layer is ignored, but the final internal C and H state is concatenated and recombined into a the bottleneck layer “neck”. This comprises the encoder.
The functional API of Keras is used and this will make it easier to rebuilt the various layers into encoder and decoder models later.

unroll = False
encoder_inputs = Input(shape=input_shape)
encoder = LSTM(lstm_dim, return_state=True,
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
states = Concatenate(axis=-1)([state_h, state_c])
neck = Dense(latent_dim, activation="relu")
neck_outputs = neck(states)

The neck_output tensor is then put through two different Dense layers to decode the states that should be set on the decoder LSTM layer. This recombination of the internal states from the encoder to decoder, makes it possible to use a different size latent space than encoded by the LSTM layers in themselves. The LSTM layer then receives the input once again, and is put to the task of predicting the next character in the sequence. So from the latent representation of the molecule and the character “!”, it should output what the next character is, representing the first atom, e.g. “C”, “N” etc. This is why the Y vectors are offset from the X vectors with the “!”character. This way of training is called teacher enforcing in contrast to the approach where the output of one recurrence step is used as input to the next. Even though the network may make a mistake in the start of the sequence, it will not affect the training of the later parts of the sequence. This trains the decoder much more efficiently, than just feeding it the internal states and then see if it can construct the whole sequence. The LSTM cells output at each step is put into a Dense network with each neurons task to predict the correct character.
The final model for training is then built from the input-layer objects and the output layer.

decode_h = Dense(lstm_dim, activation="relu")
decode_c = Dense(lstm_dim, activation="relu")
state_h_decoded =  decode_h(neck_outputs)
state_c_decoded =  decode_c(neck_outputs)
encoder_states = [state_h_decoded, state_c_decoded]
decoder_inputs = Input(shape=input_shape)
decoder_lstm = LSTM(lstm_dim,
decoder_outputs = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense = Dense(output_dim, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)
#Define the model, that inputs the training vector for two places, and predicts one character ahead of the input
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
print model.summary()
    Layer (type)                    Output Shape         Param #     Connected to                     
    input_9 (InputLayer)            (None, 27, 22)       0                                            
    lstm_9 (LSTM)                   [(None, 64), (None,  22272       input_9[0][0]                    
    concatenate_5 (Concatenate)     (None, 128)          0           lstm_9[0][1]                     
    dense_15 (Dense)                (None, 64)           8256        concatenate_5[0][0]              
    input_10 (InputLayer)           (None, 27, 22)       0                                            
    dense_16 (Dense)                (None, 64)           4160        dense_15[0][0]                   
    dense_17 (Dense)                (None, 64)           4160        dense_15[0][0]                   
    lstm_10 (LSTM)                  (None, 27, 64)       22272       input_10[0][0]                   
    dense_18 (Dense)                (None, 27, 22)       1430        lstm_10[0][0]                    
    Total params: 62,550
    Trainable params: 62,550
    Non-trainable params: 0

After preparing some Keras callbacks to record the history and reduce the learning rate once a training plateau is reached, the model is compiled with optimizer and loss function and the training can begin. Note how the X_train is fed two times to the model, to give the input at two different places in the model.

from keras.callbacks import History, ReduceLROnPlateau
h = History()
rlr = ReduceLROnPlateau(monitor='val_loss', factor=0.5,patience=10, min_lr=0.000001, verbose=1, epsilon=1e-5)
from keras.optimizers import RMSprop, Adam
opt=Adam(lr=0.005) #Default 0.001
model.compile(optimizer=opt, loss='categorical_crossentropy')[X_train,X_train],Y_train,
                    callbacks=[h, rlr],
                    validation_data=[[X_test,X_test],Y_test ])
    Train on 50029 samples, validate on 16677 samples
    Epoch 1/200
    50029/50029 [==============================] - 21s 417us/step - loss: 0.9442 - val_loss: 0.6275
    Epoch 199/200
    50029/50029 [==============================] - 18s 363us/step - loss: 3.2088e-05 - val_loss: 2.5274e-04
    Epoch 200/200
    50029/50029 [==============================] - 18s 368us/step - loss: 3.1181e-05 - val_loss: 2.5532e-04
import pickle
pickle.dump(h.history, file("Blog_history.pickle","w"))

The training process can be plotted using the data recorded in the history object. There are some spikes in the loss and validation loss, but in the end the loss and validation loss levels more or less out. Note the Logarithmic scale make the growing difference between the loss and validation loss appear larger than it really is.

plt.plot(h.history["loss"], label="Loss")
plt.plot(h.history["val_loss"], label="Val_Loss")

A quick test directly on the model, before we reassemble the trained parts into new models show that there is a good reconstruction accuracy also on the test set. The code below outputs nothing so the 100 tested SMILES from the test set is perfectly reconstructed.

for i in range(100):
    v = model.predict([X_test[i:i+1], X_test[i:i+1]]) #Can't be done as output not necessarely 1
    idxs = np.argmax(v, axis=2)
    pred=  "".join([int_to_char[h] for h in idxs[0]])[:-1]
    idxs2 = np.argmax(X_test[i:i+1], axis=2)
    true =  "".join([int_to_char[k] for k in idxs2[0]])[1:]
    if true != pred:
        print true, pred

Now the parts of the trained autoencoder will be used to build the various encoder and decoder models. Building a new model from the input layer to an output layer inside another model is very simple. The input layer defined when the model was built it reused and the neck_output reused as the output. This model will take a vectorized smile and encode it to the latent space.

smiles_to_latent_model = Model(encoder_inputs, neck_outputs)"Blog_simple_smi2lat.h5")

The next model that will be needed is a model that can decode the latent space into the states that need to be set at the decoder LSTM cells. A new input matching the latent space is defined, but the layers from before can be reused to get the h and c states. That way the weights are inherited from the trained model.

latent_input = Input(shape=(latent_dim,))
state_h_decoded_2 =  decode_h(latent_input)
state_c_decoded_2 =  decode_c(latent_input)
latent_to_states_model = Model(latent_input, [state_h_decoded_2, state_c_decoded_2])"Blog_simple_lat2state.h5")

The decoder model needs a bit more work. It was trained in batch mode, but it will be used in stateful mode predicting one character at a time. So the layers are defined precisely as before, except with a new batch_shape and the LSTM layer set to stateful.

#Last one is special, we need to change it to stateful, and change the input shape
inf_decoder_inputs = Input(batch_shape=(1, 1, input_shape[1]))
inf_decoder_lstm = LSTM(lstm_dim,
inf_decoder_outputs = inf_decoder_lstm(inf_decoder_inputs)
inf_decoder_dense = Dense(output_dim, activation='softmax')
inf_decoder_outputs = inf_decoder_dense(inf_decoder_outputs)
sample_model = Model(inf_decoder_inputs, inf_decoder_outputs)

After defining the decoder model, the corresponding weights from the trained autoencoder model are transfered.

#Transfer Weights
for i in range(1,3):
    Layer (type)                 Output Shape              Param #   
    input_13 (InputLayer)        (1, 1, 22)                0         
    lstm_11 (LSTM)               (1, 1, 64)                22272     
    dense_19 (Dense)             (1, 1, 22)                1430      
    Total params: 23,702
    Trainable params: 23,702
    Non-trainable params: 0

Using the latent space as a fingerprint

The smiles to latent model can be used to encode the SMILES into the fingerprint like latent space

x_latent = smiles_to_latent_model.predict(X_test)

A useful feature of a fingerprint is if similar molecules produces similar fingerprints. To see if similar molecules produce similar vectors in the latent space, a simple search for similar molecules can be performed. Here the absolute difference between the latent vectors is used as a metric of similarity. This test do not rule out that similar molecules can get latent vectors that are dissimilar, but is a quick test of the immediate neighborhood.

molno = 5
latent_mol = smiles_to_latent_model.predict(X_test[molno:molno+1])
sorti = np.argsort(np.sum(np.abs(x_latent - latent_mol), axis=1))
print sorti[0:10]
print smiles_test.iloc[sorti[0:8]]
    [    5  1243 15472   744 15039 14589  2449  3991  5251 16502]
    45051    COCc1cocn1
    45007    COCc1ccoc1
    45061    COCc1conn1
    45025    CCOc1cnoc1
    45069    COCc1ncon1
    44719    COCc1ccno1
    45063    CCOc1conn1
    44727    COCc1cnno1
    Name: smiles, dtype: object

The first one is the query molecule. They do look similar, but maybe because the SMILES strings are very similar, rather than following our understanding of molecular similarity. Lets see what is the most different molecules in the latent spaces.


Chemical properties in the latent space

As a proxy for chemical properties, lets see how the calculated LogP and MR maps to a PCA reduction of the latent space

logp = smiles_test.apply(Chem.MolFromSmiles).apply(Descriptors.MolLogP)
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
red = pca.fit_transform(x_latent)
plt.scatter(red[:,0], red[:,1],marker='.', c= logp)
print(pca.explained_variance_ratio_, np.sum(pca.explained_variance_ratio_))
    (array([ 0.24815299,  0.15272172], dtype=float32), 0.4008747)

molwt = smiles_test.apply(Chem.MolFromSmiles).apply(Descriptors.MolMR)
plt.scatter(red[:,0], red[:,1],marker='.', c= molwt)

There seem to be some distribution of the predicted molecular properties.

Modeling properties from the latent space

It could also be interesting to see if the fingerprints work good as a basis for QSAR models. As a proxy for the target value of QSAR modelling, a model of the predicted LogP will be built. Building a model using a modeled property is only done for demonstration purposes.

#Model LogP?
x_train_latent = smiles_to_latent_model.predict(X_train)
logp_train = smiles_train.apply(Chem.MolFromSmiles).apply(Descriptors.MolLogP)
from keras.models import Sequential
logp_model = Sequential()
logp_model.add(Dense(128, input_shape=(latent_dim,), activation="relu"))
logp_model.add(Dense(128, activation="relu"))
logp_model.compile(optimizer="adam", loss="mse")
rlr = ReduceLROnPlateau(monitor='val_loss', factor=0.5,patience=10, min_lr=0.000001, verbose=1, epsilon=1e-5), logp_train, batch_size=128, epochs=400, callbacks = [rlr])
    Epoch 1/200
    50029/50029 [==============================] - 1s 25us/step - loss: 0.0236
    Epoch 200/200
    50029/50029 [==============================] - 1s 23us/step - loss: 0.0187
logp_pred_train = logp_model.predict(x_train_latent)
logp_pred_test = logp_model.predict(x_latent)
plt.scatter(logp, logp_pred_test, label="Test")
plt.scatter(logp_train, logp_pred_train, label="Train")

I would have expected even better fit when modeling a modeled property.

From latent space to SMILES

To sample the latent space, we need two steps. First compute the c and h states using the latent_to_states_model and then set the initial states of the decoder LSTM network. The LSTM network will be fed the input char vector and iteratively sample the next character until the end char “E” is encountered.

def latent_to_smiles(latent):
    #decode states and set Reset the LSTM cells with them
    states = latent_to_states_model.predict(latent)
    #Prepare the input char
    startidx = char_to_int["!"]
    samplevec = np.zeros((1,1,22))
    samplevec[0,0,startidx] = 1
    smiles = ""
    #Loop and predict next char
    for i in range(28):
        o = sample_model.predict(samplevec)
        sampleidx = np.argmax(o)
        samplechar = int_to_char[sampleidx]
        if samplechar != "E":
            smiles = smiles + int_to_char[sampleidx]
            samplevec = np.zeros((1,1,22))
            samplevec[0,0,sampleidx] = 1
    return smiles
smiles = latent_to_smiles(x_latent[0:1])
print smiles
print smiles_test.iloc[0]

It can be relevant to know how large a fraction of the SMILES are malformed when sampling the test set?

wrong = 0
for i in range(1000):
    smiles = latent_to_smiles(x_latent[i:i+1])
    mol = Chem.MolFromSmiles(smiles)
    if mol:
        print smiles
        wrong = wrong + 1
print "%0.1F percent wrongly formatted smiles"%(wrong/float(1000)*100)
    0.1 percent wrongly formatted smiles

This is actually quite good. Another interesting thing could be to see if the molecules can be “interpolated” in the latent space.

#Interpolation test in latent_space
i = 0
j= 2
latent1 = x_latent[j:j+1]
latent0 = x_latent[i:i+1]
mols1 = []
ratios = np.linspace(0,1,25)
for r in ratios:
    #print r
    rlatent = (1.0-r)*latent0 + r*latent1
    smiles  = latent_to_smiles(rlatent)
    mol = Chem.MolFromSmiles(smiles)
    if mol:
        print smiles
Draw.MolsToGridImage(mols1, molsPerRow=5)

The mis sampling seem to be higher as two out of 25 molecules could not be parsed. The interpolation also shows how the same molecule is generated for slightly different areas of the latent space. This maybe reflects that the molecules are discrete and the latent space continuous, with a higher chance of failing when we are at the border between two molecules.
Another interesting property to examine is if novel molecules more or less similar to a lead molecule can be by adding a bit of randomness to an existing latent vector.

#Sample around the latent wector
latent = x_latent[0:1]
scale = 0.40
mols = []
for i in range(20):
    latent_r = latent + scale*(np.random.randn(latent.shape[1])) #TODO, try with 
    smiles = latent_to_smiles(latent_r)
    mol = Chem.MolFromSmiles(smiles)
    if mol:
        print smiles
Draw.MolsToGridImage(mols, molsPerRow=5)

Here we really see a lot of wrong SMILES generated, but nevertheless see some molecules that may be more or less similar to the one used to encode the latent vector. (Compare to the first molecule in the interpolation above). Tuning the scaling can be a way to tune the creativity.


In this blog post it was shown how to make an LSTM based autoencoder of SMILES strings of molecules and efficiently train it with teacher enforcing. The latent space may reflect the structure of the SMILES but nevertheless seem useful for a variety of cheminformatics related tasks. The SMILES reconstruction accuracy seem to drop when bordering domains between molecules or adding random noise. Thus the latent space may not be completely continuous reflecting the discrete nature of the molecules. I’ve not mentioned AI in this blog-post as I’m not sure how to define intelligence in the first place, but I do find it fascinating how the chemical rules of SMILES strings can be picked up by the weights and computations done in the neural networks of the LSTM cells and later used to create novel SMILES strings.

Leave a Reply

Your email address will not be published. Required fields are marked *