The Blog is a collection of News, Blog posts, technical tips and tricks and other writings from my sphere of interest.
- Esben Jannik Bjerrum


Master your molecule generator: Seq2seq RNN models with SMILES in Keras

I’ve previously written about molecular generators based on long short-term memory recurrent neural networks (LSTM-RNNs). The networks learn rules about how SMILES strings of molecules are formatted and are then able to create novel SMILES following the same rules by iterating through the characters. The results are “creative” computers that can draw correct molecules.
An interesting aspect is that the distribution of molecular properties are the same for the training dataset and the later samples “You get what you train for” could sound like a commercial for a fitness center but equally apply here.
However it takes a few thousand molecules to train one of these character based generators without over-fitting and getting the chemical rules underlying the SMILES generation correct. Pre-training on larger datasets work to some degree, but the molecules anyway end up looking strange with unlikely anti-aromatic rings.
Another approach is to make an autoencoder. The basic idea is to code the input (here SMILES strings) string to a latent space and then recreate the original input from this space. After training, the autoencoder can be split into an encoder and a decoder. The decoder can be used to sample the area around a molecule of interest (Lead-compound?) and generate similar, but not equal, molecules.
I have also written about my experiences using keras-molecules to map molecules to latent space with subsequent visualization. This code uses a convolutional encoder and a GRU based decoder.
However, it should also be possible to use an RNN based encoder, which was done by Zheng Xu and Co-workers. Their code doesn’t seem to be readily available, but read on below and see how it can be implemented with Keras.

# General Imports
import os
import pandas as pd
import numpy as np
from rdkit import Chem
from rdkit.Chem import Draw, Descriptors
from matplotlib import pyplot as plt
%matplotlib inline

There is of course a need for some SMILES strings to train and test on. To keep things a bit simple (and training shorter) for this blog post, the file with SMILES containing 8 atoms from the GDB-11 dataset will be used. It contains all molecules possible to generate with a subset of atoms and bond types. The dataset can be downloaded here: A random split into train- and test-set is done immediately before further processing.

smifile = "gdb11_size08.smi"
data = pd.read_csv(smifile, delimiter = "\t", names = ["smiles","No","Int"])
from sklearn.cross_validation import train_test_split
smiles_train, smiles_test = train_test_split(data["smiles"], random_state=42)
print smiles_train.shape
print smiles_test.shape

The SMILES must be vectorized to one-hot encoded arrays. To do this a character set is built from all characters found in the SMILES string (both train and test). Also, some start and stop characters are added, which will be used to initiate the decoder and to signal when SMILES generation has stopped. The stop character also work as padding to get the same length of all vectors, so that the network can be trained in batch mode. The character set is used to define two dictionaries to translate back and forth between index and character. The maximum length of the SMILES strings is needed as the RNN’s will be trained in batch mode, and is set to the maximum encountered + some extra.

charset = set("".join(list(data.smiles))+"!E")
char_to_int = dict((c,i) for i,c in enumerate(charset))
int_to_char = dict((i,c) for i,c in enumerate(charset))
embed = max([len(smile) for smile in data.smiles]) + 5
print str(charset)
print(len(charset), embed)
    set(['!', '#', ')', '(', '+', '-', '1', '3', '2', '4', '=', 'C', 'E', 'F', 'H', 'O', 'N', '[', ']', 'c', 'o', 'n'])
    (22, 28)

Afterwards the character set and dictionaries are used to set the necessary bits in the Numpy arrays. The result will be a “piano-roll” of each molecules SMILES string. The X data starts with !, but the output Y is offset by one character, and starts with the first character of the actual SMILES.

def vectorize(smiles):
        one_hot =  np.zeros((smiles.shape[0], embed , len(charset)),dtype=np.int8)
        for i,smile in enumerate(smiles):
            #encode the startchar
            one_hot[i,0,char_to_int["!"]] = 1
            #encode the rest of the chars
            for j,c in enumerate(smile):
                one_hot[i,j+1,char_to_int[c]] = 1
            #Encode endchar
            one_hot[i,len(smile)+1:,char_to_int["E"]] = 1
        #Return two, one for input and the other for output
        return one_hot[:,0:-1,:], one_hot[:,1:,:]
X_train, Y_train = vectorize(smiles_train.values)
X_test,Y_test = vectorize(smiles_test.values)
print smiles_train.iloc[0]
#print X_train.shape

The int_to_char dictionary can be used to go from vectorized form back to a readable string, here with a joined list comprehension.

"".join([int_to_char[idx] for idx in np.argmax(X_train[0,:,:], axis=1)])

With the vectorized data ready, it time to build the autoencoder. First some Keras objects will be imported and the dimensions for the input and output calculated from the vectorized data. Additionally the number of LSTM cell to use for the decoder and encoder is specified and the latent dimension is specified.

#Import Keras objects
from keras.models import Model
from keras.layers import Input
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import Concatenate
from keras import regularizers
input_shape = X_train.shape[1:]
output_dim = Y_train.shape[-1]
latent_dim = 64
lstm_dim = 64
    Using TensorFlow backend.

Long Short-Term Memory Cells (LSTM)

It may sound like an oxymoron, but long short-term memory cells are special kinds of neural network units that are designed to keep an internal state for longer iterations through a recurrent neural network.  They have been designed with input, output and forget gates, that control what to do with the cells internal state C. The state H is passed as a copy between the recurrent iterations and helps determine what should be done with the internal state C in the next iteration. The gating mechanisms ensures that the network can easily retain an internal state by closing off the input and forget gates until a new condition or input opens them. They work as drop-in replacements for standard recurrent units and usually increase the performance of the RNN with regard to recognizing long term interactions. Luckily they are already implemented in Keras. An also popular, simplified version is the gated recurrent unit (GRU) that is also available in Keras.

The model is overall kept rather simple. A single layer of 64 LSTM cells is used to read the input SMILES string. The outputs from the layer is ignored, but the final internal C and H state is concatenated and recombined into a the bottleneck layer “neck”. This comprises the encoder.
The functional API of Keras is used and this will make it easier to rebuilt the various layers into encoder and decoder models later.

unroll = False
encoder_inputs = Input(shape=input_shape)
encoder = LSTM(lstm_dim, return_state=True,
encoder_outputs, state_h, state_c = encoder(encoder_inputs)
states = Concatenate(axis=-1)([state_h, state_c])
neck = Dense(latent_dim, activation="relu")
neck_outputs = neck(states)

The neck_output tensor is then put through two different Dense layers to decode the states that should be set on the decoder LSTM layer. This recombination of the internal states from the encoder to decoder, makes it possible to use a different size latent space than encoded by the LSTM layers in themselves. The LSTM layer then receives the input once again, and is put to the task of predicting the next character in the sequence. So from the latent representation of the molecule and the character “!”, it should output what the next character is, representing the first atom, e.g. “C”, “N” etc. This is why the Y vectors are offset from the X vectors with the “!”character. This way of training is called teacher enforcing in contrast to the approach where the output of one recurrence step is used as input to the next. Even though the network may make a mistake in the start of the sequence, it will not affect the training of the later parts of the sequence. This trains the decoder much more efficiently, than just feeding it the internal states and then see if it can construct the whole sequence. The LSTM cells output at each step is put into a Dense network with each neurons task to predict the correct character.
The final model for training is then built from the input-layer objects and the output layer.

decode_h = Dense(lstm_dim, activation="relu")
decode_c = Dense(lstm_dim, activation="relu")
state_h_decoded =  decode_h(neck_outputs)
state_c_decoded =  decode_c(neck_outputs)
encoder_states = [state_h_decoded, state_c_decoded]
decoder_inputs = Input(shape=input_shape)
decoder_lstm = LSTM(lstm_dim,
decoder_outputs = decoder_lstm(decoder_inputs, initial_state=encoder_states)
decoder_dense = Dense(output_dim, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)
#Define the model, that inputs the training vector for two places, and predicts one character ahead of the input
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
print model.summary()
    Layer (type)                    Output Shape         Param #     Connected to                     
    input_9 (InputLayer)            (None, 27, 22)       0                                            
    lstm_9 (LSTM)                   [(None, 64), (None,  22272       input_9[0][0]                    
    concatenate_5 (Concatenate)     (None, 128)          0           lstm_9[0][1]                     
    dense_15 (Dense)                (None, 64)           8256        concatenate_5[0][0]              
    input_10 (InputLayer)           (None, 27, 22)       0                                            
    dense_16 (Dense)                (None, 64)           4160        dense_15[0][0]                   
    dense_17 (Dense)                (None, 64)           4160        dense_15[0][0]                   
    lstm_10 (LSTM)                  (None, 27, 64)       22272       input_10[0][0]                   
    dense_18 (Dense)                (None, 27, 22)       1430        lstm_10[0][0]                    
    Total params: 62,550
    Trainable params: 62,550
    Non-trainable params: 0

After preparing some Keras callbacks to record the history and reduce the learning rate once a training plateau is reached, the model is compiled with optimizer and loss function and the training can begin. Note how the X_train is fed two times to the model, to give the input at two different places in the model.

from keras.callbacks import History, ReduceLROnPlateau
h = History()
rlr = ReduceLROnPlateau(monitor='val_loss', factor=0.5,patience=10, min_lr=0.000001, verbose=1, epsilon=1e-5)
from keras.optimizers import RMSprop, Adam
opt=Adam(lr=0.005) #Default 0.001
model.compile(optimizer=opt, loss='categorical_crossentropy')[X_train,X_train],Y_train,
                    callbacks=[h, rlr],
                    validation_data=[[X_test,X_test],Y_test ])
    Train on 50029 samples, validate on 16677 samples
    Epoch 1/200
    50029/50029 [==============================] - 21s 417us/step - loss: 0.9442 - val_loss: 0.6275
    Epoch 199/200
    50029/50029 [==============================] - 18s 363us/step - loss: 3.2088e-05 - val_loss: 2.5274e-04
    Epoch 200/200
    50029/50029 [==============================] - 18s 368us/step - loss: 3.1181e-05 - val_loss: 2.5532e-04
import pickle
pickle.dump(h.history, file("Blog_history.pickle","w"))

The training process can be plotted using the data recorded in the history object. There are some spikes in the loss and validation loss, but in the end the loss and validation loss levels more or less out. Note the Logarithmic scale make the growing difference between the loss and validation loss appear larger than it really is.

plt.plot(h.history["loss"], label="Loss")
plt.plot(h.history["val_loss"], label="Val_Loss")

A quick test directly on the model, before we reassemble the trained parts into new models show that there is a good reconstruction accuracy also on the test set. The code below outputs nothing so the 100 tested SMILES from the test set is perfectly reconstructed.

for i in range(100):
    v = model.predict([X_test[i:i+1], X_test[i:i+1]]) #Can't be done as output not necessarely 1
    idxs = np.argmax(v, axis=2)
    pred=  "".join([int_to_char[h] for h in idxs[0]])[:-1]
    idxs2 = np.argmax(X_test[i:i+1], axis=2)
    true =  "".join([int_to_char[k] for k in idxs2[0]])[1:]
    if true != pred:
        print true, pred

Now the parts of the trained autoencoder will be used to build the various encoder and decoder models. Building a new model from the input layer to an output layer inside another model is very simple. The input layer defined when the model was built it reused and the neck_output reused as the output. This model will take a vectorized smile and encode it to the latent space.

smiles_to_latent_model = Model(encoder_inputs, neck_outputs)"Blog_simple_smi2lat.h5")

The next model that will be needed is a model that can decode the latent space into the states that need to be set at the decoder LSTM cells. A new input matching the latent space is defined, but the layers from before can be reused to get the h and c states. That way the weights are inherited from the trained model.

latent_input = Input(shape=(latent_dim,))
state_h_decoded_2 =  decode_h(latent_input)
state_c_decoded_2 =  decode_c(latent_input)
latent_to_states_model = Model(latent_input, [state_h_decoded_2, state_c_decoded_2])"Blog_simple_lat2state.h5")

The decoder model needs a bit more work. It was trained in batch mode, but it will be used in stateful mode predicting one character at a time. So the layers are defined precisely as before, except with a new batch_shape and the LSTM layer set to stateful.

#Last one is special, we need to change it to stateful, and change the input shape
inf_decoder_inputs = Input(batch_shape=(1, 1, input_shape[1]))
inf_decoder_lstm = LSTM(lstm_dim,
inf_decoder_outputs = inf_decoder_lstm(inf_decoder_inputs)
inf_decoder_dense = Dense(output_dim, activation='softmax')
inf_decoder_outputs = inf_decoder_dense(inf_decoder_outputs)
sample_model = Model(inf_decoder_inputs, inf_decoder_outputs)

After defining the decoder model, the corresponding weights from the trained autoencoder model are transfered.

#Transfer Weights
for i in range(1,3):
    Layer (type)                 Output Shape              Param #   
    input_13 (InputLayer)        (1, 1, 22)                0         
    lstm_11 (LSTM)               (1, 1, 64)                22272     
    dense_19 (Dense)             (1, 1, 22)                1430      
    Total params: 23,702
    Trainable params: 23,702
    Non-trainable params: 0

Using the latent space as a fingerprint

The smiles to latent model can be used to encode the SMILES into the fingerprint like latent space

x_latent = smiles_to_latent_model.predict(X_test)

A useful feature of a fingerprint is if similar molecules produces similar fingerprints. To see if similar molecules produce similar vectors in the latent space, a simple search for similar molecules can be performed. Here the absolute difference between the latent vectors is used as a metric of similarity. This test do not rule out that similar molecules can get latent vectors that are dissimilar, but is a quick test of the immediate neighborhood.

molno = 5
latent_mol = smiles_to_latent_model.predict(X_test[molno:molno+1])
sorti = np.argsort(np.sum(np.abs(x_latent - latent_mol), axis=1))
print sorti[0:10]
print smiles_test.iloc[sorti[0:8]]
    [    5  1243 15472   744 15039 14589  2449  3991  5251 16502]
    45051    COCc1cocn1
    45007    COCc1ccoc1
    45061    COCc1conn1
    45025    CCOc1cnoc1
    45069    COCc1ncon1
    44719    COCc1ccno1
    45063    CCOc1conn1
    44727    COCc1cnno1
    Name: smiles, dtype: object

The first one is the query molecule. They do look similar, but maybe because the SMILES strings are very similar, rather than following our understanding of molecular similarity. Lets see what is the most different molecules in the latent spaces.


Chemical properties in the latent space

As a proxy for chemical properties, lets see how the calculated LogP and MR maps to a PCA reduction of the latent space

logp = smiles_test.apply(Chem.MolFromSmiles).apply(Descriptors.MolLogP)
from sklearn.decomposition import PCA
pca = PCA(n_components = 2)
red = pca.fit_transform(x_latent)
plt.scatter(red[:,0], red[:,1],marker='.', c= logp)
print(pca.explained_variance_ratio_, np.sum(pca.explained_variance_ratio_))
    (array([ 0.24815299,  0.15272172], dtype=float32), 0.4008747)

molwt = smiles_test.apply(Chem.MolFromSmiles).apply(Descriptors.MolMR)
plt.scatter(red[:,0], red[:,1],marker='.', c= molwt)

There seem to be some distribution of the predicted molecular properties.

Modeling properties from the latent space

It could also be interesting to see if the fingerprints work good as a basis for QSAR models. As a proxy for the target value of QSAR modelling, a model of the predicted LogP will be built. Building a model using a modeled property is only done for demonstration purposes.

#Model LogP?
x_train_latent = smiles_to_latent_model.predict(X_train)
logp_train = smiles_train.apply(Chem.MolFromSmiles).apply(Descriptors.MolLogP)
from keras.models import Sequential
logp_model = Sequential()
logp_model.add(Dense(128, input_shape=(latent_dim,), activation="relu"))
logp_model.add(Dense(128, activation="relu"))
logp_model.compile(optimizer="adam", loss="mse")
rlr = ReduceLROnPlateau(monitor='val_loss', factor=0.5,patience=10, min_lr=0.000001, verbose=1, epsilon=1e-5), logp_train, batch_size=128, epochs=400, callbacks = [rlr])
    Epoch 1/200
    50029/50029 [==============================] - 1s 25us/step - loss: 0.0236
    Epoch 200/200
    50029/50029 [==============================] - 1s 23us/step - loss: 0.0187
logp_pred_train = logp_model.predict(x_train_latent)
logp_pred_test = logp_model.predict(x_latent)
plt.scatter(logp, logp_pred_test, label="Test")
plt.scatter(logp_train, logp_pred_train, label="Train")

I would have expected even better fit when modeling a modeled property.

From latent space to SMILES

To sample the latent space, we need two steps. First compute the c and h states using the latent_to_states_model and then set the initial states of the decoder LSTM network. The LSTM network will be fed the input char vector and iteratively sample the next character until the end char “E” is encountered.

def latent_to_smiles(latent):
    #decode states and set Reset the LSTM cells with them
    states = latent_to_states_model.predict(latent)
    #Prepare the input char
    startidx = char_to_int["!"]
    samplevec = np.zeros((1,1,22))
    samplevec[0,0,startidx] = 1
    smiles = ""
    #Loop and predict next char
    for i in range(28):
        o = sample_model.predict(samplevec)
        sampleidx = np.argmax(o)
        samplechar = int_to_char[sampleidx]
        if samplechar != "E":
            smiles = smiles + int_to_char[sampleidx]
            samplevec = np.zeros((1,1,22))
            samplevec[0,0,sampleidx] = 1
    return smiles
smiles = latent_to_smiles(x_latent[0:1])
print smiles
print smiles_test.iloc[0]

It can be relevant to know how large a fraction of the SMILES are malformed when sampling the test set?

wrong = 0
for i in range(1000):
    smiles = latent_to_smiles(x_latent[i:i+1])
    mol = Chem.MolFromSmiles(smiles)
    if mol:
        print smiles
        wrong = wrong + 1
print "%0.1F percent wrongly formatted smiles"%(wrong/float(1000)*100)
    0.1 percent wrongly formatted smiles

This is actually quite good. Another interesting thing could be to see if the molecules can be “interpolated” in the latent space.

#Interpolation test in latent_space
i = 0
j= 2
latent1 = x_latent[j:j+1]
latent0 = x_latent[i:i+1]
mols1 = []
ratios = np.linspace(0,1,25)
for r in ratios:
    #print r
    rlatent = (1.0-r)*latent0 + r*latent1
    smiles  = latent_to_smiles(rlatent)
    mol = Chem.MolFromSmiles(smiles)
    if mol:
        print smiles
Draw.MolsToGridImage(mols1, molsPerRow=5)

The mis sampling seem to be higher as two out of 25 molecules could not be parsed. The interpolation also shows how the same molecule is generated for slightly different areas of the latent space. This maybe reflects that the molecules are discrete and the latent space continuous, with a higher chance of failing when we are at the border between two molecules.
Another interesting property to examine is if novel molecules more or less similar to a lead molecule can be by adding a bit of randomness to an existing latent vector.

#Sample around the latent wector
latent = x_latent[0:1]
scale = 0.40
mols = []
for i in range(20):
    latent_r = latent + scale*(np.random.randn(latent.shape[1])) #TODO, try with 
    smiles = latent_to_smiles(latent_r)
    mol = Chem.MolFromSmiles(smiles)
    if mol:
        print smiles
Draw.MolsToGridImage(mols, molsPerRow=5)

Here we really see a lot of wrong SMILES generated, but nevertheless see some molecules that may be more or less similar to the one used to encode the latent vector. (Compare to the first molecule in the interpolation above). Tuning the scaling can be a way to tune the creativity.


In this blog post it was shown how to make an LSTM based autoencoder of SMILES strings of molecules and efficiently train it with teacher enforcing. The latent space may reflect the structure of the SMILES but nevertheless seem useful for a variety of cheminformatics related tasks. The SMILES reconstruction accuracy seem to drop when bordering domains between molecules or adding random noise. Thus the latent space may not be completely continuous reflecting the discrete nature of the molecules. I’ve not mentioned AI in this blog-post as I’m not sure how to define intelligence in the first place, but I do find it fascinating how the chemical rules of SMILES strings can be picked up by the weights and computations done in the neural networks of the LSTM cells and later used to create novel SMILES strings.


  1. Josh
    February 15, 2018 at 17:42 Reply


    How can I save and load the model(s)?

    • Esben Jannik Bjerrum
      February 15, 2018 at 20:19 Reply

      The models are Keras models and are easily saved and loaded via the‘your_filename.h5’) and the load_model function from keras.models. You can read more in the Keras FAQ:

      But you will also need to save the char_to_int and int_to_char for correctly encoding and parsing the one-hot encoded characters when you want to use the models again. You can either use pickle and save to an auxiliary file.
      Alternatively, the Keras models configuration and weights are stored in a HDF formated file which can be handled in Python with the h5py package:
      The hdf encoded Keras model file can thus be opened with h5py and the dictionaries stored there as well.

    • Josh
      February 15, 2018 at 22:13 Reply

      Great, thanks, Esben that worked incredibly well!

      What would be the best method to use to sample the latent space to generate novel molecules as smile strings? Is it possible to generate molecules from a random seed each time? Would I just apply scaling or interpolation as shown in your blog post? In addition, where does the range(28) come from in the code? Is it possible to speed up the code when generating multiple molecules from the latent space? I tried looping the code in the post and it took a few seconds for each molecule, shouldn’t it be as fast as calling .predict() on a model?

      Lastly, is it possible to apply transfer learning to this seq2seq model? Say I had data of 20 molecules that were biologically active and bound to a specific biological target, how would I apply transfer learning to tune the model to generate molecules with similar properties to the biologically active molecules? Would I just load the weights of the initial model and then retrain with a few epochs and an extremely small learning rate?

      Thank you for your reply!

    • bergi
      April 14, 2018 at 23:20 Reply

      I developed Keras Gaia for the same use case:

      It simplifies saving and loading of models by using project descriptions like this one:

      There is also an abstraction layer between the data and the numpy arrays processed by Keras. The data is stored with namespaces. That makes testing different datasets and models much easier.

  2. Esben Jannik Bjerrum
    February 16, 2018 at 10:17 Reply

    Wow, what a lot of questions. I’m not sure short answers are possible for all of them. I tried to keep the code in the blog post as simple as possible, leaving out some more advanced options. Just to give an idea of the process and a skeleton to build further on. But maybe the code is too simple for production use.

    > In addition, where does the range(28) come from in the code?

    For this one I can give a simple answer so I start with this: The range 28 comes from the size of the batch training. No SMILES + padding was longer than 28 in the training set, so it is highly unlikely that the generator would output longer sequences. The code any way stop sampling as soon as a stop codon is encountered, so it is just there to prevent a badly trained network to enter an infinite loop if it never outputs a stop codon.

    > What would be the best method to use to sample the latent space to generate novel molecules as smile strings? Is it possible to generate molecules from a random seed each time? Would I just apply scaling or interpolation as shown in your blog post?

    I think this depends on you intended use case? Do you just want to sample a lot of novel molecules with similar properties as the training set? then maybe a charRNN based approach actually works best. I published a preprint about this:

    One of the advantage points of an auto-encoder is to be able to steer the generator towards molecules of interest. For this you will need the latent space vector of an existing molecule. In your example with 20 active molecules, you could try and use the encoder to get 20 vectors. And then try and interpolate them or randomly permute each of them some and see what comes out. But there may be some issues with this, which I’ll try and cover briefly below.

    What to use for random permutation, linear random? Gaussian random? magnitude? All permuted, or just some? If this has been implemented as a variational auto-encoder the latent representation would contain both means and variance, which could be used as a starting point for sampling.

    Also, In the code example in the blog, I simply use the output character with the highest likelihood. It is also possible to sample from the output probability distribution at each character, which could lead to different molecules for multiple different samplings from the same coordinates. Especially if the probability space is rescaled a bit (the temperature factor, some details in

    > Is it possible to speed up the code when generating multiple molecules from the latent space? I tried looping the code in the post and it took a few seconds for each molecule, shouldn’t it be as fast as calling .predict() on a model?

    I’m not sure what part of the code you are looping over? It doesn’t take terrible long for me. Anyway, the generator is changed from batch mode to a stateful mode, thus calling predict for each character on a single molecule. If you want to generate many molecules, it could be done in parallel by changing the batch shape of the input to receive many chars in parallel, and then also adapt the sampling loop to work in parallel. How many molecules you can generate in parallel will depend on the size of your memory. For really fast generation it would be possible to just keep the generator in batch mode, giving you the molecules with one call to predict. But then it would not be possible to sample the probability distribution at each character and other fancy stuff. And the architecture would maybe also need to be changed to one that doesn’t uses teacher enforcing?

    > Lastly, is it possible to apply transfer learning to this seq2seq model? Say I had data of 20 molecules that were biologically active and bound to a specific biological target, how would I apply transfer learning to tune the model to generate molecules with similar properties to the biologically active molecules? Would I just load the weights of the initial model and then retrain with a few epochs and an extremely small learning rate?

    I think transfer learning for such a small sample would be very difficult. It is possible, but the output molecules quickly gets queer. I did some experimentation long time ago: , note the high proportion of highly unlikely anti-aromatic rings in the output. On the other hand, the retrained network is better than training from scratch on the small dataset and maybe I just re-trained for too long with too high a learning rate. Please, let me know the results if you experiment with this.

    For the use-case you sketch (without knowing your precise setup or intended use) I would train the network on a larger database of SMILES strings that have properties of interest (say filtered for size, clogp, HBD, HBA, bio activity on other targets etc. etc.) You would need to adjust the batch length of the input, once you know the length of your SMILES. Then after training, sample the latent space from your existing molecules, and use that as starting point for sampling. Check the distances between the latent vectors of the known bio actives. They could end up in very different parts of the latent space, as the model is quite simple and unrestrained with respect to the latent space. If they are close you could try and do an interpolation, otherwise experiment with random sampling around the latent vectors.

    The output SMILES will also be very similar to the input SMILES, as the network tries to reconstruct the SMILES and not the underlying molecule. But as each molecule can have several different non-canonical SMILES strings, it could be possible to enumerate these first, so that your 20 molecules becomes 2000 SMILES strings, giving 2000 vectors, sampled to 20000 new SMILES strings. I wrote something about the SMILES enumeration here: And there is some generator code here:

    However, I can’t give you an exact answer to what is “best”. This is an area of active research. There is actually a community effort starting to generate benchmarks and compare the various methods:
    I don’t know if it will take off and produce something useful, but you are welcome to follow and possibly contribute.

    I hope my answers clarified things somewhat? If not it may actually be more productive to have a brief chat about this over Skype or similar if you want to start a project or collaborate or something.

  3. Zheng Xu
    March 20, 2018 at 06:03 Reply

    It is nice to know you are sharing similar interests in applying RNN in molecular representations.

    We have some preliminary results with seq2seq model too.
    The paper is called “Seq2seq Fingerprint: An Unsupervised Deep Molecular Embedding for Drug Discovery
    ” (
    The code is located in
    DeepChem folks make a nice online tutorial at

    • Esben Jannik Bjerrum
      March 20, 2018 at 10:00 Reply

      Hi, Thanks for commenting and the update. I already link to your cool publication in the introduction of this post. Good that you posted the code and thank you for the link to the deepchem page. What do you think of the possibilities to make some project together?

      • Zheng Xu
        March 24, 2018 at 21:16 Reply

        Yeah. Sure.

        Let me know if you have any cooperating ideas.

Leave a Reply

Your email address will not be published. Required fields are marked *