Master your molecule generator: Seq2seq RNN models with SMILES in Keras
UPDATE: Be sure to check out the follow-up to this post if you want to improve the model: Learn how to improve SMILES based molecular autoencoders with heteroencoders
I’ve previously written about molecular generators based on long short-term memory recurrent neural networks (LSTM-RNNs). The networks learn rules about how SMILES strings of molecules are formatted and are then able to create novel SMILES following the same rules by iterating through the characters. The results are “creative” computers that can draw correct molecules.
An interesting aspect is that the distribution of molecular properties are the same for the training dataset and the later samples https://arxiv.org/abs/1705.04612. “You get what you train for” could sound like a commercial for a fitness center but equally apply here.
However it takes a few thousand molecules to train one of these character based generators without over-fitting and getting the chemical rules underlying the SMILES generation correct. Pre-training on larger datasets work to some degree, but the molecules anyway end up looking strange with unlikely anti-aromatic rings.
Another approach is to make an autoencoder. The basic idea is to code the input (here SMILES strings) string to a latent space and then recreate the original input from this space. After training, the autoencoder can be split into an encoder and a decoder. The decoder can be used to sample the area around a molecule of interest (Lead-compound?) and generate similar, but not equal, molecules.
I have also written about my experiences using keras-molecules to map molecules to latent space with subsequent visualization. This code uses a convolutional encoder and a GRU based decoder.
However, it should also be possible to use an RNN based encoder, which was done by Zheng Xu and Co-workers. Their code doesn’t seem to be readily available, but read on below and see how it can be implemented with Keras.
# General Imports import os import pandas as pd import numpy as np from rdkit import Chem from rdkit.Chem import Draw, Descriptors from matplotlib import pyplot as plt %matplotlib inline
There is of course a need for some SMILES strings to train and test on. To keep things a bit simple (and training shorter) for this blog post, the file with SMILES containing 8 atoms from the GDB-11 dataset will be used. It contains all molecules possible to generate with a subset of atoms and bond types. The dataset can be downloaded here: http://gdb.unibe.ch/downloads/. A random split into train- and test-set is done immediately before further processing.
smifile = "gdb11_size08.smi" data = pd.read_csv(smifile, delimiter = "\t", names = ["smiles","No","Int"]) from sklearn.cross_validation import train_test_split smiles_train, smiles_test = train_test_split(data["smiles"], random_state=42) print smiles_train.shape print smiles_test.shape
The SMILES must be vectorized to one-hot encoded arrays. To do this a character set is built from all characters found in the SMILES string (both train and test). Also, some start and stop characters are added, which will be used to initiate the decoder and to signal when SMILES generation has stopped. The stop character also work as padding to get the same length of all vectors, so that the network can be trained in batch mode. The character set is used to define two dictionaries to translate back and forth between index and character. The maximum length of the SMILES strings is needed as the RNN’s will be trained in batch mode, and is set to the maximum encountered + some extra.
charset = set("".join(list(data.smiles))+"!E") char_to_int = dict((c,i) for i,c in enumerate(charset)) int_to_char = dict((i,c) for i,c in enumerate(charset)) embed = max([len(smile) for smile in data.smiles]) + 5 print str(charset) print(len(charset), embed)
set(['!', '#', ')', '(', '+', '-', '1', '3', '2', '4', '=', 'C', 'E', 'F', 'H', 'O', 'N', '[', ']', 'c', 'o', 'n']) (22, 28)
Afterwards the character set and dictionaries are used to set the necessary bits in the Numpy arrays. The result will be a “piano-roll” of each molecules SMILES string. The X data starts with !, but the output Y is offset by one character, and starts with the first character of the actual SMILES.
def vectorize(smiles): one_hot = np.zeros((smiles.shape, embed , len(charset)),dtype=np.int8) for i,smile in enumerate(smiles): #encode the startchar one_hot[i,0,char_to_int["!"]] = 1 #encode the rest of the chars for j,c in enumerate(smile): one_hot[i,j+1,char_to_int[c]] = 1 #Encode endchar one_hot[i,len(smile)+1:,char_to_int["E"]] = 1 #Return two, one for input and the other for output return one_hot[:,0:-1,:], one_hot[:,1:,:] X_train, Y_train = vectorize(smiles_train.values) X_test,Y_test = vectorize(smiles_test.values) print smiles_train.iloc plt.matshow(X_train.T) #print X_train.shape
The int_to_char dictionary can be used to go from vectorized form back to a readable string, here with a joined list comprehension.
"".join([int_to_char[idx] for idx in np.argmax(X_train[0,:,:], axis=1)])
With the vectorized data ready, it time to build the autoencoder. First some Keras objects will be imported and the dimensions for the input and output calculated from the vectorized data. Additionally the number of LSTM cell to use for the decoder and encoder is specified and the latent dimension is specified.
#Import Keras objects from keras.models import Model from keras.layers import Input from keras.layers import LSTM from keras.layers import Dense from keras.layers import Concatenate from keras import regularizers input_shape = X_train.shape[1:] output_dim = Y_train.shape[-1] latent_dim = 64 lstm_dim = 64
Using TensorFlow backend.
Long Short-Term Memory Cells (LSTM)
It may sound like an oxymoron, but long short-term memory cells are special kinds of neural network units that are designed to keep an internal state for longer iterations through a recurrent neural network. They have been designed with input, output and forget gates, that control what to do with the cells internal state C. The state H is passed as a copy between the recurrent iterations and helps determine what should be done with the internal state C in the next iteration. The gating mechanisms ensures that the network can easily retain an internal state by closing off the input and forget gates until a new condition or input opens them. They work as drop-in replacements for standard recurrent units and usually increase the performance of the RNN with regard to recognizing long term interactions. Luckily they are already implemented in Keras. An also popular, simplified version is the gated recurrent unit (GRU) that is also available in Keras.
The model is overall kept rather simple. A single layer of 64 LSTM cells is used to read the input SMILES string. The outputs from the layer is ignored, but the final internal C and H state is concatenated and recombined into a the bottleneck layer “neck”. This comprises the encoder.
The functional API of Keras is used and this will make it easier to rebuilt the various layers into encoder and decoder models later.
unroll = False encoder_inputs = Input(shape=input_shape) encoder = LSTM(lstm_dim, return_state=True, unroll=unroll) encoder_outputs, state_h, state_c = encoder(encoder_inputs) states = Concatenate(axis=-1)([state_h, state_c]) neck = Dense(latent_dim, activation="relu") neck_outputs = neck(states)
The neck_output tensor is then put through two different Dense layers to decode the states that should be set on the decoder LSTM layer. This recombination of the internal states from the encoder to decoder, makes it possible to use a different size latent space than encoded by the LSTM layers in themselves. The LSTM layer then receives the input once again, and is put to the task of predicting the next character in the sequence. So from the latent representation of the molecule and the character “!”, it should output what the next character is, representing the first atom, e.g. “C”, “N” etc. This is why the Y vectors are offset from the X vectors with the “!”character. This way of training is called teacher enforcing in contrast to the approach where the output of one recurrence step is used as input to the next. Even though the network may make a mistake in the start of the sequence, it will not affect the training of the later parts of the sequence. This trains the decoder much more efficiently, than just feeding it the internal states and then see if it can construct the whole sequence. The LSTM cells output at each step is put into a Dense network with each neurons task to predict the correct character.
The final model for training is then built from the input-layer objects and the output layer.
decode_h = Dense(lstm_dim, activation="relu") decode_c = Dense(lstm_dim, activation="relu") state_h_decoded = decode_h(neck_outputs) state_c_decoded = decode_c(neck_outputs) encoder_states = [state_h_decoded, state_c_decoded] decoder_inputs = Input(shape=input_shape) decoder_lstm = LSTM(lstm_dim, return_sequences=True, unroll=unroll ) decoder_outputs = decoder_lstm(decoder_inputs, initial_state=encoder_states) decoder_dense = Dense(output_dim, activation='softmax') decoder_outputs = decoder_dense(decoder_outputs) #Define the model, that inputs the training vector for two places, and predicts one character ahead of the input model = Model([encoder_inputs, decoder_inputs], decoder_outputs) print model.summary()
__________________________________________________________________________________________________ Layer (type) Output Shape Param # Connected to ================================================================================================== input_9 (InputLayer) (None, 27, 22) 0 __________________________________________________________________________________________________ lstm_9 (LSTM) [(None, 64), (None, 22272 input_9 __________________________________________________________________________________________________ concatenate_5 (Concatenate) (None, 128) 0 lstm_9 lstm_9 __________________________________________________________________________________________________ dense_15 (Dense) (None, 64) 8256 concatenate_5 __________________________________________________________________________________________________ input_10 (InputLayer) (None, 27, 22) 0 __________________________________________________________________________________________________ dense_16 (Dense) (None, 64) 4160 dense_15 __________________________________________________________________________________________________ dense_17 (Dense) (None, 64) 4160 dense_15 __________________________________________________________________________________________________ lstm_10 (LSTM) (None, 27, 64) 22272 input_10 dense_16 dense_17 __________________________________________________________________________________________________ dense_18 (Dense) (None, 27, 22) 1430 lstm_10 ================================================================================================== Total params: 62,550 Trainable params: 62,550 Non-trainable params: 0 __________________________________________________________________________________________________ None
After preparing some Keras callbacks to record the history and reduce the learning rate once a training plateau is reached, the model is compiled with optimizer and loss function and the training can begin. Note how the X_train is fed two times to the model, to give the input at two different places in the model.
from keras.callbacks import History, ReduceLROnPlateau h = History() rlr = ReduceLROnPlateau(monitor='val_loss', factor=0.5,patience=10, min_lr=0.000001, verbose=1, epsilon=1e-5)
from keras.optimizers import RMSprop, Adam opt=Adam(lr=0.005) #Default 0.001 model.compile(optimizer=opt, loss='categorical_crossentropy')
model.fit([X_train,X_train],Y_train, epochs=200, batch_size=256, shuffle=True, callbacks=[h, rlr], validation_data=[[X_test,X_test],Y_test ])
Train on 50029 samples, validate on 16677 samples Epoch 1/200 50029/50029 [==============================] - 21s 417us/step - loss: 0.9442 - val_loss: 0.6275 ....... Epoch 199/200 50029/50029 [==============================] - 18s 363us/step - loss: 3.2088e-05 - val_loss: 2.5274e-04 Epoch 200/200 50029/50029 [==============================] - 18s 368us/step - loss: 3.1181e-05 - val_loss: 2.5532e-04
import pickle pickle.dump(h.history, file("Blog_history.pickle","w"))
The training process can be plotted using the data recorded in the history object. There are some spikes in the loss and validation loss, but in the end the loss and validation loss levels more or less out. Note the Logarithmic scale make the growing difference between the loss and validation loss appear larger than it really is.
plt.plot(h.history["loss"], label="Loss") plt.plot(h.history["val_loss"], label="Val_Loss") plt.yscale("log") plt.legend()
A quick test directly on the model, before we reassemble the trained parts into new models show that there is a good reconstruction accuracy also on the test set. The code below outputs nothing so the 100 tested SMILES from the test set is perfectly reconstructed.
for i in range(100): v = model.predict([X_test[i:i+1], X_test[i:i+1]]) #Can't be done as output not necessarely 1 idxs = np.argmax(v, axis=2) pred= "".join([int_to_char[h] for h in idxs])[:-1] idxs2 = np.argmax(X_test[i:i+1], axis=2) true = "".join([int_to_char[k] for k in idxs2])[1:] if true != pred: print true, pred
Now the parts of the trained autoencoder will be used to build the various encoder and decoder models. Building a new model from the input layer to an output layer inside another model is very simple. The input layer defined when the model was built it reused and the neck_output reused as the output. This model will take a vectorized smile and encode it to the latent space.
smiles_to_latent_model = Model(encoder_inputs, neck_outputs)
The next model that will be needed is a model that can decode the latent space into the states that need to be set at the decoder LSTM cells. A new input matching the latent space is defined, but the layers from before can be reused to get the h and c states. That way the weights are inherited from the trained model.
latent_input = Input(shape=(latent_dim,)) #reuse_layers state_h_decoded_2 = decode_h(latent_input) state_c_decoded_2 = decode_c(latent_input) latent_to_states_model = Model(latent_input, [state_h_decoded_2, state_c_decoded_2]) latent_to_states_model.save("Blog_simple_lat2state.h5")
The decoder model needs a bit more work. It was trained in batch mode, but it will be used in stateful mode predicting one character at a time. So the layers are defined precisely as before, except with a new batch_shape and the LSTM layer set to stateful.
#Last one is special, we need to change it to stateful, and change the input shape inf_decoder_inputs = Input(batch_shape=(1, 1, input_shape)) inf_decoder_lstm = LSTM(lstm_dim, return_sequences=True, unroll=unroll, stateful=True ) inf_decoder_outputs = inf_decoder_lstm(inf_decoder_inputs) inf_decoder_dense = Dense(output_dim, activation='softmax') inf_decoder_outputs = inf_decoder_dense(inf_decoder_outputs) sample_model = Model(inf_decoder_inputs, inf_decoder_outputs)
After defining the decoder model, the corresponding weights from the trained autoencoder model are transfered.
#Transfer Weights for i in range(1,3): sample_model.layers[i].set_weights(model.layers[i+6].get_weights()) sample_model.save("Blog_simple_samplemodel.h5")
_________________________________________________________________ Layer (type) Output Shape Param # ================================================================= input_13 (InputLayer) (1, 1, 22) 0 _________________________________________________________________ lstm_11 (LSTM) (1, 1, 64) 22272 _________________________________________________________________ dense_19 (Dense) (1, 1, 22) 1430 ================================================================= Total params: 23,702 Trainable params: 23,702 Non-trainable params: 0 _________________________________________________________________
Using the latent space as a fingerprint
The smiles to latent model can be used to encode the SMILES into the fingerprint like latent space
x_latent = smiles_to_latent_model.predict(X_test)
A useful feature of a fingerprint is if similar molecules produces similar fingerprints. To see if similar molecules produce similar vectors in the latent space, a simple search for similar molecules can be performed. Here the absolute difference between the latent vectors is used as a metric of similarity. This test do not rule out that similar molecules can get latent vectors that are dissimilar, but is a quick test of the immediate neighborhood.
molno = 5 latent_mol = smiles_to_latent_model.predict(X_test[molno:molno+1]) sorti = np.argsort(np.sum(np.abs(x_latent - latent_mol), axis=1)) print sorti[0:10] print smiles_test.iloc[sorti[0:8]] Draw.MolsToImage(smiles_test.iloc[sorti[0:8]].apply(Chem.MolFromSmiles))
[ 5 1243 15472 744 15039 14589 2449 3991 5251 16502] 45051 COCc1cocn1 45007 COCc1ccoc1 45061 COCc1conn1 45025 CCOc1cnoc1 45069 COCc1ncon1 44719 COCc1ccno1 45063 CCOc1conn1 44727 COCc1cnno1 Name: smiles, dtype: object
The first one is the query molecule. They do look similar, but maybe because the SMILES strings are very similar, rather than following our understanding of molecular similarity. Lets see what is the most different molecules in the latent spaces.
Chemical properties in the latent space
As a proxy for chemical properties, lets see how the calculated LogP and MR maps to a PCA reduction of the latent space
logp = smiles_test.apply(Chem.MolFromSmiles).apply(Descriptors.MolLogP)
from sklearn.decomposition import PCA pca = PCA(n_components = 2) red = pca.fit_transform(x_latent) plt.figure() plt.scatter(red[:,0], red[:,1],marker='.', c= logp) print(pca.explained_variance_ratio_, np.sum(pca.explained_variance_ratio_))
(array([ 0.24815299, 0.15272172], dtype=float32), 0.4008747)
molwt = smiles_test.apply(Chem.MolFromSmiles).apply(Descriptors.MolMR) plt.figure() plt.scatter(red[:,0], red[:,1],marker='.', c= molwt)
There seem to be some distribution of the predicted molecular properties.
Modeling properties from the latent space
It could also be interesting to see if the fingerprints work good as a basis for QSAR models. As a proxy for the target value of QSAR modelling, a model of the predicted LogP will be built. Building a model using a modeled property is only done for demonstration purposes.
#Model LogP? x_train_latent = smiles_to_latent_model.predict(X_train) logp_train = smiles_train.apply(Chem.MolFromSmiles).apply(Descriptors.MolLogP)
from keras.models import Sequential logp_model = Sequential() logp_model.add(Dense(128, input_shape=(latent_dim,), activation="relu")) logp_model.add(Dense(128, activation="relu")) logp_model.add(Dense(1)) logp_model.compile(optimizer="adam", loss="mse")
rlr = ReduceLROnPlateau(monitor='val_loss', factor=0.5,patience=10, min_lr=0.000001, verbose=1, epsilon=1e-5) logp_model.fit(x_train_latent, logp_train, batch_size=128, epochs=400, callbacks = [rlr])
Epoch 1/200 50029/50029 [==============================] - 1s 25us/step - loss: 0.0236 ..... Epoch 200/200 50029/50029 [==============================] - 1s 23us/step - loss: 0.0187
logp_pred_train = logp_model.predict(x_train_latent) logp_pred_test = logp_model.predict(x_latent) plt.scatter(logp, logp_pred_test, label="Test") plt.scatter(logp_train, logp_pred_train, label="Train") plt.legend()
I would have expected even better fit when modeling a modeled property.
From latent space to SMILES
To sample the latent space, we need two steps. First compute the c and h states using the latent_to_states_model and then set the initial states of the decoder LSTM network. The LSTM network will be fed the input char vector and iteratively sample the next character until the end char “E” is encountered.
def latent_to_smiles(latent): #decode states and set Reset the LSTM cells with them states = latent_to_states_model.predict(latent) sample_model.layers.reset_states(states=[states,states]) #Prepare the input char startidx = char_to_int["!"] samplevec = np.zeros((1,1,22)) samplevec[0,0,startidx] = 1 smiles = "" #Loop and predict next char for i in range(28): o = sample_model.predict(samplevec) sampleidx = np.argmax(o) samplechar = int_to_char[sampleidx] if samplechar != "E": smiles = smiles + int_to_char[sampleidx] samplevec = np.zeros((1,1,22)) samplevec[0,0,sampleidx] = 1 else: break return smiles
smiles = latent_to_smiles(x_latent[0:1]) print smiles print smiles_test.iloc
It can be relevant to know how large a fraction of the SMILES are malformed when sampling the test set?
wrong = 0 for i in range(1000): smiles = latent_to_smiles(x_latent[i:i+1]) mol = Chem.MolFromSmiles(smiles) if mol: pass else: print smiles wrong = wrong + 1 print "%0.1F percent wrongly formatted smiles"%(wrong/float(1000)*100)
Cc1ccc(N)n[nH]1 0.1 percent wrongly formatted smiles
This is actually quite good. Another interesting thing could be to see if the molecules can be “interpolated” in the latent space.
#Interpolation test in latent_space i = 0 j= 2 latent1 = x_latent[j:j+1] latent0 = x_latent[i:i+1] mols1 =  ratios = np.linspace(0,1,25) for r in ratios: #print r rlatent = (1.0-r)*latent0 + r*latent1 smiles = latent_to_smiles(rlatent) mol = Chem.MolFromSmiles(smiles) if mol: mols1.append(mol) else: print smiles Draw.MolsToGridImage(mols1, molsPerRow=5)
The mis sampling seem to be higher as two out of 25 molecules could not be parsed. The interpolation also shows how the same molecule is generated for slightly different areas of the latent space. This maybe reflects that the molecules are discrete and the latent space continuous, with a higher chance of failing when we are at the border between two molecules.
Another interesting property to examine is if novel molecules more or less similar to a lead molecule can be by adding a bit of randomness to an existing latent vector.
#Sample around the latent wector latent = x_latent[0:1] scale = 0.40 mols =  for i in range(20): latent_r = latent + scale*(np.random.randn(latent.shape)) #TODO, try with smiles = latent_to_smiles(latent_r) mol = Chem.MolFromSmiles(smiles) if mol: mols.append(mol) else: print smiles Draw.MolsToGridImage(mols, molsPerRow=5)
NCCO1C=CC1O CONC(CCN1CO1 NCC#CCCN1N NCCC=NO1CN1 NCC1CO=CN1O NCCC1O(C)OO CNN1C(CFC1=O1 NCC=CCO1N1 COC1CC=NN=C NCCO1CCON1
Here we really see a lot of wrong SMILES generated, but nevertheless see some molecules that may be more or less similar to the one used to encode the latent vector. (Compare to the first molecule in the interpolation above). Tuning the scaling can be a way to tune the creativity.
In this blog post it was shown how to make an LSTM based autoencoder of SMILES strings of molecules and efficiently train it with teacher enforcing. The latent space may reflect the structure of the SMILES but nevertheless seem useful for a variety of cheminformatics related tasks. The SMILES reconstruction accuracy seem to drop when bordering domains between molecules or adding random noise. Thus the latent space may not be completely continuous reflecting the discrete nature of the molecules. I’ve not mentioned AI in this blog-post as I’m not sure how to define intelligence in the first place, but I do find it fascinating how the chemical rules of SMILES strings can be picked up by the weights and computations done in the neural networks of the LSTM cells and later used to create novel SMILES strings.