Master your molecule generator 2. Direct steering of conditional recurrent neural networks (cRNNs)

esbenbjerrum/ November 12, 2019/ Blog, Cheminformatics, Neural Network, RDkit, Science/ 0 comments

Long time ago in a GPU far-far away, the deep learning rebels are happy. They have created new ways of working with chemistry using deep learning technology and condemed “the old ways” to be a provider of SMILES strings. But then something appears in the deep space …
In a previous blog-post i’ve covered how to create a molecular encoder and using the generated latent space for control of the generation of novel molecules. I’ve also previously covered the unbiased version of the generator in what now seem like long time ago, so the Blogpost may seem a bit technically outdated. Together with Panagiotis-Christos Kotsias, a bright, young graduate scientist and a couple of other researches, we’ve investigated how we can use pre-calculated descriptors and activity predictions to get closer to the holy grail of direct inverse QSAR. The details were recently published as a Chemrxiv preprint and the code and two pre-trained models made available on GitHub. The architecture is in a way related to existing auto- and heteroencoder architectures, but the encoder part is not created via deep learning. The molecular encoding is done through existing methods in the cheminformatics toolkit RDKit, as illustrated below. I kind of like the way we can combine the existing engineered methods with new deep learning technology for the creation of a hybrid method and I hope that this combination in the future will bring along a new balance of the fo^D^D source. (My teenage sons would scorn me for misuse of memes, but I couldn’t resist….).
The network will adjust the output to match these inputs, as example it will probably not put a lot of hydroxy groups (O) in a molecule we just told it has a large logP, but rather put a higher probability on carbons and thereby be able to attain a lower overall loss. If the amount of information is high, like if the conditional RNN is fed a structural fingerprint, the number of possible molecules will be low, and the network will be closer to a perfect autoencoder 1:1 molecular mapping. But if the amount of chemical information is lower, like the five descriptors and properties in the tutorial below, the network will output orders of magnitude more molecules and be intermediate in performance between autoencoders and unbiased RNN molecular generators. It can also be pre-trained on predictions from a QSAR model and thus solve the inverse QSAR task directly. The code in the GitHub repository contain a pre-trained model, and in this blog post I’ll show how to use it to generate novel molecules.
Illustration of how to change an autoencoder architecture into a conditional RNN

Installation of deep drug coder

Follow the instructions about how to install the module described in the GitHub repository. In short, Git clone https://github.com/pcko1/Deep-Drug-Coder.git, create a conda environment from the provided .yml file, activate you conda or virtualenv environment, cd Deep-Drug-Coder, python setup.py install. Then it should be possible to load the ddc_v3 class.
The code seem to be break with the recent upgrade of numpy to 1.17, but it has been tested to work with 1.16

import numpy as np
import pickle
np.__version__
    '1.16.5'
from ddc_pub import ddc_v3 as ddc
    Using TensorFlow backend.

Using the pre-trained models

The git project should come with two pre-trained models. The “phys-chem” pcb model is steered using a basic selection of molecular descriptors as well as a feature linked to the probability of being active in a DDR2 QSAR model. The new version of DDC includes the vectorizer from molvecgen directly, but in order to use some of the previous trained models, the molvecgen package from https://github.com/EBjerrum/molvecgen also need to be installed in the repository.

import molvecgen

Importing the molecule from the zip file (without the extension!), will load the needed files and build the decoder model and show a summary of the loaded model.

model_name = "Deep-Drug-Coder/models/pcb_model"
model = ddc.DDC(model_name=model_name)
    Initializing model in test mode.
    Loading model.
    'mol_to_latent_model' not found, setting to None.
    /projects/mai/kfxl284/envs/ddc_env/lib/python3.6/site-packages/scikit_learn-0.21.3-py3.6-linux-x86_64.egg/sklearn/base.py:306: UserWarning: Trying to unpickle estimator StandardScaler from version 0.20.2 when using version 0.21.3. This might lead to breaking code or invalid results. Use at your own risk.
      UserWarning)
    /projects/mai/kfxl284/envs/ddc_env/lib/python3.6/site-packages/Keras-2.2.5-py3.6.egg/keras/engine/saving.py:310: UserWarning: No training configuration found in save file: the model was *not* compiled. Compile it manually.
      warnings.warn('No training configuration found in save file: '
    Loading finished in 8 seconds.
    Model: "model_1"
    __________________________________________________________________________________________________
    Layer (type)                    Output Shape         Param #     Connected to                     
    ==================================================================================================
    Latent_Input (InputLayer)       (None, 7)            0                                            
    __________________________________________________________________________________________________
    Decoder_Inputs (InputLayer)     (None, 142, 35)      0                                            
    __________________________________________________________________________________________________
    latent_to_states_model (Model)  [(None, 256), (None, 18432       Latent_Input[0][0]               
    __________________________________________________________________________________________________
    batch_model (Model)             (None, 142, 35)      1364771     Decoder_Inputs[0][0]             
                                                                     latent_to_states_model[1][0]     
                                                                     latent_to_states_model[1][1]     
                                                                     latent_to_states_model[1][2]     
                                                                     latent_to_states_model[1][3]     
                                                                     latent_to_states_model[1][4]     
                                                                     latent_to_states_model[1][5]     
    ==================================================================================================
    Total params: 1,383,203
    Trainable params: 1,378,595
    Non-trainable params: 4,608
    __________________________________________________________________________________________________
    None

Calculating conditions from an existing molecule

Before we can generate molecules we need to feed the target conditions. It is a good idea to initially take the calculated properties from an existing molecule to get values that are realistic and in congruence, as unrealistic and contradictory settings will lead to problems for generation of molecules. The molecules will not meet the targets and the validity percentage of the generated molecules will eventually collapse. So a we write a small function that returns a list with the conditions that the loaded model was trained on. Lets use the classical aspirin molecule but try to find compounds that are DRD2 active with a probability of 0.99. This is entirely for illustrative purposes, it would probably be better to start with a compound closer to actives in the intended activity model, but lets see how it goes. Six of the properties for the model were calculated with standard RDKit functions, whereas the last one is the active probability from a DRD2 QSAR model.

qsar_model_name = "Deep-Drug-Coder/models/qsar_model.pickle"
with open(qsar_model_name, "rb") as file:
    qsar_model = pickle.load(file)["classifier_sv"]
    /projects/mai/kfxl284/envs/ddc_env/lib/python3.6/site-packages/scikit_learn-0.21.3-py3.6-linux-x86_64.egg/sklearn/base.py:306: UserWarning: Trying to unpickle estimator SVC from version 0.20.2 when using version 0.21.3. This might lead to breaking code or invalid results. Use at your own risk.
      UserWarning)
from rdkit import Chem, DataStructs
from rdkit.Chem import Descriptors, rdMolDescriptors, AllChem, QED
from rdkit.Chem.Draw import IPythonConsole
#We suppres stdout from invalid smiles and validations
from rdkit import rdBase
rdBase.DisableLog ( 'rdApp.*')
def get_descriptors(mol):
    logp  = Descriptors.MolLogP(mol)
    tpsa  = Descriptors.TPSA(mol)
    molwt = Descriptors.ExactMolWt(mol)
    hba   = rdMolDescriptors.CalcNumHBA(mol)
    hbd   = rdMolDescriptors.CalcNumHBD(mol)
    qed   = QED.qed(mol)
    
                    
    # Calculate fingerprints
    fp = AllChem.GetMorganFingerprintAsBitVect(mol,2, nBits=2048)
    ecfp4 = np.zeros((2048,))
    DataStructs.ConvertToNumpyArray(fp, ecfp4) 
    # Predict activity and pick only the second component
    active_probability = qsar_model.predict_proba([ecfp4])[0][1]
    return [logp, tpsa, molwt, qed, hba, hbd, active_probability]
mol = Chem.MolFromSmiles("c1cccc(OC(=O)C)c1C(=O)O")
display(mol)
conditions = get_descriptors(mol)
conditions


[1.3100999999999998,
63.60000000000001,
180.042258736,
0.5501217966938848,
3,
1,
9.431141115603303e-06]

Sampling molecules with similar properties

We’ll reset the DRD2 activity probability to 0.9 to search for active ligands with similar properties as aspirin.

conditions[-1] = 0.9

We’ll sample the deep drug coder multiple times with the same conditions and examine the diversity of the output. Can we really match the input conditions and will our QSAR model for DRD2 predict the compounds as being active. The molecular weight from the seed compound may be a bit low for the DDC to solve the task, but lets see what happens.

target = np.array(conditions)
smiles_out, _ = model.predict(latent=target, temp=1)
Chem.MolFromSmiles(smiles_out)


Running it a few times show some “interesting” molecules, but also sometimes gives some validation errors. We’ll use the batch mode to generate a molecule set and check the properties. The model.batch_input_length property will specify the number of predicted smiles in batch mode, but resetting it will trigger a rebuild of the model. It’s default 256 and setting it to large values will put strain on the GPU memory and sometimes leads to delays, so it may be faster to predict a couple of times 256.

#model.batch_input_length = 256
smiles_out = []
for i in range(4):
    smiles, _ = model.predict_batch(latent=target.reshape(1,-1), temp=1.0)
    smiles_out.append(smiles)
smiles_out = np.concatenate(smiles_out)
smiles_out.shape
    (1024,)
mols = [Chem.MolFromSmiles(smi) for smi in smiles_out]

We’ll just check the molecules with sanitation to ensure they are well accepted by RDKit

def sanitize(mol):
    try:
        Chem.SanitizeMol(mol)
        return mol
    except Exception as e:
        print(e)
        
sani_mols = [sanitize(mol) for mol in mols if mol != None]
    Sanitization error: Explicit valence for atom # 12 O, 3, is greater than permitted
    Sanitization error: Explicit valence for atom # 13 O, 3, is greater than permitted
    Sanitization error: Explicit valence for atom # 11 O, 3, is greater than permitted
    Sanitization error: Explicit valence for atom # 13 O, 3, is greater than permitted
    Sanitization error: Explicit valence for atom # 12 O, 3, is greater than permitted
    Sanitization error: Explicit valence for atom # 13 O, 3, is greater than permitted
    Sanitization error: Explicit valence for atom # 6 N, 5, is greater than permitted
len(sani_mols)/len(smiles_out)
    0.80859375

Approximate 75% validity, this may indicate conditions that are slightly self contradictory. God self-consistent conditions are usually above 85%. Let’s compute the properties and see if they match our targets

properties = [get_descriptors(mol) for mol in sani_mols if mol != None]
import pandas as pd
target_names = ["logp", "tpsa", "molwt", "qed", "hba", "hbd", "active_probability"]
data = pd.DataFrame(properties, columns=target_names)
data.head()
logp tpsa molwt qed hba hbd active_probability
0 1.35750 78.02 197.164046 0.301813 3 1 0.667666
1 0.91848 64.33 188.058577 0.665338 4 1 0.000024
2 0.19990 67.48 181.121512 0.595266 3 2 0.004157
3 2.00240 69.94 194.026232 0.444074 4 1 0.000183
4 1.26688 60.07 176.093072 0.655230 3 1 0.020928

Plotting the distributions of the samples molecules

target_dict = {t[0]:t[1] for t in zip(target_names, target)}
axes = data.hist(bins=25,figsize=(8,6))
for i, ax in enumerate(axes.flatten()):
    title = ax.title.__dict__["_text"]
    if title:
        ax.axvline(x=target_dict[title], color='r', linestyle='dashed', linewidth=2)


The properties seem to follow distributions around the target (red line), except for QED and active_probability. The jump from Aspirin properties to properties of a DRD2 active molecule may be too large. We’ll try with a DRD2 active molecule as seed. Here we’ll choose the well-known anti-psychotic, chlorpromazine, which acts as an antagonists on the DRD2 receptor.

Trying another molecular seed

mol2 = Chem.MolFromSmiles("CN(C)CCCN1C2=CC=CC=C2SC3=C1C=C(C=C3)Cl")
display(mol2)
conditions = get_descriptors(mol2)
print(conditions)
target = np.array(conditions)


[4.894400000000004, 6.48, 318.095747288, 0.7918053413291345, 3, 0, 0.9939953772318398]
Not surprisingly it is predicted as being active by the QSAR model. There’s a good chance that it was already a part of the training set. Next we create the SMILES strings, convert them to molecules, sanitize and compute the percentage validity

smiles_out = []
for i in range(4):
    smiles, _ = model.predict_batch(latent=target.reshape(1,-1), temp=1)
    smiles_out.append(smiles)
smiles_out = np.concatenate(smiles_out)
  
mols = [Chem.MolFromSmiles(smi) for smi in smiles_out]
sani_mols = [sanitize(mol) for mol in mols if mol != None]
len(sani_mols)/len(smiles_out)
    0.900390625

This is a much better validity range, but we also used an exact seeding condition from an existing molecule. Lets quickly look at the property distributions of the resulting molecules.

properties = [get_descriptors(mol) for mol in sani_mols if mol != None]
data = pd.DataFrame(properties, columns=target_names)
target_dict = {t[0]:t[1] for t in zip(target_names, target)}
axes = data.hist(bins=25,figsize=(8,6))
for i, ax in enumerate(axes.flatten()):
    title = ax.title.__dict__["_text"]
    if title:
        ax.axvline(x=target_dict[title], color='r', linestyle='dashed', linewidth=2)


This is again a showing more or less narrow distributions around the properties, and theres a much larger ratio of predicted actives. We can try and browse the molecules with a small interactive ipywidget. WordPress doesn’t support thos, so it’s just an image in this blog-post, but quite useful in a Jupyter notebook.

from ipywidgets import interact
def showmol(i):
    display(sani_mols[i])
    
_ = interact(showmol, i = range(len(sani_mols)))

Adjusting one of the properties independently

LogP for the seed compound was close to 5, but lets assume the scenario that we want to lower the logP value. We take the seed conditions from before but lower the target logP to 3.

conditions[0] = 3
target = np.array(conditions)
smiles_out = []
for i in range(4):
    smiles, _ = model.predict_batch(latent=target.reshape(1,-1), temp=1)
    smiles_out.append(smiles)
smiles_out = np.concatenate(smiles_out)
mols = [Chem.MolFromSmiles(smi) for smi in smiles_out]
sani_mols = [sanitize(mol) for mol in mols if mol != None]
len(sani_mols)/len(smiles_out)
    0.9599609375

Validity rate seem to be even higher. Good, and how does the property distribution and molecules look like?

properties = [get_descriptors(mol) for mol in sani_mols if mol != None]
data = pd.DataFrame(properties, columns=target_names)
target_dict = {t[0]:t[1] for t in zip(target_names, target)}
axes = data.hist(bins=25,figsize=(8,6))
for i, ax in enumerate(axes.flatten()):
    title = ax.title.__dict__["_text"]
    if title:
        ax.axvline(x=target_dict[title], color='r', linestyle='dashed', linewidth=2)


This is interesting. We lower the logP to 3 and suddenly we get better validity and higher proportion of actives. This could be because that the distribution of the actives used to train the QSAR model was centered around a logP of 3 and the model has thus learned from a lot more molecules close to this logP value (distribution show in the Chemrxiv preprint Figure 2). We can take a brief look at the molecules with an interaction widget as before.

_ = interact(showmol, i = range(len(sani_mols)))

Summary

In this blog-post it was demonstrated how to use a pre-trained deep-drug-coder model to generate molecules with specified properties. We saw how it can be important to select the right properties as some tasks can’t be solved completely due to contradictory conditions selected. Selecting an existing active compound to derive the seed conditions can provide larger amounts of predicted actives, while at the same time allow for adjustment of some of the properties, although it may be easier as the property was changed to a region that was highly populated in the training phase. This methodology for steering molecular generation from pre-specified properties such as logP and molecular weight makes it much easier for a human-in-the-loop to select what regions could be interesting for generation of datasets in a given drug discovery project context. The method can thus be used to generate huge datasets of novel molecules that can then be processed through further filters and methods (e.g. in-silico filters, ADME, tox, docking, synthesizability assesment)

Happy Deep Learning

Panagiotis-Christos and Esben

Share this Post

Leave a Comment

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>
*
*