Deep learning is a specialized part of machine learning. It have come to have a sort of magical cling to the buzz word, but there is no magic, only mathematical modelling and data science. It distinguishes itself from other machine learning models by having a data driven feature extraction, rather than using hand engineered and designed features of the data. By having multiple layers of neural networks, the deeper layers of the networks extracts features that are suitable as descriptors for the layers closer to the task or prediction. By choosing the right network architecture for the right data embedding, the network can itself build the best concepts needed for the task.
In more traditional QSAR modelling, a lot of descriptors are calculated for the molecules to be modelled. There have over the last decades been a lot of such descriptors developed, as an example there are over 5000 descriptors calculated by the Dragon software. Alternatively systematic fingerprints such as Circular fingerprints (Morgan/ECFP) can be used. The outcome of the descriptor calculations are then used together with various machine learning models such as SVMs, Random Forests, Boosted trees or even neural networks to make predictive models of the molecular properties.
A deep learning approach to the same task would be to make a suitable embedding for the molecules and then let the network itself extract the features necessary for modelling the wanted property. This was good models can be build directly from the SMILES strings mostly used as an easy storage of molecular structures in databases and spreadsheets, skipping the descriptor calculation steps. The neural networks learn to understand the SMILES strings at the same time they learn to predict the molecular properties. We have good experience using recurrent neural network architectures (RNNs) together with SMILES as the low-level molecular representation. The sequence based SMILES representation go well with the RNNs which are optimized for sequence based data.
Chemical Deep Learning using 2D embeddings of molecules
Another approach to deep learning of molecules is to encode the molecules as little “pictures”. This enable us to leverage the great progress there has been done in image modelling and classification in the last couple of years. For the spatial 2D embedding of the molecule, the convolutional neural network architectures fits the best. One such transfer of model architecture is the usage of Inception modules as in the GoogLeNet models to model chemical structures, called chemception models.
The lower layers extracts information and builds concepts such as atoms and bonds, whereas the following layers build more complex features such as aromatic rings, molecular shape and specific molecular fragments. For an example see the figure above and the example instructive article on our blog.
Do you need “Big Data”?
We’re sorry to throw another buzz word into this. But big data is often thought of as a necessity for deep learning. The default assumption is that You need massive datasets with millions of examples. It certainly is nice to have large datasets, but it turns out to our experience that this is not always a necessity. Smaller datasets of a few hundred molecules can also be used in deep learning settings, although adjustments has to be made. The architecture need to be adjusted and the available data used efficiently. An approach is to enlarge the dataset by pooling different tasks in multi-task learning.
In our experience performance gains can be had by ensuring that a proper data augmentation technique is developed or implemented. As an example, we developed the SMILES enumeration technique to make the SMILES based deep learning QSAR models better and more robust. This technique uses the fact that a molecule has a number of associated valid SMILES (see figure). This is an example of how important domain knowledge (cheminformatics) can be to utilize the data with the new methods. A detailed example of using this technique is available in our publication, and some blog posts.
If you are unsure if your dataset is of a suitable size and quality, we are able to test it in a short pilot project and give recommendations about how to best proceed with data gathering and usage.
Spectroscopical Deep Learning aka. Deep Chemometrics
Another example of where deep learning concepts has been successfully applied to limited data set sizes is spectroscopical data. Using convolutional neural network architectures and a customized data augmentation technique, models could be build to predict the content of the active ingredient from tablets using a NIR spectrometer. Only a few hundred spectra were necessary to fit models, which showed superiority when it came to model transferability between instruments and in tough extrapolation challenges. Even though we gave the baseline PLS model an advantage by tuning its performance towards the test set (and not the development set as for the Neural Network model), the deep learning model outperformed the PLS models.
We have described our endeavors in the publication Data Augmentation of Spectral Data for Convolutional Neural Network (CNN) Based Deep Chemometrics.
A nice touch was that the best model performance was obtained by combining the data augmentation technique with the established chemometric preprocessing steps. This again underlines how important domain knowledge is to successful modelling, even with deep learning techniques.
Peptide Deep Learning
Deep learning is certainly also possible with peptide sequences. Here the natural sequence of the peptide amino acids are even more obvious to match with recurrent neural networks (RNNs) than the SMILES sequences used for small molecules above. An example of how a deep recurrent neural network can be used to model properties of peptide sequences is described in our publication regarding the DeepIEP model.
Here a model of the peptides isoelectric point (pI or IEP) was build from scratch using a database of peptide sequences and their associated pI values. No need to figure out what pKa value sets should be used as the model figured that out from the data. Moreover the model was seemingly able to make adjustments based on local sequence, closeness to terminal groups and clustering of likewise clustered amino acids. The model and associated source code is available from GitHub.
Start your Chemical Deep Learning project today!
Wildcard Pharmaceutical Consulting has the necessary domain insight to help your drug discovery or pharmaceutical research department get the most from their data using deep learning. We have specialized insight into usage of chemical, biological and spectroscopical data. Please contact us for a non-committal and confidential discussion about what we could do to help you achieve your goals with your data.