Tutorials
Molecular Fingerprints ¶
Molecules can be represented in many ways. This tutorial introduces a type of representation called a "molecular fingerprint". It is a very simple representation that often works well for small drug-like molecules.
Colab ¶
This tutorial and the rest in this sequence can be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.
!pip install --pre deepchem
We can now import the
deepchem
package to play with.
import deepchem as dc
dc.__version__
'2.4.0-rc1.dev'
What is a Fingerprint? ¶
Deep learning models almost always take arrays of numbers as their inputs. If we want to process molecules with them, we somehow need to represent each molecule as one or more arrays of numbers.
Many (but not all) types of models require their inputs to have a fixed size. This can be a challenge for molecules, since different molecules have different numbers of atoms. If we want to use these types of models, we somehow need to represent variable sized molecules with fixed sized arrays.
Fingerprints are designed to address these problems. A fingerprint is a fixed length array, where different elements indicate the presence of different features in the molecule. If two molecules have similar fingerprints, that indicates they contain many of the same features, and therefore will likely have similar chemistry.
DeepChem supports a particular type of fingerprint called an "Extended Connectivity Fingerprint", or "ECFP" for short. They also are sometimes called "circular fingerprints". The ECFP algorithm begins by classifying atoms based only on their direct properties and bonds. Each unique pattern is a feature. For example, "carbon atom bonded to two hydrogens and two heavy atoms" would be a feature, and a particular element of the fingerprint is set to 1 for any molecule that contains that feature. It then iteratively identifies new features by looking at larger circular neighborhoods. One specific feature bonded to two other specific features becomes a higher level feature, and the corresponding element is set for any molecule that contains it. This continues for a fixed number of iterations, most often two.
Let's take a look at a dataset that has been featurized with ECFP.
tasks, datasets, transformers = dc.molnet.load_tox21(featurizer='ECFP')
train_dataset, valid_dataset, test_dataset = datasets
print(train_dataset)
<DiskDataset X.shape: (6264, 1024), y.shape: (6264, 12), w.shape: (6264, 12), task_names: ['NR-AR' 'NR-AR-LBD' 'NR-AhR' ... 'SR-HSE' 'SR-MMP' 'SR-p53']>
The feature array
X
has shape (6264, 1024). That means there are 6264 samples in the training set. Each one is represented by a fingerprint of length 1024. Also notice that the label array
y
has shape (6264, 12): this is a multitask dataset. Tox21 contains information about the toxicity of molecules. 12 different assays were used to look for signs of toxicity. The dataset records the results of all 12 assays, each as a different task.
Let's also take a look at the weights array.
train_dataset.w
array([[1.0433141624730409, 1.0369942196531792, 8.53921568627451, ..., 1.060388945752303, 1.1895710249165168, 1.0700990099009902], [1.0433141624730409, 1.0369942196531792, 1.1326397919375812, ..., 0.0, 1.1895710249165168, 1.0700990099009902], [0.0, 0.0, 0.0, ..., 1.060388945752303, 0.0, 0.0], ..., [0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0], [1.0433141624730409, 1.0369942196531792, 8.53921568627451, ..., 1.060388945752303, 0.0, 0.0], [1.0433141624730409, 1.0369942196531792, 1.1326397919375812, ..., 1.060388945752303, 1.1895710249165168, 1.0700990099009902]], dtype=object)
Notice that some elements are 0. The weights are being used to indicate missing data. Not all assays were actually performed on every molecule. Setting the weight for a sample or sample/task pair to 0 causes it to be ignored during fitting and evaluation. It will have no effect on the loss function or other metrics.
Most of the other weights are close to 1, but not exactly 1. This is done to balance the overall weight of positive and negative samples on each task. When training the model, we want each of the 12 tasks to contribute equally, and on each task we want to put equal weight on positive and negative samples. Otherwise, the model might just learn that most of the training samples are non-toxic, and therefore become biased toward identifying other molecules as non-toxic.
Training a Model on Fingerprints ¶
Let's train a model. In earlier tutorials we use
GraphConvModel
, which is a fairly complicated architecture that takes a complex set of inputs. Because fingerprints are so simple, just a single fixed length array, we can use a much simpler type of model.
model = dc.models.MultitaskClassifier(n_tasks=12, n_features=1024, layer_sizes=[1000])
MultitaskClassifier
is a simple stack of fully connected layers. In this example we tell it to use a single hidden layer of width 1000. We also tell it that each input will have 1024 features, and that it should produce predictions for 12 different tasks.
Why not train a separate model for each task? We could do that, but it turns out that training a single model for multiple tasks often works better. We will see an example of that in a later tutorial.
Let's train and evaluate the model.
import numpy as np
model.fit(train_dataset, nb_epoch=10)
metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
print('training set score:', model.evaluate(train_dataset, [metric], transformers))
print('test set score:', model.evaluate(test_dataset, [metric], transformers))
training set score: {'roc_auc_score': 0.9550063590563469} test set score: {'roc_auc_score': 0.7781819573695475}
Not bad performance for such a simple model and featurization. More sophisticated models do slightly better on this dataset, but not enormously better.
Congratulations! Time to join the Community! ¶
Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:
Star DeepChem on GitHub ¶
This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.
Join the DeepChem Gitter ¶
The DeepChem Gitter hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!
Citing This Tutorial ¶
If you found this tutorial useful please consider citing it using the provided BibTeX.
@manual{Intro4,
title={Molecular Fingerprints},
organization={DeepChem},
author={Ramsundar, Bharath},
howpublished = {\url{https://github.com/deepchem/deepchem/blob/master/examples/tutorials/Molecular_Fingerprints.ipynb}},
year={2021},
}