Tutorials

1. Introduction To Deepchem

1. The Basic Tools of the Deep Life Sciences

2. Working With Datasets

3. An Introduction To MoleculeNet

4. Molecular Fingerprints

5. Creating Models with TensorFlow and PyTorch

6. Introduction to Graph Convolutions

7. Going Deeper on Molecular Featurizations

8. Working With Splitters

9. Advanced Model Training

10. Creating a high fidelity model from experimental data

11. Putting Multitask Learning to Work

12. Modeling Protein Ligand Interactions

13. Modeling Protein Ligand Interactions With Atomic Convolutions

14. Conditional Generative Adversarial Networks

15. Training a Generative Adversarial Network on MNIST

16. Advanced model training using hyperopt

17. Introduction to Gaussian Processes

18. PytorchLightning Integration

19. Compiling DeepChem Torch Models

2. Molecular Machine Learning

1. Molecular Fingerprints

2. Going Deeper on Molecular Featurizations

3. Learning Unsupervised Embeddings for Molecules

4. Synthetic Feasibility Scoring

5. Atomic Contributions for Molecules

6. Interactive Model Evaluation with Trident Chemwidgets

7. Transfer Learning With ChemBERTa Transformers

8. Training a Normalizing Flow on QM9

9. Large Scale Chemical Screens

10. Introduction to Molecular Attention Transformer

11. Generating molecules with MolGAN

12. Introduction to GROVER

13. Introduction to PROTACs

14. Druggablity Assessment with Fpocket and Machine Learning

3. Modeling Proteins

1. Protein Deep Learning

2. DeepChem AntibodyTutorial Simplified

3. Protein Structure Prediction with ESMFold

4. Introduction to ProtBERT

5. ProteinLM Tutorial0

4. Protein Ligand Modeling

1. Introduction to Binding Sites

2. Modeling Protein Ligand Interactions

3. Modeling Protein Ligand Interactions With Atomic Convolutions

4. DeepChemXAlphafold

5. UniProt Data Preprocessing for Binding Sites

5. Quantum Chemistry

1. Exploring Quantum Chemistry with GDB1k

2. DeepQMC tutorial

3. Training an Exchange Correlation Functional using Deepchem

6. Bioinformatics

1. Introduction to Bioinformatics

2. Multisequence Alignments

3. Scanpy

4. Deep probabilistic analysis of single-cell omics data

5. Cell Counting Tutorial

7. Material Science

1. Introduction To Material Science

8. Machine Learning Methods

1. Using Reinforcement Learning to Play Pong

2. Introduction to Model Interpretability

3. Uncertainty In Deep Learning

9. Deep Differential Equations

1. Physics Informed Neural Networks

2. Introducing JaxModel and PINNModel

3. About nODE Using Torchdiffeq in Deepchem

4. Differentiation Infrastructure in Deepchem

5. Ordinary Differential Equation Solving using deepchem

10. Equivariance

1. Introduction to Equivariance

11. Olfaction

1. Predict Multi Label Odor Descriptors using OpenPOM

12. Polymer Science

1. An Introduction to the Polymers and Their Representation

2. crystallization tendency regression

3. Introduction to Polymer SMILES

4. Understanding Weighted Directed Graphs for Polymer Implimentations

5. PolyBERT Introduction to Chemical Language Models for Fingerprint Generation

Molecular Fingerprints ¶

Molecules can be represented in many ways. This tutorial introduces a type of representation called a "molecular fingerprint". It is a very simple representation that often works well for small drug-like molecules.

Colab ¶

This tutorial and the rest in this sequence can be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.

In [ ]:

!pip install --pre deepchem

We can now import the deepchem package to play with.

In [1]:

import deepchem as dc
dc.__version__

Out[1]:

'2.4.0-rc1.dev'

What is a Fingerprint? ¶

Deep learning models almost always take arrays of numbers as their inputs. If we want to process molecules with them, we somehow need to represent each molecule as one or more arrays of numbers.

Many (but not all) types of models require their inputs to have a fixed size. This can be a challenge for molecules, since different molecules have different numbers of atoms. If we want to use these types of models, we somehow need to represent variable sized molecules with fixed sized arrays.

Fingerprints are designed to address these problems. A fingerprint is a fixed length array, where different elements indicate the presence of different features in the molecule. If two molecules have similar fingerprints, that indicates they contain many of the same features, and therefore will likely have similar chemistry.

DeepChem supports a particular type of fingerprint called an "Extended Connectivity Fingerprint", or "ECFP" for short. They also are sometimes called "circular fingerprints". The ECFP algorithm begins by classifying atoms based only on their direct properties and bonds. Each unique pattern is a feature. For example, "carbon atom bonded to two hydrogens and two heavy atoms" would be a feature, and a particular element of the fingerprint is set to 1 for any molecule that contains that feature. It then iteratively identifies new features by looking at larger circular neighborhoods. One specific feature bonded to two other specific features becomes a higher level feature, and the corresponding element is set for any molecule that contains it. This continues for a fixed number of iterations, most often two.

Let's take a look at a dataset that has been featurized with ECFP.

In [2]:

tasks, datasets, transformers = dc.molnet.load_tox21(featurizer='ECFP')
train_dataset, valid_dataset, test_dataset = datasets
print(train_dataset)

<DiskDataset X.shape: (6264, 1024), y.shape: (6264, 12), w.shape: (6264, 12), task_names: ['NR-AR' 'NR-AR-LBD' 'NR-AhR' ... 'SR-HSE' 'SR-MMP' 'SR-p53']>

The feature array X has shape (6264, 1024). That means there are 6264 samples in the training set. Each one is represented by a fingerprint of length 1024. Also notice that the label array y has shape (6264, 12): this is a multitask dataset. Tox21 contains information about the toxicity of molecules. 12 different assays were used to look for signs of toxicity. The dataset records the results of all 12 assays, each as a different task.

Let's also take a look at the weights array.

In [3]:

train_dataset.w

Out[3]:

array([[1.0433141624730409, 1.0369942196531792, 8.53921568627451, ...,
        1.060388945752303, 1.1895710249165168, 1.0700990099009902],
       [1.0433141624730409, 1.0369942196531792, 1.1326397919375812, ...,
        0.0, 1.1895710249165168, 1.0700990099009902],
       [0.0, 0.0, 0.0, ..., 1.060388945752303, 0.0, 0.0],
       ...,
       [0.0, 0.0, 0.0, ..., 0.0, 0.0, 0.0],
       [1.0433141624730409, 1.0369942196531792, 8.53921568627451, ...,
        1.060388945752303, 0.0, 0.0],
       [1.0433141624730409, 1.0369942196531792, 1.1326397919375812, ...,
        1.060388945752303, 1.1895710249165168, 1.0700990099009902]],
      dtype=object)

Notice that some elements are 0. The weights are being used to indicate missing data. Not all assays were actually performed on every molecule. Setting the weight for a sample or sample/task pair to 0 causes it to be ignored during fitting and evaluation. It will have no effect on the loss function or other metrics.

Most of the other weights are close to 1, but not exactly 1. This is done to balance the overall weight of positive and negative samples on each task. When training the model, we want each of the 12 tasks to contribute equally, and on each task we want to put equal weight on positive and negative samples. Otherwise, the model might just learn that most of the training samples are non-toxic, and therefore become biased toward identifying other molecules as non-toxic.

Training a Model on Fingerprints ¶

Let's train a model. In earlier tutorials we use GraphConvModel , which is a fairly complicated architecture that takes a complex set of inputs. Because fingerprints are so simple, just a single fixed length array, we can use a much simpler type of model.

In [4]:

model = dc.models.MultitaskClassifier(n_tasks=12, n_features=1024, layer_sizes=[1000])

MultitaskClassifier is a simple stack of fully connected layers. In this example we tell it to use a single hidden layer of width 1000. We also tell it that each input will have 1024 features, and that it should produce predictions for 12 different tasks.

Why not train a separate model for each task? We could do that, but it turns out that training a single model for multiple tasks often works better. We will see an example of that in a later tutorial.

Let's train and evaluate the model.

In [5]:

import numpy as np

model.fit(train_dataset, nb_epoch=10)
metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
print('training set score:', model.evaluate(train_dataset, [metric], transformers))
print('test set score:', model.evaluate(test_dataset, [metric], transformers))

training set score: {'roc_auc_score': 0.9550063590563469}
test set score: {'roc_auc_score': 0.7781819573695475}

Not bad performance for such a simple model and featurization. More sophisticated models do slightly better on this dataset, but not enormously better.

Congratulations! Time to join the Community! ¶

Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:

Star DeepChem on GitHub ¶

This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.

Join the DeepChem Gitter ¶

The DeepChem Gitter hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!

Citing This Tutorial ¶

If you found this tutorial useful please consider citing it using the provided BibTeX.

In [ ]:

@manual{Intro4, 
 title={Molecular Fingerprints}, 
 organization={DeepChem},
 author={Ramsundar, Bharath}, 
 howpublished = {\url{https://github.com/deepchem/deepchem/blob/master/examples/tutorials/Molecular_Fingerprints.ipynb}}, 
 year={2021}, 
}