Tutorials

1. Introduction To Deepchem

1. The Basic Tools of the Deep Life Sciences

2. Working With Datasets

3. An Introduction To MoleculeNet

4. Molecular Fingerprints

5. Creating Models with TensorFlow and PyTorch

6. Introduction to Graph Convolutions

7. Going Deeper on Molecular Featurizations

8. Working With Splitters

9. Advanced Model Training

10. Creating a high fidelity model from experimental data

11. Putting Multitask Learning to Work

12. Modeling Protein Ligand Interactions

13. Modeling Protein Ligand Interactions With Atomic Convolutions

14. Conditional Generative Adversarial Networks

15. Training a Generative Adversarial Network on MNIST

16. Advanced model training using hyperopt

17. Introduction to Gaussian Processes

18. PytorchLightning Integration

19. Compiling DeepChem Torch Models

2. Molecular Machine Learning

1. Molecular Fingerprints

2. Going Deeper on Molecular Featurizations

3. Learning Unsupervised Embeddings for Molecules

4. Synthetic Feasibility Scoring

5. Atomic Contributions for Molecules

6. Interactive Model Evaluation with Trident Chemwidgets

7. Transfer Learning With ChemBERTa Transformers

8. Training a Normalizing Flow on QM9

9. Large Scale Chemical Screens

10. Introduction to Molecular Attention Transformer

11. Generating molecules with MolGAN

12. Introduction to GROVER

13. Introduction to PROTACs

14. Druggablity Assessment with Fpocket and Machine Learning

3. Modeling Proteins

1. Protein Deep Learning

2. DeepChem AntibodyTutorial Simplified

3. Protein Structure Prediction with ESMFold

4. Introduction to ProtBERT

5. ProteinLM Tutorial0

4. Protein Ligand Modeling

1. Introduction to Binding Sites

2. Modeling Protein Ligand Interactions

3. Modeling Protein Ligand Interactions With Atomic Convolutions

4. DeepChemXAlphafold

5. UniProt Data Preprocessing for Binding Sites

5. Quantum Chemistry

1. Exploring Quantum Chemistry with GDB1k

2. DeepQMC tutorial

3. Training an Exchange Correlation Functional using Deepchem

6. Bioinformatics

1. Introduction to Bioinformatics

2. Multisequence Alignments

3. Scanpy

4. Deep probabilistic analysis of single-cell omics data

5. Cell Counting Tutorial

7. Material Science

1. Introduction To Material Science

8. Machine Learning Methods

1. Using Reinforcement Learning to Play Pong

2. Introduction to Model Interpretability

3. Uncertainty In Deep Learning

9. Deep Differential Equations

1. Physics Informed Neural Networks

2. Introducing JaxModel and PINNModel

3. About nODE Using Torchdiffeq in Deepchem

4. Differentiation Infrastructure in Deepchem

5. Ordinary Differential Equation Solving using deepchem

10. Equivariance

1. Introduction to Equivariance

11. Olfaction

1. Predict Multi Label Odor Descriptors using OpenPOM

12. Polymer Science

1. An Introduction to the Polymers and Their Representation

2. crystallization tendency regression

3. Introduction to Polymer SMILES

4. Understanding Weighted Directed Graphs for Polymer Implimentations

5. PolyBERT Introduction to Chemical Language Models for Fingerprint Generation

Introduction to Graph Convolutions ¶

In this tutorial we will learn more about "graph convolutions." These are one of the most powerful deep learning tools for working with molecular data. The reason for this is that molecules can be naturally viewed as graphs.

Molecular Graph

Note how standard chemical diagrams of the sort we're used to from high school lend themselves naturally to visualizing molecules as graphs. In the remainder of this tutorial, we'll dig into this relationship in significantly more detail. This will let us get a deeper understanding of how these systems work.

Colab ¶

This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.

In [3]:

!pip install --pre deepchem

What are Graph Convolutions? ¶

Consider a standard convolutional neural network (CNN) of the sort commonly used to process images. The input is a grid of pixels. There is a vector of data values for each pixel, for example the red, green, and blue color channels. The data passes through a series of convolutional layers. Each layer combines the data from a pixel and its neighbors to produce a new data vector for the pixel. Early layers detect small scale local patterns, while later layers detect larger, more abstract patterns. Often the convolutional layers alternate with pooling layers that perform some operation such as max or min over local regions.

Graph convolutions are similar, but they operate on a graph. They begin with a data vector for each node of the graph (for example, the chemical properties of the atom that node represents). Convolutional and pooling layers combine information from connected nodes (for example, atoms that are bonded to each other) to produce a new data vector for each node.

Training a GraphConvModel ¶

Let's use the MoleculeNet suite to load the Tox21 dataset. To featurize the data in a way that graph convolutional networks can use, we set the featurizer option to 'GraphConv' . The MoleculeNet call returns a training set, a validation set, and a test set for us to use. It also returns tasks , a list of the task names, and transformers , a list of data transformations that were applied to preprocess the dataset. (Most deep networks are quite finicky and require a set of data transformations to ensure that training proceeds stably.)

Note: While importing deepchem, if you see any warnings, ignore them for now. Deepchem is a vast library and there are many things that can cause minor warnings to occur. Almost always, it doesn't require any action from your side.

In [4]:

import deepchem as dc

tasks, datasets, transformers = dc.molnet.load_tox21(featurizer='GraphConv')
train_dataset, valid_dataset, test_dataset = datasets

Let's now train a graph convolutional network on this dataset. DeepChem has the class GraphConvModel that wraps a standard graph convolutional architecture underneath the hood for user convenience. Let's instantiate an object of this class and train it on our dataset.

In [5]:

n_tasks = len(tasks)
num_features = train_dataset.X[0].get_atom_features().shape[1]
model = dc.models.torch_models.GraphConvModel(n_tasks, mode='classification',number_input_features=[num_features,64])
model.fit(train_dataset, nb_epoch=50)

Out[5]:

0.29102970123291017

Let's try to evaluate the performance of the model we've trained. For this, we need to define a metric, a measure of model performance. dc.metrics holds a collection of metrics already. For this dataset, it is standard to use the ROC-AUC score, the area under the receiver operating characteristic curve (which measures the tradeoff between precision and recall). Luckily, the ROC-AUC score is already available in DeepChem.

To measure the performance of the model under this metric, we can use the convenience function model.evaluate() .

In [6]:

metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
print('Training set score:', model.evaluate(train_dataset, [metric], transformers))
print('Test set score:', model.evaluate(test_dataset, [metric], transformers))

Training set score: {'roc_auc_score': 0.970785822904073}
Test set score: {'roc_auc_score': 0.7112009940440461}

The results are pretty good, and GraphConvModel is very easy to use. But what's going on under the hood? Could we build GraphConvModel ourselves? Of course! DeepChem provides Keras layers for all the calculations involved in a graph convolution. We are going to apply the following layers from DeepChem.

GraphConv layer: This layer implements the graph convolution. The graph convolution combines per-node feature vectures in a nonlinear fashion with the feature vectors for neighboring nodes. This "blends" information in local neighborhoods of a graph.
GraphPool layer: This layer does a max-pooling over the feature vectors of atoms in a neighborhood. You can think of this layer as analogous to a max-pooling layer for 2D convolutions but which operates on graphs instead.
GraphGather : Many graph convolutional networks manipulate feature vectors per graph-node. For a molecule for example, each node might represent an atom, and the network would manipulate atomic feature vectors that summarize the local chemistry of the atom. However, at the end of the application, we will likely want to work with a molecule level feature representation. This layer creates a graph level feature vector by combining all the node-level feature vectors.

Apart from this we are going to apply standard neural network layers such as Dense , BatchNormalization and Softmax layer.

Training a custom Graph Convolution network ¶

As you may have seen in the previous tutorials, DeepChem offers both PyTorch and Tensorflow functionalities. However, most of our work moving forward will leverage the PyTorch ecosystem.

Let's look at the Tensorflow implementation first.

Tensorflow ¶

In [7]:

from deepchem.models.layers import GraphConv, GraphPool, GraphGather
import tensorflow as tf
import tensorflow.keras.layers as layers

batch_size = 100

class GraphConvModelTensorflow(tf.keras.Model):

  def __init__(self):
    super(GraphConvModelTensorflow, self).__init__()
    self.gc1 = GraphConv(128, activation_fn=tf.nn.tanh)
    self.batch_norm1 = layers.BatchNormalization()
    self.gp1 = GraphPool()

    self.gc2 = GraphConv(128, activation_fn=tf.nn.tanh)
    self.batch_norm2 = layers.BatchNormalization()
    self.gp2 = GraphPool()

    self.dense1 = layers.Dense(256, activation=tf.nn.tanh)
    self.batch_norm3 = layers.BatchNormalization()
    self.readout = GraphGather(batch_size=batch_size, activation_fn=tf.nn.tanh)

    self.dense2 = layers.Dense(n_tasks*2)
    self.logits = layers.Reshape((n_tasks, 2))
    self.softmax = layers.Softmax()

  def call(self, inputs):
    gc1_output = self.gc1(inputs)
    batch_norm1_output = self.batch_norm1(gc1_output)
    gp1_output = self.gp1([batch_norm1_output] + inputs[1:])

    gc2_output = self.gc2([gp1_output] + inputs[1:])
    batch_norm2_output = self.batch_norm1(gc2_output)
    gp2_output = self.gp2([batch_norm2_output] + inputs[1:])

    dense1_output = self.dense1(gp2_output)
    batch_norm3_output = self.batch_norm3(dense1_output)
    readout_output = self.readout([batch_norm3_output] + inputs[1:])

    logits_output = self.logits(self.dense2(readout_output))
    return self.softmax(logits_output)

We can now see more clearly what is happening. There are two convolutional blocks, each consisting of a GraphConv , followed by batch normalization, followed by a GraphPool to do max pooling. We finish up with a dense layer, another batch normalization, a GraphGather to combine the data from all the different nodes, and a final dense layer to produce the global output.

Let's now create the DeepChem model which will be a wrapper around the Keras model that we just created. We will also specify the loss function so the model knows the objective to minimize.

In [8]:

model = dc.models.KerasModel(GraphConvModelTensorflow(), loss=dc.models.losses.CategoricalCrossEntropy())

What are the inputs to this model? A graph convolution requires a complete description of each molecule, including the list of nodes (atoms) and a description of which ones are bonded to each other. In fact, if we inspect the dataset we see that the feature array contains Python objects of type ConvMol .

In [9]:

test_dataset.X[0]

Out[9]:

<deepchem.feat.mol_graphs.ConvMol at 0x7bf66bfa1160>

Models expect arrays of numbers as their inputs, not Python objects. We must convert the ConvMol objects into the particular set of arrays expected by the GraphConv , GraphPool , and GraphGather layers. Fortunately, the ConvMol class includes the code to do this, as well as to combine all the molecules in a batch to create a single set of arrays.

The following code creates a Python generator that given a batch of data generates the lists of inputs, labels, and weights whose values are Numpy arrays. atom_features holds a feature vector of length 75 for each atom. The other inputs are required to support minibatching in TensorFlow. degree_slice is an indexing convenience that makes it easy to locate atoms from all molecules with a given degree. membership determines the membership of atoms in molecules (atom i belongs to molecule membership[i] ). deg_adjs is a list that contains adjacency lists grouped by atom degree. For more details, check out the code .

In [10]:

from deepchem.metrics import to_one_hot
from deepchem.feat.mol_graphs import ConvMol
import numpy as np

def data_generator(dataset, epochs=1):
  for ind, (X_b, y_b, w_b, ids_b) in enumerate(dataset.iterbatches(batch_size, epochs,
                                                                   deterministic=False, pad_batches=True)):
    multiConvMol = ConvMol.agglomerate_mols(X_b)
    inputs = [multiConvMol.get_atom_features(), multiConvMol.deg_slice, np.array(multiConvMol.membership)]
    for i in range(1, len(multiConvMol.get_deg_adjacency_lists())):
      inputs.append(multiConvMol.get_deg_adjacency_lists()[i])
    labels = [to_one_hot(y_b.flatten(), 2).reshape(-1, n_tasks, 2)]
    weights = [w_b]
    yield (inputs, labels, weights)

Now, we can train the model using fit_generator(generator) which will use the generator we've defined to train the model.

In [11]:

model.fit_generator(data_generator(train_dataset, epochs=50))

Out[11]:

0.23354644775390626

Now that we have trained our graph convolutional method, let's evaluate its performance. We again have to use our defined generator to evaluate model performance.

In [12]:

print('Training set score:', model.evaluate_generator(data_generator(train_dataset), [metric], transformers))
print('Test set score:', model.evaluate_generator(data_generator(test_dataset), [metric], transformers))

Training set score: {'roc_auc_score': 0.8370577643901682}
Test set score: {'roc_auc_score': 0.6610993488016647}

PyTorch ¶

Before working on the PyTorch implementation, we must import a few crucial layers from the torch_models collection. These are PyTorch implementations of GraphConv , GraphPool and GraphGather which we used in the tensorflow's implementation as well.

In [13]:

import torch
import torch.nn as nn
from deepchem.models.torch_models.layers import GraphConv, GraphGather, GraphPool

PyTorch's GraphConv requires the number of input features to be specified, hence we can extract that piece of information by following steps:

First we get a sample from the dataset.
Next we slice and separate the node_features (which is the first element of the list, hence the index 0).
Finally, we obtain the number of features by finding the shape of the array.

In [14]:

sample_batch = next(data_generator(train_dataset))
node_features = sample_batch[0][0]
num_input_features = node_features.shape[1]
print(f"Number of input features: {num_input_features}")

Number of input features: 75

In [17]:

class GraphConvModelTorch(nn.Module):
    def __init__(self):
        super(GraphConvModelTorch, self).__init__()
        
        self.gc1 = GraphConv(out_channel=128, number_input_features=num_input_features, activation_fn=nn.Tanh())
        self.batch_norm1 = nn.BatchNorm1d(128)
        self.gp1 = GraphPool()

        self.gc2 = GraphConv(out_channel=128, number_input_features=128, activation_fn=nn.Tanh())
        self.batch_norm2 = nn.BatchNorm1d(128)
        self.gp2 = GraphPool()

        self.dense1 = nn.Linear(128, 256)
        self.act3 = nn.Tanh()
        self.batch_norm3 = nn.BatchNorm1d(256)
        self.readout = GraphGather(batch_size=batch_size, activation_fn=nn.Tanh())
        
        self.dense2 = nn.Linear(512, n_tasks * 2)  
       
        self.logits = lambda data: data.view(-1, n_tasks, 2)
        self.softmax = nn.Softmax(dim=-1)
    
    def forward(self, inputs):
        gc1_output = self.gc1(inputs)
        batch_norm1_output = self.batch_norm1(gc1_output)
        gp1_output = self.gp1([batch_norm1_output] + inputs[1:])

        gc2_output = self.gc2([gp1_output] + inputs[1:])
        batch_norm2_output = self.batch_norm2(gc2_output)
        gp2_output = self.gp2([batch_norm2_output] + inputs[1:])

        dense1_output = self.act3(self.dense1(gp2_output))
        batch_norm3_output = self.batch_norm3(dense1_output)
        readout_output = self.readout([batch_norm3_output] + inputs[1:])
        
        dense2_output = self.dense2(readout_output)
        logits_output = self.logits(dense2_output)
        softmax_output = self.softmax(logits_output)
        return softmax_output

In [18]:

model = dc.models.TorchModel(GraphConvModelTorch(), loss=dc.models.losses.CategoricalCrossEntropy())
model.fit_generator(data_generator(train_dataset, epochs=50))

Out[18]:

0.2121513557434082

In [19]:

print('Training set score:', model.evaluate_generator(data_generator(train_dataset), [metric], transformers))
print('Test set score:', model.evaluate_generator(data_generator(test_dataset), [metric], transformers))

Training set score: {'roc_auc_score': 0.9838238607233897}
Test set score: {'roc_auc_score': 0.6923516284964811}

Success! Both the models we've constructed behave nearly identically to GraphConvModel . If you're looking to build your own custom models, you can follow the examples we've provided here to do so. We hope to see exciting constructions from your end soon!

Congratulations! Time to join the Community! ¶

Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:

Star DeepChem on GitHub ¶

This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.

Join the DeepChem Gitter ¶

The DeepChem Gitter hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!