Tutorials
Introduction to Graph Convolutions ¶
In this tutorial we will learn more about "graph convolutions." These are one of the most powerful deep learning tools for working with molecular data. The reason for this is that molecules can be naturally viewed as graphs.
Note how standard chemical diagrams of the sort we're used to from high school lend themselves naturally to visualizing molecules as graphs. In the remainder of this tutorial, we'll dig into this relationship in significantly more detail. This will let us get a deeper understanding of how these systems work.
Colab ¶
This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.
!pip install --pre deepchem
What are Graph Convolutions? ¶
Consider a standard convolutional neural network (CNN) of the sort commonly used to process images. The input is a grid of pixels. There is a vector of data values for each pixel, for example the red, green, and blue color channels. The data passes through a series of convolutional layers. Each layer combines the data from a pixel and its neighbors to produce a new data vector for the pixel. Early layers detect small scale local patterns, while later layers detect larger, more abstract patterns. Often the convolutional layers alternate with pooling layers that perform some operation such as max or min over local regions.
Graph convolutions are similar, but they operate on a graph. They begin with a data vector for each node of the graph (for example, the chemical properties of the atom that node represents). Convolutional and pooling layers combine information from connected nodes (for example, atoms that are bonded to each other) to produce a new data vector for each node.
Training a GraphConvModel ¶
Let's use the MoleculeNet suite to load the Tox21 dataset. To featurize the data in a way that graph convolutional networks can use, we set the featurizer option to
'GraphConv'
. The MoleculeNet call returns a training set, a validation set, and a test set for us to use. It also returns
tasks
, a list of the task names, and
transformers
, a list of data transformations that were applied to preprocess the dataset. (Most deep networks are quite finicky and require a set of data transformations to ensure that training proceeds stably.)
Note: While importing deepchem, if you see any warnings, ignore them for now. Deepchem is a vast library and there are many things that can cause minor warnings to occur. Almost always, it doesn't require any action from your side.
import deepchem as dc
tasks, datasets, transformers = dc.molnet.load_tox21(featurizer='GraphConv')
train_dataset, valid_dataset, test_dataset = datasets
Let's now train a graph convolutional network on this dataset. DeepChem has the class
GraphConvModel
that wraps a standard graph convolutional architecture underneath the hood for user convenience. Let's instantiate an object of this class and train it on our dataset.
n_tasks = len(tasks)
num_features = train_dataset.X[0].get_atom_features().shape[1]
model = dc.models.torch_models.GraphConvModel(n_tasks, mode='classification',number_input_features=[num_features,64])
model.fit(train_dataset, nb_epoch=50)
0.29102970123291017
Let's try to evaluate the performance of the model we've trained. For this, we need to define a metric, a measure of model performance.
dc.metrics
holds a collection of metrics already. For this dataset, it is standard to use the ROC-AUC score, the area under the receiver operating characteristic curve (which measures the tradeoff between precision and recall). Luckily, the ROC-AUC score is already available in DeepChem.
To measure the performance of the model under this metric, we can use the convenience function
model.evaluate()
.
metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
print('Training set score:', model.evaluate(train_dataset, [metric], transformers))
print('Test set score:', model.evaluate(test_dataset, [metric], transformers))
Training set score: {'roc_auc_score': 0.970785822904073} Test set score: {'roc_auc_score': 0.7112009940440461}
The results are pretty good, and
GraphConvModel
is very easy to use. But what's going on under the hood? Could we build GraphConvModel ourselves? Of course! DeepChem provides Keras layers for all the calculations involved in a graph convolution. We are going to apply the following layers from DeepChem.
-
GraphConv
layer: This layer implements the graph convolution. The graph convolution combines per-node feature vectures in a nonlinear fashion with the feature vectors for neighboring nodes. This "blends" information in local neighborhoods of a graph. -
GraphPool
layer: This layer does a max-pooling over the feature vectors of atoms in a neighborhood. You can think of this layer as analogous to a max-pooling layer for 2D convolutions but which operates on graphs instead. -
GraphGather
: Many graph convolutional networks manipulate feature vectors per graph-node. For a molecule for example, each node might represent an atom, and the network would manipulate atomic feature vectors that summarize the local chemistry of the atom. However, at the end of the application, we will likely want to work with a molecule level feature representation. This layer creates a graph level feature vector by combining all the node-level feature vectors.
Apart from this we are going to apply standard neural network layers such as Dense , BatchNormalization and Softmax layer.
Training a custom Graph Convolution network ¶
As you may have seen in the previous tutorials, DeepChem offers both PyTorch and Tensorflow functionalities. However, most of our work moving forward will leverage the PyTorch ecosystem.
Let's look at the Tensorflow implementation first.
Tensorflow ¶
from deepchem.models.layers import GraphConv, GraphPool, GraphGather
import tensorflow as tf
import tensorflow.keras.layers as layers
batch_size = 100
class GraphConvModelTensorflow(tf.keras.Model):
def __init__(self):
super(GraphConvModelTensorflow, self).__init__()
self.gc1 = GraphConv(128, activation_fn=tf.nn.tanh)
self.batch_norm1 = layers.BatchNormalization()
self.gp1 = GraphPool()
self.gc2 = GraphConv(128, activation_fn=tf.nn.tanh)
self.batch_norm2 = layers.BatchNormalization()
self.gp2 = GraphPool()
self.dense1 = layers.Dense(256, activation=tf.nn.tanh)
self.batch_norm3 = layers.BatchNormalization()
self.readout = GraphGather(batch_size=batch_size, activation_fn=tf.nn.tanh)
self.dense2 = layers.Dense(n_tasks*2)
self.logits = layers.Reshape((n_tasks, 2))
self.softmax = layers.Softmax()
def call(self, inputs):
gc1_output = self.gc1(inputs)
batch_norm1_output = self.batch_norm1(gc1_output)
gp1_output = self.gp1([batch_norm1_output] + inputs[1:])
gc2_output = self.gc2([gp1_output] + inputs[1:])
batch_norm2_output = self.batch_norm1(gc2_output)
gp2_output = self.gp2([batch_norm2_output] + inputs[1:])
dense1_output = self.dense1(gp2_output)
batch_norm3_output = self.batch_norm3(dense1_output)
readout_output = self.readout([batch_norm3_output] + inputs[1:])
logits_output = self.logits(self.dense2(readout_output))
return self.softmax(logits_output)
We can now see more clearly what is happening. There are two convolutional blocks, each consisting of a
GraphConv
, followed by batch normalization, followed by a
GraphPool
to do max pooling. We finish up with a dense layer, another batch normalization, a
GraphGather
to combine the data from all the different nodes, and a final dense layer to produce the global output.
Let's now create the DeepChem model which will be a wrapper around the Keras model that we just created. We will also specify the loss function so the model knows the objective to minimize.
model = dc.models.KerasModel(GraphConvModelTensorflow(), loss=dc.models.losses.CategoricalCrossEntropy())
What are the inputs to this model? A graph convolution requires a complete description of each molecule, including the list of nodes (atoms) and a description of which ones are bonded to each other. In fact, if we inspect the dataset we see that the feature array contains Python objects of type
ConvMol
.
test_dataset.X[0]
<deepchem.feat.mol_graphs.ConvMol at 0x7bf66bfa1160>
Models expect arrays of numbers as their inputs, not Python objects. We must convert the
ConvMol
objects into the particular set of arrays expected by the
GraphConv
,
GraphPool
, and
GraphGather
layers. Fortunately, the
ConvMol
class includes the code to do this, as well as to combine all the molecules in a batch to create a single set of arrays.
The following code creates a Python generator that given a batch of data generates the lists of inputs, labels, and weights whose values are Numpy arrays.
atom_features
holds a feature vector of length 75 for each atom. The other inputs are required to support minibatching in TensorFlow.
degree_slice
is an indexing convenience that makes it easy to locate atoms from all molecules with a given degree.
membership
determines the membership of atoms in molecules (atom
i
belongs to molecule
membership[i]
).
deg_adjs
is a list that contains adjacency lists grouped by atom degree. For more details, check out the
code
.
from deepchem.metrics import to_one_hot
from deepchem.feat.mol_graphs import ConvMol
import numpy as np
def data_generator(dataset, epochs=1):
for ind, (X_b, y_b, w_b, ids_b) in enumerate(dataset.iterbatches(batch_size, epochs,
deterministic=False, pad_batches=True)):
multiConvMol = ConvMol.agglomerate_mols(X_b)
inputs = [multiConvMol.get_atom_features(), multiConvMol.deg_slice, np.array(multiConvMol.membership)]
for i in range(1, len(multiConvMol.get_deg_adjacency_lists())):
inputs.append(multiConvMol.get_deg_adjacency_lists()[i])
labels = [to_one_hot(y_b.flatten(), 2).reshape(-1, n_tasks, 2)]
weights = [w_b]
yield (inputs, labels, weights)
Now, we can train the model using
fit_generator(generator)
which will use the generator we've defined to train the model.
model.fit_generator(data_generator(train_dataset, epochs=50))
0.23354644775390626
Now that we have trained our graph convolutional method, let's evaluate its performance. We again have to use our defined generator to evaluate model performance.
print('Training set score:', model.evaluate_generator(data_generator(train_dataset), [metric], transformers))
print('Test set score:', model.evaluate_generator(data_generator(test_dataset), [metric], transformers))
Training set score: {'roc_auc_score': 0.8370577643901682} Test set score: {'roc_auc_score': 0.6610993488016647}
PyTorch ¶
Before working on the PyTorch implementation, we must import a few crucial layers from the
torch_models
collection. These are PyTorch implementations of
GraphConv
,
GraphPool
and
GraphGather
which we used in the tensorflow's implementation as well.
import torch
import torch.nn as nn
from deepchem.models.torch_models.layers import GraphConv, GraphGather, GraphPool
PyTorch's
GraphConv
requires the number of input features to be specified, hence we can extract that piece of information by following steps:
- First we get a sample from the dataset.
- Next we slice and separate the node_features (which is the first element of the list, hence the index 0).
- Finally, we obtain the number of features by finding the shape of the array.
sample_batch = next(data_generator(train_dataset))
node_features = sample_batch[0][0]
num_input_features = node_features.shape[1]
print(f"Number of input features: {num_input_features}")
Number of input features: 75
class GraphConvModelTorch(nn.Module):
def __init__(self):
super(GraphConvModelTorch, self).__init__()
self.gc1 = GraphConv(out_channel=128, number_input_features=num_input_features, activation_fn=nn.Tanh())
self.batch_norm1 = nn.BatchNorm1d(128)
self.gp1 = GraphPool()
self.gc2 = GraphConv(out_channel=128, number_input_features=128, activation_fn=nn.Tanh())
self.batch_norm2 = nn.BatchNorm1d(128)
self.gp2 = GraphPool()
self.dense1 = nn.Linear(128, 256)
self.act3 = nn.Tanh()
self.batch_norm3 = nn.BatchNorm1d(256)
self.readout = GraphGather(batch_size=batch_size, activation_fn=nn.Tanh())
self.dense2 = nn.Linear(512, n_tasks * 2)
self.logits = lambda data: data.view(-1, n_tasks, 2)
self.softmax = nn.Softmax(dim=-1)
def forward(self, inputs):
gc1_output = self.gc1(inputs)
batch_norm1_output = self.batch_norm1(gc1_output)
gp1_output = self.gp1([batch_norm1_output] + inputs[1:])
gc2_output = self.gc2([gp1_output] + inputs[1:])
batch_norm2_output = self.batch_norm2(gc2_output)
gp2_output = self.gp2([batch_norm2_output] + inputs[1:])
dense1_output = self.act3(self.dense1(gp2_output))
batch_norm3_output = self.batch_norm3(dense1_output)
readout_output = self.readout([batch_norm3_output] + inputs[1:])
dense2_output = self.dense2(readout_output)
logits_output = self.logits(dense2_output)
softmax_output = self.softmax(logits_output)
return softmax_output
model = dc.models.TorchModel(GraphConvModelTorch(), loss=dc.models.losses.CategoricalCrossEntropy())
model.fit_generator(data_generator(train_dataset, epochs=50))
0.2121513557434082
print('Training set score:', model.evaluate_generator(data_generator(train_dataset), [metric], transformers))
print('Test set score:', model.evaluate_generator(data_generator(test_dataset), [metric], transformers))
Training set score: {'roc_auc_score': 0.9838238607233897} Test set score: {'roc_auc_score': 0.6923516284964811}
Success! Both the models we've constructed behave nearly identically to
GraphConvModel
. If you're looking to build your own custom models, you can follow the examples we've provided here to do so. We hope to see exciting constructions from your end soon!
Congratulations! Time to join the Community! ¶
Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:
Star DeepChem on GitHub ¶
This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.
Join the DeepChem Gitter ¶
The DeepChem Gitter hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!