Tutorials

1. Introduction To Deepchem

1. The Basic Tools of the Deep Life Sciences

2. Working With Datasets

3. An Introduction To MoleculeNet

4. Molecular Fingerprints

5. Creating Models with TensorFlow and PyTorch

6. Introduction to Graph Convolutions

7. Going Deeper on Molecular Featurizations

8. Working With Splitters

9. Advanced Model Training

10. Creating a high fidelity model from experimental data

11. Putting Multitask Learning to Work

12. Modeling Protein Ligand Interactions

13. Modeling Protein Ligand Interactions With Atomic Convolutions

14. Conditional Generative Adversarial Networks

15. Training a Generative Adversarial Network on MNIST

16. Advanced model training using hyperopt

17. Introduction to Gaussian Processes

18. PytorchLightning Integration

19. Compiling DeepChem Torch Models

2. Molecular Machine Learning

1. Molecular Fingerprints

2. Going Deeper on Molecular Featurizations

3. Learning Unsupervised Embeddings for Molecules

4. Synthetic Feasibility Scoring

5. Atomic Contributions for Molecules

6. Interactive Model Evaluation with Trident Chemwidgets

7. Transfer Learning With ChemBERTa Transformers

8. Training a Normalizing Flow on QM9

9. Large Scale Chemical Screens

10. Introduction to Molecular Attention Transformer

11. Generating molecules with MolGAN

12. Introduction to GROVER

13. Introduction to PROTACs

14. Druggablity Assessment with Fpocket and Machine Learning

3. Modeling Proteins

1. Protein Deep Learning

2. DeepChem AntibodyTutorial Simplified

3. Protein Structure Prediction with ESMFold

4. Introduction to ProtBERT

5. ProteinLM Tutorial0

4. Protein Ligand Modeling

1. Introduction to Binding Sites

2. Modeling Protein Ligand Interactions

3. Modeling Protein Ligand Interactions With Atomic Convolutions

4. DeepChemXAlphafold

5. UniProt Data Preprocessing for Binding Sites

5. Quantum Chemistry

1. Exploring Quantum Chemistry with GDB1k

2. DeepQMC tutorial

3. Training an Exchange Correlation Functional using Deepchem

6. Bioinformatics

1. Introduction to Bioinformatics

2. Multisequence Alignments

3. Scanpy

4. Deep probabilistic analysis of single-cell omics data

5. Cell Counting Tutorial

7. Material Science

1. Introduction To Material Science

8. Machine Learning Methods

1. Using Reinforcement Learning to Play Pong

2. Introduction to Model Interpretability

3. Uncertainty In Deep Learning

9. Deep Differential Equations

1. Physics Informed Neural Networks

2. Introducing JaxModel and PINNModel

3. About nODE Using Torchdiffeq in Deepchem

4. Differentiation Infrastructure in Deepchem

5. Ordinary Differential Equation Solving using deepchem

10. Equivariance

1. Introduction to Equivariance

11. Olfaction

1. Predict Multi Label Odor Descriptors using OpenPOM

12. Polymer Science

1. An Introduction to the Polymers and Their Representation

2. crystallization tendency regression

3. Introduction to Polymer SMILES

4. Understanding Weighted Directed Graphs for Polymer Implimentations

5. PolyBERT Introduction to Chemical Language Models for Fingerprint Generation

Training a Normalizing Flow on QM9 ¶

By Nathan C. Frey | Twitter

In this tutorial, we will train a Normalizing Flow (NF) on the QM9 dataset . The dataset comprises 133,885 stable small organic molecules made up of CHNOF atoms. We will try to train a network that is an invertible transformation between a simple base distribution and the distribution of molecules in QM9. One of the key advantages of normalizing flows is that they can be constructed to efficiently sample from a distribution (generative modeling) and do probability density calculations (exactly compute log-likelihoods), whereas other models make tradeoffs between the two or can only approximate probability densities. This work has been published and considered as FastFlows see reference .

NFs are useful whenever we need a probabilistic model with one or both of these capabilities. Note that because NFs are completely invertible, there is no "latent space" in the sense used when referring to generative adversarial networks or variational autoencoders. For more on NFs, we refer to this review paper .

To encode the QM9 dataset, we'll make use of the SELFIES (SELF-referencIng Embedded Strings) representation, which is a 100% robust molecular string representation. SMILES strings produced by generative models are often syntactically invalid (they do not correspond to a molecular graph), or they violate chemical rules like the maximum number of bonds between atoms. SELFIES are designed so that even totally random SELFIES strings correspond to valid molecular graphs, so they are a great framework for generative modeling. For more details about SELFIES, see the GitHub repo and the associated paper .

Colab ¶

This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.

Setup ¶

To run DeepChem within Colab, you'll need to run the following cell of installation commands. This will take about 5 minutes to run to completion and install your environment.

In [1]:

!pip install --pre deepchem
import deepchem
deepchem.__version__

Requirement already satisfied: deepchem in /usr/local/lib/python3.7/dist-packages (2.6.1)
Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from deepchem) (1.3.5)
Requirement already satisfied: scikit-learn in /usr/local/lib/python3.7/dist-packages (from deepchem) (1.0.2)
Requirement already satisfied: numpy>=1.21 in /usr/local/lib/python3.7/dist-packages (from deepchem) (1.21.5)
Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (from deepchem) (1.4.1)
Requirement already satisfied: joblib in /usr/local/lib/python3.7/dist-packages (from deepchem) (1.1.0)
Requirement already satisfied: rdkit-pypi in /usr/local/lib/python3.7/dist-packages (from deepchem) (2021.9.5.1)
Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas->deepchem) (2018.9)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas->deepchem) (2.8.2)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas->deepchem) (1.15.0)
Requirement already satisfied: Pillow in /usr/local/lib/python3.7/dist-packages (from rdkit-pypi->deepchem) (7.1.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn->deepchem) (3.1.0)

Out[1]:

'2.6.1'

Install the SELFIES library to translate SMILES strings.

In [2]:

!pip install selfies

Requirement already satisfied: selfies in /usr/local/lib/python3.7/dist-packages (2.0.0)

In [3]:

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import os

import deepchem as dc
from deepchem.models.normalizing_flows import NormalizingFlow, NormalizingFlowModel
from deepchem.models.optimizers import Adam
from deepchem.data import NumpyDataset
from deepchem.splits import RandomSplitter
from deepchem.molnet import load_tox21

import rdkit
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw

from IPython.display import Image, display

import selfies as sf

import tensorflow as tf
import tensorflow_probability as tfp

tfd = tfp.distributions
tfb = tfp.bijectors
tfk = tf.keras

tfk.backend.set_floatx('float64')

First, let's get a dataset of 2500 small organic molecules from the QM9 dataset. We'll then convert the molecules to SELFIES, one-hot encode them, and dequantize the inputs so they can be processed by a normalizing flow. 2000 molecules will be used for training, while the remaining 500 will be split into validation and test sets. We'll use the validation set to see how our architecture is doing at learning the underlying the distribution, and leave the test set alone. You should feel free to experiment with this notebook to get the best model you can and evaluate it on the test set when you're done!

In [4]:

# Download from MolNet
tasks, datasets, transformers = dc.molnet.load_qm9(featurizer='ECFP')
df = pd.DataFrame(data={'smiles': datasets[0].ids})

In [5]:

data = df[['smiles']].sample(2500, random_state=42)

SELFIES defines a dictionary called bond_constraints that enforces how many bonds every atom or ion can make. E.g., 'C': 4, 'H': 1, etc. The ? symbol is used for any atom or ion that isn't defined in the dictionary, and it defaults to 8 bonds. Because QM9 contains ions and we don't want to allow those ions to form up to 8 bonds, we'll constrain them to 3. This will really improve the percentage of valid molecules we generate. You can read more about setting constraints in the SELFIES documentation .

In [6]:

sf.set_semantic_constraints()  # reset constraints
constraints = sf.get_semantic_constraints()
constraints['?'] = 3

sf.set_semantic_constraints(constraints)
constraints

Out[6]:

{'?': 3,
 'B': 3,
 'B+1': 2,
 'B-1': 4,
 'Br': 1,
 'C': 4,
 'C+1': 5,
 'C-1': 3,
 'Cl': 1,
 'F': 1,
 'H': 1,
 'I': 1,
 'N': 3,
 'N+1': 4,
 'N-1': 2,
 'O': 2,
 'O+1': 3,
 'O-1': 1,
 'P': 5,
 'P+1': 6,
 'P-1': 4,
 'S': 6,
 'S+1': 7,
 'S-1': 5}

In [7]:

def preprocess_smiles(smiles):
  return sf.encoder(smiles)  

def keys_int(symbol_to_int):
  d={}
  i=0
  for key in symbol_to_int.keys():
    d[i]=key
    i+=1
  return d

data['selfies'] = data['smiles'].apply(preprocess_smiles)

Let's take a look at some short SMILES strings and their corresponding SELFIES representations. We can see right away that there is a key difference in how the two representations deal with Rings and Branches. SELFIES is designed so that branch length and ring size are stored locally with the Branch and Ring identifiers, and the SELFIES grammar prevents invalid strings.

In [8]:

data['len'] = data['smiles'].apply(lambda x: len(x))
data.sort_values(by='len').head()

Out[8]:

	smiles	selfies	len
6728	[H]c1nnc([H])o1	[H][C][=N][N][=C][Branch1][C][H][O][Ring1][=Br...	15
72803	[H]c1nnnc([H])n1	[H][C][=N][N][=N][C][Branch1][C][H][=N][Ring1]...	16
97670	[H]c1onnc1C(F)(F)F	[H][C][O][N][=N][C][=Ring1][Branch1][C][Branch...	18
25487	[H]n1nnc(C#CC#N)n1	[H][N][N][=N][C][Branch1][Branch1][C][#C][C][#...	18
32004	[H]C#Cc1nnc(F)nc1[H]	[H][C][#C][C][=N][N][=C][Branch1][C][F][N][=C]...	20

To convert SELFIES to a one-hot encoded representation, we need to construct an alphabet of all the characters that occur in the list of SELFIES strings. We also have to know what the longest SELFIES string is, so that all the shorter SELFIES can be padded with '[nop]' to be equal length.

In [9]:

selfies_list = np.asanyarray(data.selfies)
selfies_alphabet = sf.get_alphabet_from_selfies(selfies_list)
selfies_alphabet.add('[nop]')  # Add the "no operation" symbol as a padding character
selfies_alphabet.add('.') 
selfies_alphabet = list(sorted(selfies_alphabet))
largest_selfie_len = max(sf.len_selfies(s) for s in selfies_list)
symbol_to_int = dict((c, i) for i, c in enumerate(selfies_alphabet))
int_mol=keys_int(symbol_to_int)

selfies has a handy utility function to translate SELFIES strings into one-hot encoded vectors.

In [10]:

onehots=sf.batch_selfies_to_flat_hot(selfies_list, symbol_to_int,largest_selfie_len)

Next, we "dequantize" the inputs by adding random noise from the interval [0, 1) to every input in the encodings. This allows the normalizing flow to operate on continuous inputs (rather than discrete), and the original inputs can easily be recovered by applying a floor function.

In [11]:

input_tensor = tf.convert_to_tensor(onehots, dtype='float64')
noise_tensor = tf.random.uniform(shape=input_tensor.shape, minval=0, maxval=1, dtype='float64')
dequantized_data = tf.add(input_tensor, noise_tensor)

The dequantized data is ready to be processed as a DeepChem dataset and split into training, validation, and test sets. We'll also keep track of the SMILES strings for the training set so we can compare the training data to our generated molecules later on.

In [12]:

ds = NumpyDataset(dequantized_data)  # Create a DeepChem dataset
splitter = RandomSplitter()
train, val, test = splitter.train_valid_test_split(dataset=ds, seed=42)
train_idx, val_idx, test_idx = splitter.split(dataset=ds, seed=42)

dim = len(train.X[0])  # length of one-hot encoded vectors
train.X.shape  # 2000 samples, N-dimensional one-hot vectors that represent molecules

Out[12]:

(2000, 2596)

In [13]:

# SMILES strings of training data
train_smiles = data['smiles'].iloc[train_idx].values

Next we'll set up the normalizing flow model. The base distribution is a multivariate Normal distribution. The permutation layer permutes the dimensions of the input so that the normalizing flow layers will operate along multiple dimensions of the inputs. To understand why the permutation is needed, we need to know a bit about how the normalizing flow architecture works.

In [14]:

base_dist = tfd.MultivariateNormalDiag(loc=np.zeros(dim), scale_diag=np.ones(dim))

if dim % 2 == 0:
    permutation = tf.cast(np.concatenate((np.arange(dim / 2, dim), np.arange(0, dim / 2))),
                                  tf.int32)
else:
    permutation = tf.cast(np.concatenate((np.arange(dim / 2 + 1, dim), np.arange(0, dim / 2))), tf.int32)

For this simple example, we'll set up a flow of repeating Masked Autoregressive Flow layers. The autoregressive property is enforced by using the Masked Autoencoder for Distribution Estimation architecture. The layers of the flow are a bijector, an invertible mapping between the base and target distributions.

MAF takes the inputs from the base distribution and transforms them with a simple scale-and-shift (affine) operation, but crucially the scale-and-shift for each dimension of the output depends on the previously generated dimensions of the output. That independence of future dimensions preserves the autoregressive property and ensures that the normalizing flow is invertible. Now we can see that we need permutations to change the ordering of the inputs, or else the normalizing flow would only transform certain dimensions of the inputs.

Batch Normalization layers can be added for additional stability in training, but may have strange effects on the outputs and require some input reshaping to work properly. Increasing num_layers and hidden_units can make more expressive flows capable of modeling more complex target distributions.

In [15]:

num_layers = 8
flow_layers = []

Made = tfb.AutoregressiveNetwork(params=2,
                                 hidden_units=[512, 512], activation='relu')

for i in range(num_layers):
    flow_layers.append(        
        (tfb.MaskedAutoregressiveFlow(shift_and_log_scale_fn=Made)
    ))

    permutation = tf.cast(np.random.permutation(np.arange(0, dim)), tf.int32)
    
    flow_layers.append(tfb.Permute(permutation=permutation))
    
#     if (i + 1) % int(2) == 0:
#         flow_layers.append(tfb.BatchNormalization())

We can draw samples from the untrained distribution, but for now they don't have any relation to the QM9 dataset distribution.

In [16]:

%%time
nf = NormalizingFlow(base_distribution=base_dist,
                    flow_layers=flow_layers)

CPU times: user 280 ms, sys: 10.2 ms, total: 290 ms
Wall time: 289 ms

A NormalizingFlowModel takes a NormalizingFlow and any parameters used by deepchem.models.KerasModel .

In [17]:

nfm = NormalizingFlowModel(nf, learning_rate=1e-4, batch_size=128)

Now to train the model! We'll try to minimize the negative log likelihood loss, which measures the likelihood that generated samples are drawn from the target distribution, i.e. as we train the model, it should get better at modeling the target distribution and it will generate samples that look like molecules from the QM9 dataset.

In [18]:

losses = []
val_losses = []

In [19]:

%%time
max_epochs = 10 # maximum number of epochs of the training

for epoch in range(max_epochs):
  loss = nfm.fit(train, nb_epoch=1, all_losses=losses)
  val_loss = nfm.create_nll(val.X)
  val_losses.append(val_loss.numpy())

WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596).
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596).
CPU times: user 13min 40s, sys: 20.9 s, total: 14min 1s
Wall time: 7min 27s

In [20]:

f, ax = plt.subplots()
ax.scatter(range(len(losses)), losses, label='train loss')
ax.scatter(range(len(val_losses)), val_losses, label='val loss')
plt.legend(loc='upper right');

No description has been provided for this image

The normalizing flow is learning a mapping between the multivariate Gaussian and the target distribution! We can see this by visualizing the loss on the validation set. We can now use nfm.flow.sample() to generate new QM9-like molecules and nfm.flow.log_prob() to evaluate the likelihood that a molecule was drawn from the underlying distribution.

In [21]:

generated_samples = nfm.flow.sample(10)  # generative modeling
log_probs = nfm.flow.log_prob(generated_samples)  # probability density estimation

Now we transform the generated samples back into SELFIES. We have to quantize the outputs and add padding characters to any one-hot encoding vector that has all zeros.

In [22]:

mols = tf.math.floor(generated_samples)  # quantize data
mols = tf.clip_by_value(mols, 0, 1)  # Set negative values to 0 and values > 1 to 1
mols_list = mols.numpy().tolist()

# Add padding characters if needed
for mol in mols_list:
  for i in range(largest_selfie_len):
    row = mol[len(selfies_alphabet) * i: len(selfies_alphabet) * (i + 1)]
    if all(elem == 0 for elem in row):
      mol[len(selfies_alphabet) * (i+1) - 1] = 1

selfies has another utility function to translate one-hot encoded representations back to SELFIES strings.

In [23]:

mols=sf.batch_flat_hot_to_selfies(mols_list, int_mol)

We can use RDKit to find valid generated molecules. Some have unphysical valencies and should be discarded. If you've ever tried to generate valid SMILES strings, you'll notice right away that this model is doing much better than we would expect! Using SELFIES, 90% of the generated molecules are valid, even though our normalizing flow architecture doesn't know any rules that govern chemical validity.

In [24]:

from rdkit import RDLogger  
from rdkit import Chem
RDLogger.DisableLog('rdApp.*')  # suppress error messages

valid_count = 0
valid_selfies, invalid_selfies = [], []
for idx, selfies in enumerate(mols):
  try:
    if Chem.MolFromSmiles(sf.decoder(mols[idx]), sanitize=True) is not None:
        valid_count += 1
        valid_selfies.append(selfies)
    else:
      invalid_selfies.append(selfies)
  except Exception:
    pass
print('%.2f' % (valid_count / len(mols)),  '% of generated samples are valid molecules.')

1.00 % of generated samples are valid molecules.

Let's take a look at some of the generated molecules! We'll borrow some helper functions from the Modeling Solubility tutorial to display molecules with RDKit.

In [25]:

gen_mols = [Chem.MolFromSmiles(sf.decoder(vs)) for vs in valid_selfies]

In [26]:

def display_images(filenames):
    """Helper to pretty-print images."""
    for file in filenames:
      display(Image(file))

def mols_to_pngs(mols, basename="generated_mol"):
    """Helper to write RDKit mols to png files."""
    filenames = []
    for i, mol in enumerate(mols):
        filename = "%s%d.png" % (basename, i)
        Draw.MolToFile(mol, filename)
        filenames.append(filename)
    return filenames

In [27]:

display_mols = []
for i in range(10):
  display_mols.append(gen_mols[i])

display_images(mols_to_pngs(display_mols))

Finally, we can compare generated molecules with our training data via a similarity search with Tanimoto similarity. This gives an indication of how "original" the generated samples are, versus simply producing samples that are extremely similar to molecules the model has already seen. We have to keep in mind that QM9 contains all stable small molecules with up to 9 heavy atoms (CONF). So anything new we generate either exists in the full QM9 dataset, or else will not obey the charge neutrality and stability criteria used to generated QM9.

In [28]:

from rdkit.Chem.Fingerprints.FingerprintMols import FingerprintMol
from rdkit.DataStructs import FingerprintSimilarity
from IPython.display import display

def tanimoto_similarity(database_mols, query_mol):
    """Compare generated molecules to database by Tanimoto similarity."""
    # convert Mol to datastructure type
    fps = [FingerprintMol(m) for m in database_mols]
    
    # set a query molecule to compare against database
    query = FingerprintMol(query_mol)
    
    similarities = []
    
    # loop through to find Tanimoto similarity
    for idx, f in enumerate(fps):
        # tuple: (idx, similarity)
        similarities.append((idx, FingerprintSimilarity(query, f)))
    
    # sort sim using the similarities
    similarities.sort(key=lambda x:x[1], reverse=True)
    
    return similarities

We'll consider our generated molecules and look at the top 3 most similar molecules from the training data by Tanimoto similarity. Here's an example where the Tanimoto similarity scores are medium. There are molecules in our training set that are similar to our generated sample. This might be interesting, or it might mean that the generated molecule is unrealistic.

In [29]:

train_mols = [Chem.MolFromSmiles(smiles) for smiles in train_smiles]

In [30]:

# change the second argument to compare different generated molecules to QM9
tanimoto_scores = tanimoto_similarity(train_mols, gen_mols[3])
similar_mols = []

In [31]:

for idx, ts in tanimoto_scores[:3]:
    print(round(ts, 3))
    similar_mols.append(train_mols[idx])

display_images(mols_to_pngs(similar_mols, 'qm9_mol'))

0.521
0.471
0.468

Molecules of the previous tutorial: ¶

download (9).png

download (8).png

download (7).png

download (6).png

download (5).png

download (4).png

download (3).png

download (2).png

download (1).png

These molecules were obteined through sampling.

Comparing with the tanimoto similarity:

download (12).png

download (11).png

download (10).png

With scores of:

0.243
0.243
0.241

Congratulations! Time to join the Community! ¶

Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:

Star DeepChem on GitHub ¶

This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.

Join the DeepChem Gitter ¶

The DeepChem Gitter hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!