Tutorials
Training a Normalizing Flow on QM9 ¶
By Nathan C. Frey | Twitter
In this tutorial, we will train a Normalizing Flow (NF) on the QM9 dataset . The dataset comprises 133,885 stable small organic molecules made up of CHNOF atoms. We will try to train a network that is an invertible transformation between a simple base distribution and the distribution of molecules in QM9. One of the key advantages of normalizing flows is that they can be constructed to efficiently sample from a distribution (generative modeling) and do probability density calculations (exactly compute log-likelihoods), whereas other models make tradeoffs between the two or can only approximate probability densities. This work has been published and considered as FastFlows see reference .
NFs are useful whenever we need a probabilistic model with one or both of these capabilities. Note that because NFs are completely invertible, there is no "latent space" in the sense used when referring to generative adversarial networks or variational autoencoders. For more on NFs, we refer to this review paper .
To encode the QM9 dataset, we'll make use of the SELFIES (SELF-referencIng Embedded Strings) representation, which is a 100% robust molecular string representation. SMILES strings produced by generative models are often syntactically invalid (they do not correspond to a molecular graph), or they violate chemical rules like the maximum number of bonds between atoms. SELFIES are designed so that even totally random SELFIES strings correspond to valid molecular graphs, so they are a great framework for generative modeling. For more details about SELFIES, see the GitHub repo and the associated paper .
Colab ¶
This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.
Setup ¶
To run DeepChem within Colab, you'll need to run the following cell of installation commands. This will take about 5 minutes to run to completion and install your environment.
!pip install --pre deepchem
import deepchem
deepchem.__version__
Requirement already satisfied: deepchem in /usr/local/lib/python3.7/dist-packages (2.6.1) Requirement already satisfied: pandas in /usr/local/lib/python3.7/dist-packages (from deepchem) (1.3.5) Requirement already satisfied: scikit-learn in /usr/local/lib/python3.7/dist-packages (from deepchem) (1.0.2) Requirement already satisfied: numpy>=1.21 in /usr/local/lib/python3.7/dist-packages (from deepchem) (1.21.5) Requirement already satisfied: scipy in /usr/local/lib/python3.7/dist-packages (from deepchem) (1.4.1) Requirement already satisfied: joblib in /usr/local/lib/python3.7/dist-packages (from deepchem) (1.1.0) Requirement already satisfied: rdkit-pypi in /usr/local/lib/python3.7/dist-packages (from deepchem) (2021.9.5.1) Requirement already satisfied: pytz>=2017.3 in /usr/local/lib/python3.7/dist-packages (from pandas->deepchem) (2018.9) Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas->deepchem) (2.8.2) Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas->deepchem) (1.15.0) Requirement already satisfied: Pillow in /usr/local/lib/python3.7/dist-packages (from rdkit-pypi->deepchem) (7.1.2) Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from scikit-learn->deepchem) (3.1.0)
'2.6.1'
Install the SELFIES library to translate SMILES strings.
!pip install selfies
Requirement already satisfied: selfies in /usr/local/lib/python3.7/dist-packages (2.0.0)
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import os
import deepchem as dc
from deepchem.models.normalizing_flows import NormalizingFlow, NormalizingFlowModel
from deepchem.models.optimizers import Adam
from deepchem.data import NumpyDataset
from deepchem.splits import RandomSplitter
from deepchem.molnet import load_tox21
import rdkit
from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
from rdkit.Chem import Draw
from IPython.display import Image, display
import selfies as sf
import tensorflow as tf
import tensorflow_probability as tfp
tfd = tfp.distributions
tfb = tfp.bijectors
tfk = tf.keras
tfk.backend.set_floatx('float64')
First, let's get a dataset of 2500 small organic molecules from the QM9 dataset. We'll then convert the molecules to SELFIES, one-hot encode them, and dequantize the inputs so they can be processed by a normalizing flow. 2000 molecules will be used for training, while the remaining 500 will be split into validation and test sets. We'll use the validation set to see how our architecture is doing at learning the underlying the distribution, and leave the test set alone. You should feel free to experiment with this notebook to get the best model you can and evaluate it on the test set when you're done!
# Download from MolNet
tasks, datasets, transformers = dc.molnet.load_qm9(featurizer='ECFP')
df = pd.DataFrame(data={'smiles': datasets[0].ids})
data = df[['smiles']].sample(2500, random_state=42)
SELFIES defines a dictionary called
bond_constraints
that enforces how many bonds every atom or ion can make. E.g., 'C': 4, 'H': 1, etc. The
?
symbol is used for any atom or ion that isn't defined in the dictionary, and it defaults to 8 bonds. Because QM9 contains ions and we don't want to allow those ions to form up to 8 bonds, we'll constrain them to 3. This will really improve the percentage of valid molecules we generate. You can read more about setting constraints in the
SELFIES documentation
.
sf.set_semantic_constraints() # reset constraints
constraints = sf.get_semantic_constraints()
constraints['?'] = 3
sf.set_semantic_constraints(constraints)
constraints
{'?': 3, 'B': 3, 'B+1': 2, 'B-1': 4, 'Br': 1, 'C': 4, 'C+1': 5, 'C-1': 3, 'Cl': 1, 'F': 1, 'H': 1, 'I': 1, 'N': 3, 'N+1': 4, 'N-1': 2, 'O': 2, 'O+1': 3, 'O-1': 1, 'P': 5, 'P+1': 6, 'P-1': 4, 'S': 6, 'S+1': 7, 'S-1': 5}
def preprocess_smiles(smiles):
return sf.encoder(smiles)
def keys_int(symbol_to_int):
d={}
i=0
for key in symbol_to_int.keys():
d[i]=key
i+=1
return d
data['selfies'] = data['smiles'].apply(preprocess_smiles)
Let's take a look at some short SMILES strings and their corresponding SELFIES representations. We can see right away that there is a key difference in how the two representations deal with Rings and Branches. SELFIES is designed so that branch length and ring size are stored locally with the
Branch
and
Ring
identifiers, and the SELFIES grammar prevents invalid strings.
data['len'] = data['smiles'].apply(lambda x: len(x))
data.sort_values(by='len').head()
smiles | selfies | len | |
---|---|---|---|
6728 | [H]c1nnc([H])o1 | [H][C][=N][N][=C][Branch1][C][H][O][Ring1][=Br... | 15 |
72803 | [H]c1nnnc([H])n1 | [H][C][=N][N][=N][C][Branch1][C][H][=N][Ring1]... | 16 |
97670 | [H]c1onnc1C(F)(F)F | [H][C][O][N][=N][C][=Ring1][Branch1][C][Branch... | 18 |
25487 | [H]n1nnc(C#CC#N)n1 | [H][N][N][=N][C][Branch1][Branch1][C][#C][C][#... | 18 |
32004 | [H]C#Cc1nnc(F)nc1[H] | [H][C][#C][C][=N][N][=C][Branch1][C][F][N][=C]... | 20 |
To convert SELFIES to a one-hot encoded representation, we need to construct an
alphabet
of all the characters that occur in the list of SELFIES strings. We also have to know what the longest SELFIES string is, so that all the shorter SELFIES can be padded with
'[nop]'
to be equal length.
selfies_list = np.asanyarray(data.selfies)
selfies_alphabet = sf.get_alphabet_from_selfies(selfies_list)
selfies_alphabet.add('[nop]') # Add the "no operation" symbol as a padding character
selfies_alphabet.add('.')
selfies_alphabet = list(sorted(selfies_alphabet))
largest_selfie_len = max(sf.len_selfies(s) for s in selfies_list)
symbol_to_int = dict((c, i) for i, c in enumerate(selfies_alphabet))
int_mol=keys_int(symbol_to_int)
selfies
has a handy utility function to translate SELFIES strings into one-hot encoded vectors.
onehots=sf.batch_selfies_to_flat_hot(selfies_list, symbol_to_int,largest_selfie_len)
Next, we "dequantize" the inputs by adding random noise from the interval
[0, 1)
to every input in the encodings. This allows the normalizing flow to operate on continuous inputs (rather than discrete), and the original inputs can easily be recovered by applying a floor function.
input_tensor = tf.convert_to_tensor(onehots, dtype='float64')
noise_tensor = tf.random.uniform(shape=input_tensor.shape, minval=0, maxval=1, dtype='float64')
dequantized_data = tf.add(input_tensor, noise_tensor)
The dequantized data is ready to be processed as a DeepChem dataset and split into training, validation, and test sets. We'll also keep track of the SMILES strings for the training set so we can compare the training data to our generated molecules later on.
ds = NumpyDataset(dequantized_data) # Create a DeepChem dataset
splitter = RandomSplitter()
train, val, test = splitter.train_valid_test_split(dataset=ds, seed=42)
train_idx, val_idx, test_idx = splitter.split(dataset=ds, seed=42)
dim = len(train.X[0]) # length of one-hot encoded vectors
train.X.shape # 2000 samples, N-dimensional one-hot vectors that represent molecules
(2000, 2596)
# SMILES strings of training data
train_smiles = data['smiles'].iloc[train_idx].values
Next we'll set up the normalizing flow model. The base distribution is a multivariate Normal distribution. The
permutation
layer permutes the dimensions of the input so that the normalizing flow layers will operate along multiple dimensions of the inputs. To understand why the permutation is needed, we need to know a bit about how the normalizing flow architecture works.
base_dist = tfd.MultivariateNormalDiag(loc=np.zeros(dim), scale_diag=np.ones(dim))
if dim % 2 == 0:
permutation = tf.cast(np.concatenate((np.arange(dim / 2, dim), np.arange(0, dim / 2))),
tf.int32)
else:
permutation = tf.cast(np.concatenate((np.arange(dim / 2 + 1, dim), np.arange(0, dim / 2))), tf.int32)
For this simple example, we'll set up a flow of repeating Masked Autoregressive Flow layers. The autoregressive property is enforced by using the Masked Autoencoder for Distribution Estimation architecture. The layers of the flow are a bijector, an invertible mapping between the base and target distributions.
MAF takes the inputs from the base distribution and transforms them with a simple scale-and-shift (affine) operation, but crucially the scale-and-shift for each dimension of the output depends on the previously generated dimensions of the output. That independence of future dimensions preserves the autoregressive property and ensures that the normalizing flow is invertible. Now we can see that we need permutations to change the ordering of the inputs, or else the normalizing flow would only transform certain dimensions of the inputs.
Batch Normalization layers can be added for additional stability in training, but may have strange effects on the outputs and require some input reshaping to work properly. Increasing
num_layers
and
hidden_units
can make more expressive flows capable of modeling more complex target distributions.
num_layers = 8
flow_layers = []
Made = tfb.AutoregressiveNetwork(params=2,
hidden_units=[512, 512], activation='relu')
for i in range(num_layers):
flow_layers.append(
(tfb.MaskedAutoregressiveFlow(shift_and_log_scale_fn=Made)
))
permutation = tf.cast(np.random.permutation(np.arange(0, dim)), tf.int32)
flow_layers.append(tfb.Permute(permutation=permutation))
# if (i + 1) % int(2) == 0:
# flow_layers.append(tfb.BatchNormalization())
We can draw samples from the untrained distribution, but for now they don't have any relation to the QM9 dataset distribution.
%%time
nf = NormalizingFlow(base_distribution=base_dist,
flow_layers=flow_layers)
CPU times: user 280 ms, sys: 10.2 ms, total: 290 ms Wall time: 289 ms
A
NormalizingFlowModel
takes a
NormalizingFlow
and any parameters used by
deepchem.models.KerasModel
.
nfm = NormalizingFlowModel(nf, learning_rate=1e-4, batch_size=128)
Now to train the model! We'll try to minimize the negative log likelihood loss, which measures the likelihood that generated samples are drawn from the target distribution, i.e. as we train the model, it should get better at modeling the target distribution and it will generate samples that look like molecules from the QM9 dataset.
losses = []
val_losses = []
%%time
max_epochs = 10 # maximum number of epochs of the training
for epoch in range(max_epochs):
loss = nfm.fit(train, nb_epoch=1, all_losses=losses)
val_loss = nfm.create_nll(val.X)
val_losses.append(val_loss.numpy())
WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596). WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596). WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596). WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596). WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596). WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596). WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596). WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596). WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596). WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596). WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596). WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596). WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596). WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596). WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596). WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596). WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596). WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596). WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596). WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596). WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596). WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596). WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596). WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596). WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596). WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596). WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596). WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596). WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596). WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596). WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596). WARNING:tensorflow:Model was constructed with shape (None, 2596) for input KerasTensor(type_spec=TensorSpec(shape=(None, 2596), dtype=tf.float64, name='input_1'), name='input_1', description="created by layer 'input_1'"), but it was called on an input with incompatible shape (1, 128, 2596). CPU times: user 13min 40s, sys: 20.9 s, total: 14min 1s Wall time: 7min 27s
f, ax = plt.subplots()
ax.scatter(range(len(losses)), losses, label='train loss')
ax.scatter(range(len(val_losses)), val_losses, label='val loss')
plt.legend(loc='upper right');
The normalizing flow is learning a mapping between the multivariate Gaussian and the target distribution! We can see this by visualizing the loss on the validation set. We can now use
nfm.flow.sample()
to generate new QM9-like molecules and
nfm.flow.log_prob()
to evaluate the likelihood that a molecule was drawn from the underlying distribution.
generated_samples = nfm.flow.sample(10) # generative modeling
log_probs = nfm.flow.log_prob(generated_samples) # probability density estimation
Now we transform the generated samples back into SELFIES. We have to quantize the outputs and add padding characters to any one-hot encoding vector that has all zeros.
mols = tf.math.floor(generated_samples) # quantize data
mols = tf.clip_by_value(mols, 0, 1) # Set negative values to 0 and values > 1 to 1
mols_list = mols.numpy().tolist()
# Add padding characters if needed
for mol in mols_list:
for i in range(largest_selfie_len):
row = mol[len(selfies_alphabet) * i: len(selfies_alphabet) * (i + 1)]
if all(elem == 0 for elem in row):
mol[len(selfies_alphabet) * (i+1) - 1] = 1
selfies
has another utility function to translate one-hot encoded representations back to SELFIES strings.
mols=sf.batch_flat_hot_to_selfies(mols_list, int_mol)
We can use RDKit to find valid generated molecules. Some have unphysical valencies and should be discarded. If you've ever tried to generate valid SMILES strings, you'll notice right away that this model is doing much better than we would expect! Using SELFIES, 90% of the generated molecules are valid, even though our normalizing flow architecture doesn't know any rules that govern chemical validity.
from rdkit import RDLogger
from rdkit import Chem
RDLogger.DisableLog('rdApp.*') # suppress error messages
valid_count = 0
valid_selfies, invalid_selfies = [], []
for idx, selfies in enumerate(mols):
try:
if Chem.MolFromSmiles(sf.decoder(mols[idx]), sanitize=True) is not None:
valid_count += 1
valid_selfies.append(selfies)
else:
invalid_selfies.append(selfies)
except Exception:
pass
print('%.2f' % (valid_count / len(mols)), '% of generated samples are valid molecules.')
1.00 % of generated samples are valid molecules.
Let's take a look at some of the generated molecules! We'll borrow some helper functions from the Modeling Solubility tutorial to display molecules with RDKit.
gen_mols = [Chem.MolFromSmiles(sf.decoder(vs)) for vs in valid_selfies]
def display_images(filenames):
"""Helper to pretty-print images."""
for file in filenames:
display(Image(file))
def mols_to_pngs(mols, basename="generated_mol"):
"""Helper to write RDKit mols to png files."""
filenames = []
for i, mol in enumerate(mols):
filename = "%s%d.png" % (basename, i)
Draw.MolToFile(mol, filename)
filenames.append(filename)
return filenames
display_mols = []
for i in range(10):
display_mols.append(gen_mols[i])
display_images(mols_to_pngs(display_mols))
Finally, we can compare generated molecules with our training data via a similarity search with Tanimoto similarity. This gives an indication of how "original" the generated samples are, versus simply producing samples that are extremely similar to molecules the model has already seen. We have to keep in mind that QM9 contains all stable small molecules with up to 9 heavy atoms (CONF). So anything new we generate either exists in the full QM9 dataset, or else will not obey the charge neutrality and stability criteria used to generated QM9.
from rdkit.Chem.Fingerprints.FingerprintMols import FingerprintMol
from rdkit.DataStructs import FingerprintSimilarity
from IPython.display import display
def tanimoto_similarity(database_mols, query_mol):
"""Compare generated molecules to database by Tanimoto similarity."""
# convert Mol to datastructure type
fps = [FingerprintMol(m) for m in database_mols]
# set a query molecule to compare against database
query = FingerprintMol(query_mol)
similarities = []
# loop through to find Tanimoto similarity
for idx, f in enumerate(fps):
# tuple: (idx, similarity)
similarities.append((idx, FingerprintSimilarity(query, f)))
# sort sim using the similarities
similarities.sort(key=lambda x:x[1], reverse=True)
return similarities
We'll consider our generated molecules and look at the top 3 most similar molecules from the training data by Tanimoto similarity. Here's an example where the Tanimoto similarity scores are medium. There are molecules in our training set that are similar to our generated sample. This might be interesting, or it might mean that the generated molecule is unrealistic.
train_mols = [Chem.MolFromSmiles(smiles) for smiles in train_smiles]
# change the second argument to compare different generated molecules to QM9
tanimoto_scores = tanimoto_similarity(train_mols, gen_mols[3])
similar_mols = []
for idx, ts in tanimoto_scores[:3]:
print(round(ts, 3))
similar_mols.append(train_mols[idx])
display_images(mols_to_pngs(similar_mols, 'qm9_mol'))
0.521 0.471 0.468
Molecules of the previous tutorial: ¶
These molecules were obteined through sampling.
Comparing with the tanimoto similarity:
With scores of:
- 0.243
- 0.243
- 0.241
Further reading ¶
So far we have looked at a measure of validity and done a bit of investigation into the novelty of the generated compounds. There are more dimensions along which we can and should evaluate the performance of a generative model. For an example of some standard benchmarks, see the GuacaMol evaluation framework .
For more information about FastFlows look at this paper where the workflow is crearly explained.
For examples of normalizing flow-based molecular graph generation frameworks, check out the MoFlow , GraphAF , and GraphNVP papers.
Congratulations! Time to join the Community! ¶
Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:
Star DeepChem on GitHub ¶
This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.
Join the DeepChem Gitter ¶
The DeepChem Gitter hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!