Basic Protein-Ligand Affinity Models

#Tutorial: Use machine learning to model protein-ligand affinity.

Written by Evan Feinberg and Bharath Ramsundar

Copyright 2016, Stanford University

This DeepChem tutorial demonstrates how to use mach.ine learning for modeling protein-ligand binding affinity


In this tutorial, you will trace an arc from loading a raw dataset to fitting a cutting edge ML technique for predicting binding affinities. This will be accomplished by writing simple commands to access the deepchem Python API, encompassing the following broad steps:

  1. Loading a chemical dataset, consisting of a series of protein-ligand complexes.
  2. Featurizing each protein-ligand complexes with various featurization schemes.
  3. Fitting a series of models with these featurized protein-ligand complexes.
  4. Visualizing the results.

First, let’s point to a “dataset” file. This can come in the format of a CSV file or Pandas DataFrame. Regardless of file format, it must be columnar data, where each row is a molecular system, and each column represents a different piece of information about that system. For instance, in this example, every row reflects a protein-ligand complex, and the following columns are present: a unique complex identifier; the SMILES string of the ligand; the binding affinity (Ki) of the ligand to the protein in the complex; a Python list of all lines in a PDB file for the protein alone; and a Python list of all lines in a ligand file for the ligand alone.

This should become clearer with the example. (Make sure to set DISPLAY = True)

%load_ext autoreload
%autoreload 2
%pdb off
# set DISPLAY = True when running tutorial
# set PARALLELIZE to true if you want to use ipyparallel
import warnings
Automatic pdb calling has been turned OFF
import deepchem as dc
from deepchem.utils import download_url

import os

data_dir = os.path.join(dc.utils.get_data_dir())
dataset_file= os.path.join(dc.utils.get_data_dir(), "pdbbind_core_df.csv.gz")
raw_dataset =

Let’s see what dataset looks like:

print("Type of dataset is: %s" % str(type(raw_dataset)))
print("Shape of dataset is: %s" % str(raw_dataset.shape))
Type of dataset is: <class 'pandas.core.frame.DataFrame'>
  pdb_id                                             smiles  0   2d3u        CC1CCCCC1S(O)(O)NC1CC(C2CCC(CN)CC2)SC1C(O)O
1   3cyx  CC(C)(C)NC(O)C1CC2CCCCC2C[NH+]1CC(O)C(CC1CCCCC...
2   3uo4        OC(O)C1CCC(NC2NCCC(NC3CCCCC3C3CCCCC3)N2)CC1
3   1p1q                         CC1ONC(O)C1CC([NH3+])C(O)O
4   3ag9  NC(O)C(CCC[NH2+]C([NH3+])[NH3+])NC(O)C(CCC[NH2...

                                          complex_id  0    2d3uCC1CCCCC1S(O)(O)NC1CC(C2CCC(CN)CC2)SC1C(O)O
1  3cyxCC(C)(C)NC(O)C1CC2CCCCC2C[NH+]1CC(O)C(CC1C...
3                     1p1qCC1ONC(O)C1CC([NH3+])C(O)O
4  3ag9NC(O)C(CCC[NH2+]C([NH3+])[NH3+])NC(O)C(CCC...

                                         protein_pdb  0  ['HEADER    2D3U PROTEINn', 'COMPND    2D3U P...
2  ['HEADER    3UO4 PROTEINn', 'COMPND    3UO4 P...
3  ['HEADER    1P1Q PROTEINn', 'COMPND    1P1Q P...
4  ['HEADER    3AG9 PROTEINn', 'COMPND    3AG9 P...

                                          ligand_pdb  0  ['COMPND    2d3u ligand n', 'AUTHOR    GENERA...
1  ['COMPND    3cyx ligand n', 'AUTHOR    GENERA...
2  ['COMPND    3uo4 ligand n', 'AUTHOR    GENERA...
3  ['COMPND    1p1q ligand n', 'AUTHOR    GENERA...
4  ['COMPND    3ag9 ligand n', 'AUTHOR    GENERA...

                                         ligand_mol2  label
0  ['### n', '### Created by X-TOOL on Thu Aug 2...   6.92
1  ['### n', '### Created by X-TOOL on Thu Aug 2...   8.00
2  ['### n', '### Created by X-TOOL on Fri Aug 2...   6.52
3  ['### n', '### Created by X-TOOL on Thu Aug 2...   4.89
4  ['### n', '### Created by X-TOOL on Thu Aug 2...   8.05
Shape of dataset is: (193, 7)

One of the missions of deepchem is to form a synapse between the chemical and the algorithmic worlds: to be able to leverage the powerful and diverse array of tools available in Python to analyze molecules. This ethos applies to visual as much as quantitative examination:

import nglview
import tempfile
import os
import mdtraj as md
import numpy as np
import deepchem.utils.visualization
#from deepchem.utils.visualization import combine_mdtraj, visualize_complex, convert_lines_to_mdtraj

def combine_mdtraj(protein, ligand):
  chain = protein.topology.add_chain()
  residue = protein.topology.add_residue("LIG", chain, resSeq=1)
  for atom in ligand.topology.atoms:
      protein.topology.add_atom(, atom.element, residue) = np.hstack([,])
  return protein

def visualize_complex(complex_mdtraj):
  ligand_atoms = [a.index for a in complex_mdtraj.topology.atoms if "LIG" in str(a.residue)]
  binding_pocket_atoms = md.compute_neighbors(complex_mdtraj, 0.5, ligand_atoms)[0]
  binding_pocket_residues = list(set([complex_mdtraj.topology.atom(a).residue.resSeq for a in binding_pocket_atoms]))
  binding_pocket_residues = [str(r) for r in binding_pocket_residues]
  binding_pocket_residues = " or ".join(binding_pocket_residues)

  traj = nglview.MDTrajTrajectory( complex_mdtraj ) # load file from RCSB PDB
  ngltraj = nglview.NGLWidget( traj )
  ngltraj.representations = [
  { "type": "cartoon", "params": {
  "sele": "protein", "color": "residueindex"
  } },
  { "type": "licorice", "params": {
  "sele": "(not hydrogen) and (%s)" %  binding_pocket_residues
  } },
  { "type": "ball+stick", "params": {
  "sele": "LIG"
  } }
  return ngltraj

def visualize_ligand(ligand_mdtraj):
  traj = nglview.MDTrajTrajectory( ligand_mdtraj ) # load file from RCSB PDB
  ngltraj = nglview.NGLWidget( traj )
  ngltraj.representations = [
    { "type": "ball+stick", "params": {"sele": "all" } } ]
  return ngltraj

def convert_lines_to_mdtraj(molecule_lines):
  molecule_lines = molecule_lines.strip('[').strip(']').replace("'","").replace("\\n", "").split(", ")
  tempdir = tempfile.mkdtemp()
  molecule_file = os.path.join(tempdir, "molecule.pdb")
  with open(molecule_file, "w") as f:
    for line in molecule_lines:
        f.write("%s\n" % line)
  molecule_mdtraj = md.load(molecule_file)
  return molecule_mdtraj

first_protein, first_ligand = raw_dataset.iloc[0]["protein_pdb"], raw_dataset.iloc[0]["ligand_pdb"]
protein_mdtraj = convert_lines_to_mdtraj(first_protein)
ligand_mdtraj = convert_lines_to_mdtraj(first_ligand)
complex_mdtraj = combine_mdtraj(protein_mdtraj, ligand_mdtraj)
ngltraj = visualize_complex(complex_mdtraj)
A Jupyter Widget

Now that we’re oriented, let’s use ML to do some chemistry.

So, step (2) will entail featurizing the dataset.

The available featurizations that come standard with deepchem are ECFP4 fingerprints, RDKit descriptors, NNScore-style bdescriptors, and hybrid binding pocket descriptors. Details can be found on

grid_featurizer = dc.feat.RdkitGridFeaturizer(
    voxel_width=16.0, feature_types="voxel_combined",
    voxel_feature_types=["ecfp", "splif", "hbond", "pi_stack", "cation_pi", "salt_bridge"],
    ecfp_power=5, splif_power=5, parallel=True, flatten=True)
compound_featurizer = dc.feat.CircularFingerprint(size=128)

Note how we separate our featurizers into those that featurize individual chemical compounds, compound_featurizers, and those that featurize molecular complexes, complex_featurizers.

Now, let’s perform the actual featurization. Calling loader.featurize() will return an instance of class Dataset. Internally, loader.featurize() (a) computes the specified features on the data, (b) transforms the inputs into X and y NumPy arrays suitable for ML algorithms, and (c) constructs a Dataset() instance that has useful methods, such as an iterator, over the featurized data. This is a little complicated, so we will use MoleculeNet to featurize the PDBBind core set for us.

PDBBIND_tasks, (train_dataset, valid_dataset, test_dataset), transformers = dc.molnet.load_pdbbind_grid()
Loading dataset from disk.
TIMING: dataset construction took 0.020 s
Loading dataset from disk.
TIMING: dataset construction took 0.008 s
Loading dataset from disk.
TIMING: dataset construction took 0.008 s
Loading dataset from disk.

Now, we conduct a train-test split. If you’d like, you can choose splittype="scaffold" instead to perform a train-test split based on Bemis-Murcko scaffolds.

We generate separate instances of the Dataset() object to hermetically seal the train dataset from the test dataset. This style lends itself easily to validation-set type hyperparameter searches, which we will illustate in a separate section of this tutorial.

The performance of many ML algorithms hinges greatly on careful data preprocessing. Deepchem comes standard with a few options for such preprocessing.

Now, we’re ready to do some learning!

To fit a deepchem model, first we instantiate one of the provided (or user-written) model classes. In this case, we have a created a convenience class to wrap around any ML model available in Sci-Kit Learn that can in turn be used to interoperate with deepchem. To instantiate an SklearnModel, you will need (a) task_types, (b) model_params, another dict as illustrated below, and (c) a model_instance defining the type of model you would like to fit, in this case a RandomForestRegressor.

from sklearn.ensemble import RandomForestRegressor

sklearn_model = RandomForestRegressor(n_estimators=100)
model = dc.models.SklearnModel(sklearn_model)
from deepchem.utils.evaluate import Evaluator
import pandas as pd

metric = dc.metrics.Metric(dc.metrics.r2_score)

evaluator = Evaluator(model, train_dataset, transformers)
train_r2score = evaluator.compute_model_performance([metric])
print("RF Train set R^2 %f" % (train_r2score["r2_score"]))

evaluator = Evaluator(model, valid_dataset, transformers)
valid_r2score = evaluator.compute_model_performance([metric])
print("RF Valid set R^2 %f" % (valid_r2score["r2_score"]))
computed_metrics: [0.88553937155927021]
RF Train set R^2 0.885539
computed_metrics: [0.31742980243495422]
RF Valid set R^2 0.317430

In this simple example, in few yet intuitive lines of code, we traced the machine learning arc from featurizing a raw dataset to fitting and evaluating a model.

Here, we featurized only the ligand. The signal we observed in R^2 reflects the ability of circular fingerprints and random forests to learn general features that make ligands “drug-like.”

predictions = model.predict(test_dataset)
[ 5.2497  5.0708  6.2361  7.1606  5.1387  6.6723  6.7259  6.2996  6.5986
  5.5831  6.7202  6.6823  6.4065  6.1988  6.8872  6.0689  6.7044  7.2988
  6.6718  6.7882]
# TODO(rbharath): This cell visualizes the ligand with highest predicted activity. Commenting it out for now. Fix this later
#from deepchem.utils.visualization import visualize_ligand

#top_ligand = predictions.iloc[0]['ids']
#ligand1 = convert_lines_to_mdtraj(dataset.loc[dataset['complex_id']==top_ligand]['ligand_pdb'].values[0])
#    ngltraj = visualize_ligand(ligand1)
#    ngltraj
# TODO(rbharath): This cell visualizes the ligand with lowest predicted activity. Commenting it out for now. Fix this later
#worst_ligand = predictions.iloc[predictions.shape[0]-2]['ids']
#ligand1 = convert_lines_to_mdtraj(dataset.loc[dataset['complex_id']==worst_ligand]['ligand_pdb'].values[0])
#    ngltraj = visualize_ligand(ligand1)
#    ngltraj

The protein-ligand complex view.

The preceding simple example, in few yet intuitive lines of code, traces the machine learning arc from featurizing a raw dataset to fitting and evaluating a model.

In this next section, we illustrate deepchem’s modularity, and thereby the ease with which one can explore different featurization schemes, different models, and combinations thereof, to achieve the best performance on a given dataset. We will demonstrate this by examining protein-ligand interactions.

In the previous section, we featurized only the ligand. The signal we observed in R^2 reflects the ability of grid fingerprints and random forests to learn general features that make ligands “drug-like.” In this section, we demonstrate how to use hyperparameter searching to find a higher scoring ligands.

def rf_model_builder(model_params, model_dir):
  sklearn_model = RandomForestRegressor(**model_params)
  return dc.models.SklearnModel(sklearn_model, model_dir)

params_dict = {
    "n_estimators": [10, 50, 100],
    "max_features": ["auto", "sqrt", "log2", None],

metric = dc.metrics.Metric(dc.metrics.r2_score)
optimizer = dc.hyper.HyperparamOpt(rf_model_builder)
best_rf, best_rf_hyperparams, all_rf_results = optimizer.hyperparam_search(
    params_dict, train_dataset, valid_dataset, transformers,
Fitting model 1/12
hyperparameters: {'n_estimators': 10, 'max_features': 'auto'}
computed_metrics: [0.25804004860413576]
Model 1/12, Metric r2_score, Validation set 0: 0.258040
    best_validation_score so far: 0.258040
Fitting model 2/12
hyperparameters: {'n_estimators': 10, 'max_features': 'sqrt'}
computed_metrics: [0.31894454238841463]
Model 2/12, Metric r2_score, Validation set 1: 0.318945
    best_validation_score so far: 0.318945
Fitting model 3/12
hyperparameters: {'n_estimators': 10, 'max_features': 'log2'}
computed_metrics: [0.40151233479373227]
Model 3/12, Metric r2_score, Validation set 2: 0.401512
    best_validation_score so far: 0.401512
Fitting model 4/12
hyperparameters: {'n_estimators': 10, 'max_features': None}
computed_metrics: [0.2379004574282495]
Model 4/12, Metric r2_score, Validation set 3: 0.237900
    best_validation_score so far: 0.401512
Fitting model 5/12
hyperparameters: {'n_estimators': 50, 'max_features': 'auto'}
computed_metrics: [0.31407341153260904]
Model 5/12, Metric r2_score, Validation set 4: 0.314073
    best_validation_score so far: 0.401512
Fitting model 6/12
hyperparameters: {'n_estimators': 50, 'max_features': 'sqrt'}
computed_metrics: [0.30561076245035912]
Model 6/12, Metric r2_score, Validation set 5: 0.305611
    best_validation_score so far: 0.401512
Fitting model 7/12
hyperparameters: {'n_estimators': 50, 'max_features': 'log2'}
computed_metrics: [0.15265818454692826]
Model 7/12, Metric r2_score, Validation set 6: 0.152658
    best_validation_score so far: 0.401512
Fitting model 8/12
hyperparameters: {'n_estimators': 50, 'max_features': None}
computed_metrics: [0.3023042218537606]
Model 8/12, Metric r2_score, Validation set 7: 0.302304
    best_validation_score so far: 0.401512
Fitting model 9/12
hyperparameters: {'n_estimators': 100, 'max_features': 'auto'}
computed_metrics: [0.33459631598792483]
Model 9/12, Metric r2_score, Validation set 8: 0.334596
    best_validation_score so far: 0.401512
Fitting model 10/12
hyperparameters: {'n_estimators': 100, 'max_features': 'sqrt'}
computed_metrics: [0.31318915098968536]
Model 10/12, Metric r2_score, Validation set 9: 0.313189
    best_validation_score so far: 0.401512
Fitting model 11/12
hyperparameters: {'n_estimators': 100, 'max_features': 'log2'}
computed_metrics: [0.21173765911803388]
Model 11/12, Metric r2_score, Validation set 10: 0.211738
    best_validation_score so far: 0.401512
Fitting model 12/12
hyperparameters: {'n_estimators': 100, 'max_features': None}
computed_metrics: [0.35420481043072938]
Model 12/12, Metric r2_score, Validation set 11: 0.354205
    best_validation_score so far: 0.401512
computed_metrics: [0.83300249084895006]
Best hyperparameters: (10, 'log2')
train_score: 0.833002
validation_score: 0.401512
%matplotlib inline

import matplotlib
import numpy as np
import matplotlib.pyplot as plt

rf_predicted_test = best_rf.predict(test_dataset)
rf_true_test = test_dataset.y
plt.scatter(rf_predicted_test, rf_true_test)
plt.xlabel('Predicted pIC50s')
plt.ylabel('True IC50')
plt.title(r'RF predicted IC50 vs. True pIC50')
plt.xlim([2, 11])
plt.ylim([2, 11])
plt.plot([2, 11], [2, 11], color='k')