Tutorials

1. Introduction To Deepchem

1. The Basic Tools of the Deep Life Sciences

2. Working With Datasets

3. An Introduction To MoleculeNet

4. Molecular Fingerprints

5. Creating Models with TensorFlow and PyTorch

6. Introduction to Graph Convolutions

7. Going Deeper on Molecular Featurizations

8. Working With Splitters

9. Advanced Model Training

10. Creating a high fidelity model from experimental data

11. Putting Multitask Learning to Work

12. Modeling Protein Ligand Interactions

13. Modeling Protein Ligand Interactions With Atomic Convolutions

14. Conditional Generative Adversarial Networks

15. Training a Generative Adversarial Network on MNIST

16. Advanced model training using hyperopt

17. Introduction to Gaussian Processes

18. PytorchLightning Integration

19. Compiling DeepChem Torch Models

2. Molecular Machine Learning

1. Molecular Fingerprints

2. Going Deeper on Molecular Featurizations

3. Learning Unsupervised Embeddings for Molecules

4. Synthetic Feasibility Scoring

5. Atomic Contributions for Molecules

6. Interactive Model Evaluation with Trident Chemwidgets

7. Transfer Learning With ChemBERTa Transformers

8. Training a Normalizing Flow on QM9

9. Large Scale Chemical Screens

10. Introduction to Molecular Attention Transformer

11. Generating molecules with MolGAN

12. Introduction to GROVER

13. Introduction to PROTACs

14. Druggablity Assessment with Fpocket and Machine Learning

3. Modeling Proteins

1. Protein Deep Learning

2. DeepChem AntibodyTutorial Simplified

3. Protein Structure Prediction with ESMFold

4. Introduction to ProtBERT

5. ProteinLM Tutorial0

4. Protein Ligand Modeling

1. Introduction to Binding Sites

2. Modeling Protein Ligand Interactions

3. Modeling Protein Ligand Interactions With Atomic Convolutions

4. DeepChemXAlphafold

5. UniProt Data Preprocessing for Binding Sites

5. Quantum Chemistry

1. Exploring Quantum Chemistry with GDB1k

2. DeepQMC tutorial

3. Training an Exchange Correlation Functional using Deepchem

6. Bioinformatics

1. Introduction to Bioinformatics

2. Multisequence Alignments

3. Scanpy

4. Deep probabilistic analysis of single-cell omics data

5. Cell Counting Tutorial

7. Material Science

1. Introduction To Material Science

8. Machine Learning Methods

1. Using Reinforcement Learning to Play Pong

2. Introduction to Model Interpretability

3. Uncertainty In Deep Learning

9. Deep Differential Equations

1. Physics Informed Neural Networks

2. Introducing JaxModel and PINNModel

3. About nODE Using Torchdiffeq in Deepchem

4. Differentiation Infrastructure in Deepchem

5. Ordinary Differential Equation Solving using deepchem

10. Equivariance

1. Introduction to Equivariance

11. Olfaction

1. Predict Multi Label Odor Descriptors using OpenPOM

12. Polymer Science

1. An Introduction to the Polymers and Their Representation

2. crystallization tendency regression

3. Introduction to Polymer SMILES

4. Understanding Weighted Directed Graphs for Polymer Implimentations

5. PolyBERT Introduction to Chemical Language Models for Fingerprint Generation

Applications of DeepChem with Alphafold: Docking and protein-ligand interaction from protein sequence ¶

Author: Sriphani Vardhan Bellamkonda

In this tutorial, we explore a potential use case where we combine the capabilities of AlphaFold and DeepChem.

Alphafold2 has made immense strides in predicting protein structure folding without the use of costly lab equipment and DeepChem comprises a repertoire of easy to use modules which can then be applied on these protein structures for further analysis. In the first part of our tutorial we will predict the protein structure when given a protein sequence. Then, in the second part of our tutorial, we sample a few ligands from the protein-complex dataset(PDBbind) and perform programmatic docking to estimate binding affinities between our protein and a number of ligands.

This tutorial is meant to be run in google colab. You can follow the below link to open this notebook in colab.

Setup ¶

We start off with all the installations and configurations. If you would like to directly skip reading this part, you can head over to the Input Query Section below.

We will first be installing deepchem as a runtime restart might be required. Along with that lets also install condacolab, vina and pdbfixer which will be used in the later parts of this tutorial.

In [ ]:

!pip install -q condacolab
import condacolab
condacolab.install()
!conda install -c conda-forge openmm -y
!mkdir vina_test

!conda install -c conda-forge pdbfixer
!conda install -c conda-forge vina
!pip install deepchem
!pip install -q mdtraj nglview

Restart may be prompted

Part 1: Predict Protein Structure in pdb format from a given input sequence ¶

Note: The cells until part 2 of this tutorial are directly from the Colabfold's google colab implementation and have been further annotated in this tutorial.

Alphafold Colabfold Openfold .....so many folds! ¶

In 2020, a major breakthrough in protein structure prediction occurred when AlphaFold2, a newer version of AlphaFold, vastly outperformed other models in the CASP (Critical Assessment of Protein Structure Prediction) competition. After this breakthrough event, several organizations worked ahead of AlphaFold2 and provided slightly different solutions for the protein folding problem. For example, ColabFold provides faster protein structure modeling because it uses a separate server for the sequence alignment step of AlphaFold2, and OpenFold provides a few enhancements, such as faster inference and custom CUDA attention kernels, while being implemented in PyTorch. Apart from these, there are other protein folding models, such as AlphaFold-multimer, OmegaFold, and RosettaFold, designed for different scenarios. Due to its simplicity, we will be using Colabfold in our tutorial.

ColabFold v1.5.3-patch: AlphaFold2 using MMseqs2 ¶

ColabFold is an easy to use protien structure and complex prediction model which uses Alphafold2 and Alphafold2-multimer for sequence to structure prediction of single and multiple protein chains. It also abstracts out the Sequence Alignment/Template Generation step or MSA, Multiple Sequence Alignemnt, through MMSeqs2 and HHsearch. MSA step is explained in more detail in the cells below.

For more details, checkout the ColabFold GitHub and read the below manuscript.

Mirdita M, Schütze K, Moriwaki Y, Heo L, Ovchinnikov S, Steinegger M. ColabFold: Making protein folding accessible to all. Nature Methods , 2022

We then create a job-name folder and store all the sequences as queries in a text file which we will reference later to call model inference on. The way that Alphafold is structured is that it interacts with a job directory to process the inputs, templates, and save the outpus. Hence, it is important to create a unique job folder.

In [ ]:

#@title Input protein sequence(s), then hit `Runtime` -> `Run all`
from google.colab import files
import os
import re
import hashlib
import random

from sys import version_info
python_version = f"{version_info.major}.{version_info.minor}"

# query_sequence = 'PIAQIHILEGRSDEQKETLIREVSEAISRSLDAPLTSVRVIITEMAKGHFGIGGELASK' #@param {type:"string"}
query_sequence='#'
#@markdown  - Use `:` to specify inter-protein chainbreaks for **modeling complexes** (supports homo- and hetro-oligomers). For example **PI...SK:PI...SK** for a homodimer
jobname = 'test' #@param {type:"string"}
# number of models to use
num_relax = 0 #@param [0, 1, 5] {type:"raw"}
#@markdown - specify how many of the top ranked structures to relax using amber

use_amber = num_relax > 0

Use the below parameter to set template configs.

AlphaFold uses templates from a vast protein structure database to guide its predictions. It aligns the target protein's sequence with similar sequences from the database to generate structural constraints, aiding in the accurate prediction of the protein's 3D structure.

none = no template information is used. In this case the sequence alignment step is skipped and prediction is made without it.

pdb100 = templates are detected from the pdb100 database. pdb100 is a subset of the Protein Data Bank (PDB) containing 100 diverse protein structures used for training and validation in structural biology and bioinformatics research.

custom = the user has the option to upload their own templates database to search on (PDB or mmCIF format)

In [ ]:

template_mode = "none" #@param ["none", "pdb100","custom"]

Lets specify a hashing function which we will use to create a job name. Hashing is mainly used to reduce the length of the folder name in order to have reasonable and unique jobname everytime we run the notebook.

In [ ]:

def add_hash(x,y):
  return x+"_"+hashlib.sha1(y.encode()).hexdigest()[:5]

# remove whitespaces
query_sequence = "".join(query_sequence.split())

basejobname = "".join(jobname.split())
basejobname = re.sub(r'\W+', '', basejobname)
jobname = add_hash(basejobname, query_sequence)

We then create the directory where we will store our protein seqeunce (query) and respective files which will be generated in the later parts of this tutorial. We also define a check function below which prevents us from creating duplicate directories.

In [ ]:

# check if directory with jobname exists
def check(folder):
  if os.path.exists(folder):
    return False
  else:
    return True
if not check(jobname):
  n = 0
  while not check(f"{jobname}_{n}"): n += 1
  jobname = f"{jobname}_{n}"

# make directory to save results
os.makedirs(jobname, exist_ok=True)

Based on the template_mode specified earlier, we automatically adjust a few parameters such as use_templates and custom_template_path. If the template_mode is custom, we will create an additional directory for it.

In [ ]:

if template_mode == "pdb100":
  use_templates = True
  custom_template_path = None
elif template_mode == "custom":
  custom_template_path = os.path.join(jobname,f"template")
  os.makedirs(custom_template_path, exist_ok=True)
  uploaded = files.upload()
  use_templates = True
  for fn in uploaded.keys():
    os.rename(fn,os.path.join(custom_template_path,fn))
else:
  custom_template_path = None
  use_templates = False

Install dependencies ¶

Based on the parameters mentioned above we will respectively install the dependencies with the code below.

First we have to install the latest version of ColabFold from their github repo. After a successful installation, a file called COLABFOLD_READY will be created to mark its completion.

In [ ]:

%%time
import os
USE_AMBER = use_amber
USE_TEMPLATES = use_templates
PYTHON_VERSION = python_version

if not os.path.isfile("COLABFOLD_READY"):
  print("installing colabfold...")
  os.system("pip install -q --no-warn-conflicts 'colabfold[alphafold-minus-jax] @ git+https://github.com/sokrypton/ColabFold'")
  os.system("pip install --upgrade dm-haiku")
  os.system("ln -s /usr/local/lib/python3.*/dist-packages/colabfold colabfold")
  os.system("ln -s /usr/local/lib/python3.*/dist-packages/alphafold alphafold")
  # patch for jax > 0.3.25
  os.system("sed -i 's/weights = jax.nn.softmax(logits)/logits=jnp.clip(logits,-1e8,1e8);weights=jax.nn.softmax(logits)/g' alphafold/model/modules.py")
  os.system("touch COLABFOLD_READY")

CPU times: user 0 ns, sys: 35 µs, total: 35 µs
Wall time: 40.1 µs

Next, if we need amber relaxation or protein templates, we will install mamba which will help us in installing further packages

In [ ]:

if USE_AMBER or USE_TEMPLATES:
  if not os.path.isfile("CONDA_READY"):
    print("installing conda...")
    os.system("wget -qnc https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh")
    os.system("bash Mambaforge-Linux-x86_64.sh -bfp /usr/local")
    os.system("mamba config --set auto_update_conda false")
    os.system("touch CONDA_READY")

Then, we will install hhsuite for db-search/tempate-retireval and openmm for amber relaxation.

In [ ]:

if USE_TEMPLATES and not os.path.isfile("HH_READY") and USE_AMBER and not os.path.isfile("AMBER_READY"):
  print("installing hhsuite and amber...")
  os.system(f"mamba install -y -c conda-forge -c bioconda kalign2=2.04 hhsuite=3.3.0 openmm=7.7.0 python='{PYTHON_VERSION}' pdbfixer")
  os.system("touch HH_READY")
  os.system("touch AMBER_READY")
else:
  if USE_TEMPLATES and not os.path.isfile("HH_READY"):
    print("installing hhsuite...")
    os.system(f"mamba install -y -c conda-forge -c bioconda kalign2=2.04 hhsuite=3.3.0 python='{PYTHON_VERSION}'")
    os.system("touch HH_READY")
  if USE_AMBER and not os.path.isfile("AMBER_READY"):
    print("installing amber...")
    os.system(f"mamba install -y -c conda-forge openmm=7.7.0 python='{PYTHON_VERSION}' pdbfixer")
    os.system("touch AMBER_READY")

A little about the softwares we are installing:

HHsuite is a widely used open source software suite for sensitive sequence similarity searches and protein fold recognition.

Openmm is a high-performance toolkit for molecular simulation. In our tutorial this tooklkit helps us simulate amber force field which is used to "relax" the position of atoms with respect to each other in order to remove any definite obstructions between them. This helps us to better model protein folding in edge cases and offers better refinement of the protein structures.

Now lets specifiy the different msa options available.

MSA ¶

Multi Sequence Alignment is a process in Alphafold which aligns multiple sequences of amino acids from different sources which have a similar sequence. In this step, Alphafold2 forms a grid aligning identical amino acids in the same columns and leaving gaps where there are differences. We have the option of how we would like to pair the respective MSA with the options: "unpaired_paired" to pair sequences from same species and an unpaired MSA, "unpaired" to seperate MSA for each chain, and "paired" to only use paired sequences. Additionally, we have multiple options in the way Alphafold searches for the respective sequences and few of them are mentioned below.

MMSEQ2 : This is a sequence searching tool which finds similar sequences to our input sequence from a large database.

Single_sequence : This option restricts Alphafold from searching for any similar amino acid sequences and restricts it utilize the only given one.

Custom : This options lets Alphafold do the sequence search from user defined sequence search space.

In [ ]:

#@markdown ### MSA options (custom MSA upload, single sequence, pairing mode)
msa_mode = "mmseqs2_uniref_env" #@param ["mmseqs2_uniref_env", "mmseqs2_uniref","single_sequence","custom"]
pair_mode = "unpaired_paired" #@param ["unpaired_paired","paired","unpaired"] {type:"string"}
#@markdown - "unpaired_paired" = pair sequences from same species + unpaired MSA, "unpaired" = seperate MSA for each chain, "paired" - only use paired sequences.

So based on the above msa parameters we will set the path to the A3M file. An A3M file (Alignment to Multiple Models) is a type of input file used in the protein structure prediction process which contains multiple sequence alignments (MSAs) of related protein sequences that are used as input data for the AlphaFold model. A3M file format is an extentions to FASTA file format and can be read about over at FASTA format extension .

Additionally, for the purpose of this tutorial we don't need to get into the details of custom MSA(where user inputs their own template databse for search).

In [ ]:

# decide which a3m to use
if "mmseqs2" in msa_mode:
  a3m_file = os.path.join(jobname,f"{jobname}.a3m")

Advanced settings ¶

Below we can specify more advanced settings with Alphafold. We can choose which model parameters to use from the options given below i.e alphafold2, alphafold2_multimer_v1 etc.., recycle early stop tolerence, saving to google drive, and image resolution options. Also note that there is no need to fully understand the parameters and just stick to the defaults. But here are a few details about the parameters which can be changed.

model_type : If auto selected, will use alphafold2_ptm for monomer prediction and alphafold2_multimer_v3 for complex prediction. Any of the mode_types can be used (regardless if input is monomer or complex).

num_recycles : "auto" with other options available as ["auto", "0", "1", "3", "6", "12", "24", "48"]

recycle_early_stop_tolerance : "auto" with other options available as ["auto", "0.0", "0.5", "1.0"]

By "recycling" Alphafold refers to the process of iterative refinement of protein structure prediction to imporve accuracy.

Pairing strategy : "greedy" or "complete"

greedy = pair any taxonomically matching subsets
complete = all sequences have to match in one line.

In [ ]:

#@markdown ### Advanced settings
model_type = "auto" #@param ["auto", "alphafold2_ptm", "alphafold2_multimer_v1", "alphafold2_multimer_v2", "alphafold2_multimer_v3"]
#@markdown - if `auto` selected, will use `alphafold2_ptm` for monomer prediction and `alphafold2_multimer_v3` for complex prediction.
#@markdown Any of the mode_types can be used (regardless if input is monomer or complex).
num_recycles = "3" #@param ["auto", "0", "1", "3", "6", "12", "24", "48"]
#@markdown - if `auto` selected, will use `num_recycles=20` if `model_type=alphafold2_multimer_v3`, else `num_recycles=3` .
recycle_early_stop_tolerance = "auto" #@param ["auto", "0.0", "0.5", "1.0"]
#@markdown - if `auto` selected, will use `tol=0.5` if `model_type=alphafold2_multimer_v3` else `tol=0.0`.
relax_max_iterations = 200 #@param [0, 200, 2000] {type:"raw"}
#@markdown - max amber relax iterations, `0` = unlimited (AlphaFold2 default, can take very long)
pairing_strategy = "greedy" #@param ["greedy", "complete"] {type:"string"}
#@markdown - `greedy` = pair any taxonomically matching subsets, `complete` = all sequences have to match in one line.

We can also set the maximum length of Multiple Sequence Alignment, Number of Seeds, and whether to use dropout or not.

max_msa = "auto" with other optinos as ["auto", "512:1024", "256:512", "64:128", "32:64", "16:32"]. Here left is the minimum and right is the maximum msa length.

num_seeds = 1 with other options as [1,2,4,8,16] {type:"raw"}

use_dropout = False other option True

In [ ]:

#@markdown #### Sample settings
#@markdown -  enable dropouts and increase number of seeds to sample predictions from uncertainty of the model.
#@markdown -  decrease `max_msa` to increase uncertainity
max_msa = "auto" #@param ["auto", "512:1024", "256:512", "64:128", "32:64", "16:32"]
num_seeds = 1 #@param [1,2,4,8,16] {type:"raw"}
use_dropout = False #@param {type:"boolean"}

num_recycles = None if num_recycles == "auto" else int(num_recycles)
recycle_early_stop_tolerance = None if recycle_early_stop_tolerance == "auto" else float(recycle_early_stop_tolerance)
if max_msa == "auto": max_msa = None

Save to Google Drive ¶

We define a simple save_to_google_drive function which authenticates into our respective google drive and allows to save our files.

In [ ]:

#@markdown #### Save settings
save_all = False #@param {type:"boolean"}
save_recycles = False #@param {type:"boolean"}
save_to_google_drive = False #@param {type:"boolean"}
#@markdown -  if the save_to_google_drive option was selected, the result zip will be uploaded to your Google Drive
dpi = 200 #@param {type:"integer"}
#@markdown - set dpi for image resolution

if save_to_google_drive:
  from pydrive.drive import GoogleDrive
  from pydrive.auth import GoogleAuth
  from google.colab import auth
  from oauth2client.client import GoogleCredentials
  auth.authenticate_user()
  gauth = GoogleAuth()
  gauth.credentials = GoogleCredentials.get_application_default()
  drive = GoogleDrive(gauth)
  print("You are logged into Google Drive and are good to go!")

Now we will run the prediction model using all the inputs and specifications from above.

We import the respective files from Colabfold's package for inference and plotting, and we then check if we have a specific GPU. We also define 2 helper functions: input_features_callback, and prediction_callback, which help us to visualize the respective input featues and prediction results.

Lets import utility, batch and plotting functions from colabfold!

In [ ]:

#@title Run Prediction
display_images = True #@param {type:"boolean"}

import sys
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
from Bio import BiopythonDeprecationWarning
warnings.simplefilter(action='ignore', category=BiopythonDeprecationWarning)
from pathlib import Path
from colabfold.download import download_alphafold_params, default_data_dir
from colabfold.utils import setup_logging
from colabfold.batch import get_queries, run, set_model_type
from colabfold.plot import plot_msa_v2

import os
import numpy as np

Next we check if K80 GPU is available.

In [ ]:

try:
  K80_chk = os.popen('nvidia-smi | grep "Tesla K80" | wc -l').read()
except:
  K80_chk = "0"
  pass
if "1" in K80_chk:
  print("WARNING: found GPU Tesla K80: limited to total length < 1000")
  if "TF_FORCE_UNIFIED_MEMORY" in os.environ:
    del os.environ["TF_FORCE_UNIFIED_MEMORY"]
  if "XLA_PYTHON_CLIENT_MEM_FRACTION" in os.environ:
    del os.environ["XLA_PYTHON_CLIENT_MEM_FRACTION"]

Now, let's define some helper functions to visualize input features and output preduction. ¶

In [ ]:

from colabfold.colabfold import plot_protein
from pathlib import Path
import matplotlib.pyplot as plt

# For some reason we need that to get pdbfixer to import
if use_amber and f"/usr/local/lib/python{python_version}/site-packages/" not in sys.path:
    sys.path.insert(0, f"/usr/local/lib/python{python_version}/site-packages/")

def input_features_callback(input_features):
  if display_images:
    plot_msa_v2(input_features)
    plt.show()
    plt.close()

def prediction_callback(protein_obj, length,
                        prediction_result, input_features, mode):
  model_name, relaxed = mode
  if not relaxed:
    if display_images:
      fig = plot_protein(protein_obj, Ls=length, dpi=150)
      plt.show()
      plt.close()

Let's define our logging environment and retrieve the input queries from our job folder. ¶

In [ ]:

result_dir = jobname
log_filename = os.path.join(jobname,"log.txt")
setup_logging(Path(log_filename))

Input Query Sequence ¶

Lets specify our input for Sequence to Structure prediction. We should specify the input sequence (seperated by ':' for specifying inter-protein chain breaks to model complexes) and whether we would like to use amber relaxation or not.

In [ ]:

query_sequence = 'PIAQIHILEGRSDEQKETLIREVSEAISRSLDAPLTSVRVIITEMAKGHFGIGGELASK' #@param {type:"string"}

We then store our query_sequence in a csv inside the jobname director. This facilates multiple query execution and provides input in the format expected by ColabFold.

In [ ]:

# save queries
queries_path = os.path.join(jobname, f"{jobname}.csv")
with open(queries_path, "w") as text_file:
  text_file.write(f"id,sequence\n{jobname},{query_sequence}")

We utilize the get_queries function, which is a colabfold utility function, to fetch the queries from the directory.

One downside to having a high number of sequence alignemts(~128) detected in the MSA step is that it is quadrateically expensive to continue the later steps in inference. In order, to reduce this the MSA's are clustered smartly, based on sequence similarity, in order to reduce the computational cost while ensuring that each MSA has some influence on the final prediction. This clustering step is captured by use_cluster_profile which is also configured below.

In [ ]:

queries, is_complex = get_queries(queries_path)
model_type = set_model_type(is_complex, model_type)

if "multimer" in model_type and max_msa is not None:
  use_cluster_profile = False
else:
  use_cluster_profile = True

Inference ¶

Our final step is to download the alphafold parameters for the model and run inference using all the previous given specifications and inputs.

We will then save the results in a zip file and download it.

In [ ]:

download_alphafold_params(model_type, Path("."))
results = run(
    queries=queries,
    result_dir=result_dir,
    use_templates=use_templates,
    custom_template_path=custom_template_path,
    num_relax=num_relax,
    msa_mode=msa_mode,
    model_type=model_type,
    num_models=5,
    num_recycles=num_recycles,
    relax_max_iterations=relax_max_iterations,
    recycle_early_stop_tolerance=recycle_early_stop_tolerance,
    num_seeds=num_seeds,
    use_dropout=use_dropout,
    model_order=[1,2,3,4,5],
    is_complex=is_complex,
    data_dir=Path("."),
    keep_existing_results=False,
    rank_by="auto",
    pair_mode=pair_mode,
    pairing_strategy=pairing_strategy,
    stop_at_score=float(100),
    prediction_callback=prediction_callback,
    dpi=dpi,
    zip_results=False,
    save_all=save_all,
    max_msa=max_msa,
    use_cluster_profile=use_cluster_profile,
    input_features_callback=input_features_callback,
    save_recycles=save_recycles,
    user_agent="colabfold/google-colab-main",
)
results_zip = f"{jobname}.result.zip"
os.system(f"zip -r {results_zip} {jobname}")

2023-11-14 00:35:33,396 Running on GPU
2023-11-14 00:35:33,401 Found 4 citations for tools or databases
2023-11-14 00:35:33,401 Query 1/1: test_d08f8 (length 59)

COMPLETE: 100%|██████████| 150/150 [elapsed: 00:02 remaining: 00:00]

2023-11-14 00:35:37,107 Setting max_seq=512, max_extra_seq=5120
2023-11-14 00:35:58,595 alphafold2_ptm_model_1_seed_000 recycle=0 pLDDT=96.6 pTM=0.755
2023-11-14 00:36:01,632 alphafold2_ptm_model_1_seed_000 recycle=1 pLDDT=96.6 pTM=0.758 tol=0.311
2023-11-14 00:36:04,668 alphafold2_ptm_model_1_seed_000 recycle=2 pLDDT=96.4 pTM=0.757 tol=0.043
2023-11-14 00:36:07,706 alphafold2_ptm_model_1_seed_000 recycle=3 pLDDT=96.2 pTM=0.757 tol=0.0399
2023-11-14 00:36:07,708 alphafold2_ptm_model_1_seed_000 took 25.1s (3 recycles)

2023-11-14 00:36:10,843 alphafold2_ptm_model_2_seed_000 recycle=0 pLDDT=96.9 pTM=0.76
2023-11-14 00:36:13,881 alphafold2_ptm_model_2_seed_000 recycle=1 pLDDT=96.9 pTM=0.765 tol=0.375
2023-11-14 00:36:16,920 alphafold2_ptm_model_2_seed_000 recycle=2 pLDDT=96.9 pTM=0.766 tol=0.12
2023-11-14 00:36:19,959 alphafold2_ptm_model_2_seed_000 recycle=3 pLDDT=96.8 pTM=0.767 tol=0.0779
2023-11-14 00:36:19,960 alphafold2_ptm_model_2_seed_000 took 12.2s (3 recycles)

2023-11-14 00:36:23,114 alphafold2_ptm_model_3_seed_000 recycle=0 pLDDT=97.2 pTM=0.775
2023-11-14 00:36:26,153 alphafold2_ptm_model_3_seed_000 recycle=1 pLDDT=97.4 pTM=0.782 tol=0.297
2023-11-14 00:36:29,190 alphafold2_ptm_model_3_seed_000 recycle=2 pLDDT=97.4 pTM=0.782 tol=0.105
2023-11-14 00:36:32,230 alphafold2_ptm_model_3_seed_000 recycle=3 pLDDT=97.4 pTM=0.784 tol=0.0659
2023-11-14 00:36:32,231 alphafold2_ptm_model_3_seed_000 took 12.2s (3 recycles)

2023-11-14 00:36:35,372 alphafold2_ptm_model_4_seed_000 recycle=0 pLDDT=97.4 pTM=0.775
2023-11-14 00:36:38,411 alphafold2_ptm_model_4_seed_000 recycle=1 pLDDT=97.4 pTM=0.781 tol=0.282
2023-11-14 00:36:41,451 alphafold2_ptm_model_4_seed_000 recycle=2 pLDDT=97.2 pTM=0.778 tol=0.0839
2023-11-14 00:36:44,490 alphafold2_ptm_model_4_seed_000 recycle=3 pLDDT=97.1 pTM=0.779 tol=0.0421
2023-11-14 00:36:44,491 alphafold2_ptm_model_4_seed_000 took 12.2s (3 recycles)

2023-11-14 00:36:47,670 alphafold2_ptm_model_5_seed_000 recycle=0 pLDDT=97.4 pTM=0.784
2023-11-14 00:36:50,712 alphafold2_ptm_model_5_seed_000 recycle=1 pLDDT=96.9 pTM=0.784 tol=0.237
2023-11-14 00:36:53,750 alphafold2_ptm_model_5_seed_000 recycle=2 pLDDT=96.3 pTM=0.777 tol=0.18
2023-11-14 00:36:56,788 alphafold2_ptm_model_5_seed_000 recycle=3 pLDDT=96.2 pTM=0.777 tol=0.121
2023-11-14 00:36:56,789 alphafold2_ptm_model_5_seed_000 took 12.2s (3 recycles)

2023-11-14 00:36:56,927 reranking models by 'plddt' metric
2023-11-14 00:36:56,927 rank_001_alphafold2_ptm_model_3_seed_000 pLDDT=97.4 pTM=0.784
2023-11-14 00:36:56,929 rank_002_alphafold2_ptm_model_4_seed_000 pLDDT=97.1 pTM=0.779
2023-11-14 00:36:56,930 rank_003_alphafold2_ptm_model_2_seed_000 pLDDT=96.8 pTM=0.767
2023-11-14 00:36:56,930 rank_004_alphafold2_ptm_model_1_seed_000 pLDDT=96.2 pTM=0.757
2023-11-14 00:36:56,931 rank_005_alphafold2_ptm_model_5_seed_000 pLDDT=96.2 pTM=0.777
2023-11-14 00:36:58,495 Done

Out[ ]:

Display 3D structure of the protein file generated based on a few options and using py3Dmol package ¶

Alphafold generates top n ranked model estimates of the protein structure. Here we have 5 ranked structures. The lower the rank the higher the accuracy and quality of the predicted model.

Here we can display the structure with various color schemas: chair, lDDT, and rainbow. We also have options to show the sidechain and mainchains of the protein's structure.

Let's import a few important visualization libraries and set the visualization variables.

In [ ]:

#@title Display 3D structure {run: "auto"}
import py3Dmol
import glob
import matplotlib.pyplot as plt
from colabfold.colabfold import plot_plddt_legend
from colabfold.colabfold import pymol_color_list, alphabet_list
rank_num = 1 #@param ["1", "2", "3", "4", "5"] {type:"raw"}
color = "lDDT" #@param ["chain", "lDDT", "rainbow"]
show_sidechains = False #@param {type:"boolean"}
show_mainchains = False #@param {type:"boolean"}

tag = results["rank"][0][rank_num - 1]
jobname_prefix = ".custom" if msa_mode == "custom" else ""
pdb_filename = f"{jobname}/{jobname}{jobname_prefix}_unrelaxed_{tag}.pdb"
pdb_file = glob.glob(pdb_filename)

Capture the protein structure ¶

Let us store the file path to the generated protein structure so that we can utilize it in the second part of our tutorial for pocket finding and protein-ligand interactions!

In [ ]:

print("Files generated: ", pdb_filename, pdb_file)
pdb_filename_captured = pdb_filename

Files generated:  test_a5e17_0/test_a5e17_0_unrelaxed_rank_001_alphafold2_ptm_model_3_seed_000.pdb ['test_a5e17_0/test_a5e17_0_unrelaxed_rank_001_alphafold2_ptm_model_3_seed_000.pdb']

Now lets define the visualization function show_pdb using py3Dmol. This function takes in the pdb and various visualization parameters such as rank number, sidechains, and mainchains and visualizes the protein accordingly Here we have passed "lDDT" for the color parameter of the function. lDDT is in short for local Distance Difference Test. We display the color based on the score for each part of the protein as mentioned in the key of the .

In [ ]:

def show_pdb(rank_num=1, show_sidechains=False, show_mainchains=False, color="lDDT"):
  model_name = f"rank_{rank_num}"
  view = py3Dmol.view(js='https://3dmol.org/build/3Dmol.js',)
  view.addModel(open(pdb_file[0],'r').read(),'pdb')

  if color == "lDDT":
    view.setStyle({'cartoon': {'colorscheme': {'prop':'b','gradient': 'roygb','min':50,'max':90}}})
  elif color == "rainbow":
    view.setStyle({'cartoon': {'color':'spectrum'}})
  elif color == "chain":
    chains = len(queries[0][1]) + 1 if is_complex else 1
    for n,chain,color in zip(range(chains),alphabet_list,pymol_color_list):
       view.setStyle({'chain':chain},{'cartoon': {'color':color}})

  if show_sidechains:
    BB = ['C','O','N']
    view.addStyle({'and':[{'resn':["GLY","PRO"],'invert':True},{'atom':BB,'invert':True}]},
                        {'stick':{'colorscheme':f"WhiteCarbon",'radius':0.3}})
    view.addStyle({'and':[{'resn':"GLY"},{'atom':'CA'}]},
                        {'sphere':{'colorscheme':f"WhiteCarbon",'radius':0.3}})
    view.addStyle({'and':[{'resn':"PRO"},{'atom':['C','O'],'invert':True}]},
                        {'stick':{'colorscheme':f"WhiteCarbon",'radius':0.3}})
  if show_mainchains:
    BB = ['C','O','N','CA']
    view.addStyle({'atom':BB},{'stick':{'colorscheme':f"WhiteCarbon",'radius':0.3}})

  view.zoomTo()
  return view

show_pdb(rank_num, show_sidechains, show_mainchains, color).show()
if color == "lDDT":
  plot_plddt_legend().show()

You appear to be running in JupyterLab (or JavaScript failed to load for some other reason). You need to install the 3dmol extension:
jupyter labextension install jupyterlab_3dmol

PLOTS of Alphafold ¶

The following 3 types of plots are generated with Alphafold.

PAE (Predicted Average Error): It measures the average error in the predicted atomic positions of a protein's 3D structure compared to the true positions. In our experiment below, the PAE plot of the top 5 ranked sturctures is mostly blue which means that we have low errors.
COV (Sequence Coverage): It indicates the percentage of aminoacid sequence for which Alphafold has provided structural prediction and to what degree. In the plot below, the x-axis represents the amino acid sequence and the y-axis represents the number sequences for that amino acid.
iDDT (local Distance Difference Test) or lDDT: It provides a per-residue measure of the predicted model's confidence. It assigns a score to each residue in the protein structure, indicating the reliability of the prediction at that location. For our protein sequence the lDDT is high throughout the amino acid sequence but tends to get lower towards the right end.

In [ ]:

#@title Plots {run: "auto"}
from IPython.display import display, HTML
import base64
from html import escape

# see: https://stackoverflow.com/a/53688522
def image_to_data_url(filename):
  ext = filename.split('.')[-1]
  prefix = f'data:image/{ext};base64,'
  with open(filename, 'rb') as f:
    img = f.read()
  return prefix + base64.b64encode(img).decode('utf-8')

pae = image_to_data_url(os.path.join(jobname,f"{jobname}{jobname_prefix}_pae.png"))
cov = image_to_data_url(os.path.join(jobname,f"{jobname}{jobname_prefix}_coverage.png"))
plddt = image_to_data_url(os.path.join(jobname,f"{jobname}{jobname_prefix}_plddt.png"))
display(HTML(f"""
<style>
  img {{
    float:left;
  }}
  .full {{
    max-width:100%;
  }}
  .half {{
    max-width:50%;
  }}
  @media (max-width:640px) {{
    .half {{
      max-width:100%;
    }}
  }}
</style>
<div style="max-width:90%; padding:2em;">
  <h1>Plots for {escape(jobname)}</h1>
  <img src="{pae}" class="full" />
  <img src="{cov}" class="half" />
  <img src="{plddt}" class="half" />
</div>
"""))

Plots for test_d08f8

Package and download results(optional) ¶

The below code can be used to download the generated zip folder to your local system or saved to google drive.

In [ ]:

if msa_mode == "custom":
  print("Don't forget to cite your custom MSA generation method.")

files.download(f"{jobname}.result.zip")

if save_to_google_drive == True and drive:
  uploaded = drive.CreateFile({'title': f"{jobname}.result.zip"})
  uploaded.SetContentFile(f"{jobname}.result.zip")
  uploaded.Upload()
  print(f"Uploaded {jobname}.result.zip to Google Drive with ID {uploaded.get('id')}")

Part 2: Estimate binding affinities between the predicted protein and a sample set of ligands ¶

Let's import deepchem utility functions and rdkit which helps us in visualization.

In [ ]:

import os
import numpy as np
import pandas as pd

import tempfile

from rdkit import Chem
from rdkit.Chem import AllChem

import deepchem as dc
from deepchem.utils import download_url, load_from_disk

2023-11-14 00:45:56,266 Enabling RDKit 2023.09.1 jupyter extensions
2023-11-14 00:46:00,469 Skipped loading modules with pytorch-geometric dependency, missing a dependency. No module named 'torch_geometric'
2023-11-14 00:46:00,471 Skipped loading modules with pytorch-geometric dependency, missing a dependency. cannot import name 'DMPNN' from 'deepchem.models.torch_models' (/usr/local/lib/python3.10/site-packages/deepchem/models/torch_models/__init__.py)
2023-11-14 00:46:00,473 Skipped loading modules with pytorch-lightning dependency, missing a dependency. No module named 'pytorch_lightning'

To sample a set of ligands we will use PDBind . We will download the respective dataset file and store it in a variable called raw_dataset.

In [ ]:

data_dir = dc.utils.get_data_dir()
dataset_file = os.path.join(data_dir, "pdbbind_core_df.csv.gz")

if not os.path.exists(dataset_file):
    print('File does not exist. Downloading file...')
    download_url("https://s3-us-west-1.amazonaws.com/deepchem.io/datasets/pdbbind_core_df.csv.gz")
    print('File downloaded...')

raw_dataset = load_from_disk(dataset_file)
raw_dataset = raw_dataset[['pdb_id', 'smiles', 'label']]


from openmm.app import PDBFile
from pdbfixer import PDBFixer

from deepchem.utils.vina_utils import prepare_inputs

File does not exist. Downloading file...
File downloaded...

Pocket finding in the predicted protein structure ¶

By utilizing DeepChem's docking modules we can find the number of pockets in the generated structure. We do so by utilizing the 3D-ConvexHullPocketFinder.

In [ ]:

ligands10 = raw_dataset['smiles'].iloc[0:10]

# %%time
import os
#'test_a5e17/test_a5e17_unrelaxed_rank_001_alphafold2_ptm_model_3_seed_000.pdb'
generated_pdb = pdb_filename_captured
generated_pdb_no_extension = os.path.splitext(os.path.basename(generated_pdb))[0]

finder = dc.dock.binding_pocket.ConvexHullPocketFinder()
pockets = finder.find_pockets(generated_pdb)

print("Pockets for protein: "  + str(len(pockets)) ) # number of identified pockets

Pockets for protein: 8

Docking with Vina Pose Generator ¶

For docking the protein and ligand, we will be using DeepChem's VinaPoseGenerator which automatically installs AutoDock Vina engine and allows us to seamlessly use it. And before we dock a given ligand and protein we have to ensure that the inputs are reasonable for docking which is also handled by a DeepChem's utility function called prepare_inputs. It fixes the protein geometry, optimizes ligand geometry, and sanitizes the molecules. Next, we generate the complex combining the protein and ligand and then display the scores. Analysis of the scores of the combined complex is printed below. For the sake of simplicity, we will be iterating over 3 ligands and comparing the complex scores.

In [ ]:

import locale
locale.getpreferredencoding = lambda: "UTF-8"
!mkdir results

vpg = dc.dock.pose_generation.VinaPoseGenerator()
count=0
scores_matrix =[]
complex_mol_array = []
for count in range(0,3):
  print("Docking ligand "+str(count))
  ligand = ligands10[count]
  p, m = None, None

  vpg = dc.dock.pose_generation.VinaPoseGenerator()

  try:
      p, m = prepare_inputs('%s' % (generated_pdb), ligand)
  except:
      print('%s failed PDB fixing' % (generated_pdb))

  if p and m:  # protein and molecule are readable by RDKit
      print(generated_pdb, p.GetNumAtoms())
      Chem.rdmolfiles.MolToPDBFile(p, 'results/protein_%s.pdb' % (count))
      Chem.rdmolfiles.MolToPDBFile(m, 'results/ligand_%s.pdb' % (count))

  complexes, scores = vpg.generate_poses(molecular_complex=('results/protein_%s.pdb' % (count),'results/ligand_%s.pdb' % (count)),  # protein-ligand files for docking,
                                        out_dir='vina_test',
                                        generate_scores=True
                                      )
  complex_mol = Chem.CombineMols(complexes[0][0], complexes[0][1])
  complex_mol_array.append(complex_mol)
  print(scores)
  scores_matrix.append(scores)

Docking ligand 0

<ipython-input-42-a86da5d11cfe>:17: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding function in deepchem.utils.docking_utils.
  p, m = prepare_inputs('%s' % (generated_pdb), ligand)
[00:47:58] UFFTYPER: Unrecognized atom type: S_5+4 (7)

test_a5e17_0/test_a5e17_0_unrelaxed_rank_001_alphafold2_ptm_model_3_seed_000.pdb 448
2023-11-14 00:48:01,498 Pockets not specified. Will use whole protein to dock
2023-11-14 00:48:03,344 Docking in pocket 1/1
2023-11-14 00:48:03,345 Docking with center: [0.28462623 1.04385902 1.65269617]
2023-11-14 00:48:03,345 Box dimensions: [45.593 35.786 38.447]
2023-11-14 00:48:03,346 About to call Vina

/usr/local/lib/python3.10/site-packages/vina/vina.py:260: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  self._voxels = np.ceil(np.array(box_size) / self._spacing).astype(np.int)
<ipython-input-42-a86da5d11cfe>:17: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding function in deepchem.utils.docking_utils.
  p, m = prepare_inputs('%s' % (generated_pdb), ligand)

[-4.321, -4.142, -4.135, -4.109, -4.083, -4.069, -4.046, -4.036, -3.993]
Docking ligand 1
test_a5e17_0/test_a5e17_0_unrelaxed_rank_001_alphafold2_ptm_model_3_seed_000.pdb 448
2023-11-14 00:49:13,834 Pockets not specified. Will use whole protein to dock
2023-11-14 00:49:15,455 Docking in pocket 1/1
2023-11-14 00:49:15,457 Docking with center: [0.28062951 1.0434776  1.64895082]
2023-11-14 00:49:15,457 Box dimensions: [45.662 35.909 38.417]
2023-11-14 00:49:15,458 About to call Vina

/usr/local/lib/python3.10/site-packages/vina/vina.py:260: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  self._voxels = np.ceil(np.array(box_size) / self._spacing).astype(np.int)
<ipython-input-42-a86da5d11cfe>:17: DeprecationWarning: Call to deprecated function prepare_inputs. Please use the corresponding function in deepchem.utils.docking_utils.
  p, m = prepare_inputs('%s' % (generated_pdb), ligand)

[-6.083, -6.022, -5.811, -5.797, -5.796, -5.73, -5.689, -5.654, -5.643]
Docking ligand 2
test_a5e17_0/test_a5e17_0_unrelaxed_rank_001_alphafold2_ptm_model_3_seed_000.pdb 448
2023-11-14 00:55:55,258 Pockets not specified. Will use whole protein to dock
2023-11-14 00:55:56,761 Docking in pocket 1/1
2023-11-14 00:55:56,762 Docking with center: [0.28300874 1.0426     1.64975847]
2023-11-14 00:55:56,762 Box dimensions: [45.657 35.819 38.429]
2023-11-14 00:55:56,763 About to call Vina
[-5.96, -5.9, -5.791, -5.733, -5.704, -5.662, -5.617, -5.605, -5.591]

/usr/local/lib/python3.10/site-packages/vina/vina.py:260: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  self._voxels = np.ceil(np.array(box_size) / self._spacing).astype(np.int)

Visualize the Protein ligand Complex! ¶

We will use nglview to display the complex which we have stored previously. We add output.enable_custom_widget_manager() so that we can run nglview in colab.

In [ ]:

from google.colab import output

import mdtraj as md
import nglview

from IPython.display import display, Image
output.enable_custom_widget_manager()

Now lets visualize the first 3 protein ligand complexes which we have stored in the complex_mol_array

In [ ]:

v = nglview.show_rdkit(complex_mol_array[0])
display(v)

No description has been provided for this image

In [ ]:

v = nglview.show_rdkit(complex_mol_array[1])
display(v)

In [ ]:

v = nglview.show_rdkit(complex_mol_array[2])
display(v)

In [ ]:

print(scores_matrix)

[[-4.321, -4.142, -4.135, -4.109, -4.083, -4.069, -4.046, -4.036, -3.993], [-6.083, -6.022, -5.811, -5.797, -5.796, -5.73, -5.689, -5.654, -5.643], [-5.96, -5.9, -5.791, -5.733, -5.704, -5.662, -5.617, -5.605, -5.591]]

Next, we can see that all the scores generated from Vina Pose Generator for the respective complexes are negative. This is because protein–ligand binding occurs only when the change in Gibbs free energy (ΔG) of the system is negative and more negative the free energy is the more stable the complex would be as show in Ref. Additionally, molecular docking evaluation based on the paper here showed that the binding affinities of all the derivatives range from (- 3.2 and -18.5 kcal/mol).

Hence based on our experiment we can successfully predict the potential affinity between a protein sequence and a ligand even of we just have the protein sequence!

Congratulations! Time to join the Community! ¶

Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:

Star DeepChem on GitHub ¶

This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.

Join the DeepChem Discord ¶

The DeepChem Discord hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!

Citing This Tutorial ¶

If you found this tutorial useful please consider citing it using the provided BibTeX.

@manual{DeepChemXAlphafold,
 title={Applications of DeepChem with Alphafold: Docking and protein-ligand interaction from protein sequence},
 organization={DeepChem},
 author={Bellamkonda, Sriphani Vardhan},
 howpublished = {\url{https://github.com/deepchem/deepchem/blob/master/examples/tutorials/DeepChemXAlphafold.ipynb}},
 year={2023},
}