Tutorials
Protein Deep Learning ¶
by David Ricardo Figueroa Blanco
In this tutorial we will compare protein sequences featurization such as one hot encoders and aminoacids composition. We will use some tools of DeepChem and additional packages to create a model to predict melting temperature of proteins ( a good measurement of protein stability )
The melting temperature (MT) of a protein is a measurement of protein stability. This measure could vary from a big variety of experimental conditions, however, curated databases cand be found in literature https://aip.scitation.org/doi/10.1063/1.4947493 . In this paper we can find a lot of thermodynamic information of proteins and therefore a big resource for the study of protein stability. Other information related with protein stability could be the change in Gibbs Free Energy $ \Delta \Delta G°$ due to a mutation.
The study of protein stability is important in areas such as protein engineering and biocatalysis because catalytic efficiency could be directly related to the tertiary structure of the protein in study.
Setup ¶
To run DeepChem within Colab, you'll need to run the following installation commands. This will take about 5 minutes to run to completion and install your environment. You can of course run this tutorial locally if you prefer. In that case, don't run these cells since they will download and install Anaconda on your local machine.
!curl -Lo conda_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
import conda_installer
conda_installer.install()
!/root/miniconda/bin/conda info -e
!pip install --pre deepchem
!pip install propy3
Data extraction ¶
In this cell, we download the dataset published in the paper https://aip.scitation.org/doi/10.1063/1.4947493 from the DeepChem dataset repository
import deepchem as dc
import os
from deepchem.utils import download_url
data_dir = dc.utils.get_data_dir()
download_url("https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/pucci-proteins-appendixtable1.csv",dest_dir=data_dir)
print('Dataset Dowloaded at {}'.format(data_dir))
dataset_file = os.path.join(data_dir, "pucci-proteins-appendixtable1.csv")
Dataset Dowloaded at /tmp
A closer look of the dataset: Contains the PDBid and the respective mutation and change in thermodynamical properties in each studied protein
import pandas as pd
data = pd.read_csv(dataset_file)
data
| Unnamed: 0 | N | PDBid | Chain | RESN | RESwt | RESmut | ΔTmexp | Tmexp [wt] | ΔΔHmexp | ... | ΔΔGexp(T) | T | Nres | R (Å) | Protein | Organism | Ref. | pH | Exp.Tech. | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | NaN | 1 | 1aky | A | 8 | VAL | ILE | -1.5 | 47.6 | 70 | ... | 5.0 | 25 | 220 | 1.63 | ADK | Yeast | [1] | [7.5] | FL | NaN |
| 1 | NaN | 2 | 1aky | A | 48 | GLN | GLU | -1.3 | 47.6 | 60 | ... | 4.0 | 25 | 220 | 1.63 | ADK | Yeast | [1] | [7.7] | FL | NaN |
| 2 | NaN | 3 | 1aky | A | 77 | THR | HIS | -1.1 | 47.6 | 130 | ... | 9.0 | 25 | 220 | 1.63 | ADK | Yeast | [1] | [7.5] | FL | NaN |
| 3 | NaN | 4 | 1aky | A | 110 | THR | HIS | -4.8 | 47.6 | 165 | ... | 11.0 | 25 | 220 | 1.63 | ADK | Yeast | [1] | [7.6] | FL | NaN |
| 4 | NaN | 5 | 1aky | A | 169 | ASN | ASP | -0.6 | 47.6 | 140 | ... | 9.0 | 25 | 220 | 1.63 | ADK | Yeast | [1] | [7.5] | FL | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1621 | NaN | 1622 | 5pti_m52l | A | 15 | LYS | SER | -1.3 | 91.7 | -5 | ... | 1.2 | 25 | 58 | 1.00 | PTI M52L | Bovine | [232] | [3.0] | DSC | NaN |
| 1622 | NaN | 1623 | 5pti_m52l | A | 15 | LYS | THR | -1.1 | 91.7 | -9 | ... | -3.6 | 25 | 58 | 1.00 | PTI M52L | Bovine | [232] | [3.0] | DSC | NaN |
| 1623 | NaN | 1624 | 5pti_m52l | A | 15 | LYS | VAL | -6.3 | 91.7 | 4 | ... | 4.7 | 25 | 58 | 1.00 | PTI M52L | Bovine | [232] | [3.0] | DSC | NaN |
| 1624 | NaN | 1625 | 5pti_m52l | A | 15 | LYS | TRP | -7.5 | 91.7 | 17 | ... | 8.5 | 25 | 58 | 1.00 | PTI M52L | Bovine | [232] | [2.5] | DSC | NaN |
| 1625 | NaN | 1626 | 5pti_m52l | A | 15 | LYS | TYR | -6.6 | 91.7 | 4 | ... | 4.6 | 25 | 58 | 1.00 | PTI M52L | Bovine | [232] | [3.0] | DSC | NaN |
1626 rows × 23 columns
Here we extract a small DataFrame that only contains a unique PDBid code and its respective melting temperature
WT_Tm = data[['PDBid','Tmexp [wt]']]
WT_Tm.set_index('PDBid',inplace=True)
WT_Tm
| Tmexp [wt] | |
|---|---|
| PDBid | |
| 1aky | 47.6 |
| 1aky | 47.6 |
| 1aky | 47.6 |
| 1aky | 47.6 |
| 1aky | 47.6 |
| ... | ... |
| 5pti_m52l | 91.7 |
| 5pti_m52l | 91.7 |
| 5pti_m52l | 91.7 |
| 5pti_m52l | 91.7 |
| 5pti_m52l | 91.7 |
1626 rows × 1 columns
Here we create a dictionary that contains as keys, the pdbid of each protein and as values, the wild type melting temperature
dict_WT_TM = {}
for k,v in WT_Tm.itertuples():
if(k not in dict_WT_TM):
dict_WT_TM[k]=float(v)
Here we extract proteins with mutations reported only in chain A
pdbs = data[data['PDBid'].str.len()<5]
pdbs = pdbs[pdbs['Chain'] == "A"]
pdbs[['RESN','RESwt','RESmut']]
| RESN | RESwt | RESmut | |
|---|---|---|---|
| 0 | 8 | VAL | ILE |
| 1 | 48 | GLN | GLU |
| 2 | 77 | THR | HIS |
| 3 | 110 | THR | HIS |
| 4 | 169 | ASN | ASP |
| ... | ... | ... | ... |
| 1604 | 36 | GLY | ALA |
| 1605 | 36 | GLY | SER |
| 1606 | 37 | GLY | ALA |
| 1607 | 39 | ARG | ALA |
| 1608 | 46 | LYS | ALA |
1509 rows × 3 columns
This cell extracts the total number of mutations and changes in MT. In addition, we use a dicctionary to convert the residue mutation in a one letter code.
alls=[]
for resnum,wt in pdbs[['RESN','RESwt','RESmut','PDBid','ΔTmexp']].iteritems():
alls.append(wt.values)
d = {'CYS': 'C', 'ASP': 'D', 'SER': 'S', 'GLN': 'Q', 'LYS': 'K',
'ILE': 'I', 'PRO': 'P', 'THR': 'T', 'PHE': 'F', 'ASN': 'N',
'GLY': 'G', 'HIS': 'H', 'LEU': 'L', 'ARG': 'R', 'TRP': 'W',
'ALA': 'A', 'VAL':'V', 'GLU': 'E', 'TYR': 'Y', 'MET': 'M'}
resnum=alls[0]
wt=[d[x.strip()] for x in alls[1]] # extract the Wildtype aminoacid with one letter code
mut=[d[x.strip()] for x in alls[2]] # extract the Mutation aminoacid with one letter code
codes=alls[3] # PDB code
tms=alls[4] # Melting temperature
print("pdbid {}, WT-AA {}, Resnum {}, MUT-AA {},DeltaTm {}".format(codes[0],wt[0],resnum[0],mut[0],tms[0]))
pdbid 1aky, WT-AA V, Resnum 8, MUT-AA I,DeltaTm -1.5
PDB Download ¶
Here we download all the pdbs by PDBID using the pdbfixer tool
from pdbfixer import PDBFixer
from simtk.openmm.app import PDBFile
!mkdir PDBs
Using the fixer from pdbfixer we download each protein from its PDB code and fix some common problems present in the Protein Data Bank Files. This process will take around 15 minutes and 100 Mb. The use of the PDBFixer can be found in https://htmlpreview.github.io/?https://github.com/openmm/pdbfixer/blob/master/Manual.html . In our case, we download the pdb file from the pdb code and perform some curation such as find Nonstandar or missing residues, fix missing atoms
import os
import time
t0 = time.time()
downloaded = os.listdir("PDBs")
PDBs_ids= set(pdbs['PDBid'])
pdb_list = []
print("Start Download ")
for pdbid in PDBs_ids:
name=pdbid+".pdb"
if(name in downloaded):
continue
try:
fixer = PDBFixer(pdbid=pdbid)
fixer.findMissingResidues()
fixer.findNonstandardResidues()
fixer.replaceNonstandardResidues()
fixer.removeHeterogens(True)
fixer.findMissingAtoms()
fixer.addMissingAtoms()
PDBFile.writeFile(fixer.topology, fixer.positions, open('./PDBs/%s.pdb' % (pdbid), 'w'),keepIds=True)
except:
print("Problem with {}".format(pdbid))
print("Total Time {}".format(time.time()-t0))
The following function help us to mutate a sequence denoted as A###B where A is the wildtype aminoacid, ### the position and, B the new aminoacid
import re
def MutateSeq(seq,Mutant):
'''
Mutate a sequence based on a string (Mutant) that has the notation :
A###B where A is the wildtype aminoacid ### the position and B the mutation
'''
aalist = re.findall('([A-Z])([0-9]+)([A-Z])', Mutant)
#(len(aalist)==1):
newseq=seq
listseq=list(newseq)
for aas in aalist:
wildAA = aas[0]
pos = int(aas[1]) -1
if(pos >= len(listseq)):
print("Mutation not in the range of the protein")
return None
MutAA = aas[-1]
if(listseq[pos]==wildAA):
listseq[pos]=MutAA
else:
#print("WildType AA does not match")
return None
return("".join(listseq))
The following function help us to identify a sequence of aminoacids base on PDB structures
from Bio.PDB.PDBParser import PDBParser
from Bio.PDB.Polypeptide import PPBuilder
ppb=PPBuilder()
def GetSeqFromPDB(pdbid):
structure = PDBParser().get_structure(pdbid.split(".")[0], 'PDBs/{}'.format(pdbid))
seqs=[]
return ppb.build_peptides(structure)
Some examples of the described functions : GetSeqFromPDB. Take one pdb that we previously downloaded and extract the sequence in one letter code
import warnings; warnings.simplefilter('ignore')
test="1ezm"
print(test)
seq = GetSeqFromPDB("{}.pdb".format(test))[0].get_sequence()
print("Original Sequence")
print(seq)
1ezm Original Sequence AEAGGPGGNQKIGKYTYGSDYGPLIVNDRCEMDDGNVITVDMNSSTDDSKTTPFRFACPTNTYKQVNGAYSPLNDAHFFGGVVFKLYRDWFGTSPLTHKLYMKVHYGRSVENAYWDGTAMLFGDGATMFYPLVSLDVAAHEVSHGFTEQNSGLIYRGQSGGMNEAFSDMAGEAAEFYMRGKNDFLIGYDIKKGSGALRYMDQPSRDGRSIDNASQYYNGIDVHHSSGVYNRAFYLLANSPGWDTRKAFEVFVDANRYYWTATSNYNSGACGVIRSAQNRNYSAADVTRAFSTVGVTCPSAL
Information about the mutation
informSeq=GetSeqFromPDB(test+".pdb")[0].__repr__()
print("Seq information",informSeq)
start = re.findall('[0-9]+',informSeq)[0]
print("Reported Mutation {}{}{}".format("R",179,"A"))
numf =179 - int(start) + 1 # fix some cases of negative aminoacid numbers
Seq information <Polypeptide start=1 end=301> Reported Mutation R179A
Mutation in the sequence.
mutfinal = "R{}A".format(numf)
print("Real Mutation = ",mutfinal)
mutseq = MutateSeq(seq,mutfinal)
print(mutseq)
Real Mutation = R179A AEAGGPGGNQKIGKYTYGSDYGPLIVNDRCEMDDGNVITVDMNSSTDDSKTTPFRFACPTNTYKQVNGAYSPLNDAHFFGGVVFKLYRDWFGTSPLTHKLYMKVHYGRSVENAYWDGTAMLFGDGATMFYPLVSLDVAAHEVSHGFTEQNSGLIYRGQSGGMNEAFSDMAGEAAEFYMAGKNDFLIGYDIKKGSGALRYMDQPSRDGRSIDNASQYYNGIDVHHSSGVYNRAFYLLANSPGWDTRKAFEVFVDANRYYWTATSNYNSGACGVIRSAQNRNYSAADVTRAFSTVGVTCPSAL
In this for loop we extract the sequences of all proteins in the dataset. In addition we created the mutated sequences and append the change in MT. In some cases, gaps in pdbs will cause that mutateSeq function fails, therefore this entries will be avoided. This is an important step in the whole process because creates a final tabulated data that contains the sequence and the Melting temperature ( our label)
information = {}
count = 1
failures=[]
for code,tm,numr,wt_val,mut_val in zip(codes,tms,resnum,wt,mut):
count += 1
seq = GetSeqFromPDB("{}.pdb".format(code))[0].get_sequence()
mutfinal="WT"
if("{}-{}".format(code,mutfinal) not in information):
informSeq=GetSeqFromPDB(code+".pdb")[0].__repr__()
start = re.findall('[-0-9]+',informSeq)[0]
if(int(start)<0):
numf =numr - int(start) # if start is negative 0 is not used as resnumber
else:
numf =numr - int(start) + 1
mutfinal = "{}{}{}".format(wt_val,numf,mut_val)
mutseq = MutateSeq(seq,mutfinal)
if(mutseq==None):
failures.append((code,mutfinal))
continue
information["{}-{}".format(code,mutfinal)]=[mutseq,dict_WT_TM[code]-float(tm)]
Deep Learning and Machine Learning Models using proteins sequences ¶
import deepchem as dc
import tensorflow as tf
Here we extract two list, sequences (data) and melting temperature (label)
seq_list=[]
deltaTm=[]
for i in information.values():
seq_list.append(i[0])
deltaTm.append(i[1])
max_seq= 0
for i in seq_list:
if(len(i)>max_seq):
max_seq=len(i)
Here we use a OneHotFeaturizer in order to convert protein sequences in numeric arrays
codes = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L',
'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y']
OneHotFeaturizer = dc.feat.OneHotFeaturizer(codes,max_length=max_seq)
features = OneHotFeaturizer.featurize(seq_list)
Note that the OneHotFeaturizer produces a matrix that contains the OneHot Vector for each sequence.
features_vector = []
for i in range(len(features)):
features_vector.append(features[i].flatten())
dc_dataset = dc.data.NumpyDataset(X=features_vector,y=deltaTm)
dc_dataset
<NumpyDataset X.shape: (1497, 13188), y.shape: (1497,), w.shape: (1497,), task_names: [0]>
Here we create a spliiter to perform a tran_test split of the dataset
from deepchem import splits
splitter = splits.RandomSplitter()
train, test = splitter.train_test_split(dc_dataset,seed=42)
Here we create a neuronal network using tensorflow-keras and using a loss function of "MAE" to evaluate the regression result.
import tensorflow.keras as keras
#from keras import layers
model = tf.keras.Sequential([
keras.layers.Dense(units=32, activation='relu', input_shape=dc_dataset.X.shape[1:]),
keras.layers.Dropout(0.2),
keras.layers.Dense(units=32, activation='relu'),
keras.layers.Dropout(0.2),
keras.layers.Dense(units=1),
])
model.compile(loss='mae', optimizer='adam')
print(model.summary())
history = model.fit(
train.X, train.y,
validation_data=(test.X,test.y),
batch_size=100,
epochs=30,
)
## perform a plot of loss vs epochs
import matplotlib.pyplot as plt
history_df = pd.DataFrame(history.history)
history_df[['loss', 'val_loss']].plot()
Model: "sequential_2" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense_6 (Dense) (None, 32) 422048 _________________________________________________________________ dropout_4 (Dropout) (None, 32) 0 _________________________________________________________________ dense_7 (Dense) (None, 32) 1056 _________________________________________________________________ dropout_5 (Dropout) (None, 32) 0 _________________________________________________________________ dense_8 (Dense) (None, 1) 33 ================================================================= Total params: 423,137 Trainable params: 423,137 Non-trainable params: 0 _________________________________________________________________ None Epoch 1/30 12/12 [==============================] - 1s 20ms/step - loss: 62.4284 - val_loss: 51.3894 Epoch 2/30 12/12 [==============================] - 0s 10ms/step - loss: 45.6483 - val_loss: 22.2757 Epoch 3/30 12/12 [==============================] - 0s 12ms/step - loss: 19.4803 - val_loss: 11.7996 Epoch 4/30 12/12 [==============================] - 0s 11ms/step - loss: 18.0858 - val_loss: 8.6031 Epoch 5/30 12/12 [==============================] - 0s 11ms/step - loss: 14.5904 - val_loss: 9.4577 Epoch 6/30 12/12 [==============================] - 0s 10ms/step - loss: 14.3444 - val_loss: 7.3325 Epoch 7/30 12/12 [==============================] - 0s 10ms/step - loss: 13.6787 - val_loss: 7.5799 Epoch 8/30 12/12 [==============================] - 0s 10ms/step - loss: 14.0674 - val_loss: 6.6186 Epoch 9/30 12/12 [==============================] - 0s 11ms/step - loss: 12.8215 - val_loss: 7.4920 Epoch 10/30 12/12 [==============================] - 0s 11ms/step - loss: 13.0748 - val_loss: 5.4614 Epoch 11/30 12/12 [==============================] - 0s 11ms/step - loss: 12.3646 - val_loss: 6.7943 Epoch 12/30 12/12 [==============================] - 0s 10ms/step - loss: 11.5250 - val_loss: 5.2098 Epoch 13/30 12/12 [==============================] - 0s 11ms/step - loss: 11.5254 - val_loss: 5.4262 Epoch 14/30 12/12 [==============================] - 0s 11ms/step - loss: 12.1312 - val_loss: 6.4607 Epoch 15/30 12/12 [==============================] - 0s 11ms/step - loss: 11.4129 - val_loss: 5.6157 Epoch 16/30 12/12 [==============================] - 0s 11ms/step - loss: 11.4406 - val_loss: 6.0425 Epoch 17/30 12/12 [==============================] - 0s 12ms/step - loss: 11.4762 - val_loss: 5.3330 Epoch 18/30 12/12 [==============================] - 0s 11ms/step - loss: 11.4820 - val_loss: 8.5933 Epoch 19/30 12/12 [==============================] - 0s 11ms/step - loss: 11.1607 - val_loss: 5.6460 Epoch 20/30 12/12 [==============================] - 0s 12ms/step - loss: 11.2637 - val_loss: 5.3362 Epoch 21/30 12/12 [==============================] - 0s 11ms/step - loss: 11.4390 - val_loss: 6.2368 Epoch 22/30 12/12 [==============================] - 0s 10ms/step - loss: 10.8070 - val_loss: 6.2875 Epoch 23/30 12/12 [==============================] - 0s 11ms/step - loss: 11.1277 - val_loss: 5.3496 Epoch 24/30 12/12 [==============================] - 0s 10ms/step - loss: 11.1626 - val_loss: 5.0670 Epoch 25/30 12/12 [==============================] - 0s 10ms/step - loss: 10.9386 - val_loss: 5.0440 Epoch 26/30 12/12 [==============================] - 0s 14ms/step - loss: 11.0402 - val_loss: 5.6448 Epoch 27/30 12/12 [==============================] - 0s 14ms/step - loss: 10.7278 - val_loss: 5.3868 Epoch 28/30 12/12 [==============================] - 0s 13ms/step - loss: 10.6073 - val_loss: 5.4149 Epoch 29/30 12/12 [==============================] - 0s 12ms/step - loss: 11.0812 - val_loss: 5.9349 Epoch 30/30 12/12 [==============================] - 0s 14ms/step - loss: 10.2336 - val_loss: 5.7489
<AxesSubplot:>
DeepChem Keras Model ¶
Note that DeepChem Keras model allows to create a deepchem model based on the previously built model of keras. All the information in the training can be access with model.model.history
model_dc = dc.models.KerasModel(model, dc.models.losses.L1Loss())
model_dc.fit(train)
10.399123382568359
History_df = pd.DataFrame(model_dc.model.history.history)
History_df[['loss', 'val_loss']].plot()
<AxesSubplot:>
metric = dc.metrics.Metric(dc.metrics.pearson_r2_score)
print('test dataset R2:', model_dc.evaluate(test, [metric]))
test dataset R2: {'pearson_r2_score': 0.7125054529591869}
Examples of Classic ML models ¶
Finally, we will compare others descriptros such as AAcomposition and Composition,transition and distribution of AA ( https://www.pnas.org/content/92/19/8700 )
from propy import PyPro
In the following cell, we are creating and pyPro Object based on the protein sequence. Pypro allows us the calculation of amino acid composition vectors
Here we create a list with the aminoacido composition vector for each sequence used in the previous model.
import numpy as np
aaComplist = []
CTDList =[]
for seq in seq_list:
Obj = PyPro.GetProDes(seq)
aaComplist.append(np.array(list(Obj.GetAAComp().values())))
CTDList.append(np.array(list(Obj.GetCTD().values())))
dc_dataset_aacomp = dc.data.NumpyDataset(X=aaComplist,y=deltaTm)
dc_dataset_ctd = dc.data.NumpyDataset(X=CTDList,y=deltaTm)
Evaluation of classical machine learning models ¶
In the following cell we create a randomForest Regressor and the deepchem SklearnModel. As it was used in the DL models, here we use "MAE" score to evaluate the results of the regression
from deepchem import splits
splitter = splits.RandomSplitter()
train, test = splitter.train_test_split(dc_dataset_aacomp,seed=42)
from sklearn.ensemble import RandomForestRegressor
from deepchem.utils.evaluate import Evaluator
import pandas as pd
print("RandomForestRegressor")
seed = 42 # Set a random seed to get stable results
sklearn_model = RandomForestRegressor(n_estimators=100, max_features='sqrt')
sklearn_model.random_state = seed
model = dc.models.SklearnModel(sklearn_model)
model.fit(train)
metric = dc.metrics.Metric(dc.metrics.mae_score)
train_score = model.evaluate(train, [metric])
test_score = model.evaluate(test, [metric])
print("Train score is : {}".format(train_score))
print("Test score is : {}".format(test_score))
RandomForestRegressor
Train score is : {'mae_score': 1.7916551501995608}
Test score is : {'mae_score': 3.8967191996673947}
In the following cell we create a Suport Vector Regressor and the deepchem SklearnModel. As it was used in the DL models, here we use "MAE" score to evaluate the results of the regression
print("SupportVectorMachineRegressor")
from sklearn.svm import SVR
svr_sklearn = SVR(kernel="poly",degree=4)
svr_sklearn.random_state = seed
model = dc.models.SklearnModel(svr_sklearn)
model.fit(train)
metric = dc.metrics.Metric(dc.metrics.mae_score)
train_score = model.evaluate(train, [metric])
test_score = model.evaluate(test, [metric])
print("Train score is : {}".format(train_score))
print("Test score is : {}".format(test_score))
SupportVectorMachineRegressor
Train score is : {'mae_score': 3.275727325767219}
Test score is : {'mae_score': 4.058136267284038}
Congratulations! Time to join the Community! ¶
Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:
Star DeepChem on GitHub ¶
This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.
Join the DeepChem Gitter ¶
The DeepChem Gitter hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!