Tutorials

1. Introduction To Deepchem

1. The Basic Tools of the Deep Life Sciences

2. Working With Datasets

3. An Introduction To MoleculeNet

4. Molecular Fingerprints

5. Creating Models with TensorFlow and PyTorch

6. Introduction to Graph Convolutions

7. Going Deeper on Molecular Featurizations

8. Working With Splitters

9. Advanced Model Training

10. Creating a high fidelity model from experimental data

11. Putting Multitask Learning to Work

12. Modeling Protein Ligand Interactions

13. Modeling Protein Ligand Interactions With Atomic Convolutions

14. Conditional Generative Adversarial Networks

15. Training a Generative Adversarial Network on MNIST

16. Advanced model training using hyperopt

17. Introduction to Gaussian Processes

18. PytorchLightning Integration

19. Compiling DeepChem Torch Models

2. Molecular Machine Learning

1. Molecular Fingerprints

2. Going Deeper on Molecular Featurizations

3. Learning Unsupervised Embeddings for Molecules

4. Synthetic Feasibility Scoring

5. Atomic Contributions for Molecules

6. Interactive Model Evaluation with Trident Chemwidgets

7. Transfer Learning With ChemBERTa Transformers

8. Training a Normalizing Flow on QM9

9. Large Scale Chemical Screens

10. Introduction to Molecular Attention Transformer

11. Generating molecules with MolGAN

12. Introduction to GROVER

13. Introduction to PROTACs

14. Druggablity Assessment with Fpocket and Machine Learning

3. Modeling Proteins

1. Protein Deep Learning

2. DeepChem AntibodyTutorial Simplified

3. Protein Structure Prediction with ESMFold

4. Introduction to ProtBERT

5. ProteinLM Tutorial0

4. Protein Ligand Modeling

1. Introduction to Binding Sites

2. Modeling Protein Ligand Interactions

3. Modeling Protein Ligand Interactions With Atomic Convolutions

4. DeepChemXAlphafold

5. UniProt Data Preprocessing for Binding Sites

5. Quantum Chemistry

1. Exploring Quantum Chemistry with GDB1k

2. DeepQMC tutorial

3. Training an Exchange Correlation Functional using Deepchem

6. Bioinformatics

1. Introduction to Bioinformatics

2. Multisequence Alignments

3. Scanpy

4. Deep probabilistic analysis of single-cell omics data

5. Cell Counting Tutorial

7. Material Science

1. Introduction To Material Science

8. Machine Learning Methods

1. Using Reinforcement Learning to Play Pong

2. Introduction to Model Interpretability

3. Uncertainty In Deep Learning

9. Deep Differential Equations

1. Physics Informed Neural Networks

2. Introducing JaxModel and PINNModel

3. About nODE Using Torchdiffeq in Deepchem

4. Differentiation Infrastructure in Deepchem

5. Ordinary Differential Equation Solving using deepchem

10. Equivariance

1. Introduction to Equivariance

11. Olfaction

1. Predict Multi Label Odor Descriptors using OpenPOM

12. Polymer Science

1. An Introduction to the Polymers and Their Representation

2. crystallization tendency regression

3. Introduction to Polymer SMILES

4. Understanding Weighted Directed Graphs for Polymer Implimentations

5. PolyBERT Introduction to Chemical Language Models for Fingerprint Generation

Protein Deep Learning ¶

by David Ricardo Figueroa Blanco

In this tutorial we will compare protein sequences featurization such as one hot encoders and aminoacids composition. We will use some tools of DeepChem and additional packages to create a model to predict melting temperature of proteins ( a good measurement of protein stability )

The melting temperature (MT) of a protein is a measurement of protein stability. This measure could vary from a big variety of experimental conditions, however, curated databases cand be found in literature https://aip.scitation.org/doi/10.1063/1.4947493 . In this paper we can find a lot of thermodynamic information of proteins and therefore a big resource for the study of protein stability. Other information related with protein stability could be the change in Gibbs Free Energy $ \Delta \Delta G°$ due to a mutation.

The study of protein stability is important in areas such as protein engineering and biocatalysis because catalytic efficiency could be directly related to the tertiary structure of the protein in study.

Setup ¶

To run DeepChem within Colab, you'll need to run the following installation commands. This will take about 5 minutes to run to completion and install your environment. You can of course run this tutorial locally if you prefer. In that case, don't run these cells since they will download and install Anaconda on your local machine.

In [ ]:

!curl -Lo conda_installer.py https://raw.githubusercontent.com/deepchem/deepchem/master/scripts/colab_install.py
import conda_installer
conda_installer.install()
!/root/miniconda/bin/conda info -e

In [ ]:

!pip install --pre deepchem
!pip install propy3

Data extraction ¶

In this cell, we download the dataset published in the paper https://aip.scitation.org/doi/10.1063/1.4947493 from the DeepChem dataset repository

In [1]:

import deepchem as dc 
import os 
from deepchem.utils import download_url
data_dir = dc.utils.get_data_dir()
download_url("https://deepchemdata.s3-us-west-1.amazonaws.com/datasets/pucci-proteins-appendixtable1.csv",dest_dir=data_dir)
print('Dataset Dowloaded at {}'.format(data_dir))
dataset_file = os.path.join(data_dir, "pucci-proteins-appendixtable1.csv")

Dataset Dowloaded at /tmp

A closer look of the dataset: Contains the PDBid and the respective mutation and change in thermodynamical properties in each studied protein

In [2]:

import pandas as pd 
data = pd.read_csv(dataset_file)
data

Out[2]:

	Unnamed: 0	N	PDBid	Chain	RESN	RESwt	RESmut	ΔTmexp	Tmexp [wt]	ΔΔHmexp	...	ΔΔGexp(T)	T	Nres	R (Å)	Protein	Organism	Ref.	pH	Exp.Tech.
0	NaN	1	1aky	A	8	VAL	ILE	-1.5	47.6	70	...	5.0	25	220	1.63	ADK	Yeast	[1]	[7.5]	FL	NaN
1	NaN	2	1aky	A	48	GLN	GLU	-1.3	47.6	60	...	4.0	25	220	1.63	ADK	Yeast	[1]	[7.7]	FL	NaN
2	NaN	3	1aky	A	77	THR	HIS	-1.1	47.6	130	...	9.0	25	220	1.63	ADK	Yeast	[1]	[7.5]	FL	NaN
3	NaN	4	1aky	A	110	THR	HIS	-4.8	47.6	165	...	11.0	25	220	1.63	ADK	Yeast	[1]	[7.6]	FL	NaN
4	NaN	5	1aky	A	169	ASN	ASP	-0.6	47.6	140	...	9.0	25	220	1.63	ADK	Yeast	[1]	[7.5]	FL	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1621	NaN	1622	5pti_m52l	A	15	LYS	SER	-1.3	91.7	-5	...	1.2	25	58	1.00	PTI M52L	Bovine	[232]	[3.0]	DSC	NaN
1622	NaN	1623	5pti_m52l	A	15	LYS	THR	-1.1	91.7	-9	...	-3.6	25	58	1.00	PTI M52L	Bovine	[232]	[3.0]	DSC	NaN
1623	NaN	1624	5pti_m52l	A	15	LYS	VAL	-6.3	91.7	4	...	4.7	25	58	1.00	PTI M52L	Bovine	[232]	[3.0]	DSC	NaN
1624	NaN	1625	5pti_m52l	A	15	LYS	TRP	-7.5	91.7	17	...	8.5	25	58	1.00	PTI M52L	Bovine	[232]	[2.5]	DSC	NaN
1625	NaN	1626	5pti_m52l	A	15	LYS	TYR	-6.6	91.7	4	...	4.6	25	58	1.00	PTI M52L	Bovine	[232]	[3.0]	DSC	NaN

1626 rows × 23 columns

Here we extract a small DataFrame that only contains a unique PDBid code and its respective melting temperature

In [3]:

WT_Tm = data[['PDBid','Tmexp [wt]']]
WT_Tm.set_index('PDBid',inplace=True)

In [4]:

WT_Tm

Out[4]:

	Tmexp [wt]
PDBid
1aky	47.6
1aky	47.6
1aky	47.6
1aky	47.6
1aky	47.6
...	...
5pti_m52l	91.7
5pti_m52l	91.7
5pti_m52l	91.7
5pti_m52l	91.7
5pti_m52l	91.7

1626 rows × 1 columns

Here we create a dictionary that contains as keys, the pdbid of each protein and as values, the wild type melting temperature

In [5]:

dict_WT_TM = {}
for k,v in WT_Tm.itertuples():
    if(k not in dict_WT_TM):
        dict_WT_TM[k]=float(v)

Here we extract proteins with mutations reported only in chain A

In [6]:

pdbs = data[data['PDBid'].str.len()<5]
pdbs = pdbs[pdbs['Chain'] == "A"]

In [7]:

pdbs[['RESN','RESwt','RESmut']]

Out[7]:

	RESN	RESwt	RESmut
0	8	VAL	ILE
1	48	GLN	GLU
2	77	THR	HIS
3	110	THR	HIS
4	169	ASN	ASP
...	...	...	...
1604	36	GLY	ALA
1605	36	GLY	SER
1606	37	GLY	ALA
1607	39	ARG	ALA
1608	46	LYS	ALA

1509 rows × 3 columns

This cell extracts the total number of mutations and changes in MT. In addition, we use a dicctionary to convert the residue mutation in a one letter code.

In [8]:

alls=[]
for resnum,wt in pdbs[['RESN','RESwt','RESmut','PDBid','ΔTmexp']].iteritems():
    alls.append(wt.values)
d = {'CYS': 'C', 'ASP': 'D', 'SER': 'S', 'GLN': 'Q', 'LYS': 'K',
     'ILE': 'I', 'PRO': 'P', 'THR': 'T', 'PHE': 'F', 'ASN': 'N', 
     'GLY': 'G', 'HIS': 'H', 'LEU': 'L', 'ARG': 'R', 'TRP': 'W', 
     'ALA': 'A', 'VAL':'V', 'GLU': 'E', 'TYR': 'Y', 'MET': 'M'}
resnum=alls[0]
wt=[d[x.strip()] for x in alls[1]] # extract the Wildtype aminoacid with one letter code 
mut=[d[x.strip()] for x in alls[2]] # extract the Mutation aminoacid with one letter code 
codes=alls[3] # PDB code 
tms=alls[4] # Melting temperature

In [9]:

print("pdbid {}, WT-AA {}, Resnum {}, MUT-AA {},DeltaTm {}".format(codes[0],wt[0],resnum[0],mut[0],tms[0]))

pdbid 1aky, WT-AA V, Resnum 8, MUT-AA I,DeltaTm -1.5

PDB Download ¶

Here we download all the pdbs by PDBID using the pdbfixer tool

In [10]:

from pdbfixer import PDBFixer
from simtk.openmm.app import PDBFile

In [ ]:

!mkdir PDBs

Using the fixer from pdbfixer we download each protein from its PDB code and fix some common problems present in the Protein Data Bank Files. This process will take around 15 minutes and 100 Mb. The use of the PDBFixer can be found in https://htmlpreview.github.io/?https://github.com/openmm/pdbfixer/blob/master/Manual.html . In our case, we download the pdb file from the pdb code and perform some curation such as find Nonstandar or missing residues, fix missing atoms

In [ ]:

import os 
import time
t0 = time.time()

downloaded = os.listdir("PDBs")
PDBs_ids= set(pdbs['PDBid'])
pdb_list = []
print("Start Download ")
for pdbid in PDBs_ids:
    name=pdbid+".pdb"
    if(name in downloaded):
        continue
    try:
        fixer = PDBFixer(pdbid=pdbid)
        fixer.findMissingResidues()
        fixer.findNonstandardResidues()
        fixer.replaceNonstandardResidues()
        fixer.removeHeterogens(True)
        fixer.findMissingAtoms()
        fixer.addMissingAtoms()
        PDBFile.writeFile(fixer.topology, fixer.positions, open('./PDBs/%s.pdb' % (pdbid), 'w'),keepIds=True)
    except:
        print("Problem with {}".format(pdbid))
print("Total Time {}".format(time.time()-t0))

The following function help us to mutate a sequence denoted as A###B where A is the wildtype aminoacid, ### the position and, B the new aminoacid

In [11]:

import re
def MutateSeq(seq,Mutant):
    '''
    Mutate a sequence based on a string (Mutant) that has the notation : 
    A###B where A is the wildtype aminoacid ### the position and B the mutation
    '''
    aalist = re.findall('([A-Z])([0-9]+)([A-Z])', Mutant)
    
    #(len(aalist)==1):
    newseq=seq
    listseq=list(newseq)
    for aas in aalist:
        wildAA = aas[0]
        pos = int(aas[1]) -1
        if(pos >= len(listseq)):
            print("Mutation not in the range of the protein")
            return None
        MutAA = aas[-1]
        
        if(listseq[pos]==wildAA):
            
            listseq[pos]=MutAA
            
        else:
            #print("WildType AA does not match")
            return None
    return("".join(listseq))

The following function help us to identify a sequence of aminoacids base on PDB structures

In [12]:

from Bio.PDB.PDBParser import PDBParser
from Bio.PDB.Polypeptide import PPBuilder
ppb=PPBuilder()
def GetSeqFromPDB(pdbid):
    structure = PDBParser().get_structure(pdbid.split(".")[0], 'PDBs/{}'.format(pdbid))
    seqs=[]
    return ppb.build_peptides(structure)

Some examples of the described functions : GetSeqFromPDB. Take one pdb that we previously downloaded and extract the sequence in one letter code

In [13]:

import warnings; warnings.simplefilter('ignore')
test="1ezm"
print(test)
seq = GetSeqFromPDB("{}.pdb".format(test))[0].get_sequence()
print("Original Sequence")
print(seq)

1ezm
Original Sequence
AEAGGPGGNQKIGKYTYGSDYGPLIVNDRCEMDDGNVITVDMNSSTDDSKTTPFRFACPTNTYKQVNGAYSPLNDAHFFGGVVFKLYRDWFGTSPLTHKLYMKVHYGRSVENAYWDGTAMLFGDGATMFYPLVSLDVAAHEVSHGFTEQNSGLIYRGQSGGMNEAFSDMAGEAAEFYMRGKNDFLIGYDIKKGSGALRYMDQPSRDGRSIDNASQYYNGIDVHHSSGVYNRAFYLLANSPGWDTRKAFEVFVDANRYYWTATSNYNSGACGVIRSAQNRNYSAADVTRAFSTVGVTCPSAL

Information about the mutation

In [14]:

informSeq=GetSeqFromPDB(test+".pdb")[0].__repr__()
print("Seq information",informSeq)
start = re.findall('[0-9]+',informSeq)[0]
print("Reported Mutation {}{}{}".format("R",179,"A"))
numf =179 - int(start) + 1  # fix some cases of negative aminoacid numbers

Seq information <Polypeptide start=1 end=301>
Reported Mutation R179A

Mutation in the sequence.

In [15]:

mutfinal = "R{}A".format(numf)
print("Real Mutation = ",mutfinal)
mutseq = MutateSeq(seq,mutfinal)
print(mutseq)

Real Mutation =  R179A
AEAGGPGGNQKIGKYTYGSDYGPLIVNDRCEMDDGNVITVDMNSSTDDSKTTPFRFACPTNTYKQVNGAYSPLNDAHFFGGVVFKLYRDWFGTSPLTHKLYMKVHYGRSVENAYWDGTAMLFGDGATMFYPLVSLDVAAHEVSHGFTEQNSGLIYRGQSGGMNEAFSDMAGEAAEFYMAGKNDFLIGYDIKKGSGALRYMDQPSRDGRSIDNASQYYNGIDVHHSSGVYNRAFYLLANSPGWDTRKAFEVFVDANRYYWTATSNYNSGACGVIRSAQNRNYSAADVTRAFSTVGVTCPSAL

In this for loop we extract the sequences of all proteins in the dataset. In addition we created the mutated sequences and append the change in MT. In some cases, gaps in pdbs will cause that mutateSeq function fails, therefore this entries will be avoided. This is an important step in the whole process because creates a final tabulated data that contains the sequence and the Melting temperature ( our label)

In [16]:

information = {}
count = 1
failures=[]
for code,tm,numr,wt_val,mut_val in zip(codes,tms,resnum,wt,mut):
    count += 1
    seq = GetSeqFromPDB("{}.pdb".format(code))[0].get_sequence()
    mutfinal="WT"
    if("{}-{}".format(code,mutfinal) not in information):
        informSeq=GetSeqFromPDB(code+".pdb")[0].__repr__()
        start = re.findall('[-0-9]+',informSeq)[0]
    if(int(start)<0):
        numf =numr - int(start) # if start is negative 0 is not used as resnumber
    else:
        numf =numr - int(start) + 1 
    mutfinal = "{}{}{}".format(wt_val,numf,mut_val)
    mutseq = MutateSeq(seq,mutfinal)
    if(mutseq==None):
        failures.append((code,mutfinal))
        continue
    information["{}-{}".format(code,mutfinal)]=[mutseq,dict_WT_TM[code]-float(tm)]

Deep Learning and Machine Learning Models using proteins sequences ¶

In [17]:

import deepchem as dc
import tensorflow as tf

Here we extract two list, sequences (data) and melting temperature (label)

In [18]:

seq_list=[]
deltaTm=[]
for i in information.values():
    seq_list.append(i[0])
    deltaTm.append(i[1])

In [19]:

max_seq= 0 
for i in seq_list:
    if(len(i)>max_seq):
        max_seq=len(i)

Here we use a OneHotFeaturizer in order to convert protein sequences in numeric arrays

In [20]:

codes = ['A', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'K', 'L',
         'M', 'N', 'P', 'Q', 'R', 'S', 'T', 'V', 'W', 'Y']
OneHotFeaturizer = dc.feat.OneHotFeaturizer(codes,max_length=max_seq)
features = OneHotFeaturizer.featurize(seq_list)

Note that the OneHotFeaturizer produces a matrix that contains the OneHot Vector for each sequence.

In [21]:

features_vector  = []
for i in range(len(features)):
    features_vector.append(features[i].flatten())

In [22]:

dc_dataset = dc.data.NumpyDataset(X=features_vector,y=deltaTm)

In [23]:

dc_dataset

Out[23]:

<NumpyDataset X.shape: (1497, 13188), y.shape: (1497,), w.shape: (1497,), task_names: [0]>

Here we create a spliiter to perform a tran_test split of the dataset

In [41]:

from deepchem import splits
splitter = splits.RandomSplitter()
train, test  = splitter.train_test_split(dc_dataset,seed=42)

Here we create a neuronal network using tensorflow-keras and using a loss function of "MAE" to evaluate the regression result.

In [42]:

import tensorflow.keras as keras
#from  keras import layers
model = tf.keras.Sequential([
    keras.layers.Dense(units=32, activation='relu', input_shape=dc_dataset.X.shape[1:]),
    keras.layers.Dropout(0.2),
    keras.layers.Dense(units=32, activation='relu'), 
    keras.layers.Dropout(0.2),
    keras.layers.Dense(units=1),
])

In [43]:

model.compile(loss='mae', optimizer='adam')

print(model.summary())

history = model.fit(
    train.X, train.y,
    validation_data=(test.X,test.y),
    batch_size=100,
    epochs=30,
)

## perform a plot of loss vs epochs 
import matplotlib.pyplot as plt
history_df = pd.DataFrame(history.history)
history_df[['loss', 'val_loss']].plot()

Model: "sequential_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_6 (Dense)              (None, 32)                422048    
_________________________________________________________________
dropout_4 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_7 (Dense)              (None, 32)                1056      
_________________________________________________________________
dropout_5 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_8 (Dense)              (None, 1)                 33        
=================================================================
Total params: 423,137
Trainable params: 423,137
Non-trainable params: 0
_________________________________________________________________
None
Epoch 1/30
12/12 [==============================] - 1s 20ms/step - loss: 62.4284 - val_loss: 51.3894
Epoch 2/30
12/12 [==============================] - 0s 10ms/step - loss: 45.6483 - val_loss: 22.2757
Epoch 3/30
12/12 [==============================] - 0s 12ms/step - loss: 19.4803 - val_loss: 11.7996
Epoch 4/30
12/12 [==============================] - 0s 11ms/step - loss: 18.0858 - val_loss: 8.6031
Epoch 5/30
12/12 [==============================] - 0s 11ms/step - loss: 14.5904 - val_loss: 9.4577
Epoch 6/30
12/12 [==============================] - 0s 10ms/step - loss: 14.3444 - val_loss: 7.3325
Epoch 7/30
12/12 [==============================] - 0s 10ms/step - loss: 13.6787 - val_loss: 7.5799
Epoch 8/30
12/12 [==============================] - 0s 10ms/step - loss: 14.0674 - val_loss: 6.6186
Epoch 9/30
12/12 [==============================] - 0s 11ms/step - loss: 12.8215 - val_loss: 7.4920
Epoch 10/30
12/12 [==============================] - 0s 11ms/step - loss: 13.0748 - val_loss: 5.4614
Epoch 11/30
12/12 [==============================] - 0s 11ms/step - loss: 12.3646 - val_loss: 6.7943
Epoch 12/30
12/12 [==============================] - 0s 10ms/step - loss: 11.5250 - val_loss: 5.2098
Epoch 13/30
12/12 [==============================] - 0s 11ms/step - loss: 11.5254 - val_loss: 5.4262
Epoch 14/30
12/12 [==============================] - 0s 11ms/step - loss: 12.1312 - val_loss: 6.4607
Epoch 15/30
12/12 [==============================] - 0s 11ms/step - loss: 11.4129 - val_loss: 5.6157
Epoch 16/30
12/12 [==============================] - 0s 11ms/step - loss: 11.4406 - val_loss: 6.0425
Epoch 17/30
12/12 [==============================] - 0s 12ms/step - loss: 11.4762 - val_loss: 5.3330
Epoch 18/30
12/12 [==============================] - 0s 11ms/step - loss: 11.4820 - val_loss: 8.5933
Epoch 19/30
12/12 [==============================] - 0s 11ms/step - loss: 11.1607 - val_loss: 5.6460
Epoch 20/30
12/12 [==============================] - 0s 12ms/step - loss: 11.2637 - val_loss: 5.3362
Epoch 21/30
12/12 [==============================] - 0s 11ms/step - loss: 11.4390 - val_loss: 6.2368
Epoch 22/30
12/12 [==============================] - 0s 10ms/step - loss: 10.8070 - val_loss: 6.2875
Epoch 23/30
12/12 [==============================] - 0s 11ms/step - loss: 11.1277 - val_loss: 5.3496
Epoch 24/30
12/12 [==============================] - 0s 10ms/step - loss: 11.1626 - val_loss: 5.0670
Epoch 25/30
12/12 [==============================] - 0s 10ms/step - loss: 10.9386 - val_loss: 5.0440
Epoch 26/30
12/12 [==============================] - 0s 14ms/step - loss: 11.0402 - val_loss: 5.6448
Epoch 27/30
12/12 [==============================] - 0s 14ms/step - loss: 10.7278 - val_loss: 5.3868
Epoch 28/30
12/12 [==============================] - 0s 13ms/step - loss: 10.6073 - val_loss: 5.4149
Epoch 29/30
12/12 [==============================] - 0s 12ms/step - loss: 11.0812 - val_loss: 5.9349
Epoch 30/30
12/12 [==============================] - 0s 14ms/step - loss: 10.2336 - val_loss: 5.7489

Out[43]:

<AxesSubplot:>

No description has been provided for this image

DeepChem Keras Model ¶

Note that DeepChem Keras model allows to create a deepchem model based on the previously built model of keras. All the information in the training can be access with model.model.history

In [44]:

model_dc = dc.models.KerasModel(model, dc.models.losses.L1Loss())
model_dc.fit(train)

Out[44]:

10.399123382568359

In [45]:

History_df = pd.DataFrame(model_dc.model.history.history)
History_df[['loss', 'val_loss']].plot()

Out[45]:

<AxesSubplot:>

In [48]:

metric = dc.metrics.Metric(dc.metrics.pearson_r2_score)
print('test dataset R2:', model_dc.evaluate(test, [metric]))

test dataset R2: {'pearson_r2_score': 0.7125054529591869}

Examples of Classic ML models ¶

Finally, we will compare others descriptros such as AAcomposition and Composition,transition and distribution of AA ( https://www.pnas.org/content/92/19/8700 )

In [49]:

from propy import PyPro

In the following cell, we are creating and pyPro Object based on the protein sequence. Pypro allows us the calculation of amino acid composition vectors
Here we create a list with the aminoacido composition vector for each sequence used in the previous model.

In [50]:

import numpy as np 
aaComplist = []
CTDList =[]
for seq in seq_list:
    Obj = PyPro.GetProDes(seq)
    aaComplist.append(np.array(list(Obj.GetAAComp().values())))
    CTDList.append(np.array(list(Obj.GetCTD().values())))

In [51]:

dc_dataset_aacomp = dc.data.NumpyDataset(X=aaComplist,y=deltaTm)
dc_dataset_ctd = dc.data.NumpyDataset(X=CTDList,y=deltaTm)

Evaluation of classical machine learning models ¶

In the following cell we create a randomForest Regressor and the deepchem SklearnModel. As it was used in the DL models, here we use "MAE" score to evaluate the results of the regression

In [52]:

from deepchem import splits
splitter = splits.RandomSplitter()
train, test  = splitter.train_test_split(dc_dataset_aacomp,seed=42)
from sklearn.ensemble import RandomForestRegressor
from deepchem.utils.evaluate import Evaluator
import pandas as pd
print("RandomForestRegressor")
seed = 42 # Set a random seed to get stable results
sklearn_model = RandomForestRegressor(n_estimators=100, max_features='sqrt')
sklearn_model.random_state = seed
model = dc.models.SklearnModel(sklearn_model)
model.fit(train)
metric = dc.metrics.Metric(dc.metrics.mae_score)
train_score = model.evaluate(train, [metric])
test_score = model.evaluate(test, [metric])
print("Train score is : {}".format(train_score))
print("Test score is : {}".format(test_score))

RandomForestRegressor
Train score is : {'mae_score': 1.7916551501995608}
Test score is : {'mae_score': 3.8967191996673947}

In the following cell we create a Suport Vector Regressor and the deepchem SklearnModel. As it was used in the DL models, here we use "MAE" score to evaluate the results of the regression

In [53]:

print("SupportVectorMachineRegressor")
from sklearn.svm import SVR
svr_sklearn = SVR(kernel="poly",degree=4)
svr_sklearn.random_state = seed 
model = dc.models.SklearnModel(svr_sklearn)
model.fit(train)
metric = dc.metrics.Metric(dc.metrics.mae_score)
train_score = model.evaluate(train, [metric])
test_score = model.evaluate(test, [metric])
print("Train score is : {}".format(train_score))
print("Test score is : {}".format(test_score))

SupportVectorMachineRegressor
Train score is : {'mae_score': 3.275727325767219}
Test score is : {'mae_score': 4.058136267284038}

Congratulations! Time to join the Community! ¶

Congratulations on completing this tutorial notebook! If you enjoyed working through the tutorial, and want to continue working with DeepChem, we encourage you to finish the rest of the tutorials in this series. You can also help the DeepChem community in the following ways:

Star DeepChem on GitHub ¶

This helps build awareness of the DeepChem project and the tools for open source drug discovery that we're trying to build.

Join the DeepChem Gitter ¶

The DeepChem Gitter hosts a number of scientists, developers, and enthusiasts interested in deep learning for the life sciences. Join the conversation!