Tutorials

1. Introduction To Deepchem

1. The Basic Tools of the Deep Life Sciences

2. Working With Datasets

3. An Introduction To MoleculeNet

4. Molecular Fingerprints

5. Creating Models with TensorFlow and PyTorch

6. Introduction to Graph Convolutions

7. Going Deeper on Molecular Featurizations

8. Working With Splitters

9. Advanced Model Training

10. Creating a high fidelity model from experimental data

11. Putting Multitask Learning to Work

12. Modeling Protein Ligand Interactions

13. Modeling Protein Ligand Interactions With Atomic Convolutions

14. Conditional Generative Adversarial Networks

15. Training a Generative Adversarial Network on MNIST

16. Advanced model training using hyperopt

17. Introduction to Gaussian Processes

18. PytorchLightning Integration

19. Compiling DeepChem Torch Models

2. Molecular Machine Learning

1. Molecular Fingerprints

2. Going Deeper on Molecular Featurizations

3. Learning Unsupervised Embeddings for Molecules

4. Synthetic Feasibility Scoring

5. Atomic Contributions for Molecules

6. Interactive Model Evaluation with Trident Chemwidgets

7. Transfer Learning With ChemBERTa Transformers

8. Training a Normalizing Flow on QM9

9. Large Scale Chemical Screens

10. Introduction to Molecular Attention Transformer

11. Generating molecules with MolGAN

12. Introduction to GROVER

13. Introduction to PROTACs

14. Druggablity Assessment with Fpocket and Machine Learning

3. Modeling Proteins

1. Protein Deep Learning

2. DeepChem AntibodyTutorial Simplified

3. Protein Structure Prediction with ESMFold

4. Introduction to ProtBERT

5. ProteinLM Tutorial0

4. Protein Ligand Modeling

1. Introduction to Binding Sites

2. Modeling Protein Ligand Interactions

3. Modeling Protein Ligand Interactions With Atomic Convolutions

4. DeepChemXAlphafold

5. UniProt Data Preprocessing for Binding Sites

5. Quantum Chemistry

1. Exploring Quantum Chemistry with GDB1k

2. DeepQMC tutorial

3. Training an Exchange Correlation Functional using Deepchem

6. Bioinformatics

1. Introduction to Bioinformatics

2. Multisequence Alignments

3. Scanpy

4. Deep probabilistic analysis of single-cell omics data

5. Cell Counting Tutorial

7. Material Science

1. Introduction To Material Science

8. Machine Learning Methods

1. Using Reinforcement Learning to Play Pong

2. Introduction to Model Interpretability

3. Uncertainty In Deep Learning

9. Deep Differential Equations

1. Physics Informed Neural Networks

2. Introducing JaxModel and PINNModel

3. About nODE Using Torchdiffeq in Deepchem

4. Differentiation Infrastructure in Deepchem

5. Ordinary Differential Equation Solving using deepchem

10. Equivariance

1. Introduction to Equivariance

11. Olfaction

1. Predict Multi Label Odor Descriptors using OpenPOM

12. Polymer Science

1. An Introduction to the Polymers and Their Representation

2. crystallization tendency regression

3. Introduction to Polymer SMILES

4. Understanding Weighted Directed Graphs for Polymer Implimentations

5. PolyBERT Introduction to Chemical Language Models for Fingerprint Generation

Deep Probabilistic Analysis of Single-Cell Omics Data ¶

Recordings at single-cell resolution can give us a better understanding of the biological differences in our sample. As sequencing technologies and instruments have become better and cheaper, generating single-cell data is becoming more popular. In order to derive meaningful biological insights, it is important to select reliable analysis tools such as the one we will cover in this tutorial.

scvi-tools (single-cell variational inference tools) is a package for probabilistic modeling and analysis of single-cell omics data, built on top of PyTorch and AnnData that aims to address some of the limitations that arise when developing and implementing probabilistic models. scvi-tools is used in tandem with Scanpy for which Deepchem also offers a tutorial. In the broader analysis pipeline, scVI sits downstream of initial quality control (QC)-driven preprocessing and generates outputs that may be further interpreted via general single-cell analysis tools.

In this introductory tutorial, we go through the different steps of an scvi-tools workflow. While we focus on scVI in this tutorial, the API is consistent across all models. Please note that this tutorial was largely adapted from the one provided by scvi-tools and you can head to their page to find more information.

Colab ¶

This tutorial and the rest in this sequence are designed to be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.

In [ ]:

# install necessary packages
!pip install --quiet scvi-colab
from scvi_colab import install
install()

In [ ]:

import scvi
import scanpy as sc
import matplotlib.pyplot as plt

# set preferences for figures and plots
sc.set_figure_params(figsize=(4, 4))

# for white background of figures (only for docs rendering)
%config InlineBackend.print_figure_kwargs={'facecolor' : "w"}
%config InlineBackend.figure_format='retina'

Global seed set to 0

Loading and preparing data ¶

Let us first load a subsampled version of the heart cell atlas dataset described in Litviňuková et al. (2020). scvi-tools has many "built-in" datasets as well as support for loading arbitrary .csv , .loom , and .h5ad (AnnData) files. Please see our tutorial on data loading for more examples.

Litviňuková, M., Talavera-López, C., Maatz, H., Reichart, D., Worth, C. L., Lindberg, E. L., ... & Teichmann, S. A. (2020). Cells of the adult human heart. Nature, 588(7838), 466-472.

Important

All scvi-tools models require AnnData objects as input.

In [ ]:

adata = scvi.data.heart_cell_atlas_subsampled()

INFO     File data/hca_subsampled_20k.h5ad already downloaded

Now we preprocess the data to remove, for example, genes that are very lowly expressed and other outliers. For these tasks we prefer the Scanpy preprocessing module .

In [ ]:

sc.pp.filter_genes(adata, min_counts=3)

In scRNA-seq analysis, it's popular to normalize the data. These values are not used by scvi-tools, but given their popularity in other tasks as well as for visualization, we store them in the anndata object separately (via the .raw attribute).

Important

Unless otherwise specified, scvi-tools models require the raw counts (not log library size normalized).

In [ ]:

adata.layers["counts"] = adata.X.copy() # preserve counts
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.raw = adata # freeze the state in `.raw`

Finally, we perform feature selection, to reduce the number of features (genes in this case) used as input to the scvi-tools model. For best practices of how/when to perform feature selection, please refer to the model-specific tutorial. For scVI, we recommend anywhere from 1,000 to 10,000 HVGs, but it will be context-dependent.

In [ ]:

sc.pp.highly_variable_genes(
    adata, 
    n_top_genes=1200, 
    subset=True, 
    layer="counts", 
    flavor="seurat_v3",
    batch_key="cell_source"
)

Now it's time to run setup_anndata() , which alerts scvi-tools to the locations of various matrices inside the anndata. It's important to run this function with the correct arguments so scvi-tools is notified that your dataset has batches, annotations, etc. For example, if batches are registered with scvi-tools, the subsequent model will correct for batch effects. See the full documentation for details.

In this dataset, there is a "cell_source" categorical covariate, and within each "cell_source", multiple "donors", "gender" and "age_group". There are also two continuous covariates we'd like to correct for: "percent_mito" and "percent_ribo". These covariates can be registered using the categorical_covariate_keys argument. If you only have one categorical covariate, you can also use the batch_key argument instead.

In [ ]:

scvi.model.SCVI.setup_anndata(
    adata,
    layer="counts",
    categorical_covariate_keys=["cell_source", "donor"],
    continuous_covariate_keys=["percent_mito", "percent_ribo"]
)

Warning

If the adata is modified after running setup_anndata , please run setup_anndata again, before creating an instance of a model.

Creating and training a model ¶

While we highlight the scVI model here, the API is consistent across all scvi-tools models and is inspired by that of scikit-learn . For a full list of options, see the scvi documentation .

In [ ]:

model = scvi.model.SCVI(adata)

We can see an overview of the model by printing it.

In [ ]:

model

SCVI Model with the following params: 
n_hidden: 128, n_latent: 10, n_layers: 1, dropout_rate: 0.1, dispersion: gene, 
gene_likelihood: zinb, latent_distribution: normal
Training status: Not Trained

Out[ ]:

Important

All scvi-tools models run faster when using a GPU. By default, scvi-tools will use a GPU if one is found to be available. Please see the installation page for more information about installing scvi-tools when a GPU is available.

In [ ]:

model.train()

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

Epoch 400/400: 100%|██████████| 400/400 [05:43<00:00,  1.16it/s, loss=284, v_num=1]

Saving and loading ¶

Saving consists of saving the model neural network weights, as well as parameters used to initialize the model.

In [ ]:

# model.save("my_model/")

In [ ]:

# model = scvi.model.SCVI.load("my_model/", adata=adata, use_gpu=True)

Obtaining model outputs ¶

In [ ]:

latent = model.get_latent_representation()

It's often useful to store the outputs of scvi-tools back into the original anndata, as it permits interoperability with Scanpy.

In [ ]:

adata.obsm["X_scVI"] = latent

The model.get...() functions default to using the anndata that was used to initialize the model. It's possible to also query a subset of the anndata, or even use a completely independent anndata object as long as the anndata is organized in an equivalent fashion.

In [ ]:

adata_subset = adata[adata.obs.cell_type == "Fibroblast"]
latent_subset = model.get_latent_representation(adata_subset)

INFO     Received view of anndata, making copy.

INFO:scvi.model.base._base_model:Received view of anndata, making copy.

INFO     Input AnnData not setup with scvi-tools. attempting to transfer AnnData setup

INFO:scvi.model.base._base_model:Input AnnData not setup with scvi-tools. attempting to transfer AnnData setup

In [ ]:

denoised = model.get_normalized_expression(adata_subset, library_size=1e4)
denoised.iloc[:5, :5]

INFO     Received view of anndata, making copy.

INFO:scvi.model.base._base_model:Received view of anndata, making copy.

INFO     Input AnnData not setup with scvi-tools. attempting to transfer AnnData setup

INFO:scvi.model.base._base_model:Input AnnData not setup with scvi-tools. attempting to transfer AnnData setup

Out[ ]:

	ISG15	TNFRSF18	VWA1	HES5	SPSB1
GTCAAGTCATGCCACG-1-HCAHeart7702879	0.343808	0.118372	1.945633	0.062702	4.456603
GAGTCATTCTCCGTGT-1-HCAHeart8287128	1.552080	0.275485	1.457585	0.013442	14.617260
CCTCTGATCGTGACAT-1-HCAHeart7702881	5.157080	0.295140	1.195748	0.143792	2.908867
CGCCATTCATCATCTT-1-H0035_apex	0.352172	0.019281	0.570386	0.105633	6.325405
TCGTAGAGTAGGACTG-1-H0015_septum	0.290155	0.040910	0.400771	0.723409	8.142258

Let's store the normalized values back in the anndata.

In [ ]:

adata.layers["scvi_normalized"] = model.get_normalized_expression(
    library_size=10e4
)

Interoperability with Scanpy ¶

Scanpy is a powerful python library for visualization and downstream analysis of scRNA-seq data. We show here how to feed the objects produced by scvi-tools into a scanpy workflow.

Visualization without batch correction ¶

Warning

We use UMAP to qualitatively assess our low-dimension embeddings of cells. We do not advise using UMAP or any similar approach quantitatively. We do recommend using the embeddings produced by scVI as a plug-in replacement of what you would get from PCA, as we show below.

First, we demonstrate the presence of nuisance variation with respect to nuclei/whole cell, age group, and donor by plotting the UMAP results of the top 30 PCA components for the raw count data.

In [ ]:

# run PCA then generate UMAP plots
sc.tl.pca(adata)
sc.pp.neighbors(adata, n_pcs=30, n_neighbors=20)
sc.tl.umap(adata, min_dist=0.3)

In [ ]:

sc.pl.umap(
    adata, 
    color=["cell_type"], 
    frameon=False,
)
sc.pl.umap(
    adata, 
    color=["donor", "cell_source"],
    ncols=2,
    frameon=False,
)

No description has been provided for this image

We see that while the cell types are generally well separated, nuisance variation plays a large part in the variation of the data.

Visualization with batch correction (scVI) ¶

Now, let us try using the scVI latent space to generate the same UMAP plots to see if scVI successfully accounts for batch effects in the data.

In [ ]:

# use scVI latent space for UMAP generation
sc.pp.neighbors(adata, use_rep="X_scVI")
sc.tl.umap(adata, min_dist=0.3)

In [ ]:

sc.pl.umap(
    adata, 
    color=["cell_type"], 
    frameon=False,
)
sc.pl.umap(
    adata, 
    color=["donor", "cell_source"],
    ncols=2,
    frameon=False,
)

We can see that scVI was able to correct for nuisance variation due to nuclei/whole cell, age group, and donor, while maintaining separation of cell types.

Clustering on the scVI latent space ¶

The user will note that we imported curated labels from the original publication. Our interface with scanpy makes it easy to cluster the data with scanpy from scVI's latent space and then reinject them into scVI (e.g., for differential expression).

In [ ]:

# neighbors were already computed using scVI
sc.tl.leiden(adata, key_added="leiden_scVI", resolution=0.5)

In [ ]:

sc.pl.umap(
    adata, 
    color=["leiden_scVI"], 
    frameon=False,
)

Differential expression ¶

We can also use many scvi-tools models for differential expression. For further details on the methods underlying these functions as well as additional options, please see the API docs .

In [ ]:

adata.obs.cell_type.head()

Out[ ]:

AACTCCCCACGAGAGT-1-HCAHeart7844001                      Myeloid
ATAACGCAGAGCTGGT-1-HCAHeart7829979    Ventricular_Cardiomyocyte
GTCAAGTCATGCCACG-1-HCAHeart7702879                   Fibroblast
GGTGATTCAAATGAGT-1-HCAHeart8102858                  Endothelial
AGAGAATTCTTAGCAG-1-HCAHeart8102863                  Endothelial
Name: cell_type, dtype: category
Categories (11, object): ['Adipocytes', 'Atrial_Cardiomyocyte', 'Endothelial', 'Fibroblast', ..., 'Neuronal', 'Pericytes', 'Smooth_muscle_cells', 'Ventricular_Cardiomyocyte']

For example, a 1-vs-1 DE test is as simple as:

In [ ]:

de_df = model.differential_expression(
    groupby="cell_type", 
    group1="Endothelial", 
    group2="Fibroblast"
)
de_df.head()

DE...: 100%|██████████| 1/1 [00:00<00:00,  3.07it/s]

Out[ ]:

	proba_de	proba_not_de	bayes_factor	scale1	scale2	delta	lfc_mean	lfc_median	lfc_std	...	raw_mean1	raw_mean2	non_zeros_proportion1	non_zeros_proportion2	raw_normalized_mean1	raw_normalized_mean2	is_de_fdr_0.05	comparison	group1	group2
SOX17	0.9998	0.0002	8.516943	0.001615	0.000029	0.25	6.222365	6.216846	1.967564	...	0.784371	0.006541	0.307617	0.004497	17.128170	0.185868	True	Endothelial vs Fibroblast	Endothelial	Fibroblast
SLC9A3R2	0.9996	0.0004	7.823621	0.010660	0.000171	0.25	5.977907	6.049340	1.672150	...	4.451492	0.045380	0.712339	0.034342	111.582703	1.657324	True	Endothelial vs Fibroblast	Endothelial	Fibroblast
ABCA10	0.9990	0.0010	6.906745	0.000081	0.006355	0.25	-8.468659	-9.058912	2.959383	...	0.014359	1.282508	0.007788	0.499591	0.501825	69.613876	True	Endothelial vs Fibroblast	Endothelial	Fibroblast
EGFL7	0.9986	0.0014	6.569875	0.008471	0.000392	0.25	4.751251	4.730982	1.546327	...	2.376779	0.036795	0.741543	0.025756	89.507553	1.169474	True	Endothelial vs Fibroblast	Endothelial	Fibroblast
VWF	0.9984	0.0016	6.436144	0.014278	0.000553	0.25	5.013347	5.029471	1.758744	...	5.072563	0.054375	0.808226	0.032298	169.693512	2.207696	True	Endothelial vs Fibroblast	Endothelial	Fibroblast

5 rows × 22 columns

We can also do a 1-vs-all DE test, which compares each cell type with the rest of the dataset:

In [ ]:

de_df = model.differential_expression(
    groupby="cell_type", 
)
de_df.head()

DE...: 100%|██████████| 11/11 [00:03<00:00,  3.08it/s]

Out[ ]:

	proba_de	proba_not_de	bayes_factor	scale1	scale2	delta	lfc_mean	lfc_median	lfc_std	...	raw_mean1	raw_mean2	non_zeros_proportion1	non_zeros_proportion2	raw_normalized_mean1	raw_normalized_mean2	is_de_fdr_0.05	comparison	group1	group2
CIDEC	0.9988	0.0012	6.724225	0.002336	0.000031	0.25	7.082959	7.075700	2.681833	...	1.137931	0.001406	0.510345	0.001406	21.809771	0.057564	True	Adipocytes vs Rest	Adipocytes	Rest
ADIPOQ	0.9988	0.0012	6.724225	0.003627	0.000052	0.25	7.722131	7.461277	3.332577	...	2.324136	0.003622	0.593103	0.003352	33.361748	0.217763	True	Adipocytes vs Rest	Adipocytes	Rest
GPAM	0.9986	0.0014	6.569875	0.025417	0.000202	0.25	7.365266	7.381156	2.562121	...	17.372416	0.035791	0.896552	0.031520	280.340485	1.565905	True	Adipocytes vs Rest	Adipocytes	Rest
PLIN1	0.9984	0.0016	6.436144	0.004482	0.000048	0.25	7.818194	7.579515	2.977385	...	2.799999	0.004379	0.806897	0.004325	52.921761	0.196486	True	Adipocytes vs Rest	Adipocytes	Rest
GPD1	0.9974	0.0026	5.949637	0.002172	0.000044	0.25	6.543847	6.023436	2.865962	...	1.758620	0.007353	0.620690	0.007299	30.203791	0.255079	True	Adipocytes vs Rest	Adipocytes	Rest

5 rows × 22 columns

We now extract top markers for each cluster using the DE results.

In [ ]:

markers = {}
cats = adata.obs.cell_type.cat.categories
for i, c in enumerate(cats):
    cid = "{} vs Rest".format(c)
    cell_type_df = de_df.loc[de_df.comparison == cid]

    cell_type_df = cell_type_df[cell_type_df.lfc_mean > 0]

    cell_type_df = cell_type_df[cell_type_df["bayes_factor"] > 3]
    cell_type_df = cell_type_df[cell_type_df["non_zeros_proportion1"] > 0.1]

    markers[c] = cell_type_df.index.tolist()[:3]

In [ ]:

sc.tl.dendrogram(adata, groupby="cell_type", use_rep="X_scVI")

In [ ]:

sc.pl.dotplot(
    adata, 
    markers, 
    groupby='cell_type', 
    dendrogram=True, 
    color_map="Blues",
    swap_axes=True,
    use_raw=True,
    standard_scale="var",
)

We can also visualize the scVI normalized gene expression values with the layer option.

In [ ]:

sc.pl.heatmap(
    adata, 
    markers, 
    groupby='cell_type', 
    layer="scvi_normalized",
    standard_scale="var",
    dendrogram=True,
    figsize=(8, 12)
)

Logging information ¶

Verbosity varies in the following way:

logger.setLevel(logging.WARNING) will show a progress bar.
logger.setLevel(logging.INFO) will show global logs including the number of jobs done.
logger.setLevel(logging.DEBUG) will show detailed logs for each training (e.g the parameters tested).

This function's behaviour can be customized, please refer to its documentation for information about the different parameters available.

In general, you can use scvi.settings.verbosity to set the verbosity of the scvi package. Note that verbosity corresponds to the logging levels of the standard python logging module. By default, that verbosity level is set to INFO (=20). As a reminder the logging levels are:

Level	Numeric value
`CRITICAL`	50
`ERROR`	40
`WARNING`	30
`INFO`	20
`DEBUG`	10
`NOTSET`	0

Reference ¶

If you use scvi-tools in your research, please consider citing

In [ ]:

@article{Gayoso2022,
         author={Gayoso, Adam and Lopez, Romain and Xing, Galen and Boyeau, Pierre and Valiollah Pour Amiri, Valeh and Hong, Justin and Wu, Katherine and Jayasuriya, Michael and   Mehlman, Edouard and Langevin, Maxime and Liu, Yining and Samaran, Jules and Misrachi, Gabriel and Nazaret, Achille and Clivio, Oscar and Xu, Chenling and Ashuach, Tal and Gabitto, Mariano and Lotfollahi, Mohammad and Svensson, Valentine and da Veiga Beltrame, Eduardo and Kleshchevnikov, Vitalii and Talavera-L{\'o}pez, Carlos and Pachter, Lior and Theis, Fabian J. and Streets, Aaron and Jordan, Michael I. and Regier, Jeffrey and Yosef, Nir},
         title={A Python library for probabilistic analysis of single-cell omics data},
         journal={Nature Biotechnology},
         year={2022},
         month={Feb},
         day={07},
         issn={1546-1696},
         doi={10.1038/s41587-021-01206-w},
         url={https://doi.org/10.1038/s41587-021-01206-w}
}

along with the publicaton describing the model used.

This tutorial was contributed to Deepchem by:

In [ ]:

@manual{Bioinformatics, 
 title={Deep Probabilistic Analysis of Single-Cell Omics Data}, 
 organization={DeepChem},
 author={Paiz, Paulina}, 
 howpublished = {\url{https://github.com/deepchem/deepchem/blob/master/examples/tutorials/Deep_probabilistic_analysis_of_single-cell_omics_data.ipynb}}, 
 year={2022}, 
}