Tutorials

1. Introduction To Deepchem

1. The Basic Tools of the Deep Life Sciences

2. Working With Datasets

3. An Introduction To MoleculeNet

4. Molecular Fingerprints

5. Creating Models with TensorFlow and PyTorch

6. Introduction to Graph Convolutions

7. Going Deeper on Molecular Featurizations

8. Working With Splitters

9. Advanced Model Training

10. Creating a high fidelity model from experimental data

11. Putting Multitask Learning to Work

12. Modeling Protein Ligand Interactions

13. Modeling Protein Ligand Interactions With Atomic Convolutions

14. Conditional Generative Adversarial Networks

15. Training a Generative Adversarial Network on MNIST

16. Advanced model training using hyperopt

17. Introduction to Gaussian Processes

18. PytorchLightning Integration

19. Compiling DeepChem Torch Models

2. Molecular Machine Learning

1. Molecular Fingerprints

2. Going Deeper on Molecular Featurizations

3. Learning Unsupervised Embeddings for Molecules

4. Synthetic Feasibility Scoring

5. Atomic Contributions for Molecules

6. Interactive Model Evaluation with Trident Chemwidgets

7. Transfer Learning With ChemBERTa Transformers

8. Training a Normalizing Flow on QM9

9. Large Scale Chemical Screens

10. Introduction to Molecular Attention Transformer

11. Generating molecules with MolGAN

12. Introduction to GROVER

13. Introduction to PROTACs

14. Druggablity Assessment with Fpocket and Machine Learning

3. Modeling Proteins

1. Protein Deep Learning

2. DeepChem AntibodyTutorial Simplified

3. Protein Structure Prediction with ESMFold

4. Introduction to ProtBERT

5. ProteinLM Tutorial0

4. Protein Ligand Modeling

1. Introduction to Binding Sites

2. Modeling Protein Ligand Interactions

3. Modeling Protein Ligand Interactions With Atomic Convolutions

4. DeepChemXAlphafold

5. UniProt Data Preprocessing for Binding Sites

5. Quantum Chemistry

1. Exploring Quantum Chemistry with GDB1k

2. DeepQMC tutorial

3. Training an Exchange Correlation Functional using Deepchem

6. Bioinformatics

1. Introduction to Bioinformatics

2. Multisequence Alignments

3. Scanpy

4. Deep probabilistic analysis of single-cell omics data

5. Cell Counting Tutorial

7. Material Science

1. Introduction To Material Science

8. Machine Learning Methods

1. Using Reinforcement Learning to Play Pong

2. Introduction to Model Interpretability

3. Uncertainty In Deep Learning

9. Deep Differential Equations

1. Physics Informed Neural Networks

2. Introducing JaxModel and PINNModel

3. About nODE Using Torchdiffeq in Deepchem

4. Differentiation Infrastructure in Deepchem

5. Ordinary Differential Equation Solving using deepchem

10. Equivariance

1. Introduction to Equivariance

11. Olfaction

1. Predict Multi Label Odor Descriptors using OpenPOM

12. Polymer Science

1. An Introduction to the Polymers and Their Representation

2. crystallization tendency regression

3. Introduction to Polymer SMILES

4. Understanding Weighted Directed Graphs for Polymer Implimentations

5. PolyBERT Introduction to Chemical Language Models for Fingerprint Generation

ScanPy ¶

ScanPy is a scalable toolkit for analyzing single-cell gene expression data. It includes methods for preprocessing, visualization, clustering, pseudotime and trajectory inference, differential expression testing, and simulation of gene regulatory networks. There are many advantage of using a Python-based platform to process scRNA-seq data including increased processing efficiency and running speed as well as seamless integration with machine learning frameworks.

ANNDATA was presented alongside ScanPy as a generic class for handling annotated data matrices that can deal with the sparsity inherent in gene expression data.

This tutorial is largely adapted from the original tutorials which can be found in Scanpy's read the docs and from this notebook.

Colab ¶

This tutorial and the rest in this sequence can be done in Google colab. If you'd like to open this notebook in colab, you can use the following link.

Make sure you've installed Scanpy:

In [2]:

%conda install -c conda-forge scanpy python-igraph leidenalg

Collecting package metadata (current_repodata.json): done
Solving environment: done

# All requested packages already installed.


Note: you may need to restart the kernel to use updated packages.

In [3]:

# import necessary packages
import numpy as np
import pandas as pd
import scanpy as sc

In [4]:

sc.settings.verbosity = 3   # verbosity: errors (0), warnings (1), info (2), hints (3)

In [ ]:

# see what package versions you have installed 
sc.logging.print_versions()
# customize resolution and color of your figures
sc.settings.set_figure_params(dpi=80, facecolor='white')

In [ ]:

# download the test data for this tutorial
!mkdir data
!wget http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz -O data/pbmc3k_filtered_gene_bc_matrices.tar.gz
!cd data; tar -xzf pbmc3k_filtered_gene_bc_matrices.tar.gz
!mkdir write

In [7]:

results_file = 'write/pbmc3k.h5ad' # the file that will store the analysis results

In [8]:

adata = sc.read_10x_mtx(
    'data/filtered_gene_bc_matrices/hg19/',  # the directory with the `.mtx` file
    var_names='gene_symbols',                # use gene symbols for the variable names (variables-axis index)
    cache=True)                              # write a cache file for faster subsequent reading

... reading from cache file cache/data-filtered_gene_bc_matrices-hg19-matrix.h5ad

See [anndata-tutorials/getting-started] for a more comprehensive introduction to AnnData.

In [9]:

adata.var_names_make_unique()  # this is unnecessary if using `var_names='gene_ids'` in `sc.read_10x_mtx`

In [10]:

# look at what the AnnData object contains
adata

Out[10]:

AnnData object with n_obs × n_vars = 2700 × 32738
    var: 'gene_ids'

Pre-processing ¶

Check for highly expressed genes ¶

Show genes that yield the highest fraction of counts in each single cell, across all cells. The sc.pl.highest_expr_genes command normalizes counts per cell, and plots the genes that are most abundant in each cell.

In [11]:

sc.pl.highest_expr_genes(adata, n_top=20, )

normalizing counts per cell
    finished (0:00:00)

No description has been provided for this image

Note that MALAT1, a non-coding RNA that is known to be extremely abundant in many cells, ranks at the top.

Basic filtering: remove cells and genes with low expression or missing values. ¶

In [12]:

sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)

filtered out 19024 genes that are detected in less than 3 cells

Check mitochondrial genes for Quality Control ¶

Let’s assemble some information about mitochondrial genes, which are important for quality control.

Citing from “Simple Single Cell” workflows (Lun, McCarthy & Marioni, 2017):

High proportions are indicative of poor-quality cells (Islam et al. 2014; Ilicic et al. 2016), possibly because of loss of cytoplasmic RNA from perforated cells. The reasoning is that mitochondria are larger than individual transcript molecules and less likely to escape through tears in the cell membrane.

With pp.calculate_qc_metrics , we can compute many metrics very efficiently.

In [13]:

adata.var['mt'] = adata.var_names.str.startswith('MT-')  # annotate the group of mitochondrial genes as 'mt'
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)

A violin plot of some of the computed quality measures:

the number of genes expressed in the count matrix
the total counts per cell
the percentage of counts in mitochondrial genes

In [14]:

sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],
             jitter=0.4, multi_panel=True)

Remove cells that have too many mitochondrial genes expressed or too many total counts. High proportions of mitochondrial genes indicate poor-quality cells, potentially because of loss of cytoplasmic RNA from perforated cells.

In [15]:

sc.pl.scatter(adata, x='total_counts', y='pct_counts_mt')
sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts')

Filter data based on QC ¶

Check current datset and filter it by slicing the AnnData object.

In [16]:

print(adata)

AnnData object with n_obs × n_vars = 2700 × 13714
    obs: 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt'
    var: 'gene_ids', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'

In [17]:

# slice the adata object so you only keep genes and cells that pass the QC
adata = adata[adata.obs.n_genes_by_counts < 2500, :]
adata = adata[adata.obs.pct_counts_mt < 5, :]
print(adata)

View of AnnData object with n_obs × n_vars = 2638 × 13714
    obs: 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt'
    var: 'gene_ids', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts'

Data normalization ¶

To correct differences in library sizes across cells, normalize the total read count of the data matrix to 10,000 reads per cell so that counts become comparable among cells.

In [18]:

sc.pp.normalize_total(adata, target_sum=1e4)

normalizing counts per cell
    finished (0:00:00)

/Users/paulinampaiz/opt/anaconda3/envs/deepchem/lib/python3.10/site-packages/scanpy/preprocessing/_normalization.py:170: UserWarning: Received a view of an AnnData. Making a copy.
  view_to_actual(adata)

Log transform the data for later use in differential gene expression as well as in visualizations. The natural logarithm is used, and log1p means that an extra read is added to cells of the count matrix as a pseudo-read. See here for more information on why log scale makes more sense for genomic data.

In [19]:

sc.pp.log1p(adata)

Identify highly-variable genes. ¶

The function sc.pp.highly_variable_genes can detect marker genes that can help us identify cells based on a few manually set parameters, including mininum mean expression, maximum mean expression, and minimum dispersion. We will focus our analysis on such genes.

In [20]:

sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)

extracting highly variable genes
    finished (0:00:00)
--> added
    'highly_variable', boolean vector (adata.var)
    'means', float vector (adata.var)
    'dispersions', float vector (adata.var)
    'dispersions_norm', float vector (adata.var)

In [21]:

# visualize the highly variable genes with a plot 
sc.pl.highly_variable_genes(adata)

Set the .raw attribute of the AnnData object to the normalized and logarithmized raw gene expression for later use in differential testing and visualizations of gene expression. This simply freezes the state of the AnnData object.

In [22]:

adata.raw = adata

Filter the adata object so that only genes that are highly variable are kept.

In [23]:

adata = adata[:, adata.var.highly_variable]

Correct for the effects of counts per cell and mitochondrial gene expression ¶

Regress out effects of total counts per cell and the percentage of mitochondrial genes expressed. This can consume some memory and take some time because the input data is sparse.

In [24]:

sc.pp.regress_out(adata, ['n_genes_by_counts', 'pct_counts_mt'])

regressing out ['n_genes_by_counts', 'pct_counts_mt']
    sparse input is densified and may lead to high memory use
    finished (0:00:11)

Center the data to zero and scale to unit variance ¶

Use sc.pp.scale to center the average expression per gene to zero. Here we are also clipping scaled values that exceed standard deviation of 10.

In [25]:

sc.pp.scale(adata, max_value=10)

Dimension reduction with PCA ¶

We first use principal component analysis (PCA), a linear dimention-reduction technique, to reveal the main axes of variation and denoise the data.

In [26]:

sc.tl.pca(adata, svd_solver='arpack')

computing PCA
    on highly variable genes
    with n_comps=50
    finished (0:00:00)

We can make a scatter plot in the PCA coordinates, but we will not use that later on.

In [27]:

sc.pl.pca(adata, color="CST3")

The variance ratio plot lists contributions of individual principal components (PC) to the total variance in the data. This piece of information helps us to choose an appropriate number of PCs in order to compute the neighborhood relationships between the cells, for instance, using the clustering method Louvain sc.tl.louvain() or the embedding method tSNE sc.tl.tsne() for dimension-reduction.

According to the authors of Scanpy, a rough estimate of the number of PCs does fine.

In [28]:

sc.pl.pca_variance_ratio(adata, log=True)

Save the result up to PCA analysis.

In [29]:

! mkdir -p write
adata.write(results_file)

Note that our adata object has following elements: observations annotation (obs), variables (var), unstructured annotation (uns), multi-dimensional observations annotation (obsm), and multi-dimensional variables annotation (varm). The meanings of these parameters are documented in the anndata package, available at anndata documentation .

In [30]:

adata

Out[30]:

AnnData object with n_obs × n_vars = 2638 × 1838
    obs: 'n_genes', 'n_genes_by_counts', 'total_counts', 'total_counts_mt', 'pct_counts_mt'
    var: 'gene_ids', 'n_cells', 'mt', 'n_cells_by_counts', 'mean_counts', 'pct_dropout_by_counts', 'total_counts', 'highly_variable', 'means', 'dispersions', 'dispersions_norm', 'mean', 'std'
    uns: 'log1p', 'hvg', 'pca'
    obsm: 'X_pca'
    varm: 'PCs'

Computing and Embedding the neighborhood graph ¶

Use the PCA representation of the data matrix to compute the neighborhood graph of cells.

In [ ]:

sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)

The auhours of Scanpy suggest embedding the graph in two dimensions using UMAP ( McInnes et al., 2018 ). UMAP is potentially more faithful to the global connectivity of the manifold than tSNE, i.e., it better preserves trajectories.

In [ ]:

sc.tl.umap(adata)

In [ ]:

sc.pl.umap(adata, color=['CST3', 'NKG7', 'PPBP'])

As we set the .raw attribute of adata, the previous plots showed the “raw” (normalized, logarithmized, but uncorrected) gene expression. You can also plot the scaled and corrected gene expression by explicitly stating that you don’t want to use .raw .

In [ ]:

sc.pl.umap(adata, color=['CST3', 'NKG7', 'PPBP'], use_raw=False)

In some ocassions, you might still observe disconnected clusters and similar connectivity violations. They can usually be remedied by running:

In [ ]:

tl.paga(adata)
pl.paga(adata, plot=False)  # remove `plot=False` if you want to see the coarse-grained graph
tl.umap(adata, init_pos='paga')

Clustering the neighborhood graph ¶

As with Seurat and many other frameworks, we recommend the Leiden graph-clustering method (community detection based on optimizing modularity) by Traag et al. (2018) . Note that Leiden clustering directly clusters the neighborhood graph of cells, which we already computed in the previous section. Compared with the Louvain algorithm, the Leiden algorithm yields communities that are guaranteed to be connected. When applied iteratively, the Leiden algorithm converges to a partition in which all subsets of all communicities are locally optimally assigned. Last but not least, it runs faster.

In [ ]:

sc.tl.leiden(adata)

Plot the clusters using sc.pl.umap . Note that the color parameter accepts both individual genes and the clustering method (leiden in this case).

In [ ]:

sc.pl.umap(adata, color=['leiden', 'CST3', 'NKG7'])

We save the result again.

In [ ]:

adata.write(results_file)

Identifying marker genes ¶

Compute a ranking for the highly differential genes in each cluster. For this, by default, the .raw attribute of AnnData is used in case it has been initialized before. The simplest and fastest method to do so is the t-test. Other methods include Wilcoxon rank-sum (Mann-Whitney-U) test, MAST, limma, DESeq2, and diffxpy by the Theis lab. The authours of Scanpy reccomend using the Wilcoxon rank-sum test in publications.

The Wilcoxon's test ¶

For simplicity, we start with the Mann-Whitney-U test. The null hypothesis is that the rank of a gene in a cluster is the same as its rank in all cells. The alternative hypothesis is that the rank of a gene in a cluster is much higher than its rank in all cells (one-sided). The function sc.tl.rank_genes_groups performs the test, and sc.pl.rank_genes_groups plots the top genes.

In [ ]:

sc.tl.rank_genes_groups(adata, groupby='leiden', method='wilcoxon')
sc.pl.rank_genes_groups(adata, n_genes=15, sharey=False)

In [ ]:

sc.settings.verbosity = 2  # reduce the verbosity

In [ ]:

adata.write(results_file) # write the output to the results file

The Student's t-test ¶

An alternative to the non-parametric Wilcoxon test is the t-test.

In [ ]:

sc.tl.rank_genes_groups(adata, 'leiden', method='t-test')
sc.pl.rank_genes_groups(adata, n_genes=15, sharey=False)

As an alternative, let us rank genes using logistic regression. For instance, this has been suggested by Natranos et al. (2018). The essential difference is that here, we use a multi-variate appraoch whereas conventional differential tests are uni-variate. Clark et al. (2014) has more details.

In [ ]:

sc.tl.rank_genes_groups(adata, 'leiden', method="logreg")
sc.pl.rank_genes_groups(adata, n_genes=15, sharey=False)

Let us also define a list of marker genes for later reference.

In [ ]:

marker_genes = ['IL7R', 'CD79A', 'MS4A1', 'CD8A', 'CD8B', 'LYZ', 'CD14',
                'LGALS3', 'S100A8', 'GNLY', 'NKG7', 'KLRB1',
                'FCGR3A', 'MS4A7', 'FCER1A', 'CST3', 'PPBP']

Listing signatures using the results of the Wilcoxon's test ¶

We use the results of the Wilcoxon's test for downstream analysis. Reload the object that has been save with the Wilcoxon Rank-Sum test result.

In [ ]:

adata = sc.read(results_file)

Show the 10 top ranked genes per cluster 0, 1, …, 7 in a dataframe.

In [ ]:

pd.DataFrame(adata.uns['rank_genes_groups']['names']).head(5)

Get a table with the scores and groups.

In [ ]:

result = adata.uns['rank_genes_groups']
groups = result['names'].dtype.names
pd.DataFrame(
    {group + '_' + key[:1]: result[key][group]
    for group in groups for key in ['names', 'pvals']}).head(5)

Compare to a single cluster:

In [ ]:

sc.tl.rank_genes_groups(adata, 'leiden', groups=['0'], reference='1', method='wilcoxon')
sc.pl.rank_genes_groups(adata, groups=['0'], n_genes=20)

If we want a more detailed view for a certain group, use sc.pl.rank_genes_groups_violin .

In [ ]:

sc.pl.rank_genes_groups_violin(adata, groups='0', n_genes=8)

If you want to compare a certain gene across groups, use the following.

In [ ]:

sc.pl.violin(adata, ['CST3', 'NKG7', 'PPBP'], groupby='leiden')

Actually mark the cell types.

In [ ]:

new_cluster_names = [
    'CD4 T', 'CD14 Monocytes',
    'B', 'CD8 T',
    'NK', 'FCGR3A Monocytes',
    'Dendritic', 'Megakaryocytes']
adata.rename_categories('leiden', new_cluster_names)

In [ ]:

# plot the UMAP
sc.pl.umap(adata, color='leiden', legend_loc='on data', title='', frameon=False, save='.pdf')

Now that we annotated the cell types, let us visualize the marker genes.

In [ ]:

sc.pl.dotplot(adata, marker_genes, groupby='leiden');

There is also a very compact violin plot.

In [ ]:

sc.pl.stacked_violin(adata, marker_genes, groupby='leiden', rotation=90);

During the course of this analysis, the AnnData accumulated the following annotations.

In [ ]:

adata

If you want to save your results:

In [ ]:

adata.write(results_file, compression='gzip')  # `compression='gzip'` saves disk space, but slows down writing and subsequent reading

Get a rough overview of the file using h5ls, which has many options - for more details see here. The file format might still be subject to further optimization in the future. All reading functions will remain backwards-compatible, though.

If you want to share this file with people who merely want to use it for visualization, a simple way to reduce the file size is by removing the dense scaled and corrected data matrix. The file still contains the raw data used in the visualizations in adata.raw .

In [ ]:

adata.raw.to_adata().write('./write/pbmc3k_withoutX.h5ad')

If you want to export to “csv”, you have the following options:

In [ ]:

# Export single fields of the annotation of observations
# adata.obs[['n_counts', 'louvain_groups']].to_csv(
#     './write/pbmc3k_corrected_louvain_groups.csv')

# Export single columns of the multidimensional annotation
# adata.obsm.to_df()[['X_pca1', 'X_pca2']].to_csv(
#     './write/pbmc3k_corrected_X_pca.csv')

# Or export everything except the data using `.write_csvs`.
# Set `skip_data=False` if you also want to export the data.
# adata.write_csvs(results_file[:-5], )