Seurat single-cell transcriptome sequencing data analysis tutorial (2) - python (scanpy)
The article refers to the scanpy official website for a more detailed explanation.
Data consist of 3k PBMC from healthy donors and are freely available from 10x Genomics. On unix systems, you can uncomment and run the following commands to download and unpack the data. The last line creates a directory to which the processed data will be written.
# !mkdir data
# !wget http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz -O data/pbmc3k_filtered_gene_bc_matrices.tar.gz
# !cd data; tar -xzf pbmc3k_filtered_gene_bc_matrices.tar.gz
# !mkdir write
Import Data
import numpy as np
import pandas as pd
import scanpy as sc
sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_header()
sc.settings.set_figure_params(dpi=80, facecolor='white')
Read the count matrix into an AnnData object, which contains a number of slots for annotations and different data representations. It also comes with its own HDF5-based file format.
adata = sc.read_10x_mtx(
'data/filtered_gene_bc_matrices/hg19/', # the directory with the `.mtx` file
var_names='gene_symbols', # use gene symbols for the variable names (variables-axis index)
cache=True) # wr
adata.var_names_make_unique() # this is unnecessary if using `var_names='gene_ids'` in `sc.read_10x_mtx`
preprocessing
Those genes yielding the highest count score in each cell across all cells are shown.
sc.pl.highest_expr_genes(adata, n_top=20, )
Basic filtering:
sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)
Let's gather some information about mitochondrial genes, which are important for quality control.
Quoted from the "Simple Single Cell" workflow (Lun, McCarthy & Marioni, 2017):
A high ratio indicates poor cell quality (Islam et al., 2014; Ilicic et al., 2016), possibly due to loss of cytoplasmic RNA from perforated cells . The reason is that mitochondria are larger than individual transcript molecules and are less likely to escape through tears in the cell membrane.
With pp.calculate_qc_metrics we can calculate many metrics very efficiently.
adata.var['mt'] = adata.var_names.str.startswith('MT-') # annotate the group of mitochondrial genes as 'mt'
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
Violin plots of some computed quality metrics:
- Count the number of genes expressed in the matrix
- Total count per cell
- Percentage of mitochondrial gene count
sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],
jitter=0.4, multi_panel=True)
Remove cells with excessive mitochondrial gene expression or excessive total counts:
sc.pl.scatter(adata, x='total_counts', y='pct_counts_mt')
sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts')
AnnData is actually filtered by slicing objects.
adata = adata[adata.obs.n_genes_by_counts < 2500, :]
adata = adata[adata.obs.pct_counts_mt < 5, :]
Total counts were normalized (library size correct) to the data matrix x
10,000 reads per cell so that counts between cells are comparable.
sc.pp.normalize_total(adata, target_sum=1e4)
Logarithmize the data:
sc.pp.log1p(adata)
Identifying highly variable genes.
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
sc.pl.highly_variable_genes(adata)
Set the .raw property of the AnnData object to the normalized and logarithmic raw gene expression for later use in differential testing and visualization of gene expression. This simply freezes the state of the AnnData object.
actually filtering
adata = adata[:, adata.var.highly_variable]
The effects of total counts per cell and percentage mitochondrial gene expression were regressed. Scale the data to unit variance.
sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])
Scale each gene to unit variance. Clipping values exceed 10 standard deviations.
sc.pp.scale(adata, max_value=10)
Principal component analysis
The dimensionality of the data was reduced by running principal component analysis (PCA), which revealed the main axes of variation and denoised the data.
sc.tl.pca(adata, svd_solver='arpack')
We can plot a scatter plot in PCA coordinates, but we won't use it later.
sc.pl.pca(adata, color='CST3')
Let's examine the contribution of a single PC to the total variance of the data. This gives us information about how many PCs we should consider in order to calculate the neighborhood relationships of cells, e.g. using sc.tl.tsne() in the clustering function sc.tl.louvain() or tSNE. In our experience, a rough estimate of the number of PCs usually suffices.
sc.pl.pca_variance_ratio(adata, log=True)
Compute Neighborhood Graph
Let us use the PCA representation of the data matrix to compute the neighborhood graph of a cell. You can simply use the default value here. To reproduce Seurat's results we take the following values.
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
Embed neighborhood graph
We propose using UMAP to embed graphs in 2D (McInnes et al. 2018), see below. It may be more faithful to the global connectivity of the manifold than tSNE, i.e., it preserves trajectories better. In some cases, you may still observe disconnected clusters and similar connectivity violations. They can usually be remedied by running:
sc.tl.paga(adata)
sc.pl.paga(adata, plot=False) # remove `plot=False` if you want to see the coarse-grained graph
sc.tl.umap(adata, init_pos='paga')
sc.tl.umap(adata)
sc.pl.umap(adata, color=['CST3', 'NKG7', 'PPBP'])
When we set the .raw attribute adata, the previous plot shows the "raw" (normalized, logarithmic, but uncorrected) gene expression. You can also plot scaled and corrected gene expression by explicitly stating that you do not want to use .raw.
sc.pl.umap(adata, color=['CST3', 'NKG7', 'PPBP'], use_raw=False)
Clustering Neighborhood Graph
Like Seurat and many other frameworks, we recommend the Leiden graph clustering method of Traag et al. (2018) (community detection based on optimized modularity). Note that Leiden clustering directly clusters the neighborhood graph of cells, which we have already calculated in the previous section.
sc.tl.leiden(adata)
Clustering is plotted, which agrees well with Seurat's results.
sc.pl.umap(adata, color=['leiden', 'CST3', 'NKG7'])
Find marker genes
Let us calculate the ranking of highly differentiated genes in each cluster. For this purpose, .raw uses the properties of AnnData by default, in case it has been initialized before. The simplest and fastest method is the t-test.
sc.tl.rank_genes_groups(adata, 'leiden', method='t-test')
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)
sc.settings.verbosity = 2 # reduce the verbosity
The results of the Wilcoxon rank sum (Mann-Whitney-U) test are very similar. We recommend using the latter in publications, see for example Sonison & Robinson (2018). You might also consider more powerful differential testing packages such as MAST, limma, DESeq2, and the latest diffxpy for Python.
sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon')
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)
As an alternative, let's use logistic regression to rank the genes. For example, this was proposed by Natranos et al. (2018). The essential difference is that here, we use a multivariate approach, whereas traditional difference tests are univariate. Clark et al. (2014) have more details.
sc.tl.rank_genes_groups(adata, 'leiden', method='logreg')
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)
All marker genes were recovered in all methods except IL7R, which was found only by t-test and FCER1A, which was found only by the other two methods.
Let us also define a list of marker genes for later reference.
marker_genes = ['IL7R', 'CD79A', 'MS4A1', 'CD8A', 'CD8B', 'LYZ', 'CD14',
'LGALS3', 'S100A8', 'GNLY', 'NKG7', 'KLRB1',
'FCGR3A', 'MS4A7', 'FCER1A', 'CST3', 'PPBP']
Actually label the cell type.
new_cluster_names = [
'CD4 T', 'CD14 Monocytes',
'B', 'CD8 T',
'NK', 'FCGR3A Monocytes',
'Dendritic', 'Megakaryocytes']
adata.rename_categories('leiden', new_cluster_names)
sc.pl.umap(adata, color='leiden', legend_loc='on data', title='', frameon=False, save='.pdf')
Now that we have annotated cell types, let's visualize the marker genes.
sc.pl.dotplot(adata, marker_genes, groupby='leiden');
There is also a very compact violin plot
sc.pl.stacked_violin(adata, marker_genes, groupby='leiden', rotation=90);
During this analysis, AnnData accumulated the following annotations
。