Seurat single-cell transcriptome sequencing data analysis tutorial (2) - python (scanpy)

Seurat single-cell transcriptome sequencing data analysis tutorial (2) - python (scanpy)

The article refers to the scanpy official website for a more detailed explanation.

Data consist of 3k PBMC from healthy donors and are freely available from 10x Genomics. On unix systems, you can uncomment and run the following commands to download and unpack the data. The last line creates a directory to which the processed data will be written.

# !mkdir data
# !wget http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz -O data/pbmc3k_filtered_gene_bc_matrices.tar.gz
# !cd data; tar -xzf pbmc3k_filtered_gene_bc_matrices.tar.gz
# !mkdir write

Import Data

import numpy as np
import pandas as pd
import scanpy as sc
sc.settings.verbosity = 3             # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_header()
sc.settings.set_figure_params(dpi=80, facecolor='white')

Read the count matrix into an AnnData object, which contains a number of slots for annotations and different data representations. It also comes with its own HDF5-based file format.

adata = sc.read_10x_mtx(
    'data/filtered_gene_bc_matrices/hg19/',  # the directory with the `.mtx` file
    var_names='gene_symbols',                # use gene symbols for the variable names (variables-axis index)
    cache=True)                              # wr
adata.var_names_make_unique()  # this is unnecessary if using `var_names='gene_ids'` in `sc.read_10x_mtx`

preprocessing

Those genes yielding the highest count score in each cell across all cells are shown.

sc.pl.highest_expr_genes(adata, n_top=20, )

Insert image description here

Basic filtering:

sc.pp.filter_cells(adata, min_genes=200)
sc.pp.filter_genes(adata, min_cells=3)

Let's gather some information about mitochondrial genes, which are important for quality control.

Quoted from the "Simple Single Cell" workflow (Lun, McCarthy & Marioni, 2017):
A high ratio indicates poor cell quality (Islam et al., 2014; Ilicic et al., 2016), possibly due to loss of cytoplasmic RNA from perforated cells . The reason is that mitochondria are larger than individual transcript molecules and are less likely to escape through tears in the cell membrane.
With pp.calculate_qc_metrics we can calculate many metrics very efficiently.

adata.var['mt'] = adata.var_names.str.startswith('MT-')  # annotate the group of mitochondrial genes as 'mt'
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)

Violin plots of some computed quality metrics:

  • Count the number of genes expressed in the matrix
  • Total count per cell
  • Percentage of mitochondrial gene count
sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],
             jitter=0.4, multi_panel=True)

Insert image description here
Remove cells with excessive mitochondrial gene expression or excessive total counts:

sc.pl.scatter(adata, x='total_counts', y='pct_counts_mt')
sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts')

Insert image description here
AnnData is actually filtered by slicing objects.

adata = adata[adata.obs.n_genes_by_counts < 2500, :]
adata = adata[adata.obs.pct_counts_mt < 5, :]

Total counts were normalized (library size correct) to the data matrix x
10,000 reads per cell so that counts between cells are comparable.

sc.pp.normalize_total(adata, target_sum=1e4)

Logarithmize the data:

sc.pp.log1p(adata)

Identifying highly variable genes.

sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
sc.pl.highly_variable_genes(adata)

Insert image description here
Set the .raw property of the AnnData object to the normalized and logarithmic raw gene expression for later use in differential testing and visualization of gene expression. This simply freezes the state of the AnnData object.

actually filtering

adata = adata[:, adata.var.highly_variable]

The effects of total counts per cell and percentage mitochondrial gene expression were regressed. Scale the data to unit variance.

sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])

Scale each gene to unit variance. Clipping values ​​exceed 10 standard deviations.

sc.pp.scale(adata, max_value=10)

Principal component analysis

The dimensionality of the data was reduced by running principal component analysis (PCA), which revealed the main axes of variation and denoised the data.

sc.tl.pca(adata, svd_solver='arpack')

We can plot a scatter plot in PCA coordinates, but we won't use it later.

sc.pl.pca(adata, color='CST3')

Insert image description here
Let's examine the contribution of a single PC to the total variance of the data. This gives us information about how many PCs we should consider in order to calculate the neighborhood relationships of cells, e.g. using sc.tl.tsne() in the clustering function sc.tl.louvain() or tSNE. In our experience, a rough estimate of the number of PCs usually suffices.

sc.pl.pca_variance_ratio(adata, log=True)

Insert image description here

Compute Neighborhood Graph

Let us use the PCA representation of the data matrix to compute the neighborhood graph of a cell. You can simply use the default value here. To reproduce Seurat's results we take the following values.

sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)

Embed neighborhood graph

We propose using UMAP to embed graphs in 2D (McInnes et al. 2018), see below. It may be more faithful to the global connectivity of the manifold than tSNE, i.e., it preserves trajectories better. In some cases, you may still observe disconnected clusters and similar connectivity violations. They can usually be remedied by running:

sc.tl.paga(adata)
sc.pl.paga(adata, plot=False)  # remove `plot=False` if you want to see the coarse-grained graph
sc.tl.umap(adata, init_pos='paga')

sc.tl.umap(adata)
sc.pl.umap(adata, color=['CST3', 'NKG7', 'PPBP'])

Insert image description here
When we set the .raw attribute adata, the previous plot shows the "raw" (normalized, logarithmic, but uncorrected) gene expression. You can also plot scaled and corrected gene expression by explicitly stating that you do not want to use .raw.

sc.pl.umap(adata, color=['CST3', 'NKG7', 'PPBP'], use_raw=False)

Insert image description here

Clustering Neighborhood Graph

Like Seurat and many other frameworks, we recommend the Leiden graph clustering method of Traag et al. (2018) (community detection based on optimized modularity). Note that Leiden clustering directly clusters the neighborhood graph of cells, which we have already calculated in the previous section.

sc.tl.leiden(adata)

Clustering is plotted, which agrees well with Seurat's results.

sc.pl.umap(adata, color=['leiden', 'CST3', 'NKG7'])

Insert image description here

Find marker genes

Let us calculate the ranking of highly differentiated genes in each cluster. For this purpose, .raw uses the properties of AnnData by default, in case it has been initialized before. The simplest and fastest method is the t-test.

sc.tl.rank_genes_groups(adata, 'leiden', method='t-test')
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)

Insert image description here

sc.settings.verbosity = 2  # reduce the verbosity

The results of the Wilcoxon rank sum (Mann-Whitney-U) test are very similar. We recommend using the latter in publications, see for example Sonison & Robinson (2018). You might also consider more powerful differential testing packages such as MAST, limma, DESeq2, and the latest diffxpy for Python.

sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon')
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)

Insert image description here
As an alternative, let's use logistic regression to rank the genes. For example, this was proposed by Natranos et al. (2018). The essential difference is that here, we use a multivariate approach, whereas traditional difference tests are univariate. Clark et al. (2014) have more details.

sc.tl.rank_genes_groups(adata, 'leiden', method='logreg')
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)

Insert image description here
All marker genes were recovered in all methods except IL7R, which was found only by t-test and FCER1A, which was found only by the other two methods.
Let us also define a list of marker genes for later reference.

marker_genes = ['IL7R', 'CD79A', 'MS4A1', 'CD8A', 'CD8B', 'LYZ', 'CD14',
                'LGALS3', 'S100A8', 'GNLY', 'NKG7', 'KLRB1',
                'FCGR3A', 'MS4A7', 'FCER1A', 'CST3', 'PPBP']

Actually label the cell type.

new_cluster_names = [
    'CD4 T', 'CD14 Monocytes',
    'B', 'CD8 T',
    'NK', 'FCGR3A Monocytes',
    'Dendritic', 'Megakaryocytes']
adata.rename_categories('leiden', new_cluster_names)
sc.pl.umap(adata, color='leiden', legend_loc='on data', title='', frameon=False, save='.pdf')

Insert image description here

Now that we have annotated cell types, let's visualize the marker genes.

sc.pl.dotplot(adata, marker_genes, groupby='leiden');

Insert image description here
There is also a very compact violin plot

sc.pl.stacked_violin(adata, marker_genes, groupby='leiden', rotation=90);

Insert image description here
During this analysis, AnnData accumulated the following annotations

Insert image description here

Guess you like

Origin blog.csdn.net/coffeeii/article/details/130818253