python newbie's introduction to single cell analysis scanpy

Hello everyone, today we share the standard process of scanpy 

Introduction to basic concepts

Scanpy and Seurat are basically the same. The object built by Scanpy is called the AnnData object, and its data storage is stored in 4 modules (as shown below)

picture

If you don’t understand the data structure of scanpy, you can compare and learn the data structure in seurat  Single cell live broadcast three seurat data structure and data visualization

Where the X object is the count matrix. It should be noted here that it is different from the R language. Behavior samples in Scanpy are listed as genes. This is also related to python usage habits

  • obs stores the meta.data matrix in the seurat object

  • The X object is a count matrix and has a transposed relationship with the seurat object.

  • var stores gene (feature) information

  • uns stores unstructured information added later

Official sample code

import scanpy as sc
import os
import math
import itertools
import warnings
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
%config InlineBackend.figure_format = 'svg'
warnings.filterwarnings("ignore")
plt.rc('font',family='Times New Roman')
my_colors = ["#1EB2A6","#ffc4a3","#e2979c","#F67575"]
sc.settings.verbosity = 3  # 输出提示信息         
# ?sc.settings.verbosity
sc.logging.print_header()
sc.settings.set_figure_params(dpi=80, facecolor='white')# 设置输出图像格式
results_file = 'write/pbmc3k.h5ad'  # 存储分析结果
scanpy==1.6.0 anndata==0.7.5 umap==0.4.6 numpy==1.19.2 scipy==1.4.1 pandas==1.1.3 scikit-learn==0.23.2 statsmodels==0.12.0

The method of reading files here is basically the same as constructing seurat objects in R language (there are 12 reading methods according to the official website classification)
The following two methods are mainly introduced
The first method is that there must be 3 initial files below the file:

  1. barcord

  2. genes

  3. matrix
    Then use the input sc.read_10_mtx to read

The second method is to directly construct the AnnData object
and then read the expression matrix, cell information, and gene information respectively. The code is as follows

# 这个是第二种方法
# creat scanpy object
#df = pd.read_csv('processfile/count.csv', index_col=0)
#meta = pd.read_csv('processfile/metadata.csv', index_col=0)
#cellinfo = pd.DataFrame(df.index,index=df.index,columns=['sample_index'])
#geneinfo = pd.DataFrame(df.columns,index=df.columns,columns=['genes_index'])
#sce = sc.AnnData(df, obs=cellinfo, var = geneinfo)
# 这个是第一种读取方法
adata = sc.read_10x_mtx(
    './filtered_gene_bc_matrices/hg19/',  # the directory with the `.mtx` file
    var_names='gene_symbols',                # use gene symbols for the variable names (variables-axis index)
    cache=True) 
adata.var_names_make_unique()
adata
tips: python and R languages ​​are a bit different. Usually, behavioral samples are listed as features
adata.obs.shape # 2700个细胞
adata.var.shape # 32738个基因
adata.to_df().shape # 2700*32738
adata.obs.head()
adata.var.head()
adata.to_df().iloc[0:5,0:5]

Data preprocessing

Here is an introduction to the commonly used components in scanpy

  1. pp: data preprocessing

  2. tl: add additional information

  3. pl: visualization

Count and visualize the proportion of genes in cells

sc.pl.highest_expr_genes(adata, n_top=20) # 每一个基因在所有细胞中的平均表达量(这里计算了百分比含量)
sc.pp.filter_cells(adata, min_genes=200) # 每一个细胞至少表达200个基因
sc.pp.filter_genes(adata, min_cells=3) # 每一个基因至少在3个细胞中表达

picture

Filter mitochondrial DNA

str.startswith does not support regular expressions. If you want to use regular expressions, use .str.match
sce.var_names[sce.var_names.str.match(r'^MT-')]
sce.var_names[sce.var_names.str.match(r'^RP[SL0-9]')]
sce.var_names[sce.var_names.str.match(r'^ERCC-')]
# 抽取带有MT的字符串
adata.var['mt'] = adata.var_names.str.startswith('MT-') 
# 数据过滤
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
# 过滤后可视化(官方文档真的骚到我头皮发麻)
sc.pl.violin(adata, ['n_genes_by_counts'],jitter=0.4)
sc.pl.violin(adata, ['total_counts'],jitter=0.4)
sc.pl.violin(adata, ['pct_counts_mt'],jitter=0.4)

sc.pl.scatter(adata, x='total_counts', y='pct_counts_mt')
sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts')

# 提取线粒体dna在5%以下
adata = adata[adata.obs.pct_counts_mt < 5, :]
# 提取基因不超过2500的细胞
adata = adata[adata.obs.n_genes_by_counts < 2500, :]

The following is the standard process of scanpy:

  1. log : NormalizeData

  2. Find features: FindVariableFeatures

  3. Normalization: ScaleData

  4. pca : RunPCA

  5. Build Diagram: FindNeighbors

  6. Clustering: FindClusters

  7. tsne /umap : RunTSNE RunUMAP

  8. Differential genes: FindAllMarkers / FindMarkers

sc.pp.normalize_total(adata, target_sum=1e4) # 不要和log顺序搞反了 ,这个是去文库的
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
# 可视化
sc.pl.highly_variable_genes(adata)
# 保存一下原始数据
adata.raw = adata
# 提取高变基因
adata = adata[:, adata.var.highly_variable]
# 过滤掉没用的东西
sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])
# 中心化
sc.pp.scale(adata, max_value=10)
# pca
sc.tl.pca(adata, svd_solver='arpack')
sc.pl.pca(adata, color='CST3')
sc.pl.pca_variance_ratio(adata, log=True)
# 输出结果
adata.write(results_file)

# 构建图
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
sc.tl.umap(adata)
sc.pl.umap(adata, color=['CST3', 'NKG7', 'PPBP'])
sc.pl.umap(adata, color=['CST3', 'NKG7', 'PPBP'], use_raw=False)

sc.tl.tsne(adata)
sc.pl.tsne(adata, color=['CST3', 'NKG7', 'PPBP'])

sc.pl.tsne(adata, color=['CST3', 'NKG7', 'PPBP'], use_raw=False)

picture

sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
sc.tl.leiden(adata)
sc.pl.umap(adata, color=['leiden', 'CST3', 'NKG7'])
sc.pl.tsne(adata, color=['leiden', 'CST3', 'NKG7'])
# 保存结果

adata.write(results_file)

picture

Find differential genes

# 这里使用秩和检验
sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon')
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)
adata.write(results_file)

picture

num = 2 # 通过这个控制marker基因的数量 
marker_genes = list(set(np.array(pd.DataFrame(adata.uns['rank_genes_groups']['names']).head(num)).reshape(-1)))
len(marker_genes)
# 看一下每一个组的特征基因

adata = sc.read(results_file) 
result = adata.uns['rank_genes_groups']
groups = result['names'].dtype.names
pd.DataFrame(
    {group + '_' + key[:1]: result[key][group]
    for group in groups for key in ['names', 'pvals']}).iloc[0:6,0:6]
# 比较组别间差异
sc.tl.rank_genes_groups(adata, 'leiden', groups=['0'], reference='1', method='wilcoxon')
sc.pl.rank_genes_groups(adata, groups=['0'], n_genes=20)
sc.pl.rank_genes_groups_violin(adata, groups='0', n_genes=8)
# 这里需要重载一下结果,如果不重载的话结果会有差异的
adata = sc.read(results_file)
sc.pl.rank_genes_groups_violin(adata, groups='0', n_genes=8)

picture

sc.pl.violin(adata, ['CST3', 'NKG7', 'PPBP'], groupby='leiden')

picture


new_cluster_names = [
    'CD4 T', 'CD14 Monocytes',
    'B', 'CD8 T',
    'NK', 'FCGR3A Monocytes',
    'Dendritic', 'Megakaryocytes']
adata.rename_categories('leiden', new_cluster_names)
sc.pl.umap(adata, color='leiden', legend_loc='on data', title='', frameon=False, save='.pdf')
sc.pl.dotplot(adata, marker_genes, groupby='leiden');
sc.pl.stacked_violin(adata, marker_genes, groupby='leiden', rotation=90);
adata.raw.to_adata().write('./write/pbmc3k_withoutX.h5ad')

WARNING: saving figure to file figures\umap.pdf

picture

picture

picture

 
 

if you can see here

Then let me show you how I use scanpy to process a covid-19 very large data set


GSE158055 covid19 lung tissue 60W single cell practice

In response to fans’ calls, welcome to our biotechnology exchange group

picture

For more exciting events, follow the public account:Shengxin Little Doctor

picture

picture

Remember to click "Looking" after reading itOh!

Guess you like

Origin blog.csdn.net/qq_52813185/article/details/134826213