Single-cell sequencing data dimensionality reduction and clustering method to identify cell subsets summary

 

1, identification and analysis of cell subsets (Stegle et al NATURE REVIEWS |. GENETICS, 2015)

 

With the development of single-cell sequencing technology, the number of cells in each study experiments or assay significantly increased. Now many single-cell research, ranging from a few hundred to produce, the number of cells generated hundreds of thousands, or even more. Among them, identification of cell subtypes (cell subtype or cell subpopulations) is a single-cell sequencing technology is a very important basis for the application. However, since the single-cell sequencing data typically involves a lot of cells, and the number of genes in each cell may be tens of thousands and, therefore, is a single-cell high-dimensional data sequence of complex data.


2, the single-cell subtype identification methods based on sequencing data summarized cells (Andrews and Hemberg, 2018, Mol. Aspects Med.)
 

In order to effectively single cell sequencing analysis data of various processes, in particular to identify cell subsets, usually we need to be single cell sequencing data dimension reduction. Methods cell dimension reduction sequencing data can be divided into two categories (micro-channel public number: AIPuFuBio):

1, Dimensionality reduction (dimensionality reduction) . The method of dimension reduction is usually after the key characteristics of the original high-dimensional data in data retention by optimizing the projected into a low dimensional space, such that by the form of the two-dimensional or three-dimensional impression data out.

Commonly used dimensionality reduction methods are:

. 1) the PCA (the Component Principle the Analysis) , principal component analysis, a linear dimensionality reduction method;

2) the SNE-T (T-Distributed Stochastic embedding that neighbor) , is a nonlinear dimensionality reduction method;

3)UMAP (uniform manifold approximation and projection) (Becht et al., 2018, Nat. Biotechnol.), 

4)scvis (Ding et al., 2018, Nat. Commun.) 

其中PCA和t-SNE被广泛应用于已发表的单细胞测序相关文章中。特别注意,PCA和t-SNE是降维的方法,并不是聚类方法。


图3、PCA、DM(Diffusion maps)、t-SNE(perplexity=10 (C) 和perplexity=50 (D))聚类示意图 (Andrews and Hemberg, 2018, Mol. Aspects Med.)  
 

2、Feature selection(特征选择),主要是通过去除信息含量少的基因而保留信息含量最多的基因来降低数据的维度。

常用的Feature selection的方法有:

1)基于先验信息的方法(如已知细胞的亚型)。比如通过SCDE软件鉴定已知不同细胞亚型间的差异表达基因,然后再基于差异表达基因来聚类分析等。

2)非监督方法。又可细分为:

(i) 基于highly variable genes (HVG) ;

(ii) 基于spike-in,如scLVM (Buettner et al., 2015)和BASiCS (Vallejos et al., 2015)等;

(iii)基于 dropout,如M3Drop (Andrews and Hemberg, 2018)。

 

单细胞测序数据细胞亚型鉴定方法(更多请见AIPuFu:www.aipufu.com)

1、监督的方法。比如基于特定细胞亚型的已知marker基因来聚类分析。

2、非监督的方法(unsupervised clustering)。又可细分为:

(i) k-means,通常可结合PCA和t-SNE等来使用;

(ii) hierarchical clustering,运行速度比K-means要慢;

(iii) density-based clustering,需要基于大样本才能提高聚类的精度; 

(iv) graph-based clustering,是density-based clustering的一个延伸,可以应用于上百万的细胞数量。

图4、一些比较流行的单细胞亚型鉴定方法(Chen et al. Frontiers in Genetics, 2019)  
 

***不同细胞亚型鉴定聚类方法运行时间和具体性能的比较***
图5、不同聚类方法的运行时间比较(Duò et al. F1000Research, 2018)

 

图6、不同聚类方法运行时间(横轴)和性能(纵轴)的比较(Duò et al. F1000Research, 2018)


因此,从上面的图中可知,不同的聚类方法所具备的特点可能不一样,有些聚类方法运行时间短,有些聚类方法的结果更准确。可根据具体的数据情况,选择相应的软件。建议选择最新发表、且发表在高质量期刊的软件哦~(更多请见AIPuFu:www.aipufu.com)

Guess you like

Origin www.cnblogs.com/aipufu/p/11480779.html