celaref ||单细胞细胞类型定义工具

Sarah Williams (2019). celaref: Single-cell RNAseq cell cluster labelling by reference. R package version 1.0.1.

当我们拿到单细胞转录组数据的时候,不管结果怎么样,翻来覆去发现其实就是一个基因和细胞的矩阵而已!一般基于这个矩阵,会先过滤细胞均一化,降维,聚类(最重要的一步,因为它可以发现细胞的异质性),聚类结果的差异分析(声称找到了每个亚群的marker基因),最后做一下数据的可视化。

且慢,用非监督的方式聚出来的类(一般叫做为分的类数),既然是细胞异质性的体现,那么它到底是什么细胞呢,可以叫做什么,是CD4细胞呢还是microglia细胞?因为是没有意义的,只有起了像CD4这样的名字,这类细胞才有故事可以讲。

在定义细胞类型之前,需要确定就哪种聚类结果来做,是图聚类的结果还是k-means某一类的结果。如何来确定?看看分群的tsne图是一个不错的参考。

那么呢,之前你们家大师推荐过几个数据库single cell marker 基因数据库是基于某亚群细胞的marker基因,根据已经知道marker基因的细胞类型,这些数据库其实就是一个细胞类型和marker基因大表。根据marker基因列表与分析出来的差异基因列表以及基因丰度的关系(往往是做一个热图)来推断某个cluster属于哪一种细胞类型。

另外还有一种定义细胞类型的方法是通过每个cluster与已知细胞类型的表达谱的相似性来定义细胞类型,已有的方法为SingleRcelaref。均是R包,其中celaref是2019年的新品,也是本文主要介绍的一种方法。

celaref 简介
7600498-fb99a9ab9b3672e0.png
走遍示例流程

安装celaref包的时候,就把示例数据也下载了,首先载入数据,然后观察一下示例数据是怎么组织的,以便我们以后组织自己的数据。在接触一个新包的时候,最好的学习方法就是用示例数据走一遍然后看看有哪些功能(函数以及函数的参数可以用),多用help或者?来查看细节。一个R包封装的越好也就要求我们花时间去了解他的细节。

library(celaref)

# Paths to data files.
counts_filepath.query    <- system.file("extdata", "sim_query_counts.tab",    package = "celaref")
cell_info_filepath.query <- system.file("extdata", "sim_query_cell_info.tab", package = "celaref")
counts_filepath.ref      <- system.file("extdata", "sim_ref_counts.tab",      package = "celaref")
cell_info_filepath.ref   <- system.file("extdata", "sim_ref_cell_info.tab",   package = "celaref")

# Load data
toy_ref_se   <- load_se_from_files(counts_file=counts_filepath.ref, cell_info_file=cell_info_filepath.ref)
head(toy_ref_se)
toy_query_se <- load_se_from_files(counts_file=counts_filepath.query, cell_info_file=cell_info_filepath.query)

可见我载入了两个数据:一个是toy_ref_se,reference;一个是toy_query_se,query.都是SummarizedExperiment对象:

7600498-c808e8db3d07e21f.png
7600498-b4ced2d0121f8906.png

对数据进行过滤,help一下这个函数,看看都有哪些过滤条件。

# Filter data
toy_ref_se     <- trim_small_groups_and_low_expression_genes(toy_ref_se)
toy_query_se   <- trim_small_groups_and_low_expression_genes(toy_query_se)

对参考和需要鉴定的表达谱分别做差异分析:

# Setup within-experiment differential expression
de_table.toy_ref   <- contrast_each_group_to_the_rest(toy_ref_se,    dataset_name="ref")
de_table.toy_query <- contrast_each_group_to_the_rest(toy_query_se,  dataset_name="query")

分别得到差异基因结果:


7600498-a8b951f13312952e.png

绘制一组小提琴图,显示每个查询组的“top”基因在参考数据库中的分布。

# Plot
make_ranking_violin_plot(de_table.test=de_table.toy_query, de_table.ref=de_table.toy_ref)

7600498-6b94b076da84016d.png

根据参考数据集的相似性,在查询数据集中构造一些合理的标签或组/集群。

# And get group labels
make_ref_similarity_names(de_table.toy_query, de_table.toy_ref)
7600498-09855f84ec954fa8.png

可以看到除了Group2 没有相应的similarity 类似的type之外,其余的都找到了相对应的细胞类型(基于检验的,不是生物学的)。这当然可以为我们定义细胞类型做一个统计上的参考。

准备数据

那么,如何应用我们自己的数据来做呢?官网也给出了数据准备的方法:

在准备数据时需要以下数据:

  • Counts Matrix Number of gene tags per gene per cell.
  • Cell information A sample-sheet table of cell-level information. Only two fields are essential:
    cell_sample: A unique cell identifier
    group: The cluster/group to which the cell has been assigned.
  • Gene information Optional. A table of extra gene-level information.
    ID: A unique gene identifier

输入数据:

  1. tab键分割的ounts matrix
gene Cell1 cell2 cell3 cell4 cell954
GeneA 0 1 0 1 0
GeneB 0 3 0 2 2
GeneC 1 40 1 0 0
....
  1. tab键分割的细胞分群信息
CellId Sample Cluster
cell1 Control cluster1
cell2 Control cluster7
cell954 KO cluster8

From 10X pipeline output

dataset_se <- load_dataset_10Xdata('~/path/to/data/10X_mydata', 
                                   dataset_genome = "GRCh38", 
                                   clustering_set = "kmeans_7_clusters") 
拿一个10X的例子试试吧
#Prepare 10X query dataset

先要把数据下载好,发现这应用的对象还是cellranger V2 pipeline的输出结果,注意为了和官网数据格式保持一致,我把文件夹命名为了,这个前后一致就可以了。

+--- 10X_pbmc4k
|   +--- analysis
|   |   +--- clustering
|   |   |   +--- graphclust
|   |   |   |   +--- clusters.csv
|   |   |   +--- kmeans_10_clusters
|   |   |   |   +--- clusters.csv
|   |   |   +--- kmeans_2_clusters
···
|   |   +--- diffexp
|   |   |   +--- graphclust
|   |   |   |   +--- differential_expression.csv
|   |   |   +--- kmeans_10_clusters
···
|   |   +--- pca
|   |   |   +--- 10_components
|   |   |   |   +--- components.csv
|   |   |   |   +--- dispersion.csv
|   |   |   |   +--- genes_selected.csv
|   |   |   |   +--- projection.csv
|   |   |   |   +--- variance.csv
|   |   +--- tsne
|   |   |   +--- 2_components
|   |   |   |   +--- projection.csv
|   +--- filtered_gene_bc_matrices
|   |   +--- GRCh38
|   |   |   +--- barcodes.tsv
|   |   |   +--- genes.tsv
|   |   |   +--- matrix.mtx

datasets_dir <- "D:\\Users\\Administrator\\Desktop\\RStudio\\single_cell\\celaref/celaref_extra_vignette_data/datasets"
dir(datasets_dir)

dataset_se.10X_pbmc4k_k7 <- load_dataset_10Xdata(
  dataset_path   = file.path(datasets_dir,'10X_pbmc4k'), 
  dataset_genome = "GRCh38", 
  clustering_set = "kmeans_7_clusters", 
  id_to_use      = "GeneSymbol")
dataset_se.10X_pbmc4k_k7 <- trim_small_groups_and_low_expression_genes(dataset_se.10X_pbmc4k_k7)

可见我选择的是kmeans_7_clusters 的聚类结果来进行细胞定义。处理过之后:

> dataset_se.10X_pbmc4k_k7
class: SummarizedExperiment 
dim: 15407 4340 
metadata(0):
assays(1): ''
rownames(15407): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
rowData names(4): ensembl_ID GeneSymbol ID total_count
colnames(4340): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ... TTTGTCAGTTAAGACA-1 TTTGTCATCCCAAGAT-1
colData names(2): cell_sample group

下一步是对数据的转化和处理,也是one-to-others的DE分析,非常的慢和消耗内存,有可能跑断的。关键是这个函数做了是什么?其实就是把我们的矩阵文件整理成上面演示的格式。

de_table.10X_pbmc4k_k7   <- contrast_each_group_to_the_rest(dataset_se.10X_pbmc4k_k7, dataset_name="10X_pbmc4k_k7", num_cores=7)
Sorry, multithreading not supported on windows. Use linux, or set num_threads = 1 to suppress this message.Running single threaded  # 这个sorry 可以忽略 ,在Linux下才能指定运行进程数。
`fData` has no primerid.  I'll make something up.
`cData` has no wellKey.  I'll make something up.
Assuming data assay in position 1, with name et is log-transformed.
                                                                                                                                                          
Done!
Combining coefficients and standard errors
Calculating log-fold changes
Calculating likelihood ratio tests
Refitting on reduced model...
                                                                                                                                                          
Done!
`fData` has no primerid.  I'll make something up.
`cData` has no wellKey.  I'll make something up.
Assuming data assay in position 1, with name et is log-transformed.
 Completed [=======>----------------------------------------------------------------------------------------------------------------]   7% with 0 failures

windows下居然跑了一上午

准备 reference数据集

来自响应的数据库,因为我们的是血细胞,选择的是haemosphere .

A suitable PBMC reference (a ‘HaemAtlas’) has been published by . They purified populations of PBMC cell types and measured gene expression via microarray. The data used here was downloaded in a normalised table from the website (Graaf et al. 2016).

7600498-fe21d50046bb8b55.png
下载页面

下载完之后由于是微阵列数据需要做一些特殊处理。

  • Logged, normalised expression values. Any low expression or poor quality measurements should have already been removed.
  • Sample information (see contrast_each_group_to_the_rest_for_norm_ma_with_limma for details)
library(readr)
this_dataset_dir     <- file.path(datasets_dir,     'haemosphere_datasets','watkins')
norm_expression_file <- file.path(this_dataset_dir, "watkins_expression.txt")
samples_file         <- file.path(this_dataset_dir, "watkins_samples.txt")

norm_expression_table.full <- read.table(norm_expression_file, sep="\t", header=TRUE, quote="", comment.char="", row.names=1, check.names=FALSE)

samples_table              <- read_tsv(samples_file, col_types = cols())
samples_table$description  <- make.names( samples_table$description) # Avoid group or extra_factor names starting with numbers, for microarrays

> samples_table$description
 [1] "X49.years.adult" "X49.years.adult" "X49.years.adult" "X49.years.adult" "X49.years.adult" "X49.years.adult" "X49.years.adult"
 [8] "X49.years.adult" "X49.years.adult" "X49.years.adult" "X49.years.adult" "X49.years.adult" "X63.years.adult" "X63.years.adult"
[15] "X63.years.adult" "X63.years.adult" "X63.years.adult" "X63.years.adult" "X53.years.adult" "X53.years.adult" "X53.years.adult"
[22] "X53.years.adult" "X53.years.adult" "X53.years.adult" "X60.years.adult" "X60.years.adult" "X60.years.adult" "X60.years.adult"
[29] "X60.years.adult" "X60.years.adult" "X48.years.adult" "X48.years.adult" "X48.years.adult" "X48.years.adult" "X48.years.adult"
[36] "X48.years.adult" "X57.years.adult" "X57.years.adult" "X57.years.adult" "X57.years.adult" "X57.years.adult" "X57.years.adult"
[43] "NA."             "NA."             "NA."             "NA."             "NA."             "NA."             "NA."            
[50] "NA."            

从样本表中可以看出,这个数据集包含了其他组织,但是作为PBMC的参考,我们只考虑外周血样本。与其他数据加载函数一样,要从分析中删除一个样本(或单元格),从样本表中删除它就ok了。

amples_table        <- samples_table[samples_table$tissue == "Peripheral Blood",] 
> samples_table
# A tibble: 42 x 7
   sampleId     celltype            cell_lineage       surface_markers tissue           description     notes
   <chr>        <chr>               <chr>              <lgl>           <chr>            <chr>           <lgl>
 1 1674120023_A B lymphocyte        B Cell Lineage     NA              Peripheral Blood X49.years.adult NA   
 2 1674120023_B granulocyte         Neutrophil Lineage NA              Peripheral Blood X49.years.adult NA   
 3 1674120023_C natural killer cell NK Cell Lineage    NA              Peripheral Blood X49.years.adult NA   
 4 1674120023_D Th lymphocyte       T Cell Lineage     NA              Peripheral Blood X49.years.adult NA   
 5 1674120023_E Tc lymphocyte       T Cell Lineage     NA              Peripheral Blood X49.years.adult NA   
 6 1674120023_F monocyte            Macrophage Lineage NA              Peripheral Blood X49.years.adult NA   
 7 1674120053_A B lymphocyte        B Cell Lineage     NA              Peripheral Blood X49.years.adult NA   
 8 1674120053_B monocyte            Macrophage Lineage NA              Peripheral Blood X49.years.adult NA   
 9 1674120053_C granulocyte         Neutrophil Lineage NA              Peripheral Blood X49.years.adult NA   
10 1674120053_D Tc lymphocyte       T Cell Lineage     NA              Peripheral Blood X49.years.adult NA   
# ... with 32 more rows

通常情况下,最难的部分是格式化输入。应该使用与查询数据集相同的id,将微阵列表达式值作为规范化的、日志转换的数据。任何探头或样品级的过滤也应事先进行。在这种情况下,从网站获得的数据是正常的,但仍然需要匹配id。

library("tidyverse")
library("illuminaHumanv2.db")
probes_with_gene_symbol_and_with_data <- intersect(keys(illuminaHumanv2SYMBOL),rownames(norm_expression_table.full))

# Get mappings - non NA
probe_to_symbol <- select(illuminaHumanv2.db, keys=rownames(norm_expression_table.full), columns=c("SYMBOL"), keytype="PROBEID")
probe_to_symbol <- unique(probe_to_symbol[! is.na(probe_to_symbol$SYMBOL),])
# no multimapping probes
genes_per_probe <- table(probe_to_symbol$PROBEID) # How many genes a probe is annotated against?
multimap_probes <- names(genes_per_probe)[genes_per_probe  > 1]
probe_to_symbol <- probe_to_symbol[!probe_to_symbol$PROBEID %in% multimap_probes, ]

convert_expression_table_ids<- function(expression_table, the_probes_table, old_id_name, new_id_name){
  
  the_probes_table <- the_probes_table[,c(old_id_name, new_id_name)]
  colnames(the_probes_table) <- c("old_id", "new_id")
  
  # Before DE, just pick the top expresed probe to represent the gene
  # Not perfect, but this is a ranking-based analysis.
  # hybridisation issues aside, would expect higher epressed probes to be more relevant to Single cell data anyway.
  probe_expression_levels <- rowSums(expression_table)
  the_probes_table$avgexpr <- probe_expression_levels[as.character(the_probes_table$old_id)]
  
  the_genes_table <-  the_probes_table %>% 
    group_by(new_id) %>%
    top_n(1, avgexpr)
  
  expression_table <- expression_table[the_genes_table$old_id,]
  rownames(expression_table) <- the_genes_table$new_id
  
  return(expression_table)
}

# Just the most highly expressed probe foreach gene.
norm_expression_table.genes <- convert_expression_table_ids(norm_expression_table.full, 
                                                            probe_to_symbol, old_id_name="PROBEID", new_id_name="SYMBOL")



# Go...
de_table.Watkins2009PBMCs <- contrast_each_group_to_the_rest_for_norm_ma_with_limma(
  norm_expression_table = norm_expression_table.genes, 
  sample_sheet_table    = samples_table, 
  dataset_name          = "Watkins2009PBMCs", 
  extra_factor_name     = 'description', 
  sample_name           = "sampleId",
  group_name            = 'celltype')
> de_table.Watkins2009PBMCs
# A tibble: 113,376 x 21
   ID    logFC AveExpr     t  P.Value adj.P.Val     B     ci CI_upper CI_lower ci_inner ci_outer log2FC      fdr group sig   sig_up
   <chr> <dbl>   <dbl> <dbl>    <dbl>     <dbl> <dbl>  <dbl>    <dbl>    <dbl>    <dbl>    <dbl>  <dbl>    <dbl> <fct> <lgl> <lgl> 
 1 JCHA~  6.86    8.65  27.5 7.13e-26  2.70e-23  49.0 0.0952     6.96     6.77     6.77     6.96   6.86 2.70e-23 B ly~ TRUE  TRUE  
 2 FCRLA  6.41    8.77  23.7 1.13e-23  3.29e-21  44.0 0.103      6.51     6.30     6.30     6.51   6.41 3.29e-21 B ly~ TRUE  TRUE  
 3 VPRE~  6.18    8.66  40.2 1.20e-31  8.70e-29  62.0 0.0587     6.23     6.12     6.12     6.23   6.18 8.70e-29 B ly~ TRUE  TRUE  
 4 TCL1A  6.09    8.52  42.8 1.33e-32  1.14e-29  64.1 0.0544     6.14     6.04     6.04     6.14   6.09 1.14e-29 B ly~ TRUE  TRUE  
 5 CD79A  6.09    9.08  22.7 5.17e-23  1.40e-20  42.4 0.102      6.20     5.99     5.99     6.20   6.09 1.40e-20 B ly~ TRUE  TRUE  
 6 CD19   5.90    8.44  38.4 6.09e-31  4.11e-28  60.5 0.0587     5.96     5.85     5.85     5.96   5.90 4.11e-28 B ly~ TRUE  TRUE  
 7 HLA-~  5.61    8.35  47.9 2.30e-34  2.56e-31  68.0 0.0447     5.66     5.57     5.57     5.66   5.61 2.56e-31 B ly~ TRUE  TRUE  
 8 OSBP~  5.48    7.83  53.2 5.60e-36  8.82e-33  71.4 0.0394     5.52     5.45     5.45     5.52   5.48 8.82e-33 B ly~ TRUE  TRUE  
 9 CNTN~  5.38    7.95  49.3 8.12e-35  9.60e-32  68.9 0.0416     5.42     5.34     5.34     5.42   5.38 9.60e-32 B ly~ TRUE  TRUE  
10 COBL~  5.26    7.70  56.4 6.77e-37  1.28e-33  73.3 0.0356     5.30     5.23     5.23     5.30   5.26 1.28e-33 B ly~ TRUE  TRUE  
# ... with 113,366 more rows, and 4 more variables: gene_count <int>, rank <int>, rescaled_rank <dbl>, dataset <chr>

细胞类型定义(Compare 10X query PBMCs to to reference)

数据准备好就和之前的示例是一样的了。

make_ranking_violin_plot(de_table.test=de_table.10X_pbmc4k_k7, de_table.ref=de_table.Watkins2009PBMCs)

7600498-373cfda698937f67.png

Hmm, there’s a few clusters where different the top genes are bunched near the top for a couple of different reference cell types.

Logging the plot will be more informative at the top end for this dataset.

make_ranking_violin_plot(de_table.test=de_table.10X_pbmc4k_k7, de_table.ref=de_table.Watkins2009PBMCs, log10trans = TRUE)
7600498-7b4896325664fa00.png

可以打标签啦。

 label_table.pbmc4k_k7_vs_Watkins2009PBMCs <- make_ref_similarity_names(de_table.10X_pbmc4k_k7, de_table.Watkins2009PBMCs)
label_table.pbmc4k_k7_vs_Watkins2009PBMCs
7600498-a8cac77eb45d909e.png

可以看出,七个群有六个找到了与之相似的群名称,并给出了pval.

make_ranking_violin_plot(de_table.test=de_table.Watkins2009PBMCs, de_table.ref=de_table.10X_pbmc4k_k7, log10trans = TRUE)
7600498-cc6333af78020738.png

行啦,人pbmc细胞的定义工作就完成了。官网也提供了小鼠细胞类型识别的教程。这里就不再复述啦。

完成了细胞类型定义,单细胞的故事才刚刚开始。所以,这不是结束,甚至不是结束的开始,而可能是开始的结束。


SingleR
celaref
Sarah Williams (2019). celaref: Single-cell RNAseq cell cluster labelling by reference. R package version 1.0.1.
SummarizedExperiment

转载于:https://www.jianshu.com/p/c11ff102baa4

猜你喜欢

转载自blog.csdn.net/weixin_33860528/article/details/91323310