celaref ||单细胞细胞类型定义工具

Sarah Williams (2019). celaref: Single-cell RNAseq cell cluster labelling by reference. R package version 1.0.1.

当我们拿到单细胞转录组数据的时候，不管结果怎么样，翻来覆去发现其实就是一个基因和细胞的矩阵而已！一般基于这个矩阵，会先过滤细胞均一化，降维，聚类（最重要的一步，因为它可以发现细胞的异质性），聚类结果的差异分析（声称找到了每个亚群的marker基因），最后做一下数据的可视化。

且慢，用非监督的方式聚出来的类（一般叫做为分的类数），既然是细胞异质性的体现，那么它到底是什么细胞呢，可以叫做什么，是CD4细胞呢还是microglia细胞？因为是没有意义的，只有起了像CD4这样的名字，这类细胞才有故事可以讲。

在定义细胞类型之前，需要确定就哪种聚类结果来做，是图聚类的结果还是k-means某一类的结果。如何来确定？看看分群的tsne图是一个不错的参考。

那么呢，之前你们家大师推荐过几个数据库single cell marker 基因数据库是基于某亚群细胞的marker基因，根据已经知道marker基因的细胞类型，这些数据库其实就是一个细胞类型和marker基因大表。根据marker基因列表与分析出来的差异基因列表以及基因丰度的关系（往往是做一个热图）来推断某个cluster属于哪一种细胞类型。

另外还有一种定义细胞类型的方法是通过每个cluster与已知细胞类型的表达谱的相似性来定义细胞类型，已有的方法为SingleR、celaref。均是R包，其中celaref是2019年的新品，也是本文主要介绍的一种方法。

celaref 简介

走遍示例流程

安装celaref包的时候，就把示例数据也下载了，首先载入数据，然后观察一下示例数据是怎么组织的，以便我们以后组织自己的数据。在接触一个新包的时候，最好的学习方法就是用示例数据走一遍然后看看有哪些功能（函数以及函数的参数可以用），多用help或者？来查看细节。一个R包封装的越好也就要求我们花时间去了解他的细节。

library(celaref)

# Paths to data files.
counts_filepath.query    <- system.file("extdata", "sim_query_counts.tab",    package = "celaref")
cell_info_filepath.query <- system.file("extdata", "sim_query_cell_info.tab", package = "celaref")
counts_filepath.ref      <- system.file("extdata", "sim_ref_counts.tab",      package = "celaref")
cell_info_filepath.ref   <- system.file("extdata", "sim_ref_cell_info.tab",   package = "celaref")

# Load data
toy_ref_se   <- load_se_from_files(counts_file=counts_filepath.ref, cell_info_file=cell_info_filepath.ref)
head(toy_ref_se)
toy_query_se <- load_se_from_files(counts_file=counts_filepath.query, cell_info_file=cell_info_filepath.query)

可见我载入了两个数据：一个是toy_ref_se，reference；一个是toy_query_se，query.都是SummarizedExperiment对象:

对数据进行过滤，help一下这个函数，看看都有哪些过滤条件。

# Filter data
toy_ref_se     <- trim_small_groups_and_low_expression_genes(toy_ref_se)
toy_query_se   <- trim_small_groups_and_low_expression_genes(toy_query_se)

对参考和需要鉴定的表达谱分别做差异分析：

# Setup within-experiment differential expression
de_table.toy_ref   <- contrast_each_group_to_the_rest(toy_ref_se,    dataset_name="ref")
de_table.toy_query <- contrast_each_group_to_the_rest(toy_query_se,  dataset_name="query")

分别得到差异基因结果：

绘制一组小提琴图，显示每个查询组的“top”基因在参考数据库中的分布。

# Plot
make_ranking_violin_plot(de_table.test=de_table.toy_query, de_table.ref=de_table.toy_ref)

根据参考数据集的相似性，在查询数据集中构造一些合理的标签或组/集群。

# And get group labels
make_ref_similarity_names(de_table.toy_query, de_table.toy_ref)

可以看到除了Group2 没有相应的similarity 类似的type之外，其余的都找到了相对应的细胞类型（基于检验的，不是生物学的）。这当然可以为我们定义细胞类型做一个统计上的参考。

准备数据

那么，如何应用我们自己的数据来做呢？官网也给出了数据准备的方法：

在准备数据时需要以下数据：

Counts Matrix Number of gene tags per gene per cell.
Cell information A sample-sheet table of cell-level information. Only two fields are essential:
cell_sample: A unique cell identifier
group: The cluster/group to which the cell has been assigned.
Gene information Optional. A table of extra gene-level information.
ID: A unique gene identifier

输入数据：

tab键分割的ounts matrix

gene	Cell1	cell2	cell3	cell4	…	cell954
GeneA	0	1	0	1	…	0
GeneB	0	3	0	2	…	2
GeneC	1	40	1	0	…	0
....

tab键分割的细胞分群信息

CellId	Sample	Cluster
cell1	Control	cluster1
cell2	Control	cluster7
…	…	…
cell954	KO	cluster8

From 10X pipeline output

dataset_se <- load_dataset_10Xdata('~/path/to/data/10X_mydata', 
                                   dataset_genome = "GRCh38", 
                                   clustering_set = "kmeans_7_clusters")

拿一个10X的例子试试吧

#Prepare 10X query dataset

先要把数据下载好，发现这应用的对象还是cellranger V2 pipeline的输出结果，注意为了和官网数据格式保持一致，我把文件夹命名为了，这个前后一致就可以了。

+--- 10X_pbmc4k
|   +--- analysis
|   |   +--- clustering
|   |   |   +--- graphclust
|   |   |   |   +--- clusters.csv
|   |   |   +--- kmeans_10_clusters
|   |   |   |   +--- clusters.csv
|   |   |   +--- kmeans_2_clusters
···
|   |   +--- diffexp
|   |   |   +--- graphclust
|   |   |   |   +--- differential_expression.csv
|   |   |   +--- kmeans_10_clusters
···
|   |   +--- pca
|   |   |   +--- 10_components
|   |   |   |   +--- components.csv
|   |   |   |   +--- dispersion.csv
|   |   |   |   +--- genes_selected.csv
|   |   |   |   +--- projection.csv
|   |   |   |   +--- variance.csv
|   |   +--- tsne
|   |   |   +--- 2_components
|   |   |   |   +--- projection.csv
|   +--- filtered_gene_bc_matrices
|   |   +--- GRCh38
|   |   |   +--- barcodes.tsv
|   |   |   +--- genes.tsv
|   |   |   +--- matrix.mtx

datasets_dir <- "D:\\Users\\Administrator\\Desktop\\RStudio\\single_cell\\celaref/celaref_extra_vignette_data/datasets"
dir(datasets_dir)

dataset_se.10X_pbmc4k_k7 <- load_dataset_10Xdata(
  dataset_path   = file.path(datasets_dir,'10X_pbmc4k'), 
  dataset_genome = "GRCh38", 
  clustering_set = "kmeans_7_clusters", 
  id_to_use      = "GeneSymbol")
dataset_se.10X_pbmc4k_k7 <- trim_small_groups_and_low_expression_genes(dataset_se.10X_pbmc4k_k7)

可见我选择的是kmeans_7_clusters 的聚类结果来进行细胞定义。处理过之后：

> dataset_se.10X_pbmc4k_k7
class: SummarizedExperiment 
dim: 15407 4340 
metadata(0):
assays(1): ''
rownames(15407): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
rowData names(4): ensembl_ID GeneSymbol ID total_count
colnames(4340): AAACCTGAGAAGGCCT-1 AAACCTGAGACAGACC-1 ... TTTGTCAGTTAAGACA-1 TTTGTCATCCCAAGAT-1
colData names(2): cell_sample group

下一步是对数据的转化和处理，也是one-to-others的DE分析，非常的慢和消耗内存，有可能跑断的。关键是这个函数做了是什么？其实就是把我们的矩阵文件整理成上面演示的格式。

de_table.10X_pbmc4k_k7   <- contrast_each_group_to_the_rest(dataset_se.10X_pbmc4k_k7, dataset_name="10X_pbmc4k_k7", num_cores=7)
Sorry, multithreading not supported on windows. Use linux, or set num_threads = 1 to suppress this message.Running single threaded  # 这个sorry 可以忽略 ，在Linux下才能指定运行进程数。
`fData` has no primerid.  I'll make something up.
`cData` has no wellKey.  I'll make something up.
Assuming data assay in position 1, with name et is log-transformed.
                                                                                                                                                          
Done!
Combining coefficients and standard errors
Calculating log-fold changes
Calculating likelihood ratio tests
Refitting on reduced model...
                                                                                                                                                          
Done!
`fData` has no primerid.  I'll make something up.
`cData` has no wellKey.  I'll make something up.
Assuming data assay in position 1, with name et is log-transformed.
 Completed [=======>----------------------------------------------------------------------------------------------------------------]   7% with 0 failures

windows下居然跑了一上午

准备 reference数据集

来自响应的数据库，因为我们的是血细胞，选择的是haemosphere .

A suitable PBMC reference (a ‘HaemAtlas’) has been published by . They purified populations of PBMC cell types and measured gene expression via microarray. The data used here was downloaded in a normalised table from the website (Graaf et al. 2016).

下载页面

下载完之后由于是微阵列数据需要做一些特殊处理。

Logged, normalised expression values. Any low expression or poor quality measurements should have already been removed.
Sample information (see contrast_each_group_to_the_rest_for_norm_ma_with_limma for details)

library(readr)
this_dataset_dir     <- file.path(datasets_dir,     'haemosphere_datasets','watkins')
norm_expression_file <- file.path(this_dataset_dir, "watkins_expression.txt")
samples_file         <- file.path(this_dataset_dir, "watkins_samples.txt")

norm_expression_table.full <- read.table(norm_expression_file, sep="\t", header=TRUE, quote="", comment.char="", row.names=1, check.names=FALSE)

samples_table              <- read_tsv(samples_file, col_types = cols())
samples_table$description  <- make.names( samples_table$description) # Avoid group or extra_factor names starting with numbers, for microarrays

> samples_table$description
 [1] "X49.years.adult" "X49.years.adult" "X49.years.adult" "X49.years.adult" "X49.years.adult" "X49.years.adult" "X49.years.adult"
 [8] "X49.years.adult" "X49.years.adult" "X49.years.adult" "X49.years.adult" "X49.years.adult" "X63.years.adult" "X63.years.adult"
[15] "X63.years.adult" "X63.years.adult" "X63.years.adult" "X63.years.adult" "X53.years.adult" "X53.years.adult" "X53.years.adult"
[22] "X53.years.adult" "X53.years.adult" "X53.years.adult" "X60.years.adult" "X60.years.adult" "X60.years.adult" "X60.years.adult"
[29] "X60.years.adult" "X60.years.adult" "X48.years.adult" "X48.years.adult" "X48.years.adult" "X48.years.adult" "X48.years.adult"
[36] "X48.years.adult" "X57.years.adult" "X57.years.adult" "X57.years.adult" "X57.years.adult" "X57.years.adult" "X57.years.adult"
[43] "NA."             "NA."             "NA."             "NA."             "NA."             "NA."             "NA."            
[50] "NA."

从样本表中可以看出，这个数据集包含了其他组织，但是作为PBMC的参考，我们只考虑外周血样本。与其他数据加载函数一样，要从分析中删除一个样本(或单元格)，从样本表中删除它就ok了。

amples_table        <- samples_table[samples_table$tissue == "Peripheral Blood",] 
> samples_table
# A tibble: 42 x 7
   sampleId     celltype            cell_lineage       surface_markers tissue           description     notes
   <chr>        <chr>               <chr>              <lgl>           <chr>            <chr>           <lgl>
 1 1674120023_A B lymphocyte        B Cell Lineage     NA              Peripheral Blood X49.years.adult NA   
 2 1674120023_B granulocyte         Neutrophil Lineage NA              Peripheral Blood X49.years.adult NA   
 3 1674120023_C natural killer cell NK Cell Lineage    NA              Peripheral Blood X49.years.adult NA   
 4 1674120023_D Th lymphocyte       T Cell Lineage     NA              Peripheral Blood X49.years.adult NA   
 5 1674120023_E Tc lymphocyte       T Cell Lineage     NA              Peripheral Blood X49.years.adult NA   
 6 1674120023_F monocyte            Macrophage Lineage NA              Peripheral Blood X49.years.adult NA   
 7 1674120053_A B lymphocyte        B Cell Lineage     NA              Peripheral Blood X49.years.adult NA   
 8 1674120053_B monocyte            Macrophage Lineage NA              Peripheral Blood X49.years.adult NA   
 9 1674120053_C granulocyte         Neutrophil Lineage NA              Peripheral Blood X49.years.adult NA   
10 1674120053_D Tc lymphocyte       T Cell Lineage     NA              Peripheral Blood X49.years.adult NA   
# ... with 32 more rows

通常情况下，最难的部分是格式化输入。应该使用与查询数据集相同的id，将微阵列表达式值作为规范化的、日志转换的数据。任何探头或样品级的过滤也应事先进行。在这种情况下，从网站获得的数据是正常的，但仍然需要匹配id。

library("tidyverse")
library("illuminaHumanv2.db")
probes_with_gene_symbol_and_with_data <- intersect(keys(illuminaHumanv2SYMBOL),rownames(norm_expression_table.full))

# Get mappings - non NA
probe_to_symbol <- select(illuminaHumanv2.db, keys=rownames(norm_expression_table.full), columns=c("SYMBOL"), keytype="PROBEID")
probe_to_symbol <- unique(probe_to_symbol[! is.na(probe_to_symbol$SYMBOL),])
# no multimapping probes
genes_per_probe <- table(probe_to_symbol$PROBEID) # How many genes a probe is annotated against?
multimap_probes <- names(genes_per_probe)[genes_per_probe  > 1]
probe_to_symbol <- probe_to_symbol[!probe_to_symbol$PROBEID %in% multimap_probes, ]

convert_expression_table_ids<- function(expression_table, the_probes_table, old_id_name, new_id_name){
  
  the_probes_table <- the_probes_table[,c(old_id_name, new_id_name)]
  colnames(the_probes_table) <- c("old_id", "new_id")
  
  # Before DE, just pick the top expresed probe to represent the gene
  # Not perfect, but this is a ranking-based analysis.
  # hybridisation issues aside, would expect higher epressed probes to be more relevant to Single cell data anyway.
  probe_expression_levels <- rowSums(expression_table)
  the_probes_table$avgexpr <- probe_expression_levels[as.character(the_probes_table$old_id)]
  
  the_genes_table <-  the_probes_table %>% 
    group_by(new_id) %>%
    top_n(1, avgexpr)
  
  expression_table <- expression_table[the_genes_table$old_id,]
  rownames(expression_table) <- the_genes_table$new_id
  
  return(expression_table)
}

# Just the most highly expressed probe foreach gene.
norm_expression_table.genes <- convert_expression_table_ids(norm_expression_table.full, 
                                                            probe_to_symbol, old_id_name="PROBEID", new_id_name="SYMBOL")



# Go...
de_table.Watkins2009PBMCs <- contrast_each_group_to_the_rest_for_norm_ma_with_limma(
  norm_expression_table = norm_expression_table.genes, 
  sample_sheet_table    = samples_table, 
  dataset_name          = "Watkins2009PBMCs", 
  extra_factor_name     = 'description', 
  sample_name           = "sampleId",
  group_name            = 'celltype')
> de_table.Watkins2009PBMCs
# A tibble: 113,376 x 21
   ID    logFC AveExpr     t  P.Value adj.P.Val     B     ci CI_upper CI_lower ci_inner ci_outer log2FC      fdr group sig   sig_up
   <chr> <dbl>   <dbl> <dbl>    <dbl>     <dbl> <dbl>  <dbl>    <dbl>    <dbl>    <dbl>    <dbl>  <dbl>    <dbl> <fct> <lgl> <lgl> 
 1 JCHA~  6.86    8.65  27.5 7.13e-26  2.70e-23  49.0 0.0952     6.96     6.77     6.77     6.96   6.86 2.70e-23 B ly~ TRUE  TRUE  
 2 FCRLA  6.41    8.77  23.7 1.13e-23  3.29e-21  44.0 0.103      6.51     6.30     6.30     6.51   6.41 3.29e-21 B ly~ TRUE  TRUE  
 3 VPRE~  6.18    8.66  40.2 1.20e-31  8.70e-29  62.0 0.0587     6.23     6.12     6.12     6.23   6.18 8.70e-29 B ly~ TRUE  TRUE  
 4 TCL1A  6.09    8.52  42.8 1.33e-32  1.14e-29  64.1 0.0544     6.14     6.04     6.04     6.14   6.09 1.14e-29 B ly~ TRUE  TRUE  
 5 CD79A  6.09    9.08  22.7 5.17e-23  1.40e-20  42.4 0.102      6.20     5.99     5.99     6.20   6.09 1.40e-20 B ly~ TRUE  TRUE  
 6 CD19   5.90    8.44  38.4 6.09e-31  4.11e-28  60.5 0.0587     5.96     5.85     5.85     5.96   5.90 4.11e-28 B ly~ TRUE  TRUE  
 7 HLA-~  5.61    8.35  47.9 2.30e-34  2.56e-31  68.0 0.0447     5.66     5.57     5.57     5.66   5.61 2.56e-31 B ly~ TRUE  TRUE  
 8 OSBP~  5.48    7.83  53.2 5.60e-36  8.82e-33  71.4 0.0394     5.52     5.45     5.45     5.52   5.48 8.82e-33 B ly~ TRUE  TRUE  
 9 CNTN~  5.38    7.95  49.3 8.12e-35  9.60e-32  68.9 0.0416     5.42     5.34     5.34     5.42   5.38 9.60e-32 B ly~ TRUE  TRUE  
10 COBL~  5.26    7.70  56.4 6.77e-37  1.28e-33  73.3 0.0356     5.30     5.23     5.23     5.30   5.26 1.28e-33 B ly~ TRUE  TRUE  
# ... with 113,366 more rows, and 4 more variables: gene_count <int>, rank <int>, rescaled_rank <dbl>, dataset <chr>

细胞类型定义（Compare 10X query PBMCs to to reference）

数据准备好就和之前的示例是一样的了。

make_ranking_violin_plot(de_table.test=de_table.10X_pbmc4k_k7, de_table.ref=de_table.Watkins2009PBMCs)

Hmm, there’s a few clusters where different the top genes are bunched near the top for a couple of different reference cell types.

Logging the plot will be more informative at the top end for this dataset.

make_ranking_violin_plot(de_table.test=de_table.10X_pbmc4k_k7, de_table.ref=de_table.Watkins2009PBMCs, log10trans = TRUE)

可以打标签啦。

 label_table.pbmc4k_k7_vs_Watkins2009PBMCs <- make_ref_similarity_names(de_table.10X_pbmc4k_k7, de_table.Watkins2009PBMCs)
label_table.pbmc4k_k7_vs_Watkins2009PBMCs

可以看出，七个群有六个找到了与之相似的群名称，并给出了pval.

make_ranking_violin_plot(de_table.test=de_table.Watkins2009PBMCs, de_table.ref=de_table.10X_pbmc4k_k7, log10trans = TRUE)

行啦，人pbmc细胞的定义工作就完成了。官网也提供了小鼠细胞类型识别的教程。这里就不再复述啦。

完成了细胞类型定义，单细胞的故事才刚刚开始。所以，这不是结束，甚至不是结束的开始，而可能是开始的结束。

SingleR
celaref
Sarah Williams (2019). celaref: Single-cell RNAseq cell cluster labelling by reference. R package version 1.0.1.
SummarizedExperiment

转载于:https://www.jianshu.com/p/c11ff102baa4