代码分析 | 单细胞转录组Normalization详解

标准化加高变基因选择

NGS系列文章包括NGS基础、转录组分析（Nature重磅综述|关于RNA-seq你想知道的全在这）、ChIP-seq分析（ChIP-seq基本分析流程）、单细胞测序分析 (重磅综述：三万字长文读懂单细胞RNA测序分析的最佳实践教程（原理、代码和评述）)、DNA甲基化分析、重测序分析、GEO数据挖掘（典型医学设计实验GEO数据分析 (step-by-step) - Limma差异分析、火山图、功能富集）等内容。

最近找到了芬兰CSC-IT科学中心主讲的生物信息课程(https://www.csc.fi/web/training/-/scrnaseq )视频，官网上还提供了练习素材以及详细代码，今天就来练习一下单细胞数据Normalization以及高变基因选择的过程。

在本练习中，我们将数据标准化，然后比较两个软件包scran和Seurat分析得到的高变基因，并使用scater软件包进行可视化（Hemberg-lab单细胞转录组数据分析（十二）- Scater单细胞表达谱tSNE可视化）。

knitr::opts_chunk$set(echo = TRUE)
library(scater)
library(scran)
library(Seurat)
options(stringsAsFactors = FALSE)
set.seed(32546)

可以使用以下命令下载数据，使用的数据是上次QC部分中使用的10X数据集-1k PBMCs using 10x v3 chemistry

system("curl -O http://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_v3/pbmc_1k_v3_filtered_feature_bc_matrix.h5")

读入数据并设置SingleCellExperiment对象：

# setwd("~/scrna-seq2019/day1/2-normalization/")
pbmc.mat <- Read10X_h5("pbmc_1k_v3_filtered_feature_bc_matrix.h5")
pbmc.sce <- SingleCellExperiment(assays = list(counts = as.matrix(pbmc.mat)))
pbmc.sce <- pbmc.sce[rowSums(counts(pbmc.sce) > 0) > 2,]
isSpike(pbmc.sce, "MT") <- grepl("^MT-", rownames(pbmc.sce))
pbmc.sce <- calculateQCMetrics(pbmc.sce)
colnames(colData(pbmc.sce))

##  [1] "is_cell_control"
##  [2] "total_features_by_counts"
##  [3] "log10_total_features_by_counts"
##  [4] "total_counts"
##  [5] "log10_total_counts"
##  [6] "pct_counts_in_top_50_features"
##  [7] "pct_counts_in_top_100_features"
##  [8] "pct_counts_in_top_200_features"
##  [9] "pct_counts_in_top_500_features"
## [10] "total_features_by_counts_endogenous"
## [11] "log10_total_features_by_counts_endogenous"
## [12] "total_counts_endogenous"
## [13] "log10_total_counts_endogenous"
## [14] "pct_counts_endogenous"
## [15] "pct_counts_in_top_50_features_endogenous"
## [16] "pct_counts_in_top_100_features_endogenous"
## [17] "pct_counts_in_top_200_features_endogenous"
## [18] "pct_counts_in_top_500_features_endogenous"
## [19] "total_features_by_counts_feature_control"
## [20] "log10_total_features_by_counts_feature_control"
## [21] "total_counts_feature_control"
## [22] "log10_total_counts_feature_control"
## [23] "pct_counts_feature_control"
## [24] "pct_counts_in_top_50_features_feature_control"
## [25] "pct_counts_in_top_100_features_feature_control"
## [26] "pct_counts_in_top_200_features_feature_control"
## [27] "pct_counts_in_top_500_features_feature_control"
## [28] "total_features_by_counts_MT"
## [29] "log10_total_features_by_counts_MT"
## [30] "total_counts_MT"
## [31] "log10_total_counts_MT"
## [32] "pct_counts_MT"
## [33] "pct_counts_in_top_50_features_MT"
## [34] "pct_counts_in_top_100_features_MT"
## [35] "pct_counts_in_top_200_features_MT"
## [36] "pct_counts_in_top_500_features_MT"

过滤质量较差的细胞：

pbmc.sce <- filter(pbmc.sce, pct_counts_MT < 20)
pbmc.sce <- filter(pbmc.sce,
                   total_features_by_counts > 1000 &
                     total_features_by_counts < 4100)

创建一个新的assay用于比较标准化和未标准化数据。

assay(pbmc.sce, "logcounts_raw") <- log2(counts(pbmc.sce) + 1)
plotRLE(pbmc.sce[,1:50], exprs_values = "logcounts_raw", style = "full")

运行PCA的方式降维（Hemberg-lab单细胞转录组数据分析（十一）- Scater单细胞表达谱PCA可视化）并将结果保存到新对象raw.sce中：

raw.sce <- runPCA(pbmc.sce, exprs_values = "logcounts_raw")
scater::plotPCA(raw.sce, colour_by = "total_counts")

## Warning: 'add_ticks' is deprecated.
## Use '+ geom_rug(...)' instead.

绘制B细胞标记基因MS4A1的表达：

plotReducedDim(raw.sce, use_dimred = "PCA", by_exprs_values = "logcounts_raw",
               colour_by = "MS4A1")

## Warning: 'add_ticks' is deprecated.
## Use '+ geom_rug(...)' instead.

以上图例告诉了你什么？
其实可以看出来B细胞是和其他细胞混在一起的，区分度并不明显。

标准化：Log

Seurat的默认标准化方法是：每个细胞的某一count先除以该细胞的总count，然后乘以scale因子-10000，再做个对数转换（单细胞分析Seurat使用相关的10个问题答疑精选！）。

要使用Seurat的标准化方法，我们先把来自SCE对象中的过滤数据用来创建Seurat对象，实施标准化后，再将结果转换回SingleCellExperiment对象，用于绘制比较图。

pbmc.seu <- CreateSeuratObject(counts(pbmc.sce), project = "PBMC")
pbmc.seu <- NormalizeData(pbmc.seu)
pbmc.seu.sce <- as.SingleCellExperiment(pbmc.seu)
pbmc.seu.sce <- calculateQCMetrics(pbmc.seu.sce)

执行PCA（一文看懂PCA主成分分析）并使用plotRLE和plotReducedDim检查标准化结果。这次使用“logcounts”绘图（“logcounts”是默认值，可省略此设置）。检查一些标记基因，例如GNLY（NK细胞）或LYZ单核细胞）。

plotRLE(pbmc.seu.sce[,1:50], style = "full")

pbmc.seu.sce <- runPCA(pbmc.seu.sce)
scater::plotPCA(pbmc.seu.sce, colour_by = "total_counts")

plotReducedDim(pbmc.seu.sce, use_dimred = "PCA", colour_by = "MS4A1")

## Warning: 'add_ticks' is deprecated.
## Use '+ geom_rug(...)' instead.

标准化：scran

scran中的标准化过程是基于Lun等人（2016）的反卷积方法：先合并许多细胞的计数以避免drop-out问题，然后将基于池的量化因子“去卷积”为cell-based factors，以进行特定细胞的标准化。而标准化之前对细胞进行聚类并非必要的，不过减少cluster中细胞之间的差异表达基因数量确实可以提高标准化的准确性。

qclust <- quickCluster(pbmc.sce)
pbmc.sce <- computeSumFactors(pbmc.sce, clusters = qclust)
summary(sizeFactors(pbmc.sce))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##  0.1836  0.6380  0.8207  1.0000  1.2021  2.7399

pbmc.sce <- normalize(pbmc.sce)

## Warning in .get_all_sf_sets(object): spike-in set 'MT' should have its own
## size factors

plotRLE(pbmc.sce[,1:50], exprs_values = "logcounts", exprs_logged = FALSE, 
       style = "full")

pbmc.sce <- runPCA(pbmc.sce)
scater::plotPCA(pbmc.sce, colour_by = "total_counts")

## Warning: 'add_ticks' is deprecated.
## Use '+ geom_rug(...)' instead.

plotReducedDim(pbmc.sce, use_dimred = "PCA", colour_by = "MS4A1")

## Warning: 'add_ticks' is deprecated.
## Use '+ geom_rug(...)' instead.

高变基因选择

高变基因选择: scran

在用于发现highly variable gene（HVG）的方法中，scran是在没有spike-ins并假设大多数基因没有可变表达的情况下，先使用整个数据将趋势拟合到技术差异中，然后通过从总方差中减去趋势的拟合值得到对于每个内源基因的方差的生物学组成。

fit <- trendVar(pbmc.sce, use.spikes = NA)
decomp <- decomposeVar(pbmc.sce, fit)
top.hvgs <- order(decomp$bio, decreasing = TRUE)
head(decomp[top.hvgs,])

## DataFrame with 6 rows and 6 columns
##                     mean            total              bio
##                <numeric>        <numeric>        <numeric>
## S100A9   2.1798394106162 9.81200838511402 8.70848113171717
## S100A8  1.94717610389369 8.74191022471459 7.73278930922898
## LYZ     2.12848173541277 8.71130204756805 7.62859300817667
## HLA-DRA 2.26709495525931 5.56946701594201 4.43065009839707
## IGKC    1.03117091153778 4.64610433899492 3.94960628330731
## CD74    2.85525470083599 4.66046377538451 3.29006358386426
##                     tech               p.value                   FDR
##                <numeric>             <numeric>             <numeric>
## S100A9  1.10352725339686                     0                     0
## S100A8  1.00912091548561                     0                     0
## LYZ     1.08270903939138                     0                     0
## HLA-DRA 1.13881691754493                     0                     0
## IGKC    0.69649805568761                     0                     0
## CD74    1.37040019152025 6.00496990101363e-272 3.15447280938074e-269

plot(decomp$mean, decomp$total, xlab = "Mean log-expression", ylab = "Variance")o <- order(decomp$mean)lines(decomp$mean[o], decomp$tech[o], col = "red", lwd = 2)

该趋势不是非常适合此数据，这可能是因为数据集很小，只有大约1000个细胞。TrendVar函数的parametric = TRUE选项可以解决此问题。

我们使用5％的false discovery rate (FDR)来选择生物学组成明显大于零的基因。

hvg.out <- decomp[which(decomp$FDR <= 0.05),]
hvg.out <- hvg.out[order(hvg.out$bio, decreasing=TRUE),]
plotExpression(pbmc.sce, features = rownames(hvg.out)[1:10])

高变基因选择: Seurat

Seurat 3中的默认方法是方差稳定变换（方差齐性的一种变换）。通过数据拟合趋势可以预测每个基因随其均值的变化。对于每个基因，在所有细胞中计算标准化值的方差，并用于对特征进行排名。默认情况下返回top2000的基因。

pbmc.seu <- FindVariableFeatures(pbmc.seu, selection.method = "vst")
top10 <- head(VariableFeatures(pbmc.seu), 10)
vplot <- VariableFeaturePlot(pbmc.seu)
LabelPoints(plot = vplot, points = top10, repel = TRUE)

## When using repel, set xnudge and ynudge to 0 for optimal results
## Warning: Transformation introduced infinite values in continuous x-axis

Seurat的VariableFeatures中包含有多少scran检测到的可变基因呢？通过以下代码我们可以看出1119个基因的重合

table(rownames(hvg.out) %in% VariableFeatures(pbmc.seu))

#### FALSE  TRUE
##  1973  1119

sessionInfo()

## R version 3.5.1 (2018-07-02)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: CentOS release 6.10 (Final)
##
## Matrix products: default
## BLAS/LAPACK: /homeappl/appl_taito/opt/mkl/11.3.0/compilers_and_libraries_2016.0.109/linux/mkl/lib/intel64_lin/libmkl_gf_lp64.so
##
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets
## [8] methods   base
##
## other attached packages:
##  [1] Seurat_3.0.0                scran_1.10.2
##  [3] scater_1.10.1               ggplot2_3.1.1
##  [5] SingleCellExperiment_1.4.1  SummarizedExperiment_1.12.0
##  [7] DelayedArray_0.8.0          BiocParallel_1.16.5
##  [9] matrixStats_0.54.0          Biobase_2.42.0
## [11] GenomicRanges_1.34.0        GenomeInfoDb_1.18.1
## [13] IRanges_2.16.0              S4Vectors_0.20.1
## [15] BiocGenerics_0.28.0
##
## loaded via a namespace (and not attached):
##   [1] Rtsne_0.15               ggbeeswarm_0.6.0
##   [3] colorspace_1.4-0         ggridges_0.5.1
##   [5] dynamicTreeCut_1.63-1    XVector_0.22.0
##   [7] BiocNeighbors_1.0.0      listenv_0.7.0
##   [9] npsurv_0.4-0             bit64_0.9-7
##  [11] ggrepel_0.8.0            codetools_0.2-16
##  [13] splines_3.5.1            R.methodsS3_1.7.1
##  [15] lsei_1.2-0               knitr_1.22
##  [17] jsonlite_1.6             ica_1.0-2
##  [19] cluster_2.0.7-1          png_0.1-7
##  [21] R.oo_1.22.0              sctransform_0.2.0
##  [23] HDF5Array_1.10.1         httr_1.4.0
##  [25] compiler_3.5.1           assertthat_0.2.1
##  [27] Matrix_1.2-17            lazyeval_0.2.1
##  [29] limma_3.38.3             htmltools_0.3.6
##  [31] tools_3.5.1              rsvd_1.0.0
##  [33] igraph_1.2.4.1           gtable_0.3.0
##  [35] glue_1.3.1               GenomeInfoDbData_1.2.0
##  [37] RANN_2.6.1               reshape2_1.4.3
##  [39] dplyr_0.8.0.1            Rcpp_1.0.1
##  [41] gdata_2.18.0             ape_5.3
##  [43] nlme_3.1-137             DelayedMatrixStats_1.4.0
##  [45] gbRd_0.4-11              lmtest_0.9-36
##  [47] xfun_0.6                 stringr_1.4.0
##  [49] globals_0.12.4           irlba_2.3.3
##  [51] gtools_3.8.1             statmod_1.4.30
##  [53] future_1.12.0            edgeR_3.24.3
##  [55] zlibbioc_1.28.0          MASS_7.3-51.1
##  [57] zoo_1.8-4                scales_1.0.0
##  [59] rhdf5_2.26.2             RColorBrewer_1.1-2
##  [61] yaml_2.2.0               reticulate_1.12
##  [63] pbapply_1.4-0            gridExtra_2.3
##  [65] stringi_1.2.4            caTools_1.17.1.2
##  [67] bibtex_0.4.2             Rdpack_0.11-0
##  [69] SDMTools_1.1-221.1       rlang_0.3.1
##  [71] pkgconfig_2.0.2          bitops_1.0-6
##  [73] evaluate_0.12            lattice_0.20-38
##  [75] ROCR_1.0-7               purrr_0.3.2
##  [77] Rhdf5lib_1.4.2           labeling_0.3
##  [79] htmlwidgets_1.3          bit_1.1-14
##  [81] cowplot_0.9.4            tidyselect_0.2.5
##  [83] plyr_1.8.4               magrittr_1.5
##  [85] R6_2.3.0                 gplots_3.0.1
##  [87] pillar_1.3.1             withr_2.1.2
##  [89] fitdistrplus_1.0-14      survival_2.43-3
##  [91] RCurl_1.95-4.11          tsne_0.1-3
##  [93] tibble_2.0.1             future.apply_1.2.0
##  [95] hdf5r_1.2.0              crayon_1.3.4
##  [97] KernSmooth_2.23-15       plotly_4.9.0
##  [99] rmarkdown_1.11           viridis_0.5.1
## [101] locfit_1.5-9.1           grid_3.5.1
## [103] data.table_1.12.2        metap_1.1
## [105] digest_0.6.18            tidyr_0.8.2
## [107] R.utils_2.7.0            munsell_0.5.0
## [109] beeswarm_0.2.3           viridisLite_0.3.0
## [111] vipor_0.4.5

撰文：Tiger
编辑：生信宝典

生信宝典

发布了266 篇原创文章 · 获赞 182 · 访问量 64万+

他的留言板关注