Paper intensive reading (十四)：Batch effects in single-cell RNA-sequencing data are corrected

论文题目：Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors

scholar 引用：262

页数：12

发表时间：2 April 2018

发表刊物：nature biotechnology

作者：Laleh Haghverdi1,2, Aaron T L Lun3 , Michael D Morgan4 & John C Marioni1,3,4

Cambridge

摘要：

Large-scale single-cell RNA sequencing (scRNA-seq) data sets that are produced in different laboratories and at different times contain batch effects that may compromise the integration and interpretation of the data. Existing scRNA-seq analysis methods incorrectly assume that the composition of cell populations is either known or identical across batches. We present a strategy for batch correction based on the detection of mutual nearest neighbors (MNNs) in the high-dimensional expression space. Our approach does not rely on predefined or equal population compositions across batches; instead, it requires only that a subset of the population be shared between batches. We demonstrate the superiority of our approach compared with existing methods by using both simulated and real scRNA-seq data sets. Using multiple droplet-based scRNA-seq data sets, we demonstrate that our MNN batch-effect-correction method can be scaled to large numbers of cells.

Discussion：

Existing batch-correction methods do not account for differences in cell composition between batches and fail to fully remove the batch effect in such cases. 已有方法的局限
By using both simulated data and real scRNA-seq data sets, we demonstrated that our MNN method is able to successfully remove the batch effect in the presence of differences in composition. 在仿真和实际数据中均有测试
Moreover, we demonstrated the MNN method’s scalability on large droplet-based data sets.算法具有可扩展性
One prerequisite for our MNN method is that each batch must contain at least one shared cell population with another batch.MNN方法的先决条件
A notable feature of our MNN correction method is that it adjusts for local variations in batch effects by using a Gaussian kernel. MNN方法的一个显著特征，用了高斯核，适用于高维，线性方法无法实现此功能。
the correction vectors (provided as an output of the MNN algorithm) could potentially be examined to understand the differences between batches.
看了半天，好像意思是，不同的批次中，细胞系是不同的，然后不同批次会包含一些相同的细胞系，然后认为相同细胞系的数据应该是相同的，就得出一个correction vector，然后不断的去找这种重叠的子集，最后全部校正完毕~

Introduction：

扫描二维码关注公众号，回复： 9728307 查看本文章

Such differences can mask underlying biology or introduce spurious structure in the data; thus, to avoid misleading conclusions, they must be corrected before further analysis.
Most existing methods for batch correction are based on linear regression.大多数方法都是基于线性回归。
The limma package provides the removeBatchEffect function, limma包里面的一个函数，该方法的主要原理：which fits a linear model containing a blocking term for the batch structure to the expression values for each gene. Subsequently, the coefficient for each blocking term is set to zero, and the expression values are computed from the remaining terms and residuals, thus yielding a new expression matrix without batch effects.
The ComBat method8 uses a similar strategy but performs an additional step involving empirical Bayes shrinkage of the blocking coefficient estimates. ComBat采取的策略跟removeBatchEffect函数的策略类似，但是有一些优化。
Other methods, such as RUVseq9 and svaseq10, are also frequently used for batch correction, but their focus is primarily on identifying unknown factors of variation, for example, those due to unrecorded experimental differences in cell processing. After these factors are identified, their effects can be regressed out as described previously. 这两个方法侧重于发现变化的未知因子
their application to scRNA-seq data is based on the assumption that the composition of the cell population within each batch is identical. 目前已有的方法基于的假设是每一批次的细胞群相同，但实际上可能并不是
in practice, the population composition is usually not identical across batches in scRNA-seq studies.
即使相同，这一假设也有问题，Even if the same cell types are present in each batch, the abundance of each cell type in the data set can change depending upon subtle differences in procedures such as cell culture or tissue extraction, dissociation and sorting.
the estimated coefficients for the batch blocking factors are not purely technical but contain a nonzero biological component because of differences in composition.
Batch correction based on these coefficients would thus yield inaccurate representations of the cellular expression pro- files, and the results might potentially be worse than if no correction were performed.
An alternative approach for data merging and comparison in the presence of batch effects uses a set of landmarks from a reference data set to project new data onto the reference
PCA等投影方法的缺陷：if the new batches include cell types that fall outside the transcriptional space explored in the reference batch, these cell types will not be projected to an appropriate position in the space defined by the landmarks
本文的主要工作：
Here, we propose a new method for removal of discrepancies between biologically related batches according to the presence of MNNs between batches, which are considered to define the most similar cells of the same type across batches. 提出了MNN方法
The difference in expression values between cells in an MNN pair provides an estimate of the batch effect, which is made more precise by averaging across many such pairs.
A correction vector is obtained from the estimated batch effect and applied to the expression values to perform batch correction.
Our approach automatically identifies overlaps in population composition between batches and uses only the overlapping subsets for correction, thus avoiding the assumption of equal composition required by other methods.
We demonstrate that our approach outperforms existing methods on a range of simulated and real scRNA-seq data sets involving different biological systems and technologies.