Paper intensive reading (十九):Batch effects in a multiyear sequencing study

论文题目:Batch effects in a multiyear sequencing study: False biological trends due to changes in read lengths 

scholar 引用:9

页数:11

发表时间:1 March 2018

发表刊物:Mol Ecol Resour.

作者:D. M. Leigh1,2,3 | H. E. L. Lischer1,2  C. Grossen1 | L. F. Kelle 

University of Zurich,

摘要:

High-throughput sequencing is a powerful tool, but suffers biases and errors that must be accounted for to prevent false biological conclusions. Such errors include batch effects; technical errors only present in subsets of data due to procedural changes within a study. If overlooked and multiple batches of data are combined, spurious biological signals can arise, particularly if batches of data are correlated with biological variables. Batch effects can be minimized through randomization of sample groups across batches. However, in long-term or multiyear studies where data are added incrementally, full randomization is impossible, and batch effects may be a common feature. Here, we present a case study where false signals of selection were detected due to a batch effect in a multiyear study of Alpine ibex (Capra ibex). The batch effect arose because sequencing read length changed over the course of the project and populations were added incrementally to the study, resulting in nonrandom distributions of populations across read lengths. The differ- ences in read length caused small misalignments in a subset of the data, leading to false variant alleles and thus false SNPs. Pronounced allele frequency differences between populations arose at these SNPs because of the correlation between read length and population. This created highly statistically significant, but biologically spurious, signals of selection and false associations between allele frequencies and the environment. We highlight the risk of batch effects and discuss strategies to reduce the impacts of batch effects in multiyear high-throughput sequencing studies.

结论:

  • Studies that gather sequencing data over several years should look for alleles associated with specific batches of data, for instance, by testing for a significant association between allele frequencies and batch identity (The UK10K Consortium, 2015), and filter the data accordingly or change upstream bioinformatics steps like the SNP caller. 长时间收集的样本数据集应该做的处理
  • In addition, batch effect detection and correction tools that have been designed for microarray data (Johnson, Li, & Rabi- novic, 2007; Leek, Johnson, Parker, Jaffe, & Storey, 2012; Mani- maran et al., 2016), and SNP data (e.g., batchTest in GWASTOOLS; Gogarten et al., 2012), can be employed. 处理方法一定程度上通用
  • reporting of the most common sources of batch effects, that is dif- ferences in read length, sequencing technology, or sequencing centre should become standard procedure so that this information is avail- able to data analysts (Leek et al., 2010).
  • Various bioinformatics steps helped to remove the batch effect reported here. Firstly, using more stringent SNP filters removed the false SNP calls, but presumably at the cost of removing many true SNPs. 删除
  • A second, alternative approach was to employ a different SNP caller. Calling variants with GATK’s HaplotypeCaller did lead to cor- rect SNP calls.
  • This indicates that is important to not only evaluate SNP filters but also SNP callers in projects that combine data from multiple sequencing runs.
  • The study highlights the need for careful control of study design, bioinformatics pipelines and data analysis to prevent batch effects. 实验设计
  • Only a combination of approaches will ensure that biologically spurious conclusions from batch effects in HTS data are kept to a minimum.

Introduction:

扫描二维码关注公众号,回复: 9728284 查看本文章
  • genome-wide single nucleotide polymorphism (SNP) 全基因组单核苷酸多态性(SNP)
  • 限制性酶切位点关联DNA测序技术(restriction-site-associated DNA sequencing,RAD-seq)是在二代测序技术基础上发展起来的一种简化基因组技术(reduced-representation genome sequencing,RRGS)。
  • RAD-seq利用限性核酸内切酶使基因组片段化,经过修饰后连接含标记的接头构建文库并进行测序。因其具有操作简单、实验成本低、通量高等优点,在分子生态学、进化基因组学、保护遗传学等领域得到应用。感觉我们组不一定接触?
  • The simplest and most widely known form of errors is false SNP calls arising from sequencing error  高通量测序最显著的错误
  • Batch effects are thus technical sources of variation that differ among subsets of the data
  • When batch effects are correlated with biological variables, the systematic differences among batches may lead to invalid biological conclusions. 
  • 最理想的解决方案:One way to address batch effects in HTS studies is to randomly divide samples from a population or experimental group across libraries and sequencing lanes 
  • 实际情况:as HTS sequencing develops, an increasing number of mul- tiyear or long-term studies will add sequencing data over time
  • In contrast, when scientific questions focus on specific SNPs, batch effects may be more problematic. 这种情况下,影响更大
  • For example, if a reduced representation sequencing method, such as restriction site-associated DNA sequencing (RAD- seq), is used in a GWAS (Yu et al., 2015), each relevant section of the genome will often only be represented by a single SNP (Lowry et al., 2017; Mckinney, Larson, Seeb, & Seeb, 2017). 举个例子,In such cases, batch effects can easily cause bias. 
  •  It should be noted, however, that in studies with a higher marker density (e.g., from whole genome sequencing), associations would be confirmed by a cluster of markers rather than a single marker, making false associations due to batch effects less likely. 全基因组测序,影响就小一些
  • 本文对batch effect的讨论主要围绕一个例子开展,We discuss the origin of this batch effect, how it was identified, and its impact on the biological conclusions. We end with a brief discussion of ways in which the consequences of batch effects can be reduced.

正文组织架构:

1. Introduction

2. Methods

2.1 Study system

2.2 Data collection

2.3 Data processing

2.4 Selection detection

2.5 Detection of false SNPs

2.6 Preventing batch effects in selection analyses

2.7 Removing the false SNP calls

3. Results

3.1 Screening for signals of selection

3.2 Preventing batch effects in selection analysis

3.3 Removing false SNP calls

4. Discussion

4.1 Identifying batch effects

4.2 Removing false SNP calls

正文部分内容摘录:

发布了273 篇原创文章 · 获赞 16 · 访问量 2万+

猜你喜欢

转载自blog.csdn.net/wxw060709/article/details/104178299