Haplotype-aware genotyping from noisy long reads 单倍型识别的基因分型来自嘈杂的长读

Haplotype-aware genotyping from noisy long reads

单倍型识别的基因分型来自嘈杂的长读

Abstract

Motivation Current genotyping approaches for single nucleotide variations (SNVs) rely on short, relatively accurate reads from second generation sequencing devices. Presently, third generation sequencing platforms able to generate much longer reads are becoming more widespread. These platforms come with the significant drawback of higher sequencing error rates, which makes them ill-suited to current genotyping algorithms. However, the longer reads make more of the genome unambiguously mappable and typically provide linkage information between neighboring variants.

Results In this paper we introduce a novel approach for haplotype-aware genotyping from noisy long reads. We do this by considering bipartitions of the sequencing reads, corresponding to the two haplotypes. We formalize the computational problem in terms of a Hidden Markov Model and compute posterior genotype probabilities using the forward-backward algorithm. Genotype predictions can then be made by picking the most likely genotype at each site. Our experiments indicate that longer reads allow significantly more of the genome to potentially be accurately genotyped. Further, we are able to use both Oxford Nanopore and Pacific Biosciences sequencing data to independently validate millions of variants previously identified by short-read technologies in the reference NA12878 sample, including hundreds of thousands of variants that were not previously included in the high-confidence reference set.

摘要

目前单核苷酸变异(SNVs)的基因分型方法依赖于第二代测序设备的短而相对准确的读取。目前,能够产生更长的reads的第三代测序平台正变得越来越普遍。这些平台具有较高的测序错误率的显著缺陷,这使得它们不适合当前的基因分型算法。然而,读取时间越长,基因组的可映射性就越强,并且通常会在相邻的变体之间提供链接信息。

结果

本文提出了一种新的单倍型识别的基因分型方法。我们通过考虑与两个单倍型相对应的测序读的双分区来做到这一点。我们用隐马尔可夫模型对计算问题进行形式化,并使用前向-后向算法计算后验基因型概率。然后可以通过在每个位点选择最可能的基因型进行基因型预测。我们的实验表明,较长的读取时间可以使更多的基因组被潜在地精确地分型。此外,我们还可以使用牛津纳米颗粒公司和太平洋生物科学公司的测序数据来独立验证参考NA12878样本中先前通过短读技术识别出的数百万个变异,包括数十万个以前没有被纳入高置信度参考集的变异。

参考文献

https://www.biorxiv.org/content/10.1101/293944v2.abstract

发布了515 篇原创文章 · 获赞 79 · 访问量 17万+

猜你喜欢

转载自blog.csdn.net/u010608296/article/details/103497969