Bioinformatics Tutorial: Multiple Sequence Alignment

Summary

All phylogenetic inference methods require homology datasets as input. Therefore, when nucleotide sequences are used for phylogenetic analysis, the first step is usually to infer which nucleotides in the sequences of different taxa are homologous to each other, so that differences between these nucleotides arise only from what has occurred in sequence evolution. Variety. Inference of homology between nucleotides of different sequences is most often accomplished by methods belonging to the category of "multiple sequence alignments".

In this tutorial [1] I will introduce how to use one of the fastest and most popular multiple sequence alignment tools, the program MAFFT (Katoh and Standley 2013). I will further demonstrate how to detect and exclude aligned regions where nucleotide homology may be problematic, how to identify additional homologous sequences using a public sequence database (NCBI's GenBank), and how to use these sequences to supplement existing data sets .

data set

The dataset used in this tutorial is a small subset of the data used by Matschiner et al. Estimated divergence times of African and Neotropical cichlids relative to the breakup of Gondwana in India, Madagascar, Africa, and South America. The data set used here includes the sequences of two genes; the mitochondrial 16S gene encoding 16S ribosomal RNA and the nuclear RAG1 gene encoding recombination activating protein 1.

rely

  • MAFFT: Installation instructions and precompiled versions of MAFFT are available on the MAFFT web page. Although installation of the program should be easy on all operating systems, all steps of this tutorial can also be performed using the server version of MAFFT; therefore, installation of the software is optional.
  • AliView: To visualize sequence alignments, the software AliView (Larsson 2014) is recommended. Installation of AliView is described at http://www.ormbunkar.se/aliview/ and should work on all operating systems.
  • BMGE: BMGE is useful for identifying and removing poorly aligned regions in sequence alignments. The latest version of BMGE is available as a Java jar file at ftp://ftp.pasteur.fr/pub/gensoft/projects/BMGE/.

Compare and visualize

We will first use the MAFFT program to align the sequences of the mitochondrial 16S gene, and then use the software AliView to visualize and improve the alignment.

  • Download the file containing the 16S sequence 16s.fasta to your analysis directory. View the file in a text editor or on the command line, for example using the less command:
less 16s.fasta

You will see that each record consists of an ID and a sequence, where the ID is always on a single line starting with a ">" symbol, followed by the line containing the sequence. The sequences are not aligned yet; that is why they contain no gaps and are of different lengths. Other naming schemes can be applied instead of the 14-character IDs used in this file; however, I strongly recommend using short IDs because in phylogenetic analyses, if you use actual Latin names or common names that contain spaces or hyphens, species name, many programs or scripts may not work.

  • Open the MAFFT online version of the website. This website provides a web interface to the MAFFT alignment program. If you successfully installed MAFFT, you can also use MAFFT on your computer instead of using the website.
  • On the MAFFT server website, under the "Advanced Settings" heading (scroll down to view), you will find the alignment options available. In the first gray box titled "Strategy" you can choose between global and local alignment methods. The “G-INS-i” method implements the global Needleman-Wunsch algorithm (Needleman and Wunsch 1970), and the “L-INS-i” method implements the local “Smith-Waterman” algorithm (Smith and Waterman 1981). For simplicity, leave the default "Auto" option. If you are using the command line version of MAFFT on your computer instead of the MAFFT server, the equivalent command is as follows:
mafft --auto 16s.fasta > 16s_aln.fasta
  • 在“高级设置”部分的第三个灰色框中,标题为“参数”,您可以更改评分矩阵。对于氨基酸序列,您可以选择任何与 PAM 矩阵等效的 BLOSUM 矩阵。对于核苷酸序列,可以选择“1PAM / K=2”、“20PAM / K=2”和“200PAM / K=2”。目前,保留所有默认选项。单击“提交”按钮。将 Fasta 格式的比对下载到您的计算机。为此,请右键单击页面最顶部的“Fasta 格式”链接。将文件命名为 16s_aln.fasta。

  • 重复相同的操作,这次惩罚设置为 2,而不是默认值 1.53。将分析所得的比对文件命名为 16s_op2_aln.fasta。如果您使用 MAFFT 的命令行版本,则等效命令如下:

mafft --auto --op 2 16s.fasta > 16s_op2_aln.fasta
  • 在AliView中打开文件16s_aln.fasta。在不关闭 AliView 窗口的情况下,在第二个 AliView 窗口中打开文件 16s_op2_aln.fasta。比较右下角状态栏中显示的总对齐长度。在两个 AliView 窗口中,滚动到位置 1250 和 1350 之间的区域。
  • 在 16s_aln.fasta 的窗口中,识别对齐不良的区域(例如位置 1020 到 1040 周围)并尝试重新对齐。为此,请通过单击路线顶部的标尺来选择区域,如下面的屏幕截图所示。
alt
  • 选择对齐不良的区域后,单击 AliView 的“对齐”菜单中的“重新对齐所选块”。

BMGE 自动对齐过滤

正如您所看到的,16S 序列的比对包含高度可变区域和保守区域的混合。因此,核苷酸的同源性在基因的某些部分相当明显,但在其他部分可能不明确。为了避免下游系统发育分析中的比对错误导致的问题,我们将根据缺口的比例和这些区域内发现的遗传变异来识别比对不良的区域,并将它们从比对中排除。

  • 要从 16S 比对中排除不可靠的比对区域,请使用软件 BMGE。要检查该程序是否在您的计算机上运行并查看可用选项,请打开命令行窗口(例如 Mac OSX 上的终端应用程序)并键入以下命令:
java -jar BMGE.jar -?
  
# 如果上述方法有效,请输入以下命令:
java -jar BMGE.jar -i 16s_aln.fasta -t DNA -of 16s_filtered.fasta -oh 16s_filtered.html

通过上述命令,BMGE 以 Fasta 格式在文件 16s_filtered.fasta 中写入过滤后的比对,并在文件 16s_filtered.html 中以 HTML 格式可视化过滤后的比对。在浏览器中打开文件 16s_filtered.html。滚动浏览对齐并注意黑色对齐块。在对齐的最顶部,您将看到为每个站点以浅灰色和黑色绘制的两个值。差距比例用浅灰色等号显示,范围从 0 到 1。黑色冒号表示 BMGE 的作者所说的“平滑熵状分数”(Criscuolo 和 Gribaldo 2010)。基本上,这是对该位点核苷酸多样性的衡量。您会注意到黑色对齐块与低间隙比例和低熵的区域一致,这是最适合系统发育推断的对齐位置。我们对对齐块的选择基于 BMGE 的熵分数截止(选项 -h)、间隙率截止(-g)和最小块大小(-b)的默认设置。默认情况下,BMGE 选择熵分数低于 0.5 (-h 0.5) 且间隙比例低于 0.2 (-g 0.2) 的位点,并且仅当这些位点形成至少 5 个具有这些属性的位点 (-b 5) 时。

  • 使用熵分数截止、间隙率截止和最小块大小的自定义设置重复 BMGE 块选择,并注意这如何改变所选站点的总数以及对齐中所选块的分布。例如,使用 -g 0.3 增加允许的间隙比例:
java -jar BMGE.jar -i 16s_aln.fasta -t DNA -g 0.3 -of 16s_g03_filtered.fasta -oh 16s_g03_filtered.html
  • BMGE 到终端的标准输出告诉您有多少站点(字符)仍被选中。请注意最后两次运行之间的差异。除了文件 16s_filtered.html 之外,还要在单独的浏览器窗口中打开文件 16s_g03_filtered.html。滚动对齐。您会注意到,由于每个站点允许的间隙比例增加,现在有更多区域被标记为黑色。
  • 在AliView中打开文件16s_filtered.fasta。请注意,它现在比以前的对齐方式更短并且看起来更压缩。使用 AliView 的“文件”菜单中的“另存为 Phylip(全名和填充)”选项,将文件以 Phylip 格式保存为 16s_filtered.phy。还可以使用“另存为 Nexus”选项将文件保存为 Nexus 格式的 16s_filtered.nex。
  • 在文本编辑器中打开 Phylip 和 Nexus 文件以查看文件格式之间的差异。

Reference

[1]

Source: https://github.com/mmatschiner/tutorials/blob/master/multiple_sequence_alignment/README.md

本文由 mdnice 多平台发布

Guess you like

Origin blog.csdn.net/swindler_ice/article/details/132761544