vsearch2.8.1使用和命令简介——中文帮助文档(免费64位版usearch)

简介

USEARCH是最好用的扩增子分析软件，但是代码不开源，可分析大数据的64位版收费，阻止了很多经费有限小伙伴的学习和使用。

因此VSEARCH因时而生，免费、开源，让大家分析扩增子即方便，又免费，同时算法公开透明。更多介绍见下文链接：

VSEARCH操作实战-免费使用无内存限制的USEARCH！

vsearch主要功能有 - 嵌合体检测、聚类、去冗余、添加重复、fa/fq文件处理、masking、两两比对、搜索、重排、排序、抽样、物种分类(宏基因组、基因组和群体遗传)等。

此软件从14年11月28日发布v1.0.0以来，目前已经更新了89个版本，最新版于18年6月22号更新v2.8.1。主页：https://github.com/torognes/vsearch，
拥有主流操作系统Windows/Mac/Linux的各种版本，方便跨平台使用。

18年7月28日，最新版仍为 2.8.1 (180622)

以Windows版本为例，下面是下载链接

https://github.com/torognes/vsearch/releases/download/v2.8.1/vsearch-2.8.1-win-x86_64.zip

里面有程序文件，还有帮助文档。

主要功能和命令行格式

嵌合体检测

Chimera detection:

序列自身比对去嵌合

常用最新的uchime3进行denovo去嵌合，–nonchimeras指定输出过滤后的结果

基于参考数据库去嵌合

vsearch –uchime_ref fastafile (–chimeras | –nonchimeras | –uchimealns | –uchimeout) outputfile
–db fastafile [options]

再使用uchime_ref进行有参去嵌合体，数据库推荐使用又大又全的SILVA最新版本。同时推荐不要基于参考序列去嵌合，因为亲本缺少丰度信息情况下，容易造成假阴性。而de novo去嵌合时，要求亲本的丰度至少是嵌合体的16倍以上，这样可以较少控制假阴性率。

聚类

Clustering:

按丰度高到低聚类选择cluster_fast，非聚类的精度序列变异选择cluster_unoise算法。

去冗余

Dereplication and rereplication:

vsearch (–derep_fulllength | –derep_prefix) fastafile (–output | –uc) outputfile [options]
vsearch –rereplicate fastafile –output outputfile [options]

合并序列采用derep_fulllength去冗余，非冗余序列名中包括测序获得非冗余序列的次数(count值)

序列操作

FASTA/FASTQ file processing:

质量评估

vsearch –fastq_chars fastqfile [options]

格式转换

如phred64 - phred33

vsearch –fastq_convert fastqfile –fastqout outputfile [options]

质量统计

vsearch (–fastq_eestats | –fastq_eestats2) fastqfile –output outputfile [options]

质量控制

vsearch –fastq_filter fastqfile (–fastaout | –fastaout_discarded | –fastqout | –fastqout_discarded)
outputfile [options]

双端序列合并

双端序列合并，左端为默认参数，–reverse为右端文件, –fastqout指定输出文件，–fastaout_notmerged_fwd, –fastaout_notmerged_rev,可输出末匹配的结果，–eetabbedout输出统计至文件，–fastq_truncqual可移除3‘端低质量区。–fastq_minlen可过滤短序列，默认1。–fastq_maxns过滤N序列，默认不限制。默认不允许Overhang，除非使用–fastq_allowmergestagger。最小重叠区–fastq_minovlen默认为10，序列过长可调小。–fastq_maxdiffs设置重叠区数，默认最大10，–fastq_maxdiffpct可设置错配比例，默认为100%。扩增子有切胶回收目的片段的产物，已知情况下可按最大、最小长度过滤，–fastq_minmergelen and –fastq_maxmergelen。其它的参数还有–fastq_ascii, –fastq_maxee, –fastq_nostagger, –fastq_qmax, –fastq_qmaxout, –fastq_qmin, –fastq_qminout, and –label_suffix

序列统计

vsearch –fastq_stats fastqfile [–log logfile] [options]

序列取反向互补

vsearch –fastx_revcomp fastxfile (–fastaout | –fastqout) outputfile [options]

屏蔽序列

Masking:

Fasta/q文件屏蔽低复杂序列

vsearch –fastx_mask fastxfile (–fastaout | –fastqout) outputfile [options]

Fasta文件屏蔽低复杂序列

vsearch –maskfasta fastafile –output outputfile [options]

两两比对

Pairwise alignment:

序列自身两两全局比对

搜索

Searching:

搜索序列完全一致结果

全局比对，用于生成OTU表

重排与排序

Shuffling and sorting:

序列洗牌、按长度排序、排丰度排序

vsearch (–shuffle | –sortbylength | –sortbysize) fastafile –output outputfile [options]

抽样

Subsampling:

fastq/a文件抽样

vsearch –fastx_subsample fastafile (–fastaout | –fastqout) outputfile (–sample_pct real | –sample_
size N) [options]

物种分类

Taxonomic classification:

sintax算法物种注释

vsearch –sintax fastafile –db fastafile –tabbedout outputfile [–sintax_cutoff real] [options]

处理UDB数据库索引

UDB database handling:

比对数据库建索引

可节约每次建索引时间

vsearch –makeudb_usearch fastafile –output outputfile [options]

转换索引为fasta序列文件

vsearch –udb2fasta udbfile –output outputfile [options]

索引文件统计

vsearch (–udbinfo | –udbstats) udbfile [options]

描述

vsearch主要用途是扩增子分析过程中的序列处理，包括序列质控、去冗余、聚类、去嵌合、生成OTUs表等

输入

输入文件为标准的fasta或fastq格式；

当序列名中存在整数时，会作为丰度用于输助嵌合体检测、OTU聚类/去噪代表性序列选择；

文件中字母大小写是有意义的，正常为大写，小写为软屏蔽(soft masking)

输入文件支持管道操作，用-代替管道的输入文件，实现多命令连用

支持gzip或bzip2压缩文件，在管道中来的压缩文件需要加参数--gzip_decompress

参数

通用参数

参数	描述
–bzip2_decompress	当管道流入bzip2压缩格式时使用，直接读取压缩文件时不需要
–fasta_width	输出fasta格式默认为80个nt一行，参数可设置输出fasta列宽
–gzip_decompress	当管道流入gzip压缩格式时使用，直接读取压缩文件时不需要
–help或-h	显示帮助并退出
–log filename	日志写入文件
–maxseqlength N	最大序列长度，默认长度>50000将丢弃
–minseqlength N	最小序列长度，排序和洗牌时默认为1，聚类、去冗余或搜索时为32
–no_progress	不显示运行进度
–notrunclabels	保留完整序列名，默认去掉空格或制表符后面的信息
–quiet	除警告或致命错误，其它标准输出和标准误信息不输出
–threads N	计算使用的线程数，范围1-256。要<=CPU核数，默认使用全部。支持多线程命令有allpairs_global, cluster_fast, cluster_size, cluster_smallmem, fastq_mergepairs, maskfasta, search_exact, uchime_ref, and usearch_global
–version或-v	输出版本信息并退出

嵌合体检测参数

参数	描述
–abskew real	–uchime_denovo时，丰度比例用于检测谁是嵌合，谁是亲本。–uchime3_denovo默认值为16，其它时为2，即亲本是嵌合体2倍以上，必须大于1
–alignwidth N	–uchimealns时，设置三路比对的宽度，默认为80，0为无限制
–borderline 文件名	输出无法确定的嵌合体，它们像嵌合体，但不足以区分其和亲本
–chimeras 文件名	输出嵌合体序列
–db 文件名	使用–uchime_ref指定数据库fasta文件
–dn real	No vote pseudo-count, corresponding to the parameter n in the chimera scoring function (default value is 1.4).
–fasta_score	输出结果中包含嵌合体打分
–mindiffs N	每部分最小不同，默认3。–uchime2_denovo和–uchime3_denovo中此参数无效
–mindiv real	与亲本最小分歧，默认0.8，同上
–minh real	最小得分(h)，增大此值可减少假阳性、增加敏感性。默认0.28，范围0-1，同上
–nonchimeras filename	输出无嵌合体结果文件
–relabel string	序列重命名，–sizeout保留丰度注释
–relabel_keep	重命名时保留原始名称
–relabel_md5	按序列md5值重命名
–relabel_sha1	按序列sha1值重命名
–self	-uchime_ref时，忽略原始和数据库中的同名序列
–selfid	-uchime_ref时，忽略原始和数据库中完全相同的序列
–sizeout	输出文件添加丰度值
–uchime_denovo filename	序列按丰度排列，自身去嵌合，无需参考数据库，不支持多线程
–uchime2_denovo filename	UCHIME2算法去嵌合，按丰度排序，类似扩增子去噪–cluster_unoise
–uchime3_denovo	与uchime2类似，–abskew默认是16，而不是2
–uchime_ref	有参去嵌合
–uchimealns	输出嵌合体三路比较结果
–uchimeout	嵌合体比对详细
–uchimeout5	usearch5版本格式
–xn real	No vote weight, corresponding to the parameter beta in the scoring function (default value is 8.0).
–xsize	去除丰度值

聚类参数

vsearch采用单路、贪婪中心聚类算法。

参数	描述
–biomout filename	输出biom格式1.0的OTU表，JSON文件，详见http://biom-format.org/documentation/format_versions/biom-1.0.html
–centroids filename	输出中心序列作为代表序列
–clusterout_id	当使用–consout和–profile时，添加簇的标志符信息
–clusterout_sort	–consout, –msaout和–profile时，结果按丰度降序排列
–cluster_fast	按序列长度排序聚类
–cluster_size	按序列丰度排序聚类
–cluster_smallmem	省内存方式聚类，不按丰度排序，默认按长度，除排指定–usersort
–cluster_unoise	采用unoise3算法去噪。–minsize默认值为8，–unoise_alpha默认值为2，完成后需要连用–uchime3_denovo
–clusters string	输出结果为每条序列一个fasta文件
–consout filename	输出每个cluster比对的一致序列
–cons_truncate	This command is ignored. A warning is issued.
–id real	相似度阈值
–iddef 01234	修改id的定义，0. CD-HIT definition: (matching columns) / (shortest sequence length).n1. edit distance: (matching columns) / (alignment length). 2. edit distance excluding terminal gaps (same as –id). 3. Marine Biological Lab definition counting each gap opening (internal or terminal) as a single mismatch, whether or not the gap was extended: 1.0 - [(mismatches + gap openings)/(longest sequence length)] 4. BLAST definition, equivalent to –iddef 1 in a context of global pairwise alignment.
–minsize N	–cluster_unoise最低丰度，默认8
–msaout filename	输出多序列比对
–mothur_shared_out filename	mother格式OTU表
–otutabout filename	经典表格格式 OTU表
–profile filename	多序列比对频率谱文件
–qmask none dust soft	屏蔽序列的方法
–relabel string	序列重命名，–sizeout保留丰度注释
–relabel_keep	重命名时保留原始名称
–relabel_md5	按序列md5值重命名
–relabel_sha1	按序列sha1值重命名
–sizein	考虑序列的丰度注释，search for the pattern ’[>;]size=integer[;]’ in sequence headers
–sizeorder	扩增子有多个可能中心时，考虑丰度优先
–sizeout	输出结果带丰度信息
–strand plus或both	默认只检测正链
–uc filename	uclust结果格式
–unoise_alpha real	默认为2，–cluster_unoise命令的子参数
–usersort	-cluster_smallmem下可指定序列顺序
–xsize	去除丰度信息
其它参数	Most searching options as well as score filtering, gap penalties and masking also apply to clustering (see the Searching section for definitions): –alnout, –blast6out, –fastapairs, –matched, –notmatched, –maxaccept, –maxreject, –samout, –userout, –userfields

序列去冗余

参数	描述
–derep_fulllength	序列去冗余
–derep_prefix	序列重命名
–maxuniquesize N	过滤高丰度reads
–minuniquesize N	过滤低丰度reads，常用8，RPM1等
–output filename	输出结果
–relabel string	序列重命名，–sizeout保留丰度注释
–relabel_keep	重命名时保留原始名称
–relabel_md5	按序列md5值重命名
–relabel_sha1	按序列sha1值重命名
–sizein	考虑序列的丰度注释，search for the pattern ’[>;]size=integer[;]’ in sequence headers
–sizeout	输出结果带丰度信息
–strand plus或both	默认只检测正链
–topn N	输出丰度前N个
–uc filename	uclust结果格式
–xsize	去除丰度信息
–eeout	–fastq_filter or –fastq_mergepairs, include the number of expected errors (ee) in the sequence header of FASTQ and FASTA files. This option is a synonym of the –fastq_eeout option.
–eetabbedout filename	–fastq_mergepairs命令输出统计结果
–fastaout filename	输出结果
–fastaout_notmerged_fwd filename	无法合并正向序列
–fastaout_notmerged_rev filename	无法合并反向序列
–fastaout_discarded filename	过滤抛弃的数据
–fastq_allowmergestagger	允许合并交错的序列(存在overhang)
–fastq_ascii N	质量值类型，默认33，可选 64
–fastq_asciiout N	输出质量值类型
–fastq_chars filename	统计序列质量组成
–fastq_convert filename	33、64格式转换
–fastq_eeout	文件序列名中有期望错误率
–fastq_eestats	错误率统计报表
–fastq_eestats2	错误率统计报表2
–fastq_filter	按质量或长度过滤fasta
–fastq_maxdiffs N	–fastq_mergepairs双端互补合并时最大错误率，默认为10，末端低质量且重叠区大时可增大此值
–fastq_maxdiffpct real	设置匹配比例，默认为100%
–fastq_maxee real	最大错误率碱基个数阈值，–fastq_filter, –fastq_mergepairs or –fastx_filter,配合使用
–fastq_maxee_rate real	最大错误率碱基比例阈值，–fastq_filter, –fastq_mergepairs or –fastx_filter,配合使用
–fastq_maxlen N	最大长度
–fastq_maxmergelen N	合并后最大长度，默认无限制，可用于去除非目标片段序列
–fastq_maxns N	去除有N的序列
–fastq_mergepairs filename	双端序列合并，左端为默认参数，–reverse为右端文件, –fastqout指定输出文件
–fastq_minlen N	最小长度
–fastq_minmergelen N	合并后最小长度
–fastq_minovlen N	合并时最小重叠区
–fastq_nostagger	禁止合并末端不匹配overhang
–fastq_qmax N	最大质量值，默认为41
–fastq_qmaxout N	–fastq_convert转换时输出质量的最大值
–fastq_qmin N	最小质量值，默认为0
–fastq_qminout N	输出文件的最小质量值
–fastq_stats filename	序列长度、数量、质量统计
–fastq_stripleft N	左端切除碱基数，如barcode、正向引物
–fastq_stripright N	右端切除碱基数，如反向引物
–fastq_tail N	统计末尾4kmer频率，可修改长度
–fastq_truncee real	错误率过滤
–fastq_trunclen N	长度过滤
–fastq_trunclen_keep N	按长度过滤，但保留存序列
–fastq_truncqual	按质量值过滤
–fastx_revcomp	取反向互补
–label_suffix string	–fastx_revcomp or –fastq_mergepairs结果重命名
–maxsize N	最高丰度阈值
–minsize N	最小丰度阈值
–output filename	–fastq_eestats或–fastq_eestats2的质量统计结果
–relabel_keep	重命名时保留原始名称
–relabel_md5	按序列md5值重命名
–relabel_sha1	按序列sha1值重命名
–reverse filename	合并时指定右端
–xsize	移除丰度信息

屏蔽序列选项

Masking options:

扩增子中使用不多，此处P18-P19略，详见帮助

成对比对选项

Pairwise alignment options:

扩增子中使用不多，此处P19-P20略，详见帮助

搜索选项

Searching options: P20-27

搜索参数过多，有上百个，此处只列出常用的参数

参数	描述
–alnout filename	输出全局成对比较结果
–biomout filename	biom1.0格式OTU表
–blast6out filename	blast表格格式比对结果
–db filename	指定数据库或参考文件
–gapext string	设置gap扩展罚分
–gapopen string	设置gap打开罚分
–id real	相似度，常用0.97，0.99
–samout filename	比对结果为sam格式
–search_exact filename	完美匹配结果
–usearch_global filename	比对参考数据库、OTU

洗牌参数

Shuffling options:

参数	描述
–randseed N	随机数种子，保证结果可重复
–shuffle filename	随机洗牌序列顺序

排序参数

参数	描述
–maxsize N	按–sortbysize排序时，丰度最大值
–minsize N	按–sortbysize排序时，丰度最小值
–output filename	输出文件
–sizeout	输出丰度
–sortbylength filename	按长度排序
–sortbysize filename	按丰度排序
–topn N	选择最长、最高丰度的

抽样参数 P29

Subsampling options:

参数	描述
–fastaout filename	输出fasta文件
–fastaout_discarded filename	剩余序列
–fastqout filename	fq类型结果
–fastqout_discarded filename	剩余序列
–fastx_subsample filename	抽样fa/fq
–randseed N	随机数种子
–sample_pct real	抽样比例0-100
–sample_size N	提取指定数量的序列

物种分类

Taxonomic classification options:P30

数据库为fasta文件，序列名格式 “>X80725_S000004313;tax=d:Bacteria,p:Proteobacteria,c:Gammaproteobacteria, o:Enterobacteriales,f:Enterobacteriaceae,g:Escherichia/Shigella,s:Escherichia_coli”.

参数	描述
–db filename	数据库
–sintax_cutoff real	最小的bootstrap支持率
–sintax filename	输入文件
–tabbedout filename	输出文件，4列表

索引格式UDB P31

UDB options:

参数	描述
–dbmask none dust soft	屏蔽序列方法，默认为none
–makeudb_usearch filename	创建索引
–output filename	输出结果
–udb2fasta filename	数据库索引转fasta
–udbinfo filename	检查索引信息
–udbstats filename	索引统计
–wordlength N	索引的词宽，3-15，默认8

刻意改变

DELIBERATE CHANGES —— P34

–cluster_size 命令将序列先排序，再聚类

–iddef 保留了可变成对比对一致比例的可调选项

–sizein 读取丰度在去冗余和聚类中都可用，方便多批数据混合使用

对待U/T无区别

使用示例

数据库中序列两两比对

Align all sequences in a database with each other and output all pairwise alignments:

vsearch –allpairs_global database.fas –alnout results.aln –acceptall

De novo检测嵌合体，父母本的丰度至少为嵌合体的1.5倍

Check for the presence of chimeras (de novo); parents should be at least 1.5 times more abundant than
chimeras. Output non-chimeric sequences in fasta format (no wrapping):

vsearch –uchime_denovo queries.fas –abskew 1.5 –nonchimeras results.fas –fasta_width 0

97%聚类，并选择中心为代表序列，输出uclust格式

Cluster with a 97% similarity threshold, collect cluster centroids, and write cluster descriptions using a uclust-like format:

vsearch –cluster_fast queries.fas –id 0.97 –centroids centroids.fas –uc clusters.uc

序列去冗余，同时考虑序列名中的丰度信息，选择丰度大于1的序列

Dereplicate the sequences contained in queries.fas, take into account the abundance information already
present, write unwrapped fasta sequences to queries_unique.fas with the new abundance information, discard all sequences with an abundance of 1:

vsearch –derep_fulllength queries.fas –sizein –fasta_width 0 –sizeout –output
queries_unique.fas –minuniquesize 2

屏蔽序列中低复杂区域

Mask simple repeats and low complexity regions in the input fasta file with the DUST algorithm (masked
regions are lowercased), and write the results to the output file:

vsearch –maskfasta queries.fas –qmask dust –output queries_masked.fas

序列比对数据库，按80%相似度，考虑末端gap

Search queries in a reference database, with a 80%-similarity threshold, take terminal gaps into account when calculating pairwise similarities, output pairwise alignments:

vsearch –usearch_global queries.fas –db references.fas –id 0.8 –iddef 1 –alnout results.aln

自己比对自己，按60%相似度，输出blast6x结果

Search a sequence dataset against itself (ignore self hits), get all matches with at least 60% similarity, and collect results in a blast-like tab-separated format. Accept an unlimited number of hits (–maxaccepts 0), and compare each query to all other sequences, including unlikely candidates (–maxrejects 0):

vsearch –usearch_global queries.fas –db queries.fas –self –id 0.6 –blast6out results.blast6
–maxaccepts 0 –maxrejects 0

重排fasta文件

Shuffle the input fasta file (change the order of sequences) in a repeatable fashion (fixed seed), and write unwrapped fasta sequences to the output file:

vsearch –shuffle queries.fas –output queries_shuffled.fas –randseed 13 –fasta_width 0

按丰度排序

Sort by decreasing abundance the sequences contained in queries.fas (using the ’size=integer’ information),
relabel the sequences while preserving the abundance information (with –sizeout), keep only sequences
with an abundance equal to or greater than 2:

vsearch –sortbysize queries.fas –output queries_sorted.fas –relabel sampleA_ –sizeout –minsize 2

引文

Rognes T, Flouri T, Nichols B, Quince C, Mahé F. (2016) VSEARCH: a versatile open source tool for
metagenomics. PeerJ 4:e2584 doi: 10.7717/peerj.2584 https://doi.org/10.7717/peerj.2584

写在后面

为鼓励读者交流、快速解决科研困难，我们建立了“宏基因组”专业讨论群，目前己有国内外1800+ 一线科研人员加入。参与讨论，获得专业解答，欢迎分享此文至朋友圈，并扫码加主编好友带你入群，务必备注“姓名-单位-研究方向-职称/年级”。技术问题寻求帮助，首先阅读《如何优雅的提问》学习解决问题思路，仍末解决群内讨论，问题不私聊，帮助同行。

学习扩增子、宏基因组科研思路和分析实战，关注“宏基因组”

点击阅读原文，跳转最新文章目录阅读
https://mp.weixin.qq.com/s/5jQspEvH5_4Xmart22gjMA