HUMAnN2：人类微生物组统一代谢网络分析2

关于宏基因组常用的有参分析流程，主要是快速获得物种组成和功能组成，之前分享了

MetaPhlAn2一条命令获得宏基因组物种组成

今天再介绍来自同一作者的另一个软件，可以一步完成功能和代谢组成。

HUMAnN2: The HMP Unified Metabolic Analysis Network 2，它在宏基因组研究中非常有用，通过这个分析，不仅能获得微生物的物种丰度信息，还能准确有效地获得微生物代谢途径和功能模块信息。

主页： http://www.huttenhower.org/humann2

官方教程：https://bitbucket.org/biobakery/humann2/wiki/Home (版本2017-12-14)

中文版本翻译日期(2018-05-01)

HUMAnN是基于宏基因组、宏转录组数据分析微生物通路丰度的有效工具。这一过程称为功能谱，目的是描述群体成员的代谢潜能。可以回答微生物群体成员可能干什么，或在干什么的问题。

软件特点：

可对已知和末知生物分析群体功能谱
- MetaPhlAn2和ChocoPhlAn泛基因组数据库，可以更快速准确获得功能谱
- 物种包括古菌、细菌、真核生物和病毒
可获得基因组、基因和通路层面的结果
- UniRef数据库提供基因家族的定义
- MetaCyc通路基因通路的定义
- MinPath提供定义的最小通路集
简单的使用界面(单行命令工作流)
- 用户只需提供质控的宏基因组或宏转录组数据
加速序列比对
- 采用Bowtie2加速核酸水平搜索
- 采用Diamond加速翻译蛋白水平搜索

HUMAnN2工作流程图

安装

如果你安装过python，且有pip安装工具，可以轻松安装humann2

# 软件安装
pip install humann2

# 或可选手动下载安装
wget //files.pythonhosted.org/packages/43/07/ec41577c3c1f9b578875ade8ed549d14fc2944c13cb7504579d542b62a69/humann2-0.11.1.tar.gz    

# 前面仍不成功，推荐conda安装更快更好用
conda install humann2

# 测试安装
humann2_test

# 比如我使用conda安装程序至/conda/bin目录，且没有添加环境变量，可以使用绝对路径调用程序

# 下载数据库
wd=/conda/bin
$wd/humann2_databases --available
# 5.37GB
$wd/humann2_databases --download chocophlan full /data/humann2
# 5.87GB，解压后11G
$wd/humann2_databases --download uniref uniref90_diamond /data/humann2

依赖关系

# Diaomond http://ab.inf.uni-tuebingen.de/software/diamond/
wget http://github.com/bbuchfink/diamond/releases/download/v0.9.21/diamond-linux64.tar.gz
tar xzf diamond-linux64.tar.gz
sudo ln -fs `pwd`/diamond /usr/local/bin/

## 分析实战

输入文件为fastq，输出文件为指定目录中有各定量表格

cd ~/ath/jt.terpene.meta/clean_data/JT-545
# 可接受压缩文件fastq，并自建目录
$wd/humann2 --input 25/JT-545_25.rmhost.1.fq.gz --output humann2_25 &
$wd/humann2 --input 26/JT-545_26.rmhost.1.fq.gz --output humann2_26 &
$wd/humann2 --input 27/JT-545_27.rmhost.1.fq.gz --output humann2_27 &

输出文件

1. 基因家族文件

# Gene Family   $SAMPLENAME_Abundance-RPKs
UNMAPPED        187.0
UniRef50_unknown        150.0
UniRef50_unknown|g__Bacteroides.s__Bacteroides_fragilis 150.0
UniRef50_A6L0N6: Conserved protein found in conjugate transposon    67.0
UniRef50_A6L0N6: Conserved protein found in conjugate transposon|g__Bacteroides.s__Bacteroides_fragilis 57.0
UniRef50_A6L0N6: Conserved protein found in conjugate transposon|g__Bacteroides.s__Bacteroides_finegoldii   5.0
UniRef50_A6L0N6: Conserved protein found in conjugate transposon|g__Bacteroides.s__Bacteroides_stercoris    4.0
UniRef50_A6L0N6: Conserved protein found in conjugate transposon|unclassified   1.0
UniRef50_O83668: Fructose-bisphosphate aldolase 60.0
UniRef50_O83668: Fructose-bisphosphate aldolase|g__Bacteroides.s__Bacteroides_vulgatus  31.0
UniRef50_O83668: Fructose-bisphosphate aldolase|g__Bacteroides.s__Bacteroides_thetaiotaomicron  22.0
UniRef50_O83668: Fructose-bisphosphate aldolase|g__Bacteroides.s__Bacteroides_stercoris 7.0

文件名： $OUTPUT_DIR/$ SAMPLENAME_genefamilies.tsv
群体中每个基因家族的丰度。基因家族是一组进化上相关的编码蛋白序列，通常具有相似功能。
基因家族的丰度在群体水平分级显著，显示已知和未知物种的贡献度。
使用MetaPhlAn2软件和ChocoPhlAn数据库，检索核酸翻译的蛋白数据库
基因家族的丰度采用RPK(每Kb的reads)以标准化不同的基因长度；RPK单位代表基因或转录本在群体中的拷贝数；RPK值可进一步求和标准化，用于不同样品测序深度的比较。
如果输入文件是基因表，不会创建基因家族文件
UNMAPPED是两步核酸和蛋白搜索后，仍无法比对的reads数量。
UniRef50_unknown 代表可比对ChocoPhlAn，但没有注释
1. 通路丰度文件

#Pathway   $SAMPLENAME_Abundance
UNMAPPED    140.0
UNINTEGRATED    87.0
UNINTEGRATED|g__Bacteroides.s__Bacteroides_caccae   23.0
UNINTEGRATED|g__Bacteroides.s__Bacteroides_finegoldii   20.0
UNINTEGRATED|unclassified   12.0
PWY0-1301: melibiose degradation    57.5
PWY0-1301: melibiose degradation|g__Bacteroides.s__Bacteroides_caccae   32.5
PWY0-1301: melibiose degradation|g__Bacteroides.s__Bacteroides_finegoldii   4.5
PWY0-1301: melibiose degradation|unclassified   3.0
PWY-5484: glycolysis II (from fructose-6P)  54.7
PWY-5484: glycolysis II (from fructose-6P)|g__Bacteroides.s__Bacteroides_caccae 16.7
PWY-5484: glycolysis II (from fructose-6P)|g__Bacteroides.s__Bacteroides_fi

文件名：OUTPUT_DIR/$SAMPLENAME_pathabundance.tsv
代表群体中通路的丰度
即有群体水平，又有物种水平丰度
通路按丰度大小排序，物种组分也按丰度大小排序，全为0的通路不输出
通路的比例是是完整拷贝的丰度，如线性通路Gene1-4，分别为10，5，5，5。则按5计算。
与基因不同，通路的丰度并一定是群体组分的总合。A物种[5, 5, 10, 10]为5个拷贝，B物种[10, 10, 5, 5]为5个拷贝，而总体有15个拷贝；详细的计算说明见英文帮助原文
对于单个基因必须的反应步骤为零丰度时，可进行所需最低丰度填充。
MetaCyc默认定义最简通路解析群体观测的代谢通路；
用户也可以自定义通路数据库
非线性基因拷贝数、无法比对序列处理方法请参考英文原文
1. 通路覆盖度文件

# Pathway   $SAMPLENAME_Coverage
UNMAPPED    1.0
UNINTEGRATED    1.0
UNINTEGRATED|g__Bacteroides.s__Bacteroides_caccae   1.0
UNINTEGRATED|g__Bacteroides.s__Bacteroides_finegoldii   1.0
UNINTEGRATED|unclassified   1.0
PWY0-1301: melibiose degradation    1.0
PWY0-1301: melibiose degradation|g__Bacteroides.s__Bacteroides_caccae   1.0
PWY0-1301: melibiose degradation|g__Bacteroides.s__Bacteroides_finegoldii   1.0
PWY0-1301: melibiose degradation|unclassified   1.0
PWY-5484: glycolysis II (from fructose-6P)  1.0
PWY-5484: glycolysis II (from fructose-6P)|g__Bacteroides.s__Bacteroides_caccae 0.7
PWY-5484: glycolysis II (from fructose-6P)|g__Bacteroides.s__Bacteroides_finegoldii 0.7
PWY-5484: glycolysis II (from fructose-6P)|unclassified 0.3

文件名：$OUTPUT_DIR/$SAMPLENAME_pathabundance.tsv
文件提供了一种有(1)和无(0)的群体通路计算法，而不是相对丰度
每个反应有置信得分
计算通路覆盖与相对丰度一致
通路丰度在群体水平和物种水平计算
群体水平比物种水平更可信
只输出非零丰度的通路
通路覆盖度与通路丰度顺序相同
1. 中间临时文件
Bowtie2比对结果($DIR/$SAMPLENAME_bowtie2_aligned.sam)
简化Bowtie2比对结果$DIR/$SAMPLENAME_bowtie2_aligned.tsv
Bowtie2数据库索引$DIR/$SAMPLENAME_bowtie2_index*
Bowtie2末比对的reads$DIR/$SAMPLENAME_bowtie2_unaligned.fa
自定义ChocoPhlAn数据库$DIR/$SAMPLENAME_custom_chocophlan_database.ffn
MetaPhlAn2的bowtie2比对结果$DIR/$SAMPLENAME_metaphlan_bowtie2.txt
MetaPhlAn2 bugs list$DIR/$SAMPLENAME_metaphlan_bugs_list.tsv
翻译后的比对结果$DIR/$SAMPLENAME_$TRANSLATEDALIGN_aligned.tsv
翻译后仍末比对序列$DIR/$SAMPLENAME_$TRANSLATEDALIGN_unaligned.fa
日志文件$DIR/$SAMPLENAME.log

配置软件

# 显示参数
$wd/humann2_config --print
# 修改参数格式
$wd/humann2_config --update $SECTION $NAME $VALUE
# 如修改线程数
$wd/humann2_config --update run_modes threads 12

HUMAnN2小工具

humann2_barplot

Basic usage: $ humann2_barplot --input $TABLE.tsv --feature $FEATURE --outfile $FIGURE
$TABLE.tsv = a stratified HUMAnN2 output file
$FEATURE = Feature from the table to plot (defaults to first feature)
$FIGURE = Where to save the figure
Run with -h to see additional command line options

可选择某个Feature进行柱状图可视化。–help参数可查看相关排序、标准化选项。

合并多样品结果为表格

此步非常重要，我们无法多少个样品，humann2结果仅为一列。多样品需经本步合并为矩阵，方便下游统计分析和差异比较。

Basic usage: $ humann2_join_tables --input $INPUT_DIR --output $TABLE
$INPUT_DIR = a directory containing gene/pathway tables (tsv or biom format)
$TABLE = the file to write the new single gene table (biom format if input is biom format)
Optional: --file_name $STR will only join gene tables with $STR in file name
Run with -h to see additional command line options

其它小工具

构建自定义数据库 humann2_build_custom_database
查看属水平基因家族与通路 humann2_gene_families_genus_level
增加物种分类(降低可信度) humann2_infer_taxonomy
合并MetaPhlAn2分类结果 humann2_reduce_table
对样品、通路进行合并/重分组操作humann2_regroup_table
对Feature进行重命名 humann2_rename_table
表格标准化 humann2_renorm_table
宏转录组标准化 humann2_rna_dna_norm
将结果分为分层、末分层两个文件 humann2_split_stratified_table
将多样品表分类单样品 humann2_split_table
挖掘菌株水平差异 humann2_strain_profiler
输出组成通路的基因丰度 humann2_unpack_pathways

其它说明

选择基因家族精度的水平
选择UniRfe90还是50? 推荐90
选择翻译搜索模式
1. Bypass translated search: 无法比对泛基因组的保存，再比对蛋白，结果中没有末分类的层级
2. Filtered translated search，是EC-filtered protein database(是UniRef中Level4层面国际生物化学联合会酶委员分类)的默认方案
3. Comprehensive translated search 使用最综合的蛋白数据库，但速度比2慢5倍。默认为方案2
双端序列：HUMAnN2不考虑双端序列，全当作单端
PICRUSt输出可继续用本软件分析，需要下载KEGG完整数据库，拆分预测宏基因组为每个样品单个文件，再运行humann2，再合并。详见官网
使用KEEG数据库，目前新版收费，免费版本最新为v56，可同HUMAnN1一起下载
结果可使用QIIME的core_diversity_analyses.py进行多样性分析
HUMAnN2分析宏转录组

常见问题

如何输出分析过程更多信息？添加--verbose参数
如何使用多核？--threads $CORES或修改默认设置
如何清空临时文件？--remove-temp-output
指定ChocoPhlAn数据库位置？--nucleotide-database $DIR
指定UniRef数据库位置？--protein-database $DIR
使用Metaphlan2结果继续分析？--taxonomic-profile bugs_list.tsv
修改样本名？--output-basename $NAME
去除分层的结果？--remove-stratified-output
使用unipathways databases？--pathways unipathway
输出biom格式结果--output-format biom
修改相似度阈值--identity-threshold <50.0>
修改metaphlan2参数--metaphlan-options="-t rel_ab"

报错与解决

CRITICAL ERROR: Can not call software version for diamond

diamond没有在环境变量，下载解压并确保添加到环境变量

The database file for MetaPhlAn does not exist at /mnt/bai/yongxin/software/metaphlan2/db_v20/mpa_v20_m200.pkl . Please provide the location with –metaphlan-options

没有找到metaphlan2的数据库，是metaphlan2新版本目录更改了位置，永久方法是建一个旧位置的硬链。

进入metaphlan2安装目录

mkdir db_v20
ln `pwd`/databases/mpa_v20_m200.* db_v20/

CRITICAL ERROR: Error executing: /mnt/bai/public/bin/diamond blastx –query /mnt/bai/yongxin/ath/jt.terpene.meta/clean_data/JT-545/humann2/JT-545_25.rmhost.1_humann2_temp/JT-545_25.rmhost.1_bowtie2_unaligned.fa –evalue 1.0 –threads 1 –max-target-seqs 20 –outfmt 6 –db /data/humann2/uniref/uniref90_annotated.1.1 –out /mnt/bai/yongxin/ath/jt.terpene.meta/clean_data/JT-545/humann2/JT-545_25.rmhost.1_humann2_temp/tmp2_lUDg/diamond_m8_TL5Rl_ –tmpdir /mnt/bai/yongxin/ath/jt.terpene.meta/clean_data/JT-545/humann2/JT-545_25.rmhost.1_humann2_temp/tmp2_lUDg

/mnt/bai/public/bin/diamond是个目录，不知为什么系统会找这个目录当前程序，系统我也装在了 /usr/local/bin/diamond 中。修改此目录为程序

2. 分析实战有参系列：
- MetaPhlAn2一条命令获得宏基因组物种组成

3. 分析实战De novo系列:
- 1背景知识-Shell入门与本地blast实战
- 2数据质控fastqc, Trimmomatic, MultiQC, khmer
- 3组装拼接MEGAHIT和评估quast
- 4基因注释Prokka
- 5基于Kmer比较数据集sourmash
- 6不比对快速估计基因丰度Salmon
- 7bwa序列比对, samtools查看, bedtools丰度统计
- 8分箱宏基因组binning, MaxBin, MetaBin, VizBin
- 9组装assembly和分箱bin结果可视化—Anvi’o
- 10绘制圈图-Circos安装与使用

如果基础知识体系不完善，自学存在困难的小伙伴，急时上车也是不错的选择。成为实验中不可或缺的人，赶快点击阅读原文报名我们的培训，加速你入行！http://www.ehbio.com/Training

写在后面

为鼓励读者交流、快速解决科研困难，我们建立了“宏基因组”专业讨论群，目前己有国内外1500+ 一线科研人员加入。参与讨论，获得专业解答，欢迎分享此文至朋友圈，并扫码加主编好友带你入群，务必备注“姓名-单位-研究方向-职称/年级”。技术问题寻求帮助，首先阅读《如何优雅的提问》学习解决问题思路，仍末解决群内讨论，问题不私聊，帮助同行。

学习扩增子、宏基因组科研思路和分析实战，关注“宏基因组”

点击阅读原文，跳转最新文章目录阅读
https://mp.weixin.qq.com/s/5jQspEvH5_4Xmart22gjMA

宏基因组培训班点链接或阅读原文报名 http://www.ehbio.com/Training