gene prediction commend 2

 若使用本方法,请引用此文,谢谢!

JoF | Free Full-Text | Genome Re-Annotation and Transcriptome Analyses of Sanghuangporus sanghuang

使用MAKER进行基因注释(高级篇之AUGUSTUS模型训练)_徐洲更hoptop的博客-CSDN博客https://www.cnblogs.com/zhanmaomao/p/12359964.html

https://www.cnblogs.com/southern-xyx/p/4497497.html

Augustus Training and Prediction | 陈连福的生信博客

使用MAKER进行基因注释(高级篇之SNAP模型训练)_snap基因预测_徐洲更hoptop的博客-CSDN博客使用MAKER进行注释: 训练SNAP基因模型_徐洲更hoptop的博客-CSDN博客使用MAKER进行基因注释(高级篇之SNAP模型训练)_snap基因预测_徐洲更hoptop的博客-CSDN博客

SNAP的安装和使用 | 陈连福的生信博客

真核生物基因组的基因分析和预测_基因结构预测_wangyunpeng_bio的博客-CSDN博客

「基因组注释」构建重复序列数据库_徐洲更hoptop的博客-CSDN博客

MITE-Hunter识别基因组MITE序列 - 知乎

MAKER的使用 | 陈连福的生信博客

MAKER3 

基因注释:基于SNAP+Augustus+GeneMark的maker3 pipeline 

# 新建文件夹
mkdir 10.gene_prediction && cd 10.gene_prediction
mkdir maker3 && cd maker3
# 进入conda环境
sudo su
密码
conda activate training
# 创建maker控制文件
maker -CTL
# 将创建三个控制文件:maker_boopts.ctl、maker_exe.ctl、maker_opts.ctl
# 也可直接复制之前使用的!!

Round1

vi maker_opts.ctl
# 修改以下参数
genome=/media/aa/DATA/SZQ2/bj/my/genome/5.1216/03plion_primary/pilon02.fasta

est=/media/aa/DATA/SZQ2/bj/my/genome/5.1216/08.RNA-seq_analysis/zonghe/transcripts.fasta #转录组序列
protein=/media/aa/DATA/SZQ2/protein.fa #uniprot中下载的同源蛋白序列

rmlib=/media/aa/DATA/SZQ2/bj/my/genome/5.1216/04genome_feature_analysis_primary/repeat_analysis/repeatModeler/RM_630350.TueMar211317052023/consensi.fa.classified
softmask=1 #软屏蔽,将重复序列转为小写而不是N,因此基因内的短重复序列仍然可以作为基因的一部分进行注释
est2genome=1 #使用转录组证据
protein2genome=1 #使用同源蛋白证据
trna=0
cpus=12
AED_threshold=1
keep_preds=0
# 运行
mpiexec -n 20 maker -fix_nucleotides -base rnd1 &> maker.log1

报错

Can't locate forks.pm in @INC (you may need to install the forks module) (@INC contains: /media/aa/DATA/SZQ2/maker/bin/../perlperl/5.30.0/lib /media/aa/DATA/SZQ2/maker/bin/../lib /media/aa/DATA/SZQ2/maker/bin/../src/inc/perl/lib /etc/perl /usr/local/lib/x86_64-linux-gnu/perl/5.30.0 /usr/local/share/ /usr/lib/x86_64-linux-gnu/perl5/5.30 /usr/share/perl5 /usr/lib/x86_64-linux-gnu/perl/5.30 /usr/share/perl/5.30 /usr/local/lib/site_perl /usr/lib/x86_64-linux-gnu/perl-base) at /media/aa/DATA/SZQ2/maker/bin/maker line 25.
BEGIN failed--compilation aborted at /media/aa/DATA/SZQ2/maker/bin/maker line 25.

Can‘t locate XXX/XXX.pm in @INC (you may need to install the XXX::XXX module)_Augus qqq的博客-CSDN博客antiSMASH 使用 | 陈连福的生信博客Can‘t locate XXX/XXX.pm in @INC (you may need to install the XXX::XXX module)_Augus qqq的博客-CSDN博客

# 尝试解决
cpan install forks
# 失败
# 按如下方法解决:安装perl模块
cpan -i BioPerl Bit::Vector DBD::SQLite DBI Error Error::Simple File::NFSLock File::Which forks forks::shared Inline Inline::C IO::All IO::Prompt PerlIO::gzip Perl::Unsafe::Signals Proc::ProcessTable Proc::Signal threads URI::Escape

Round2

1.训练SNAP基因模型 

首先使用上一轮产生的比对结果rnd1.all.gff进行训练

# 新建文件夹
mkdir SNAP1 && cd SNAP1
# merge
gff3_merge -d ../rnd1.maker.output/rnd1_master_datastore_index.log
maker2zff -l 50 -x 0.5 rnd1.all.gff
# 过滤
fathom -categorize 1000 genome.ann genome.dna 
fathom -export 1000 -plus uni.ann uni.dna
forge export.ann export.dna
# assembly
hmm-assembler.pl snap . > ../snap1.hmm
mv rnd1.all.gff ../ 
cd ..

2.使用SNAP预测基因

# 为第一轮的maker_opts.ctl 保存副本
cp maker_opts.ctl maker_opts.ctl_backup_rnd1
# 编辑第二轮的maker_opts.ctl
vi maker_opts.ctl 
# 修改如下
maker_gff=rnd1.all.gff
est_pass=1 # 使用第一轮的EST比对结果
protein_pass=1 #使用第一轮的protein比对结果
rm_pass=1 # 使用gff文件中的repeats

est= # 删除est文件,这一步不需要再跑EST比对了
protein= # 同上
model_org= #同上
rmlib= # 同上
repeat_protein= #同上
est2genome=0 # 不需要再构建基于EST证据的基因模型
protein2genome=0 #同上

snaphmm=snap1.hmm

pred_stats=1 #report AED stats
alt_splice=0 # 0: keep one isoform per gene; 1: identify splicing variants of the same gene
keep_preds=1 # keep genes even without evidence support, set to 0 if no
# 运行
mpiexec -n 20 /media/aa/DATA/SZQ2/maker/bin/maker -fix_nucleotides -base rnd2 &> maker.log2

Round3

重新训练SNAP模型并进行另一轮SNAP基因预测

SNAP一共需要运行2~3轮

1.首先训练一个新的SNAP模型

# 新建文件夹
mkdir SNAP2 && cd SNAP2
# merge
gff3_merge -d ../rnd2.maker.output/rnd2_master_datastore_index.log
maker2zff -l 50 -x 0.5 rnd2.all.gff
# 过滤
fathom -categorize 1000 genome.ann genome.dna
fathom -export 1000 -plus uni.ann uni.dna
forge export.ann export.dna
hmm-assembler.pl snap . > ../snap2.hmm
mv rnd2.all.gff ../
cd ..

2.使用SNAP预测基因

# 为第二轮的maker_opts.ctl 保存副本
cp maker_opts.ctl maker_opts.ctl_backup_rnd2
# 编辑第三轮的maker_opts.ctl
vi maker_opts.ctl 
# 修改如下
maker_gff=rnd2.all.gff
snaphmm=snap2.hmm
# 运行
mpiexec -n 20 /media/aa/DATA/SZQ2/maker/bin/maker -fix_nucleotides -base rnd3 &> maker.log3
# 为第三轮的maker_opts.ctl 保存副本
cp maker_opts.ctl maker_opts.ctl_backup_rnd3
# 新建文件夹
mkdir SNAP3 && cd SNAP3
gff3_merge -n -d ../rnd3.maker.output/rnd3_master_datastore_index.log
mv rnd3.all.gff rnd3.noseq.gff
fasta_merge -d ../rnd3.maker.output/rnd3_master_datastore_index.log
cd ..

Round4

训练AUGUSTUS模型
使用braker2得到的 augustus 模型运行 maker

# Braker2结果:/media/aa/DATA/SZQ2/bj/my/genome/5.1210/10.gene_prediction/braker/species/5.1210
vi maker_exe.ctl
augustus=/root/anaconda3/envs/training/bin/augustus 已改

vi maker_opts.ctl
# 在上一步的文件中修改以下值
maker_gff=rnd1.all.gff
est_pass=1 # use est alignment from round 1
protein_pass=1 #use protein alignment from round 1
rm_pass=1 # use repeats in the gff file
snaphmm=snap2.hmm #SNAP HMM file 不变
augustus_species=/media/aa/DATA/SZQ2/bj/my/genome/5.1210/10.gene_prediction/braker/species/5.1210 # augustus species model you just built
est= # remove est file, do not run EST blast again
protein= # remove protein file, do not run blast again
model_org= #remove repeat mask model, so not running RM again
rmlib= # not running repeat masking again
repeat_protein= #not running repeat masking again
est2genome=0 # do not do EST evidence based gene model
protein2genome=0 # do not do protein based gene model.
pred_stats=1 #report AED stats
alt_splice=0 # 0: keep one isoform per gene; 1: identify splicing variants of the same gene
keep_preds=1 # keep genes even without evidence support, set to 0 if no
# 运行
mpiexec -n 20 /media/aa/DATA/SZQ2/maker/bin/maker -fix_nucleotides -base rnd4 &> maker.log4
# 备份
cp maker_opts.ctl maker_opts.ctl_backup_rnd4

Round5

训练GeneMark
使用braker2得到的 GeneMark 模型运行 maker

# Braker2结果:/media/aa/DATA/SZQ2/bj/my/genome/5.1216/10.gene_prediction/braker/GeneMark-ET/gmhmm.mod
# 将上面的maker_exe.ctl修改如下:
gmhmme3=/media/aa/DATA/SZQ2/gmes_linux_64_4/gmhmme3 #location of eukaryotic genemark
# 将gmhmm.mod添加到maker_opts.ctl文件中
vi maker_opts.ctl
# 修改如下
gmhmm=/media/aa/DATA/SZQ2/bj/my/genome/5.1216/10.gene_prediction/braker/GeneMark-ET/gmhmm.mod
# 最后一次运行maker5.log
mpiexec -n 20 /media/aa/DATA/SZQ2/maker/bin/maker -fix_nucleotides -base rnd5 &> maker.log5
# 备份
cp maker_opts.ctl maker_opts.ctl_backup_rnd5

最后的整合

获得一个不包含基因组序列的gff3 文件:rnd5.all.gff,以及一系列蛋白质和转录组fasta 文件。

gff3_merge -n -d rnd5.maker.output/rnd5_master_datastore_index.log
fasta_merge -d rnd5.maker.output/rnd5_master_datastore_index.log
grep -P "\tmaker\t" rnd5.all.gff > genome.maker.gff3
/media/aa/DATA/SZQ2/Zhanmengtao_bin-master/gff3_clear.pl --prefix maker genome.maker.gff3 > maker.gff3
cd ..

猜你喜欢

转载自blog.csdn.net/weixin_58269397/article/details/130108234