Installation and detailed usage of the eukaryotic microbial gene sequence identification tool EukRep tool

introduce

EukRep is a tool for identifying and analyzing eukaryotic microorganisms in the environment. It is based on 16S rRNA gene sequences and can help researchers identify and classify eukaryotic microbial communities present in environmental samples.

EukRep classifies eukaryotic and prokaryotic sequences from metagenomic datasets

Install

It is recommended to use conda to install Python3 :

$ conda create -y -n eukrep-env -c bioconda scikit-learn==0.19.2 eukrep

Install via pip (requires scikit-learn v 0.19.2):

$ pip install EukRep

Example usage identifies and outputs sequences predicted to be of eukaryotic origin from a fasta file:

$ EukRep -i <fasta格式的序列> -o <真核序列输出文件>

Identify and output sequences of eukaryotic and prokaryotic origin simultaneously from a fasta file:

$ EukRep -i <fasta格式的序列> -o <真核序列输出文件> --prokarya <原核序列输出文件>

Obtaining eukaryotic bins EukRep is intended for use as part of a larger analysis pipeline. To enable high-quality gene prediction and binning of identified eukaryotic continua, as described in "Reconstructing eukaryotic genomes from complex natural microbial communities" (West et al., in review), see Methods section: Genome-reconstruction for eukaryotes from complex natural microbial communities | bioRxiv

or

Check out the sample workflow provided (work in progress): GitHub - patrickwest/EukRep_Pipeline

Adjusting the identification stringency The -m parameter can adjust the stringency of identifying eukaryotic continua. The following shows the false positive rate (FPR) and false negative rate (FNR) in strict, balanced and relaxed modes. The default setting is balanced mode. Prior to version 0.6.5, the default was relaxed mode.

sequence length strict mode Balanced mode Relaxed mode
20kb FPR, FNR FPR, FNR FPR, FNR
5kb FPR, FNR FPR, FNR FPR, FNR

NOTE: The above data were obtained by applying EukRep to 20kb and 5kb fragmented scaffolds from simulated novel phylum genomes.

Important Notes In our experience, eukaryotic genomes are not included in most metagenomic samples; however, due to the false positive rate of EukRep, you may still get output even in this case.

manual

 

Below is a sample Bash script named euk_pipeline.sh that contains all the following steps.

Require:

  1. Preassembled Shotgun metagenomic samples with coverage information for each sequence.
  2. EukRep
  3. CONCOCT or metatabat
  4. genemark-ES
  5. MAKER2
  6. BUSCO Optional (but recommended):
  7. pyenv

Run EukRep on a preassembled Shotgun metagenome sample using EukRep classification: EukRep -i metagenome.fa -o euk_contigs.fa If you have a very complex or fragmented metagenome sample, it is recommended to lower the minimum contig size:

 EukRep -i metagenome.fa -o euk_contigs.fa --min 1000

Automatic binning This step is very important to separate multiple eukaryotic genomes in the sample. Before gene prediction, it is very important to isolate the genome to obtain the highest possible quality gene prediction results. Coverage information for each sequence is required. Use CONCOCT to execute:

concoct --coverage_file euk_contig_cov.txt --composition_file euk_contigs.fa 
mkdir clusters 
python /path/to/CONCOCT/scripts/extract_fasta_bins.py --output_path ./clusters/ euk_contigs.fa clustering_gt1000.csv 

Use metabat to execute:

metabat -a euk_contig_cov.txt -i euk_contigs.fa -o bin -t 6

Filtering by bin size At this stage we find it useful to filter out any bins smaller than 2.5 Mbp. This filtering eliminates most false positives. Especially if CONCOCT is used, since CONCOCT bins each sequence, often resulting in many very small bins.

TrainingGeneMark-ES

perl gmes_petap.pl --ES -min_contig 10000 --sequence bin_1.fa -min_contig

Option specifies the minimum length of the contig used to train the bin's gene prediction model. You don't need to use every contig of the bin, but training may fail if you have fewer contigs than the threshold. Many bins from metagenomes may be very fragmented, so this option may need to be adjusted.

Use the trained GeneMark-ES model and MAKER2 to predict genes using the MAKER control file. It is at least recommended to modify them as follows to predict genes using RepeatMasker and GeneMark-ES: In the 'maker_opts.ctl' file:

keep_preds=1 
gmhmm=/path/to/output/gmhmm.mod

Then, run MAKER with 6 cores using the following command:

maker -g bin_1.fa -c 6 
cd *.maker.output 
fasta_merge -d *_master_datastore_index.log -o bin_1 

To further improve gene prediction results, MAKER can integrate homologous proteins from related organisms, transcriptomic evidence, and other de novo gene predictors such as AUGUSTUS. To obtain high-quality genetic predictions, it is often best to exploit as many of these evidence clues as possible.

For many metagenomic samples, performing de novo gene prediction may be the only available option.

Run BUSCO

python3 BUSCO.py -i *.maker.proteins.fasta -l eukaryota_odb9 -o bin_1 -m prot

BUSCO will look for single-copy orthogonal genes (SCGs) in your bin, giving an estimate of completeness (and a rough estimate of contamination with duplicate single-copy genes). -l specifies the lineage set of SCGs to use. Typically we use eukaryota_odb9 as it is the most general, but if you have a better idea of ​​what type of organism your bin belongs to, you can use a more specific lineage set.

Guess you like

Origin blog.csdn.net/zrc_xiaoguo/article/details/135416410