Metagenomic sequence analysis tool EukRep

文章:Genome-reconstruction for eukaryotes from complex natural microbial communities | bioRxiv

仓库:patrickwest/EukRep: Classification of Eukaryotic and Prokaryotic sequences from metagenomic datasets (github.com)

It is recommended to use conda for installation:

conda create -y -n eukrep-env -c bioconda scikit-learn==0.19.2 eukrep

Or install via pip (scikit-learn version 0.19.2 needs to be pre-installed):

pip install EukRep

Example usage

  • Identify and output sequences predicted to be of eukaryotic origin from a fasta file:

    EukRep -i <Sequences in Fasta format> -o <Eukaryote sequence output file>

  • Simultaneously identify and export sequences of eukaryotic and prokaryotic origin separately from fasta files:

    EukRep -i <Sequences in Fasta format> -o <Eukaryote sequence output file> --prokarya <Prokaryote sequence output file>

Obtaining eukaryotic bins EukRep is designed to be used as part of a larger analysis pipeline. To obtain high-quality gene predictions and binning of identified eukaryotic contigs based on the method described in "Reconstructing eukaryotic genomes from complex natural microbial communities" (West et al., under review), see Methods section below: Genome-reconstruction for eukaryotes from complex natural microbial communities | bioRxiv

  • or

Check out the sample workflow provided (work in progress): https://github.com/patrickwest/EukRep_Pipeline

Adjusting the screening stringency can use the -m parameter to adjust the stringency of identifying eukaryotic contigs. Below are the false positive rate (FPR) and false negative rate (FNR) in strict, balanced and relaxed modes. The default setting is balanced mode. Prior to version 0.6.5, the default setting was relaxed mode.

After running EukRep on 20kb and 5kb fragmented scaffolds generated to simulate novel phyla genomes, data shown below were obtained:

20kb

5kb

No specific data content is given here, but it can be understood from the above description that applying EukRep on scaffolds of different lengths can obtain corresponding false positive and false negative results under different screening stringencies.

Typical usage process: official recommendation

patrickwest/EukRep_Pipeline (github.com)

 

EukRep_Pipeline This is an example workflow combining EukRep to binning eukaryotic genomes from metagenomes. A sample bash script euk_pipeline.sh is included, integrating all the following steps.

Require:

  • Pre-assembled shotgun metagenomic samples and each scaffold coverage information
  • EukRepTools
  • CONCOCT or metabat tool
  • genemark-ES
  • MAKER2
  • SEARCH
  • Optional but recommended: pyenv

Run EukRep on preassembled shotgun metagenomic samples using EukRep classification:

EukRep -i metagenome.fa -o euk_contigs.fa

If you are dealing with highly complex or fragmented metagenomes, it is recommended to lower the minimum contig length threshold:


EukRep -i metagenome.fa -o euk_contigs.fa --min 1000

Automated binning This step is critical for isolating multiple eukaryotic genomes in a sample. In order to obtain the highest quality possible gene prediction results, the genome must be separated before gene prediction. The coverage information of each scaffold needs to be executed using CONCOCT:

concoct --coverage_file euk_contig_cov.txt --composition_file euk_contigs.fa
mkdir clusters
python /path/to/CONCOCT/scripts/extract_fasta_bins.py --output_path ./clusters/ euk_contigs.fa clustering_gt1000.csv

Use metabat to execute:

metabat -a euk_contig_cov.txt -i euk_contigs.fa -o bin -t 6

Filtering by bin size We find it useful at this stage to filter out bins smaller than 2.5 Mbp. This filtering can remove most false positive results, especially when using CONCOCT, because CONCOCT bins each scaffold, often producing many very small bins. TrainingGeneMark-ES

perl gmes_petap.pl --ES -min_contig 10000 --sequence bin_1.fa

-min_contigOption specifies the minimum contig length used to train the gene prediction model for the specified bin. It is not required that every contig in the bin is used for training, but if there are too few contigs above the threshold, training may fail. Since many bins from metagenomes are often very fragmented, this option may need to be adjusted.

Use the trained GeneMark-ES model and MAKER2 to predict genes using the MAKER control file. It is at least recommended to modify them in the following way for gene prediction using RepeatMasker and GeneMark-ES: In the 'maker_opts.ctl' file:

keep_preds=1
gmhmm=/path/to/output/gmhmm.mod

Then run MAKER using 6 cores:

maker -g bin_1.fa -c 6 cd *.maker.output fasta_merge -d *_master_datastore_index.log -o bin_1

In order to further improve the quality of gene prediction, MAKER can integrate homologous proteins from reference genomes of related species, transcriptome evidence, and other ab initio gene predictors such as AUGUSTUS. To obtain high-quality gene predictions, it is often best to exploit all of these sources of evidence available.

For many metagenomic samples, ab initio gene prediction may be the only option. Run BUSCO

python3 BUSCO.py -i *.maker.proteins.fasta -l eukaryota_odb9 -o bin_1 -m prot

BUSCO will look for single-copy orthologous genes (SCGs) within your bin, providing an estimate of completeness (and a rough estimate of contamination by duplicate single-copy genes). -lThe parameter specifies the set of SCG pedigrees to use. We usually use eukaryota_odb9 because it is the most general, however you may choose to use a more specific lineage set once you have a clearer idea of ​​the type of creature your bin belongs to.

Guess you like

Origin blog.csdn.net/zrc_xiaoguo/article/details/135389005