introduce
EukRep is a tool for identifying and analyzing eukaryotic microorganisms in the environment. It is based on 16S rRNA gene sequences and can help researchers identify and classify eukaryotic microbial communities present in environmental samples.
EukRep classifies eukaryotic and prokaryotic sequences from metagenomic datasets
Install
It is recommended to use conda to install Python3 :
$ conda create -y -n eukrep-env -c bioconda scikit-learn==0.19.2 eukrep
Install via pip (requires scikit-learn v 0.19.2):
$ pip install EukRep
Example usage identifies and outputs sequences predicted to be of eukaryotic origin from a fasta file:
$ EukRep -i <fasta格式的序列> -o <真核序列输出文件>
Identify and output sequences of eukaryotic and prokaryotic origin simultaneously from a fasta file:
$ EukRep -i <fasta格式的序列> -o <真核序列输出文件> --prokarya <原核序列输出文件>
Obtaining eukaryotic bins EukRep is intended for use as part of a larger analysis pipeline. To enable high-quality gene prediction and binning of identified eukaryotic continua, as described in "Reconstructing eukaryotic genomes from complex natural microbial communities" (West et al., in review), see Methods section: Genome-reconstruction for eukaryotes from complex natural microbial communities | bioRxiv
or
Check out the sample workflow provided (work in progress): GitHub - patrickwest/EukRep_Pipeline
Adjusting the identification stringency The -m parameter can adjust the stringency of identifying eukaryotic continua. The following shows the false positive rate (FPR) and false negative rate (FNR) in strict, balanced and relaxed modes. The default setting is balanced mode. Prior to version 0.6.5, the default was relaxed mode.
sequence length | strict mode | Balanced mode | Relaxed mode |
---|---|---|---|
20kb | FPR, FNR | FPR, FNR | FPR, FNR |
5kb | FPR, FNR | FPR, FNR | FPR, FNR |
NOTE: The above data were obtained by applying EukRep to 20kb and 5kb fragmented scaffolds from simulated novel phylum genomes.
Important Notes In our experience, eukaryotic genomes are not included in most metagenomic samples; however, due to the false positive rate of EukRep, you may still get output even in this case.
manual
Below is a sample Bash script named euk_pipeline.sh that contains all the following steps.
Require:
- Preassembled Shotgun metagenomic samples with coverage information for each sequence.
- EukRep
- CONCOCT or metatabat
- genemark-ES
- MAKER2
- BUSCO Optional (but recommended):
- pyenv
Run EukRep on a preassembled Shotgun metagenome sample using EukRep classification: EukRep -i metagenome.fa -o euk_contigs.fa If you have a very complex or fragmented metagenome sample, it is recommended to lower the minimum contig size:
EukRep -i metagenome.fa -o euk_contigs.fa --min 1000
Automatic binning This step is very important to separate multiple eukaryotic genomes in the sample. Before gene prediction, it is very important to isolate the genome to obtain the highest possible quality gene prediction results. Coverage information for each sequence is required. Use CONCOCT to execute:
concoct --coverage_file euk_contig_cov.txt --composition_file euk_contigs.fa
mkdir clusters
python /path/to/CONCOCT/scripts/extract_fasta_bins.py --output_path ./clusters/ euk_contigs.fa clustering_gt1000.csv
Use metabat to execute:
metabat -a euk_contig_cov.txt -i euk_contigs.fa -o bin -t 6
Filtering by bin size At this stage we find it useful to filter out any bins smaller than 2.5 Mbp. This filtering eliminates most false positives. Especially if CONCOCT is used, since CONCOCT bins each sequence, often resulting in many very small bins.
TrainingGeneMark-ES
perl gmes_petap.pl --ES -min_contig 10000 --sequence bin_1.fa -min_contig
Option specifies the minimum length of the contig used to train the bin's gene prediction model. You don't need to use every contig of the bin, but training may fail if you have fewer contigs than the threshold. Many bins from metagenomes may be very fragmented, so this option may need to be adjusted.
Use the trained GeneMark-ES model and MAKER2 to predict genes using the MAKER control file. It is at least recommended to modify them as follows to predict genes using RepeatMasker and GeneMark-ES: In the 'maker_opts.ctl' file:
keep_preds=1
gmhmm=/path/to/output/gmhmm.mod
Then, run MAKER with 6 cores using the following command:
maker -g bin_1.fa -c 6
cd *.maker.output
fasta_merge -d *_master_datastore_index.log -o bin_1
To further improve gene prediction results, MAKER can integrate homologous proteins from related organisms, transcriptomic evidence, and other de novo gene predictors such as AUGUSTUS. To obtain high-quality genetic predictions, it is often best to exploit as many of these evidence clues as possible.
For many metagenomic samples, performing de novo gene prediction may be the only available option.
Run BUSCO
python3 BUSCO.py -i *.maker.proteins.fasta -l eukaryota_odb9 -o bin_1 -m prot
BUSCO will look for single-copy orthogonal genes (SCGs) in your bin, giving an estimate of completeness (and a rough estimate of contamination with duplicate single-copy genes). -l specifies the lineage set of SCGs to use. Typically we use eukaryota_odb9 as it is the most general, but if you have a better idea of what type of organism your bin belongs to, you can use a more specific lineage set.