Introduction to the bacteriophage host prediction tool iPHoP (Integrated Phage HOst Prediction) based on comprehensive features and the detailed process of using it

introduce

iPHoP (Integrated Phage HOst Prediction) is a bacteriophage host prediction method based on integrated features. It predicts the host range of bacterial phages by integrating genome sequence, protein sequence and host genome information.

The prediction process of iPHoP is divided into three steps: feature extraction, feature selection and host prediction. In the feature extraction stage, iPHoP will extract a series of features from the phage genome and host genome, including genome features, protein features and host genome features.

In the feature selection stage, iPHoP uses machine learning algorithms to select the most predictive features from the extracted features. Commonly used feature selection algorithms include chi-square test, mutual information, and variance analysis.

In the host prediction stage, iPHoP uses the selected features to build a prediction model to determine the possible host range of unknown phages by predicting them.

iPHoP has the following characteristics: it is an integrated prediction method that can use multiple features for prediction simultaneously; it is based on machine learning algorithms and can make predictions based on different data sets; it is able to predict the host range of bacterial phages and provide predictions reliability assessment.

iPHoP has proven its prediction accuracy and reliability in some experiments and is widely used in the study of bacterial phage hosts.

Overview

iPHoP stands for integrated Phage Host Prediction. It is an automated command-line pipeline for predicting host genus of novel bacteriophages and archaeoviruses based on their genome sequences.

The pipeline can be broken down into 6 main steps:

Repository: srouxjgi/iphop — Bitbucket

文章:iPHoP: An integrated machine learning framework to maximize host prediction for metagenome-derived viruses of archaea and bacteria | PLOS Biology 

Download code base

git clone https://bitbucket.org/srouxjgi/iphop.git

General usage process

A: Step 1: Run the Single Host Prediction Tool

Phage-based tool: RaFAH ( https://doi.org/10.1016/j.patter.2021.100274): Generate host genus prediction results and corresponding scores, save them for subsequent step 5

Host basic tools:

B: Step 2: Collect all scores and distances between all hits based on host tool* For two potential hosts (i.e., two hits for a given tool and query virus), the distance is based on a GTDB tree (https : //doi.org/10.1093/nar/gkab776) calculated.

C: Steps 3 and 4: Compile an organized hit list for each virus-tool-candidate host combination* For each hit, other top hits obtained from the same virus and using the same tool are summarized and ranked according to Baseline hosts are ranked by distance between them and other hit hosts (see step 2). * The results of these series of hits are used as input to an automated classifier to derive a score for a given virus-candidate host pair. * This enables the contextual information of the top hits obtained by the virus to be taken into account when evaluating each potential host (each hit).

D: Step 5: Derive 3 scores from the host base tool for each virus-candidate host combination* Only the top scores based on blast or crispr matches will be retained, as these methods are inherently reliable enough in host prediction. * The third score is derived by considering all scores from all individual classifiers (see step 4), i.e. considering all 5 host base methods simultaneously.

E: Step 6: Calculate a comprehensive score for each virus-candidate host genus combination, integrating the host base signal and the phage base signal* Compare the 3 host base scores (see step 5) with the phage base score (RaFAH - https: // doi.org/10.1016/j.patter.2021.100274) were combined to obtain a single composite score for all virus-candidate host genus pairs.

conda installation

conda create -c conda-forge -n iphop_env python=3.8
conda activate iphop_env
mamba install -c conda-forge -c bioconda iphop

If you don’t have mamba, you can just use conda to install it. How to configure the conda basic environment can be referred to:

Installation and configuration of light miniconda3 under linux-centos9stream-Miniconda3 Linux 64-bit_Offline installation of miniconde linux-CSDN blog

Database download

iphop download --db_dir path_to_iPHoP_db

# 验证

iphop download --db_dir path_to_iPHoP_db --full_verify

Manual download:

wget https://portal.nersc.gov/cfs/m342/iphop/db/iPHoP.latest_rw.tar.gz

tar -zxvf iPHoP.latest_rw.tar.gz

start using

Super simple and straightforward to run

iphop predict --fa_file my_input_phages.fasta --db_dir path/to/iphop_db/Sept_2021_pub/ --out_dir iphop_output/

Main output result files:

Main output files

Host_prediction_to_genus_mXX.csv, where XX is the minimum score cutoff selected (default: Host_prediction_to_genus_m90.csv)

This contains integrated results from host-based and phage-based tools at the host genus level:

Virus AAI to closest RaFAH reference Host type Confidence score List of methods
IMGVR_UViG_3300029435_000002 48.49 d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Prevotella 98.50 RaFAH;91.30 iPHoP-RF;89.50 CRISPR;70.20
IMGVR_UViG_3300029435_000003 53.00 d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;g__Agathobacter 92.20 blast;94.40
IMGVR_UViG_3300029435_000003 53.00 d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;g__Bacteroides_F 90.90 CRISPR;93.30 iPHoP-RF;51.70
IMGVR_UViG_3300029435_000005 42.95 d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Oscillospirales;f__Ruminococcaceae;g__Gemmiger 95.30 blast;96.70 CRISPR;92.70 iPHoP-RF;82.50
IMGVR_UViG_3300029435_000007 35.09 d__Bacteria;p__Bacteroidota;c__Bacteroidia;o__Bacteroidales;f__Bacteroidaceae;g__Prevotella 98.40 CRISPR;98.80 iPHoP-RF;95.40 blast;93.60
IMGVR_UViG_3300029435_000009 99.62 d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;g__Lachnospira 99.00 CRISPR;98.80 blast;92.60 iPHoP-RF;70.90 RaFAH;65.80
IMGVR_UViG_3300029435_000009 99.62 d__Bacteria;p__Firmicutes_A;c__Clostridia;o__Lachnospirales;f__Lachnospiraceae;g__Roseburia 95.70 CRISPR;97.00 iPHoP-RF;56.80
IMGVR_UViG_3300029435_000010 22.47 d__Bacteria;p__Proteobacteria;c__Gammaproteobacteria;o__Burkholderiales;f__Burkholderiaceae;g__Sutterella 97.60 blast;98.30 CRISPR;80.00 iPHoP-RF;78.30
  • This output file lists for each prediction the virus sequence ID, the level of amino-acid similarity (AAI) between the query and the genomes in the RaFAH phage database, the predicted host genus, the confidence score calculated from all tools, and the list of scores for individual classifiers obtained for this virus-host pair.
  • For the detailed score by classifier, "RaFAH" represents the score derived from RaFAH (https://www.sciencedirect.com/science/article/pii/S2666389921001008), iPHoP-RF is the score derived from all host-based tools, CRISPR the score derived only from CRISPR hits, and blast the score derived only from blastn hits
  • All virus-host pairs for which the confidence score is higher than the selected cutoff (default = 90) are included, so each virus may be associated with multiple predictions (e.g. IMGVR_UViG_3300029435_000003 and IMGVR_UViG_3300029435_000009).

Other notes:

Note: We recommend that all users first run iPHoP against the same virus sequences using the standard database. At the same time, careful screening of all MAGs (Metagenomic Assembled Genomes) to remove contamination is highly recommended, as misclassified viral contigs in microbial MAGs may lead to high-confidence incorrect host predictions.

Note: For iPHoP versions smaller than 1.2.0, when adding custom MAGs, the output of GTDB-tk v1.5.0 is required, which is currently incompatible with the output of GTDB-tk v2. But in version 1.2 and above, this problem should have been fixed.

Users can add their own MAGs to the host database, e.g. MAGs obtained from the same data set or sampling site from which the input phage were obtained. The "add_to_db" module in iPHoP can be used for this purpose, providing fasta files for each MAG and the results of the "gtdb-tk infer" function applied on these same MAGs. The sample file set is available at https://bitbucket.org/srouxjgi/iphop/downloads/Data_test_add_to_db.tar.gz and is based on the study "Viral and metabolic controls on high rates of microbial sulfur and carbon cycling in wetland" published by Dalcin Martins et al. ecosystems” data.

The complete process to add MAGs to the host database is as follows:

Use wget to download the sample data package:

wget https://bitbucket.org/srouxjgi/iphop/downloads/Data_test_add_to_db.tar.gz

Unzip the downloaded data package:

tar -xvf Data_test_add_to_db.tar.gz

 View the contents of the unzipped directory:

ls Data_test_add_to_db

Among them, the "Expected_results/" folder contains the expected results files of iPHoP when using the Sept_2021_pub database or a new database containing additional MAGs. "Input_viral_contigs.fasta" is the input file. The "Wetland_MAGs/" folder contains fasta files for all MAGs. The "Wetland_MAGs_GTDB-tk_results/" folder contains the gtdb-tk results files that iPHoP will use.

Generate gtdb-tk result file

gtdbtk de_novo_wf --genome_dir Wetland_MAGs/ --bacteria --outgroup_taxon p__Patescibacteria --out_dir Wetland_MAGs_GTDB-tk_results/ --cpus 32 --force --extension fa
gtdbtk de_novo_wf --genome_dir Wetland_MAGs/ --archaea --outgroup_taxon p__Altarchaeota --out_dir Wetland_MAGs_GTDB-tk_results/ --cpus 32 --force --extension fa

Create a new iPHoP database that will include GTDB genomes and additional user-provided MAGs, but not GEM or IMG genomes

cd Data_test_add_to_db
iphop add_to_db --fna_dir Wetland_MAGs/ --gtdb_dir Wetland_MAGs_GTDB-tk_results/ --out_dir Sept_2021_pub_rw_w_Wetland_hosts --db_dir /path/to/iphop_db/Sept_2021_pub_rw/

 Note: To avoid copying a large number of files, the new database is partially based on a symbolic link from the original database. This means that if the original database (here "iphop_db/Sept_2021_pub/") is modified or deleted, the new database will not work properly. This also means that the full path to the original database should be provided as the "db_dir" parameter.

Then, you can use the "Sept_2021_pub_w_Wetland_hosts" folder as the iPHoP database for host prediction operations, for example:

iphop predict --fa_file Input_viral_contigs.fasta --db_dir Sept_2021_pub_rw_w_Wetland_hosts/ --out_dir test_add_db -t 4

Citation information

@article{roux_iphop_2023,
abstract = {The extraordinary diversity of viruses infecting bacteria and archaea is now primarily studied through metagenomics. While metagenomes enable high-throughput exploration of the viral sequence space, metagenome-derived sequences lack key information compared to isolated viruses, in particular host association. Different computational approaches are available to predict the host(s) of uncultivated viruses based on their genome sequences, but thus far individual approaches are limited either in precision or in recall, i.e., for a number of viruses they yield erroneous predictions or no prediction at all. Here, we describe iPHoP, a two-step framework that integrates multiple methods to reliably predict host taxonomy at the genus rank for a broad range of viruses infecting bacteria and archaea, while retaining a low false discovery rate. Based on a large dataset of metagenome-derived virus genomes from the IMG/VR database, we illustrate how iPHoP can provide extensive host prediction and guide further characterization of uncultivated viruses.},
author = {Roux, Simon and Camargo, Antonio Pedro and Coutinho, Felipe H. and Dabdoub, Shareef M. and Dutilh, Bas E. and Nayfach, Stephen and Tritt, Andrew},
doi = {10.1371/journal.pbio.3002083},
issn = {1545-7885},
journal = {PLOS Biology},
number = {4},
title = {
  
  {iPHoP}: {An} integrated machine learning framework to maximize host prediction for metagenome-derived viruses of archaea and bacteria},
volume = {21},
year = {2023},
}

Guess you like

Origin blog.csdn.net/zrc_xiaoguo/article/details/135386203