iMeta: The Su Xiaoquan Group of Qingdao University developed a cross-platform interactive microbiome analysis suite PMS (full text translation, PPT, video)...

cc63b4b938b276ee18899ca7e7fd0aab.png

Parallel-Meta Suite: A cross-platform interactive microbiome rapid analysis suite

Parallel-Meta Suite: Interactive and rapid microbiome data analysis on multiple platforms

DOI:https://doi.org/10.1002/imt2.1

Date of publication: March 6, 2022

First author: Yuzhu Chen (Chen Yuzhu) 1 , Jian Li (李建) 1

Corresponding author: Xiaoquan Su (Su Xiaoquan) ([email protected]) 1,2

Co-authors: Yufeng Zhang(Zhang Yufeng), Mingqian Zhang(Zhang Mingqian), Zheng Sun
(Sun Zheng), Gongchao Jing(Jing Gongchao), Shi Huang(Huang Shi)

Main unit:

1Qingdao University (College of Computer Science and Technology, Qingdao University, Qingdao, Shandong, China)

2 Qingdao Institute of Bioenergy and Bioprocess Technology, Chinese Academy of Sciences (Single‐Cell Center, Qingdao Institute of BioEnergy and Bioprocess Technology, Chinese Academy of Sciences, Qingdao, Shandong, China)

Graphic abstract

f46c0c5bf94ea7386ab5f43fb7f955dd.png

  • Parallel-Meta Suite (PMS) is an easy-to-use software package for fast, comprehensive microbiome data analysis on multiple platforms

  • PMS covers a wide range of data preprocessing and statistical methods and provides state-of-the-art visualizations

  • The entire pipeline of PMS is optimized by a parallel computing scheme, which can quickly process thousands of microbiome data

Video interpretation

Bilibili:https://www.bilibili.com/video/BV12F411s7uM/

Youtube:https://youtu.be/bSdrUSpzNDg

For Chinese translation, PPT, Chinese/English video interpretation and other extended data downloads, please visit the journal's official website: http://www.imeta.science/

Summary

The increase in sequencing throughput and the reduction in sequencing costs have greatly facilitated the development of microbiome research experiments, resulting in a vast ocean of genomic sequencing data, which contains the phenotypes of microorganisms and their environments (such as host health or ecosystem status) relationship between. Deciphering the biological information underlying microbiome data requires excellent and reliable software tools. However, most software today has usability flaws that create an insurmountable gulf for non-computer-savvy users. At the same time, computational throughput has become an important bottleneck for many analytics platforms to process large-scale datasets. This study developed Parallel-Meta Suite (PMS), an interactive software suite for rapid and comprehensive microbiome data analysis, visualization and annotation. PMS adopts the most advanced algorithms, covering a series of processes such as sequence microbiome data species and function analysis, statistical analysis, visualization, etc., and has a friendly graphical interface, which can meet the analysis needs of various users . In order to adapt to the rapidly increasing computing power requirements, the entire analysis process of PMS is optimized using parallel computing strategies, with the ability to quickly process tens of thousands of samples. In addition, PMS also features multi-operating system compatibility, easy installation and fully automatic operation.

introduction

To decipher the biological patterns hidden in microbiome big data, excellent bioinformatics tools are essential . Through these biological patterns, we can explain the associations between microbial communities and their surroundings, such as environmental conditions or human health. Over the past decade, the capabilities of bioinformatic tools in the microbiome field have expanded from basic taxonomic annotation to downstream diversity analysis and biomarker selection, greatly expanding the usefulness of microbiome data mining. However, the complex command-line operation of highly integrated toolkits like QIIME or Parallel-Meta creates obstacles for non-computer professionals to operate or even get started . On the other hand, in the past few years, the cost of sequencing has been greatly reduced, which has facilitated the investigation and study of very large-scale microorganisms in different environments , such as the Earth Microbiome Project or the American Gut Project, etc., and also improved the calculation of data processing. throughput and efficiency requirements .

Faced with this situation, many developers have also come up with their own solutions, one of which is to provide a graphical-based user interface (GUI) to improve usability, such as q2studio. However, such a graphical interface requires a specific operating system environment and many dependent libraries in the process of installation and operation, which makes the graphical interface unusable in some cases (such as remote login servers and big data processing). Another solution is an online web service with a graphical interface, such as Galaxy or gcMeta. However, unavoidable network latency and sharing of online computing resources limit the scale of user analysis data, especially when analyzing large amounts of microbiome sequencing data. In addition, data privacy and security concerns are equally concerning when analyzing unpublished private datasets on open online platforms.

To address these challenges, we have here developed Parallel-Meta Suite (PMS), a software suite for rapid and comprehensive microbiome analysis . PMS is based on mature marker gene analysis protocols and workflows, and has been redesigned and significantly improved. Its features include (but are not limited to) providing a friendly graphical interface, improving usability for various users under multiple platforms, using fully parallel computing Program optimization analysis performance, etc. In addition, in order to solve the installation difficulties of many bioinformatics tools, such as package dependencies, system settings and source code compilation, PMS also integrates an automatic installer to help users easily install and configure the software . The latest version of the PMS software has been released on GitHub (https://github.com/qdu-bioinfo/parallel-meta-suite) and Gitee (https://gitee.com/qdu-bioinfo/parallel-meta-suite), A demo dataset for testing is also provided in the package .

method

Figure 1 illustrates the analysis workflow of PMS. PMS can accept metagenomic shotgun sequences or amplicon sequences as raw input. For shotgun sequencing sequences, marker gene fragments (such as 16S rRNA or 18S rRNA genes) are identified and extracted using hidden Markov models. For amplicon sequences, PMS performs ASV noise reduction and dechimerization of marker genes to reduce the interference of sequencing errors (this step is disabled by default for shotgun sequencing sequences, and can also be enabled by the user). Sequences are then aligned against reference databases via the built-in Vsearch for profiling and taxonomic annotation from kingdom to species level. The relative abundance of community members at each taxonomic level was also corrected for marker gene copy number. Afterwards, functionally informative KEGG Orthology (KO) gene families were predicted using the PICRUSt2 algorithm, and metabolic pathways were annotated by the KEGG BRITE hierarchy. PMS also measures the predictive accuracy of function by the NSTI (Nearest Sequenced Taxonomy Index) value, calculated as the sum of the distances between OTUs and their closest individually sequenced relatives in the phylogenetic structure.

Figure 1 The entire analysis process and visualization workflow of PMS

c5c1ff9993ad08196668eb67a284c4ca.png

Species information of the microbiome was visualized by Krona and bar graphs. Microbial diversity analysis, biomarker selection, and co-occurrence network construction are then performed at the user-selected specific taxonomic or pathway level. Alpha diversity analysis calculates Shannon, Simpson and Chao1 indices for each sample. For discrete metadata (such as type, status, gender, etc.), Wilcoxon or Kruskal rank-sum tests were performed for the alpha diversity index, and regression analysis was performed for continuous variables (such as age, BMI, PH value, etc.). Beta diversity was calculated by weighted/unweighted Meta-Storms algorithm (for species classification) or Hierarchical Meta-Storms (for features) to calculate the distance matrix between all samples and visualized by heatmap. Afterwards, β-diversity patterns were displayed by PCoA (Principal Coordinate Analysis) and PCA (Principal Component Analysis) plots, PERMANOVA and ANOSIM tests were performed on discrete metadata, and regression analysis was performed on continuous variables and distance values. In biomarker analysis, PMS uses the Wilcoxon or Kruskal rank-sum test to select microbes or gene units that are significantly different between groups (discrete data variables) as candidate markers, and then rank by random forest importance . Microbiome features closely related to continuous variables were also picked out as biomarkers by regression analysis. In co-occurrence networks, network nodes are community features (e.g., a microbial taxon), the edges of the network represent Spearman correlations between nodes, and network properties are then quantified by computing network density, diameter, radius, and concentration.

result

Key Features of PMS

PMS provides a set of user-friendly GUIs for parameter configuration and detailed results presentation of data analysis ( Figure 2 ). Users can easily get started with PMS through a set of sample data and parameter configuration pages. The graphical parameter display on this page also reduces the learning curve for advanced users. The GUI selects web pages as the carrier of visualization functions, and is compatible with different usage scenarios (such as local system or remote login server) and various operating systems (such as Linux, Mac or Windows 10). As a highly integrated, fully automated analysis tool, PMS uses a variety of state-of-the-art algorithms and analysis strategies, including advanced sequence processing (such as metagenomic marker gene extraction, sequence ASV (amplicon sequence variation) noise reduction, etc.), 16S Predictive functional profiles, alpha and beta diversity calculations, multivariate statistical analysis, biomarker selection and assessment, and co-occurrence network analysis, etc. The reference sequences of marker genes were also updated and expanded through the GreenGenes, SILVA, Oral-core, SILVA-18S and ITS databases to include full-length 16S rRNA, 18S rRNA and ITS sequences. Finally, through global parallel processing and performance tuning, PMS can complete all the analysis of 14,000 samples in just over 40 hours on one computing node.

Figure 2 GUI visualization interface of PMS

52f1ed9bf1b20c86122ac68ac22eddb4.png

(A) Interactive configuration wizard;

(B) Results navigation page;

(C) Alpha diversity calculations and their relationship to key phenotypes;

(D) Relative abundance table of samples;

(E) β diversity based on PCoA (Principal Coordinate Analysis);

(F) Co-occurrence network analysis;

(G) Selection of biomarkers by internal importance scores generated by a supervised machine learning algorithm of random forests.

Implementation of Parallel Computing and GUI

The GUI of the PMS software consists of two parts, one is the interactive analysis parameter "configuration wizard" ( Fig. 2A ), and the other is the visual "results guide" ( Fig. 2B ). A configuration wizard is built into the software suite, where all parameters have been categorized and organized according to the analysis process and presented in an easy-to-understand format. In the initial state, all parameters have been set to default values, just fill in the necessary basic parameters (such as input/output type and path) to analyze. The configuration wizard also provides advanced tuning options to further customize the steps of profiling, diversity analysis, and statistics. Finally, according to the user's settings, the configuration wizard can generate corresponding executable commands. After the entire analysis process is complete, a results tour is automatically created in the output directory. This page categorizes all analysis results and visualizes each result with a well-designed scheme and color-coded plots ( Figure 2C-G ), providing direct and clear interpretations of microbiome patterns. This graphical user interface is very helpful for non-expert users who are not familiar with command line interfaces or complex parameters, and also gives them a better and clearer understanding of the workflow and results of the analysis. Additionally, the PMS GUI is highly compatible with multiple operating system platforms including Linux, Mac, and Windows, as the configuration wizard and results navigation page can be easily accessed from any web browser.

The overall computing framework of PMS is mainly developed by C++, which has faster running speed and more efficient memory usage compared with scripting languages . Using parameters parsed from the GUI configuration wizard, the framework invokes and manages the analysis steps in the workflow. In general, we have optimized the parallel computing scheme for the entire analysis process from two aspects. (1) The calculation steps related to classification and identification, abundance estimation, function prediction and distance matrix are written in C/C++ and parallelized directly by the OpenMP library . (2) Statistical steps related to alpha and beta diversity and statistical testing, biomarker selection and plotting were written by CRAN-R (https://www.R-project.org) and each R script was calculated by PMS The framework is assigned to an independent thread, and all threads can be started at the same time for parallel computing. In order to make full use of hardware computing resources, the number of available threads is set by default to the number of CPU cores and dynamically adjusted, and can also be manually controlled by the user.

Use in different scenarios

In this subsection, we present the usage and experience of PMS in three typical scenarios ( Figure 3 ) under different computing platforms and environments.

Figure 3 Three typical usages of PMS in different scenarios and platforms

1861df976c19db2d943015ce7cedcae1.png

(A) Use GUI for parameter configuration on the local machine, and perform operation analysis on the local machine;

(B) Use GUI for parameter configuration on the local machine, and perform operation analysis on the remote server;

(C) Use the command line for parameter configuration (local or remote services are available).

Scenario 1: Use GUI to configure parameters locally, and perform operation analysis locally

PMS can be installed and executed on a "local" PC (eg laptop) to process small numbers of samples (say less than 200) . The native GUI-based usage method ( Figure 3A ) is available for Linux (with GUI desktop installed), Mac, or Windows 10+ (requires installation of Windows Subsystems for Linux (WSL)) operating systems. The configuration wizard can be accessed through the "index.html" page in the PMS-config folder. Users can directly use the default options, or adjust the parameters according to actual needs. Once configured, a valid command will be generated and copied to the clipboard by clicking the "Generate" and "Copy" buttons at the bottom of the page. Then paste this one-line command in your local terminal, and you can successfully run the PMS analysis process with no additional action required. In the output directory, the visual result navigation page is also named "index.html". All raw results (such as relative abundance tables, distance matrices, etc.) are also retained for further in-depth data mining or meta-analysis. In addition, analysis summaries, work logs and detailed step-by-step workflow scripts are provided in the Results folder.

Scenario 2: Use GUI for parameter configuration on the local machine, and perform operation analysis on the remote server

The processing and computation of a large number of samples (such as greater than 1,000) requires longer time and more computing resources, and we recommend running the analysis pipeline of PMS on a more powerful server. Often such servers require remote login (eg, via SSH) and only provide a command-based terminal to operate the software. In this case ( Figure 3B ), the user should install PMS on the server, download and open the GUI configuration wizard on the local computer (download the PMS-config folder in the package, and open the "index.html" file in it with a browser ) to generate commands and run them on the remote server's terminal. The results can also be transferred to the local computer for viewing as in Scenario 1. Therefore, the entire analysis process can be easily configured and executed without extensive data transfer.

Scenario 3: Parameter configuration using the command line

PMS also supports command line based operations. This approach is usually aimed at experienced users without a GUI (Figure 3C). In order to meet the increasing number of user-specific requirements in microbiome analysis, the entire analysis workflow can work in highly flexible settings, for example, running each step with customized parameters, or executing only selected steps in the workflow. This can be done via a local or remote command-based terminal. The command line interface also provides tutorials that describe detailed usage and brief help information for each single step of the analysis process.

Case studies and results

We employ two example datasets to demonstrate the power of PMS in decoding the microbiome. Both datasets were collected from previously published studies in order to verify the accuracy and reliability of the PMS analysis results.

Case 1: Changes in indoor microbiome before and after hospital opening

Dataset 1 contains 894 16S-amplicon microbiome samples from indoor environments before and after hospital opening ( Table 1 ). We performed the PMS analysis pipeline with all default parameters. From the results, we can observe that the Shannon index of alpha diversity decreased after hospital opening ( Fig. 4A ; Wilcoxon test p-value < 0.01), and the beta diversity of the overall community shifted significantly ( Fig. 4B ; weighted Meta-Storms distance, PERMANOVA). Tested for p-value < 0.01), all validated by Lax et al. Predicted functional diversity followed a similar trend to taxonomy. This microbial dynamics between the two time points can also be illustrated by changes in relative abundance ( Fig. 4C ). Using statistical tests and machine learning analysis methods, PMS also identified the most important microorganisms, such as Staphylococcus, Heila reinhardtii, and Modest, that help differentiate this ecological change of hospital surfaces from pre-opening to post-opening state bacteria. This machine learning model achieves 95.91% accuracy (error rate = 4.09%) in distinguishing the genus-level state of the indoor samples ( Fig. 4D ).

Table 1 Details of the test dataset

cf94c90695b494b1e5279963b7877e7c.png

Figure 4 Changes of indoor microbiome before and after hospital opening

0fb7ceda82245bd5e1d01b853a66bfd9.png

(A) After the hospital opened, the Shannon index of alpha diversity decreased, and the Wilcoxon test P value was <0.01 (P value <0.05 indicates a significant difference);

(B) According to the weighted Meta-Storms distance, the overall beta diversity in the pre-hospital and post-hospital states was significantly different, with a PERMANOVA test P value < 0.01 (P value < 0.05 indicates a significant difference);

(C) Dynamic changes in relative abundance of genus levels between two time points;

(D) Five bacterial genera were selected as biomarkers that could distinguish between the two time points. The X-axis is the importance score (average drop in accuracy) produced by a random forest model that assessed the importance of each biomarker in distinguishing between different hospital states.

Case 2: Meta-analysis of the microbiome from multiple habitats

Dataset 2 includes 2,556 host-associated microbiomes ( Table 1 ), taken from different host species and studies, from which we performed a meta-analysis to systematically study the distribution of microbes in different environmental habitats. Since the 16S rRNA gene amplicon sequences were generated by different platforms (i.e. Illumina and Roche 454) and were not suitable for ASV denoising, the other options were left as default. The results in Figures 5A and 5B show that PMS revealed different alpha and beta diversity of the microbiome between host sources or habitat types. This is mainly due to the fact that the rich taxa between mammalian gut and plant roots rarely overlap, whereas fish gut and plant root communities share common microbial members, for example, the dominant flora of proteobacteria, cyanobacteria, and actinomycetes ( Fig. 5C ), which is in high agreement with the previous study Hacquard, et al., Cell Host & Microbe 2015. It is also interesting that functional alpha and beta diversity yielded taxonomically similar results, however some metabolic pathways at the KEGG BRITE 2 level were more consistent across all samples, such as protein family genetic information processing, signaling and cellular Processing, carbohydrate metabolism, amino acid metabolism and energy metabolism.

Figure 5 Meta-analysis of the microbiome from multiple habitats

a4f78e3d34b950fa622e2fce332bd3a0.png

(A) Alpha diversity Shannon indices differed significantly between host types. Kruskal's test P value < 0.01 (P value < 0.05 indicates a significant difference);

(B) Samples are grouped by habitat in weighted Meta-Storms distance based on PCoA patterns. PERMANOVA test P value < 0.01 (P value < 0.05 indicates a significant difference);

(C) Abundant community membership varies across habitat types.

Parallel computing and running speed

We further evaluate the performance of PMS in terms of parallel computing speed and efficiency using three datasets ( Table 1 ). For dataset 1 and dataset 2, we set different numbers of CPU threads (1, 10, 20, 40 and 80) respectively, repeat the whole workflow and compare the running time to test the efficiency of parallel computing. Datasets 1 and 3 were sequenced by the Illumina platform and were suitable for ASV-based analysis. Dataset 2 contains sequences from Illumina and Roche 454, so ASV was set to off. Other parameters remain as default configuration. All speed tests were performed on a single-node rack server supporting 80 threads (40 physical CPU cores).

Through dynamic thread scheduling and optimization of load balancing for parallel computing, PMS is able to process tens of thousands of microbiome data, for example, a dataset of more than 2,500 samples2 The entire workflow can be completed in 392 minutes, even a dataset 3 of 14,000 samples can be completed in 43 hours. From the results in Figure 6 , we observe that the reduction in running time is linear with the number of threads, indicating that the parallelization and subtask scheduling strategies are computationally efficient. Furthermore, the speedup is independent of the source or sequence type of the input samples. This acceleration shows that PMS can quickly and timely classify and functionally analyze input samples, which is crucial for deep data mining of more than 10,000 samples from different technical backgrounds.

Figure 6. Running time consumption of Parallel-Meta Suite on different scale datasets

acc049b006899a3d19b74701cada8f41.png

Discuss

Over the past few years, data processing methods for the microbiome have been updated and stabilized, and the key focus of bioinformatics tools is shifting from mere functional expansion to ease of use . As a continuously maintained and iterative software work, Parallel-Meta Suite aims to provide a pleasant working experience for users of different backgrounds and levels, and to provide comprehensive and rapid microbiome large-scale analysis solutions through the latest methods or technologies , which facilitates the formation of a comprehensive microbiome knowledge base using a wide range of datasets and facilitates interdisciplinary collaboration.

In addition, PMS also facilitates in-depth data mining through its high compatibility. First, its data visualization results can provide a clear understanding of microbial diversity patterns associated with key phenotypes and generate certain key hypotheses for downstream analysis or larger-scale studies. On the other hand, all raw data are stored in standard or commonly used formats for downstream processes of big data mining. For example, relative abundance tables with different sets of microbial signatures (such as taxonomic or functional pathways) are also suitable for other microbiome analysis tools or machine learning tools. Such microbiome analysis results can be used directly and seamlessly by our previously developed tools, such as microbiome search engines or Meta-Apo, which greatly facilitates data-driven science in this field.

Code and Data Availability

The package is now available on GitHub (https://github.com/qdu-bioinfo/parallel-meta-suite) and Gitee (https://gitee.com/qdu-bioinfo/parallel-meta-suite), where An installer is integrated for fully automatic installation . All datasets used in this paper are also uploaded to the online repository. In each dataset package, the "seqs" folder contains the demultiplexed FASTA-formatted sequence files for each sample , the "seqs.list" file records the paths of these sample sequences , and the "meta.txt" file contains the sequence files for each sample. meta information for each sample . All supplementary materials (texts, figures, tables, Chinese translations or videos) are also available online.

Editor in charge: Ma Tengfei Nanjing Agricultural University

Review: iMeta Journal Editorial Office

About the Author

50e6e9932bc8caf3e1e10aac1fda0f82.png

Chen Yuzhu , an academic master of software engineering from Qingdao University , was sent to Sweden as an exchange student at Bleijin University of Technology in 2019. His current research direction is microbiome big data analysis and mining , and related academic achievements have been published in iMeta, Computational and Structural Biotechnology Journal and other journals.

e859a073b29c96e589b2e6056f0bf02a.png

Li Jian holds a master's degree in electronic information from Qingdao University . Former ZTE engineer, admitted to Qingdao University for a master's degree. The main topic of the current research is microbiome analysis tools .

About the corresponding author

49918355df63a365344521ccba74a547.png

Su Xiaoquan , professor and doctoral supervisor of Qingdao University . The research direction is bioinformatics and big data science . He has published more than 40 academic papers in iMeta, mBio, mSystems, Bioinformatics and other journals . He presided over the National Natural Science Foundation of China, the national key research and development sub-project, the Shandong Provincial Natural Science Foundation of major basic projects, The Chinese Academy of Sciences has deployed key sub-projects, etc., and related achievements have obtained 8 software copyrights .

Citation

Yuzhu Chen, Jian Li, Yufeng Zhang, Mingqian Zhang, Zheng Sun, Gongchao Jing, Shi Huang, Xiaoquan Su. 2022. Parallel-Meta Suite: Interactive and rapid microbiome data analysis on multiple platforms. iMeta 1: e1. https:// doi.org/10.1002/imt2.1

iMeta—a high starting point journal for microbiome/bioinformatics

b21515812b82aaced00d6265688d55f8.png

Contact :

Homepage: http://www.imeta.science
Press: https://onlinelibrary.wiley.com/journal/2770596x
Submission: https://mc.manuscriptcentral.com/imeta
Email: [email protected]
WeChat public account : iMeta

Chinese translation of iMeta articles + video interpretation

iMeta teaches you how to draw

iMeta related information

Guess you like

Origin blog.csdn.net/woodcorpse/article/details/123516400