NBT: Rob Knight - A New Method for Dimensionality Reduction in Microbiome Data

e11a6fba7d42e9b06d212c31690a83d7.png

Context-aware dimension reduction deconvolution of gut microbial community dynamics

Context-aware dimensionality reduction deconvolutes gut microbial community dynamics

Nature Biotechnology [IF:36.558]

https://doi.org/10.1038/s41587-020-0660-7

Release date: 2018-05-23

Chinese version update time: 2018-03-30

First author: Cameron Martino, 1,2,3 , Liat Shenhav 4

Corresponding Author: Rob Knight 1,3,14,15 [email protected]

Co-author Clarisse A. Marotz, George Armstrong, Daniel McDonald, Yoshiki Vázquez-Baeza, James T. Morton, Lingjing Jiang, Maria Gloria Dominguez-Bello, Austin D. Swafford, Eran Halperin

Main unit:

1Center for Microbiome Innovation, University of California San Diego, La Jolla, CA, USA

4Department of Computer Science, University of California Los Angeles, Los Angeles, CA, USA.

Summary

The interpretive power of human microbiome studies is limited by large inter-individual variability. We describe a dimensionality reduction tool, compositional tensor factorization (CTF), that combines information from the same host in multiple samples to reveal patterns that drive differences in phenotypic microbial composition. CTF can identify robust patterns in sparse composition datasets, allowing the detection of microbial changes associated with specific phenotypes that are reproducible across datasets.

text

The host-associated microbiota is often host-specific, with subjects driving most of the variation. Such host-specific variation can mask microbial changes broadly associated with a given phenotype. Collecting multiple samples longitudinally from the same participant or from different body parts (i.e., "repeated measures") is an effective experimental method to control for individual differences. However, due to the nature of microbiome sequencing datasets, there are multiple challenges in utilizing this type of experimental design.

A common approach to exploring microbiome sequencing data is to perform dimensionality reduction (e.g., principal coordinate analysis (PCoA)) on a distance matrix, which describes relationships between samples so that global differences across the dataset can be observed. However, this method does not take into account the inherent temporal or spatial correlation structure when applied to repeated measures. An alternative approach to analyzing repeated measures microbiome data is to use supervised methods that focus on generative models that infer these community dynamics (eg, generalized Lotka Volterra). Although these methods account for correlation structure caused by repeated measures, as well as sparsity and composition, their outputs do not directly allow microbial community dynamics to cluster phenotypes .

To address these challenges simultaneously, we developed Component Tensor Factorization (CTF), which enables unsupervised dimensionality reduction on repeated-measures data, resulting in traditional beta diversity analysis as well as differential feature abundance assessments . As a first step, the two-dimensional matrix was transformed using a robust central log-ratio technique to account for the inherent sparsity and compositional properties of next-generation sequencing datasets ( Fig. 1a ). Next, this transformed matrix was reconstructed into a three-dimensional tensor related to the microbial sequence, the sampled host, and time or space ( Fig. 1b ). The decomposition (i.e. decomposition) of this tensor provides distinct vectors for objects (U), microbial features (V) and time points (W) ( Fig. 1c ). Similar to the concept of reference frames, these vectors are unit-scaled and thus can be ordered, where their rank indicates their association with the underlying phenotype group. From here, we refer to the order of these vectors as "ranking" (i.e. "feature ranking"). Notably, CTF assumes that the data have an underlying low-rank structure in which only a few phenotypic factors explain most of the variance ( Fig. 1d–g ).

Figure 1: Overview of the CTF algorithm

Fig. 1: Overview of the CTF algorithm.

bf047f630a83ef94ef2bd7939842f188.png

a, CTF uses the feature abundance matrix of objects over time. For each subject with the phenotype of interest, data are represented as the relative abundance of the feature over time (abundance gradient in grayscale).
b, Concatenate matrices, preprocessing with robust-centered log ratio (rclr) transformation and construct into tensor format with patterns corresponding to objects, features, and times.
c, and then decompose the resulting tensor into loading vectors for each dimension (i.e. objects, time points, and features) based only on the observed data.
d, Simulated count data are plotted on the y-axis of the three taxa with the mean in bold and the bold line missing missing values. The standard deviation of the distribution is behind the shading. Comparison of two phenotypes; a time-invariant control (left) and a dynamic phenotype with perturbation at time point 2 (right). Taxon 1 (blue) is highly abundant and noisy, taxon 2 (red) is less abundant but increases exponentially in phenotype 2, and taxon 3 (orange) increases with phenotype 2 magnitude Oscillation.
e–g, the axes (i.e., loadings) of the first two principal components from the CTFs (PC1 (top) and PC2 (bottom)) are plotted on the y-axis with corresponding samples (e), time (f) and characteristic load (g). In PC1, phenotype 2 was associated with unstable oscillatory waveforms of high-load taxon 3 (orange, top). Similarly, in PC2, phenotype 2 was associated with a sigmoid waveform of high-load taxon 2 (red, bottom).

a, CTF uses feature abundance matrices for subjects over time. For each subject with a phenotype of interest, the data is represented as relative abundances of features (abundance gradient represented in grayscale) over time.
b, The matrices are concatenated, rclr transformed and structured into a tensor format with modes corresponding to subjects, features and time. c, The resulting tensor is then factored based only on observed data into loading vectors for each dimension (that is, subject, timepoint and feature). d, Simulated count data is plotted on the y axis for three taxa with the mean counts in bold and missing values absent from the bold line. Standard deviation of distributions are shaded behind. Two phenotypes are compared; a control unchanging in time (left) and a dynamic phenotype with a perturbation at timepoint 2 (right). Taxon 1 (blue) is highly abundant and noisy, taxon 2 (red) is lowly abundant but growing exponentially in phenotype 2, and taxon 3 (orange) is oscillatory with increasing amplitude in phenotype 2. e–g,The first two principal component axes (that is, loadings) from CTF (PC1 (top) and PC2 (bottom)) are plotted on the y axis with the corresponding sample (e), time (f) and feature loadings (g). In PC1, phenotype 2 is linked to the unstable oscillatory waveform of highly loaded taxon 3 (orange, top). Similarly, in PC2, phenotype 2 is linked to the sigmoidal waveform of highly loaded taxon 2 (red, bottom).

To demonstrate the use of CTF, we applied it to a simulated longitudinal dataset with two phenotype groups. Simulations were generated based on the distribution of real longitudinal 16S data from Halfvarson et al., while varying sequencing depth and temporal sampling density as described in Äijö et al. Diversity varied widely among subjects with or without Crohn's disease. We compared CTF with state-of-the-art beta diversity metrics, including Jaccard, Bray-Curtis, Aitchison, unweighted UniFrac, and weighted UniFrac, via PCoA. In each of our simulations, K-Nearest Neighbor (KNN) classification by disease state showed that CTF achieved higher accuracy than existing methods regardless of sequencing depth or the number of samples collected longitudinally ( Fig. 2 , Supplementary Table). 1 and Supplementary Fig. 1). CTF also exhibited higher discriminative power at all levels of sequencing depth and at higher sampling densities (≥3 time points, Figure 2 ) by the PERMANOVA F statistic.

Figure 2: CTF outperforms mainstream distance metrics in longitudinal computer data-driven simulations

Fig. 2: CTF outperforms popular distance metrics in longitudinal in silico data-driven simulations.

25159c69d2fc881b4ae21a15f1a99f72.png

Increasing sequencing depth (500–10,000 rows) at different temporal sampling densities (x-axis) as assessed by the PERMNNOVA F-statistic as a measure of discriminative power (left column), in addition to KNN classification by area under the curve (AUC) Cross-validation; n = 100, middle column) and average precision-recall (APR, average precision-recall; n = 100, right column). Comparisons were made in CTF (green) and common distance metrics Aitchison (blue), Bray-Curtis (orange), Jaccard (grey), unweighted (purple) and weighted (red) UniFrac. Error bars represent s.e.m.

Increasing sequencing depth (500–10,000, rows) over differing temporal sampling densities (x axis) evaluated for PERMANOVA F statistic as a measure of discriminatory power (left column), in addition to KNN-classification cross-validation by area under curve (AUC; n = 100, middle column) and average precision-recall (APR; n = 100, right column). Compared among CTF (green) and popular distance metrics Aitchison (blue), Bray–Curtis (orange), Jaccard (gray), unweighted (purple) and weighted (red) UniFrac. Error bars represent s.e.m.

Next, we applied CTF to two published datasets that track changes in the infant gut over time. The datasets, abbreviated ECAM (n subjects, 43 years) 14 and DIABIMMUNE (n subjects, 39 years) 15, respectively, tracked infants for the first 2 and 3 years of life, respectively. Both studies observed that mode of delivery (i.e. vaginal delivery or cesarean section (caesarean section)) differentiated microbial community composition. Similar to our results from simulated data, CTF was tenfold better at vaginal identification of caesarean-delivered infants compared with state-of-the-art measures of beta diversity (Supplementary Figs 2a,b,3a,b and Supplementary Table 2) .

We sought to examine the ability of CTFs to reproducibly identify differentially abundant microbes in an unsupervised manner. To this end, we compared the feature rank between the ECAM and DIABIMMUNE datasets along the first axis of change and found a significant correlation between them (Pearson correlation, R = 0.974, P < 10-10 ) ( Supplementary figure 2). Although the two datasets had <50% overlap at the sOTU level (Supplementary Fig. 2d), the higher-level, lower-level sOTUs grouped at the genus level were similar on both datasets (Supplementary Fig. 2d). 2e). We note that although these datasets were collected and processed using different methods and in different laboratories, CTFs identified the same taxa through birth patterns that drive gut microbiome differentiation, suggesting that the structure of the microbiome in infants is highly steady.

We constructed log ratios of mode of delivery for vaginal versus cesarean delivery using the sOTUs most associated with vaginal and cesarean delivery in each dataset (Supplementary Fig. 4 and Methods). Along time, birth patterns in both datasets essentially separated the samples (Supplementary Fig. 5 and Supplementary Table 3). We noticed that the microbial signatures of these birth patterns were not confounded by established distinguishing factors such as antibiotic use or feeding regime (Supplementary Fig. 5). However, we cannot rule out the possibility of immeasurable confounding factors. Next, we combined those sOTUs shared by ECAM and DIABIMMUNE birth pattern ratios to create a "microbial birth pattern signature".

To examine the robustness of this microbial birth pattern signature, we tested its discriminative power on data from a large cross-sectional dataset from the American Gut Project (AGP) (n = 8,099). We found that this feature significantly differentiated participants under the age of four by birth mode (t-test P = 0.042, Supplementary Fig. 6), which is consistent with our previous findings. The robustness of this microbial signature across multiple datasets highlights the ability of CTFs to identify differentially enriched features that are reproducibly associated with phenotypes.

In the ECAM and DIABIMMUNE datasets, we observed that samples of vaginal and cesarean infants became less pronounced over time throughout infant development (Supplementary Fig. 2a,b). Likewise, in the AGP dataset, the microbial birth pattern signature no longer differentiated participants by birth pattern, i.e., samples from participants over the age of four.

CTF is the only unsupervised method that can take full advantage of repeated measures, while taking into account the inherent properties of microbiome sequencing datasets, namely high dimensionality, sparsity, and compositionality . The CTF outperforms the current state-of-the-art Beta diversity metrics in both the simulated and real datasets. Although CTFs can display powerful microbial signatures, there are several aspects to consider when using this tool. First, CTF relies on the assumption that the underlying data is of low rank. This assumption may be violated, making CTF unsuitable for use , for example, when the data is driven by gradients rather than discrete groupings (such as soil datasets). Our CTF implementation estimates base rank and notifies users when data does not meet this requirement. Second, CTF, like other beta diversity metrics, does not directly account for the presence of confounding factors that could affect downstream clustering, and thus requires additional validation similar to that shown in Supplementary Fig. 5. Finally, although CTF utilizes repeated measures to account for individual differences and is optimal in the context of synchronous events (e.g. treatment, diet), it is permutation-invariant and does not take into account the order of longitudinal data.

In addition to the longitudinal datasets benchmarked here, CTF can also be used for spatial repeated measures. This includes studies where samples are collected simultaneously; for example, to measure multiple body sites (eg, skin and saliva) or sites with different phenotypes (eg, diseased skin versus adjacent non-lesioned skin). In addition, CTF can be used to analyze other types of data with large individual differences, such as metabolomics or proteomics . In conclusion, CTF utilizes the power of repeated-measures study designs to elucidate biological variation while accounting for inter-individual differences. We recommend using this tool for the reanalysis of existing datasets as well as for future studies of microbial communities .

you may also like

10000+: Microflora Analysis  Baby and Cats and Dogs  Syphilis Rhapsody Extracts DNA Issues Nature  Cell Special Issue  Gut Commands Brain

Tutorial Series: Introduction to the Microbiome Biostar Microbiome  Metagenomics

Professional skills: Indispensable people for academic charts , high-  scoring articles  , and students' letter collections

Read the article: The evolutionary tree of metagenomic parasite benefits

Required Skill: Ask Question Search  Endnote

Literature Reading Enthusiastic SemanticScholar Geenmedical

Amplicon Analysis: Graph Interpretation Analysis Process Statistical Plotting

16S function prediction   PICRUSt  FAPROTAX  Bugbase Tax4Fun

Online Tool: 16S Predictive Media Bio-Information Mapping

Scientific research experience: cloud note  cloud collaboration public account

Programming Template:  Shell  R Perl

Biological science:   Gut bacteria  , life on the human body, the  great leap of life    

write on the back

In order to encourage readers to communicate and quickly solve scientific research difficulties, we have established a "metagenomics" professional discussion group, and currently there are more than 5000 front-line researchers at home and abroad to join. Participate in the discussion and get professional answers. Welcome to share this article to the circle of friends, and scan the code to add the editor-in-chief friend to bring you into the group. Be sure to note "name-unit-research direction-title/grade". PI, please indicate your identity, and there are also PI groups related to microorganisms at home and abroad for cooperation and exchanges. For help with technical problems, first read "How to Ask Questions Elegantly" to learn how to solve the problem.

5741d18bfc779007b95a99a5de60b929.png

Learn 16S amplicons, metagenomics scientific research ideas and analysis practice, and pay attention to "metagenomics"

Click to read the original text, jump to the latest article directory to read

Guess you like

Origin blog.csdn.net/woodcorpse/article/details/123492490