菌群数据预处理-microbiome包

使用phyloseq软件包中的工具以及microbiome软件包中的某些扩展来操作微生物组数据集的说明，包括子集，聚合和过滤。

加载数据

library(phyloseq)
library(microbiome)
library(knitr)
data(atlas1006)   
# Rename the example data (which is a phyloseq object)
pseq <- atlas1006

总结phyloseq对象的内容

summarize_phyloseq(pseq)

## Compositional = NO2

## 1] Min. number of reads = 19002] Max. number of reads = 288833] Total number of reads = 135465644] Average number of reads = 11769.38662033015] Median number of reads = 111717] Sparsity = 0.2090022054400866] Any OTU sum to 1 or less? NO8] Number of singletons = 09] Percent of OTUs that are singletons 
##         (i.e. exactly one read detected across all samples)010] Number of sample variables are: 10agesexnationalityDNA_extraction_methodprojectdiversitybmi_groupsubjecttimesample2

## [[1]]
## [1] "1] Min. number of reads = 1900"
## 
## [[2]]
## [1] "2] Max. number of reads = 28883"
## 
## [[3]]
## [1] "3] Total number of reads = 13546564"
## 
## [[4]]
## [1] "4] Average number of reads = 11769.3866203301"
## 
## [[5]]
## [1] "5] Median number of reads = 11171"
## 
## [[6]]
## [1] "7] Sparsity = 0.209002205440086"
## 
## [[7]]
## [1] "6] Any OTU sum to 1 or less? NO"
## 
## [[8]]
## [1] "8] Number of singletons = 0"
## 
## [[9]]
## [1] "9] Percent of OTUs that are singletons \n        (i.e. exactly one read detected across all samples)0"
## 
## [[10]]
## [1] "10] Number of sample variables are: 10"
## 
## [[11]]
##  [1] "age"                   "sex"                   "nationality"          
##  [4] "DNA_extraction_method" "project"               "diversity"            
##  [7] "bmi_group"             "subject"               "time"                 
## [10] "sample"

从phyloseq对象中检索数据元素

phyloseq对象包含OTU表（分类群丰度）、样本元数据、分类表（OTU和更高级别的分类群之间的映射）和系统发育树（分类群之间的关系）。其中一些是可选的。

选择元数据

meta <- meta(pseq)

物种组成数据

taxonomy <- tax_table(pseq)

分类组的丰度（"OTU表"）为TaxaxSamples矩阵。

# Absolute abundances
otu.absolute <- abundances(pseq)

# Relative abundances
otu.relative <- abundances(pseq, "compositional")

总counts

reads_sample <- readcount(pseq)
# check for first 5 samples
reads_sample[1:5]

## Sample-1 Sample-2 Sample-3 Sample-4 Sample-5 
##     7593    10148     7131    10855    12000

在phyloseq对象元数据中增加每个样本的reads。

sample_data(pseq)$reads_sample <- reads_sample

# reads_sample is add to the last column in sample_data of pseq object.
head(meta(pseq)[,c("sample", "reads_sample")])

##            sample reads_sample
## Sample-1 Sample-1         7593
## Sample-2 Sample-2        10148
## Sample-3 Sample-3         7131
## Sample-4 Sample-4        10855
## Sample-5 Sample-5        12000
## Sample-6 Sample-6         7914

Melting phyloseq数据，以便绘图

df <- psmelt(pseq)
kable(head(df))

	OTU	Sample	Abundance	age	sex	nationality	DNA_extraction_method	project	diversity	bmi_group	subject	sample	reads_sample	Phylum	Family	Genus
110989	Prevotella melaninogenica et rel.	Sample-448	14961	54	female	CentralEurope	o	18	5.98	lean	448	Sample-448	26546	Bacteroidetes	Bacteroidetes	Prevotella melaninogenica et rel.
111380	Prevotella melaninogenica et rel.	Sample-360	14296	45	female	CentralEurope	o	13	5.49	severeobese	360	Sample-360	21086	Bacteroidetes	Bacteroidetes	Prevotella melaninogenica et rel.
111232	Prevotella melaninogenica et rel.	Sample-190	13676	34	female	CentralEurope	r	7	6.06	lean	190	Sample-190	23954	Bacteroidetes	Bacteroidetes	Prevotella melaninogenica et rel.
111553	Prevotella melaninogenica et rel.	Sample-743	13509	52	male	US	NA	19	5.21	obese	743	Sample-743	20632	Bacteroidetes	Bacteroidetes	Prevotella melaninogenica et rel.
110590	Prevotella melaninogenica et rel.	Sample-366	13490	52	female	CentralEurope	o	15	5.63	obese	366	Sample-366	19651	Bacteroidetes	Bacteroidetes	Prevotella melaninogenica et rel.
111029	Prevotella melaninogenica et rel.	Sample-375	13384	45	female	CentralEurope	o	16	5.64	severeobese	375	Sample-375	21408	Bacteroidetes	Bacteroidetes	Prevotella melaninogenica et rel.

Sample operations

Sample names and variables

head(sample_names(pseq))

## [1] "Sample-1" "Sample-2" "Sample-3" "Sample-4" "Sample-5" "Sample-6"

Total OTU abundance in each sample

s <- sample_sums(pseq)

Abundance of a given species in each sample

head(abundances(pseq)["Akkermansia",])

## Sample-1 Sample-2 Sample-3 Sample-4 Sample-5 Sample-6 
##       21       36      475       61       34       14

Select a subset by metadata fields:

pseq.subset <- subset_samples(pseq, nationality == "AFR")

Select a subset by providing sample names:

# Check sample names for African Females in this phyloseq object
s <- rownames(subset(meta(pseq), nationality == "AFR" & sex == "Female"))

# Pick the phyloseq subset with these sample names
pseq.subset2 <- prune_samples(s, pseq)

Pick samples at the baseline time points only:

pseq0 <- baseline(pseq)

数据转换

微生物组包为标准样本/OTU转换提供了一个封装器。对于任意的转换，请使用phyloseq包中的transform_sample_counts函数。如果数据包含零，则Log10转换为log（1+x）。另外 "Z"、"clr"、"hellinger "和 "shift "也可作为常用的变换。相对丰度（注意，输入的数据需要是绝对尺度，而不是对数！）。

pseq.compositional <- microbiome::transform(pseq, "compositional")

CLR（"clr"）转换也是可用的，它带有一个伪计数以避免零的出现。另一种方法是对零膨胀的未观察到的数值进行估算。有时会使用乘法Kaplan-Meier平滑样条（KMSS）替代，乘法lognormal替代，或乘法简单替代。这些都可以在zCompositions R包中找到（分别是函数multKM、multLN和multRepl）。在实践中至少要使用n.draws=1000，这里为了加快例子的速度而少用。

data(dietswap)
x <- dietswap
# Compositional data 
x2 <- microbiome::transform(x, "compositional")

变量操作

变量名称

sample_variables(pseq)

##  [1] "age"                   "sex"                   "nationality"          
##  [4] "DNA_extraction_method" "project"               "diversity"            
##  [7] "bmi_group"             "subject"               "time"                 
## [10] "sample"                "reads_sample"

选择特定变量

head(get_variable(pseq, sample_variables(pseq)[1]))

## [1] 28 24 52 22 25 42

为元数据分配新字段

# Calculate diversity for samples
div <- microbiome::alpha(pseq, index = "shannon")

# Assign the estimated diversity to sample metadata
sample_data(pseq)$diversity <- div

物种分类操作

物种数目

n <- ntaxa(pseq)

丰度最高的物种

topx <- top_taxa(pseq, n = 10)

名称

ranks <- rank_names(pseq)  # Taxonomic levels
taxa  <- taxa(pseq)        # Taxa names at the analysed level

子集

pseq.bac <- subset_taxa(pseq, Phylum == "Bacteroidetes")

选择特定条件物种

# List of Genera in the Bacteroideted Phylum
taxa <- map_levels(NULL, "Phylum", "Genus", pseq)$Bacteroidetes

# With given taxon names
ex2 <- prune_taxa(taxa, pseq)

# Taxa with positive sum across samples
ex3 <- prune_taxa(taxa_sums(pseq) > 0, pseq)

通过用户指定的函数值（这里是方差）进行过滤。

f <- filter_taxa(pseq, function(x) var(x) > 1e-05, TRUE)

列出门水平物种。

head(get_taxa_unique(pseq, "Phylum"))

## [1] "Actinobacteria"  "Firmicutes"      "Proteobacteria"  "Verrucomicrobia"
## [5] "Bacteroidetes"   "Spirochaetes"

挑选特定样本的分类群丰度

samplename <- sample_names(pseq)[[1]]

# Pick abundances for a particular taxon
tax.abundances <- abundances(pseq)[, samplename]

合并操作

将分类群聚集到更高的分类级别。这在系统发育树缺失的情况下特别有用。（ merge_samples, merge_taxa and tax_glom).

pseq2 <- aggregate_taxa(pseq, "Phylum")

将所需的分类群合并到 "其他 "类别。在这里，我们将所有的Bacteroides组合并为一个名为Bacteroides的单一组。

pseq2 <- merge_taxa2(pseq, pattern = "^Bacteroides", name = "Bacteroides")

将phyloseq对象与phyloseq包合并

merge_phyloseq(pseqA, pseqB)

在一个数据框架中连接otu/asv表和分类法

library(dplyr) 
library(microbiome)
data("atlas1006") # example data from microbiome pkg
x <-atlas1006

asv_tab <- as.data.frame(abundances(x)) # get asvs/otus
asv_tab$asv_id <- rownames(asv_tab) # add a new column for ids
#tax_tab <- as.data.frame(tax_table(x)) # get taxonomy note: can be slow
tax_tab <- as(x@tax_table,"matrix") # get taxonomy note as matrix
tax_tab <- as.data.frame(tax_tab) # convert to data frame
tax_tab$asv_id <- rownames(tax_tab) # add a new column for ids
asv_tax_tab <- tax_tab %>% 
  left_join(asv_tab, by="asv_id") # join to get taxonomy and asv table

head(asv_tax_tab)[,1:8]

##            Phylum          Family                        Genus
## 1  Actinobacteria  Actinobacteria             Actinomycetaceae
## 2      Firmicutes         Bacilli                   Aerococcus
## 3  Proteobacteria  Proteobacteria                    Aeromonas
## 4 Verrucomicrobia Verrucomicrobia                  Akkermansia
## 5  Proteobacteria  Proteobacteria Alcaligenes faecalis et rel.
## 6   Bacteroidetes   Bacteroidetes           Allistipes et rel.
##                         asv_id Sample-1 Sample-2 Sample-3 Sample-4
## 1             Actinomycetaceae        0        0        0        0
## 2                   Aerococcus        0        0        0        0
## 3                    Aeromonas        0        0        0        0
## 4                  Akkermansia       21       36      475       61
## 5 Alcaligenes faecalis et rel.        1        1        1        2
## 6           Allistipes et rel.       72      127       34      344

抽平

pseq.rarified <- rarefy_even_depth(pseq)

物种分类

在分类级别之间进行转换（这里从属（Akkermansia）到门（Verrucomicrobia）。

m <- map_levels("Akkermansia", "Genus", "Phylum", tax_table(pseq))
print(m)

## [1] "Verrucomicrobia"

元数据

可视化给定因素（性别）水平在指定组（组）内的频率。

p <- plot_frequencies(sample_data(pseq), "bmi_group", "sex")
print(p)

# Retrieving the actual data values:
# kable(head(p@data), digits = 2)

提供自定义功能，将年龄或BMI信息切割成离散的类别。

group_bmi(c(22, 28, 31), "standard")

## [1] lean       overweight obese     
## Levels: underweight lean overweight obese severe morbid super

group_age(c(17, 41, 102), "decades")

## [1] [10,20)   [40,50)   [100,110]
## 10 Levels: [10,20) [20,30) [30,40) [40,50) [50,60) [60,70) [70,80) ... [100,110]

向phyloseq对象添加元数据。为了可重复性，我们在这个例子中只是使用了现有的元数据，但这可以用另一个data.frame（样本x字段）来代替。

# Example data
data(dietswap)
pseq <- dietswap

# Pick the existing metadata from a phyloseq object
# (or retrieve this from another source)
df <- meta(pseq)

# Merge the metadata back in the phyloseq object
pseq2 <- merge_phyloseq(pseq, sample_data(df))

https://microbiome.github.io/tutorials/Preprocessing.html