使用phyloseq软件包中的工具以及microbiome软件包中的某些扩展来操作微生物组数据集的说明,包括子集,聚合和过滤。
加载数据
library(phyloseq)
library(microbiome)
library(knitr)
data(atlas1006)
# Rename the example data (which is a phyloseq object)
pseq <- atlas1006
总结phyloseq对象的内容
summarize_phyloseq(pseq)
## Compositional = NO2
## 1] Min. number of reads = 19002] Max. number of reads = 288833] Total number of reads = 135465644] Average number of reads = 11769.38662033015] Median number of reads = 111717] Sparsity = 0.2090022054400866] Any OTU sum to 1 or less? NO8] Number of singletons = 09] Percent of OTUs that are singletons
## (i.e. exactly one read detected across all samples)010] Number of sample variables are: 10agesexnationalityDNA_extraction_methodprojectdiversitybmi_groupsubjecttimesample2
## [[1]]
## [1] "1] Min. number of reads = 1900"
##
## [[2]]
## [1] "2] Max. number of reads = 28883"
##
## [[3]]
## [1] "3] Total number of reads = 13546564"
##
## [[4]]
## [1] "4] Average number of reads = 11769.3866203301"
##
## [[5]]
## [1] "5] Median number of reads = 11171"
##
## [[6]]
## [1] "7] Sparsity = 0.209002205440086"
##
## [[7]]
## [1] "6] Any OTU sum to 1 or less? NO"
##
## [[8]]
## [1] "8] Number of singletons = 0"
##
## [[9]]
## [1] "9] Percent of OTUs that are singletons \n (i.e. exactly one read detected across all samples)0"
##
## [[10]]
## [1] "10] Number of sample variables are: 10"
##
## [[11]]
## [1] "age" "sex" "nationality"
## [4] "DNA_extraction_method" "project" "diversity"
## [7] "bmi_group" "subject" "time"
## [10] "sample"
从phyloseq对象中检索数据元素
phyloseq对象包含OTU表(分类群丰度)、样本元数据、分类表(OTU和更高级别的分类群之间的映射)和系统发育树(分类群之间的关系)。其中一些是可选的。
选择元数据
meta <- meta(pseq)
物种组成数据
taxonomy <- tax_table(pseq)
分类组的丰度("OTU表")为TaxaxSamples矩阵。
# Absolute abundances
otu.absolute <- abundances(pseq)
# Relative abundances
otu.relative <- abundances(pseq, "compositional")
总counts
reads_sample <- readcount(pseq)
# check for first 5 samples
reads_sample[1:5]
## Sample-1 Sample-2 Sample-3 Sample-4 Sample-5
## 7593 10148 7131 10855 12000
在phyloseq对象元数据中增加每个样本的reads。
sample_data(pseq)$reads_sample <- reads_sample
# reads_sample is add to the last column in sample_data of pseq object.
head(meta(pseq)[,c("sample", "reads_sample")])
## sample reads_sample
## Sample-1 Sample-1 7593
## Sample-2 Sample-2 10148
## Sample-3 Sample-3 7131
## Sample-4 Sample-4 10855
## Sample-5 Sample-5 12000
## Sample-6 Sample-6 7914
Melting phyloseq数据,以便绘图
df <- psmelt(pseq)
kable(head(df))
OTU | Sample | Abundance | age | sex | nationality | DNA_extraction_method | project | diversity | bmi_group | subject | time | sample | reads_sample | Phylum | Family | Genus | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
110989 | Prevotella melaninogenica et rel. | Sample-448 | 14961 | 54 | female | CentralEurope | o | 18 | 5.98 | lean | 448 | 0 | Sample-448 | 26546 | Bacteroidetes | Bacteroidetes | Prevotella melaninogenica et rel. |
111380 | Prevotella melaninogenica et rel. | Sample-360 | 14296 | 45 | female | CentralEurope | o | 13 | 5.49 | severeobese | 360 | 0 | Sample-360 | 21086 | Bacteroidetes | Bacteroidetes | Prevotella melaninogenica et rel. |
111232 | Prevotella melaninogenica et rel. | Sample-190 | 13676 | 34 | female | CentralEurope | r | 7 | 6.06 | lean | 190 | 0 | Sample-190 | 23954 | Bacteroidetes | Bacteroidetes | Prevotella melaninogenica et rel. |
111553 | Prevotella melaninogenica et rel. | Sample-743 | 13509 | 52 | male | US | NA | 19 | 5.21 | obese | 743 | 0 | Sample-743 | 20632 | Bacteroidetes | Bacteroidetes | Prevotella melaninogenica et rel. |
110590 | Prevotella melaninogenica et rel. | Sample-366 | 13490 | 52 | female | CentralEurope | o | 15 | 5.63 | obese | 366 | 0 | Sample-366 | 19651 | Bacteroidetes | Bacteroidetes | Prevotella melaninogenica et rel. |
111029 | Prevotella melaninogenica et rel. | Sample-375 | 13384 | 45 | female | CentralEurope | o | 16 | 5.64 | severeobese | 375 | 0 | Sample-375 | 21408 | Bacteroidetes | Bacteroidetes | Prevotella melaninogenica et rel. |
Sample operations
Sample names and variables
head(sample_names(pseq))
## [1] "Sample-1" "Sample-2" "Sample-3" "Sample-4" "Sample-5" "Sample-6"
Total OTU abundance in each sample
s <- sample_sums(pseq)
Abundance of a given species in each sample
head(abundances(pseq)["Akkermansia",])
## Sample-1 Sample-2 Sample-3 Sample-4 Sample-5 Sample-6
## 21 36 475 61 34 14
Select a subset by metadata fields:
pseq.subset <- subset_samples(pseq, nationality == "AFR")
Select a subset by providing sample names:
# Check sample names for African Females in this phyloseq object
s <- rownames(subset(meta(pseq), nationality == "AFR" & sex == "Female"))
# Pick the phyloseq subset with these sample names
pseq.subset2 <- prune_samples(s, pseq)
Pick samples at the baseline time points only:
pseq0 <- baseline(pseq)
数据转换
微生物组包为标准样本/OTU转换提供了一个封装器。对于任意的转换,请使用phyloseq包中的transform_sample_counts函数。如果数据包含零,则Log10转换为log(1+x)。另外 "Z"、"clr"、"hellinger "和 "shift "也可作为常用的变换。相对丰度(注意,输入的数据需要是绝对尺度,而不是对数!)。
pseq.compositional <- microbiome::transform(pseq, "compositional")
CLR("clr")转换也是可用的,它带有一个伪计数以避免零的出现。另一种方法是对零膨胀的未观察到的数值进行估算。有时会使用乘法Kaplan-Meier平滑样条(KMSS)替代,乘法lognormal替代,或乘法简单替代。这些都可以在zCompositions R包中找到(分别是函数multKM、multLN和multRepl)。在实践中至少要使用n.draws=1000,这里为了加快例子的速度而少用。
data(dietswap)
x <- dietswap
# Compositional data
x2 <- microbiome::transform(x, "compositional")
变量操作
变量名称
sample_variables(pseq)
## [1] "age" "sex" "nationality"
## [4] "DNA_extraction_method" "project" "diversity"
## [7] "bmi_group" "subject" "time"
## [10] "sample" "reads_sample"
选择特定变量
head(get_variable(pseq, sample_variables(pseq)[1]))
## [1] 28 24 52 22 25 42
为元数据分配新字段
# Calculate diversity for samples
div <- microbiome::alpha(pseq, index = "shannon")
# Assign the estimated diversity to sample metadata
sample_data(pseq)$diversity <- div
物种分类操作
物种数目
n <- ntaxa(pseq)
丰度最高的物种
topx <- top_taxa(pseq, n = 10)
名称
ranks <- rank_names(pseq) # Taxonomic levels
taxa <- taxa(pseq) # Taxa names at the analysed level
子集
pseq.bac <- subset_taxa(pseq, Phylum == "Bacteroidetes")
选择特定条件物种
# List of Genera in the Bacteroideted Phylum
taxa <- map_levels(NULL, "Phylum", "Genus", pseq)$Bacteroidetes
# With given taxon names
ex2 <- prune_taxa(taxa, pseq)
# Taxa with positive sum across samples
ex3 <- prune_taxa(taxa_sums(pseq) > 0, pseq)
通过用户指定的函数值(这里是方差)进行过滤。
f <- filter_taxa(pseq, function(x) var(x) > 1e-05, TRUE)
列出门水平物种。
head(get_taxa_unique(pseq, "Phylum"))
## [1] "Actinobacteria" "Firmicutes" "Proteobacteria" "Verrucomicrobia"
## [5] "Bacteroidetes" "Spirochaetes"
挑选特定样本的分类群丰度
samplename <- sample_names(pseq)[[1]]
# Pick abundances for a particular taxon
tax.abundances <- abundances(pseq)[, samplename]
合并操作
将分类群聚集到更高的分类级别。这在系统发育树缺失的情况下特别有用。( merge_samples, merge_taxa and tax_glom).
pseq2 <- aggregate_taxa(pseq, "Phylum")
将所需的分类群合并到 "其他 "类别。在这里,我们将所有的Bacteroides组合并为一个名为Bacteroides的单一组。
pseq2 <- merge_taxa2(pseq, pattern = "^Bacteroides", name = "Bacteroides")
将phyloseq对象与phyloseq包合并
merge_phyloseq(pseqA, pseqB)
在一个数据框架中连接otu/asv表和分类法
library(dplyr)
library(microbiome)
data("atlas1006") # example data from microbiome pkg
x <-atlas1006
asv_tab <- as.data.frame(abundances(x)) # get asvs/otus
asv_tab$asv_id <- rownames(asv_tab) # add a new column for ids
#tax_tab <- as.data.frame(tax_table(x)) # get taxonomy note: can be slow
tax_tab <- as(x@tax_table,"matrix") # get taxonomy note as matrix
tax_tab <- as.data.frame(tax_tab) # convert to data frame
tax_tab$asv_id <- rownames(tax_tab) # add a new column for ids
asv_tax_tab <- tax_tab %>%
left_join(asv_tab, by="asv_id") # join to get taxonomy and asv table
head(asv_tax_tab)[,1:8]
## Phylum Family Genus
## 1 Actinobacteria Actinobacteria Actinomycetaceae
## 2 Firmicutes Bacilli Aerococcus
## 3 Proteobacteria Proteobacteria Aeromonas
## 4 Verrucomicrobia Verrucomicrobia Akkermansia
## 5 Proteobacteria Proteobacteria Alcaligenes faecalis et rel.
## 6 Bacteroidetes Bacteroidetes Allistipes et rel.
## asv_id Sample-1 Sample-2 Sample-3 Sample-4
## 1 Actinomycetaceae 0 0 0 0
## 2 Aerococcus 0 0 0 0
## 3 Aeromonas 0 0 0 0
## 4 Akkermansia 21 36 475 61
## 5 Alcaligenes faecalis et rel. 1 1 1 2
## 6 Allistipes et rel. 72 127 34 344
抽平
pseq.rarified <- rarefy_even_depth(pseq)
物种分类
在分类级别之间进行转换(这里从属(Akkermansia)到门(Verrucomicrobia)。
m <- map_levels("Akkermansia", "Genus", "Phylum", tax_table(pseq))
print(m)
## [1] "Verrucomicrobia"
元数据
可视化给定因素(性别)水平在指定组(组)内的频率。
p <- plot_frequencies(sample_data(pseq), "bmi_group", "sex")
print(p)
# Retrieving the actual data values:
# kable(head(p@data), digits = 2)
提供自定义功能,将年龄或BMI信息切割成离散的类别。
group_bmi(c(22, 28, 31), "standard")
## [1] lean overweight obese
## Levels: underweight lean overweight obese severe morbid super
group_age(c(17, 41, 102), "decades")
## [1] [10,20) [40,50) [100,110]
## 10 Levels: [10,20) [20,30) [30,40) [40,50) [50,60) [60,70) [70,80) ... [100,110]
向phyloseq对象添加元数据。为了可重复性,我们在这个例子中只是使用了现有的元数据,但这可以用另一个data.frame(样本x字段)来代替。
# Example data
data(dietswap)
pseq <- dietswap
# Pick the existing metadata from a phyloseq object
# (or retrieve this from another source)
df <- meta(pseq)
# Merge the metadata back in the phyloseq object
pseq2 <- merge_phyloseq(pseq, sample_data(df))