菌群数据预处理-microbiome包

使用phyloseq软件包中的工具以及microbiome软件包中的某些扩展来操作微生物组数据集的说明,包括子集,聚合和过滤。

加载数据

library(phyloseq)
library(microbiome)
library(knitr)
data(atlas1006)   
# Rename the example data (which is a phyloseq object)
pseq <- atlas1006

总结phyloseq对象的内容

summarize_phyloseq(pseq)
## Compositional = NO2
## 1] Min. number of reads = 19002] Max. number of reads = 288833] Total number of reads = 135465644] Average number of reads = 11769.38662033015] Median number of reads = 111717] Sparsity = 0.2090022054400866] Any OTU sum to 1 or less? NO8] Number of singletons = 09] Percent of OTUs that are singletons 
##         (i.e. exactly one read detected across all samples)010] Number of sample variables are: 10agesexnationalityDNA_extraction_methodprojectdiversitybmi_groupsubjecttimesample2
## [[1]]
## [1] "1] Min. number of reads = 1900"
## 
## [[2]]
## [1] "2] Max. number of reads = 28883"
## 
## [[3]]
## [1] "3] Total number of reads = 13546564"
## 
## [[4]]
## [1] "4] Average number of reads = 11769.3866203301"
## 
## [[5]]
## [1] "5] Median number of reads = 11171"
## 
## [[6]]
## [1] "7] Sparsity = 0.209002205440086"
## 
## [[7]]
## [1] "6] Any OTU sum to 1 or less? NO"
## 
## [[8]]
## [1] "8] Number of singletons = 0"
## 
## [[9]]
## [1] "9] Percent of OTUs that are singletons \n        (i.e. exactly one read detected across all samples)0"
## 
## [[10]]
## [1] "10] Number of sample variables are: 10"
## 
## [[11]]
##  [1] "age"                   "sex"                   "nationality"          
##  [4] "DNA_extraction_method" "project"               "diversity"            
##  [7] "bmi_group"             "subject"               "time"                 
## [10] "sample"

从phyloseq对象中检索数据元素

phyloseq对象包含OTU表(分类群丰度)、样本元数据、分类表(OTU和更高级别的分类群之间的映射)和系统发育树(分类群之间的关系)。其中一些是可选的。

选择元数据

meta <- meta(pseq)

物种组成数据

taxonomy <- tax_table(pseq)

分类组的丰度("OTU表")为TaxaxSamples矩阵。

# Absolute abundances
otu.absolute <- abundances(pseq)

# Relative abundances
otu.relative <- abundances(pseq, "compositional")

总counts

reads_sample <- readcount(pseq)
# check for first 5 samples
reads_sample[1:5]
## Sample-1 Sample-2 Sample-3 Sample-4 Sample-5 
##     7593    10148     7131    10855    12000

在phyloseq对象元数据中增加每个样本的reads。

sample_data(pseq)$reads_sample <- reads_sample

# reads_sample is add to the last column in sample_data of pseq object.
head(meta(pseq)[,c("sample", "reads_sample")])
##            sample reads_sample
## Sample-1 Sample-1         7593
## Sample-2 Sample-2        10148
## Sample-3 Sample-3         7131
## Sample-4 Sample-4        10855
## Sample-5 Sample-5        12000
## Sample-6 Sample-6         7914

Melting phyloseq数据,以便绘图

df <- psmelt(pseq)
kable(head(df))
OTU Sample Abundance age sex nationality DNA_extraction_method project diversity bmi_group subject time sample reads_sample Phylum Family Genus
110989 Prevotella melaninogenica et rel. Sample-448 14961 54 female CentralEurope o 18 5.98 lean 448 0 Sample-448 26546 Bacteroidetes Bacteroidetes Prevotella melaninogenica et rel.
111380 Prevotella melaninogenica et rel. Sample-360 14296 45 female CentralEurope o 13 5.49 severeobese 360 0 Sample-360 21086 Bacteroidetes Bacteroidetes Prevotella melaninogenica et rel.
111232 Prevotella melaninogenica et rel. Sample-190 13676 34 female CentralEurope r 7 6.06 lean 190 0 Sample-190 23954 Bacteroidetes Bacteroidetes Prevotella melaninogenica et rel.
111553 Prevotella melaninogenica et rel. Sample-743 13509 52 male US NA 19 5.21 obese 743 0 Sample-743 20632 Bacteroidetes Bacteroidetes Prevotella melaninogenica et rel.
110590 Prevotella melaninogenica et rel. Sample-366 13490 52 female CentralEurope o 15 5.63 obese 366 0 Sample-366 19651 Bacteroidetes Bacteroidetes Prevotella melaninogenica et rel.
111029 Prevotella melaninogenica et rel. Sample-375 13384 45 female CentralEurope o 16 5.64 severeobese 375 0 Sample-375 21408 Bacteroidetes Bacteroidetes Prevotella melaninogenica et rel.

Sample operations

Sample names and variables

head(sample_names(pseq))
## [1] "Sample-1" "Sample-2" "Sample-3" "Sample-4" "Sample-5" "Sample-6"

Total OTU abundance in each sample

s <- sample_sums(pseq)

Abundance of a given species in each sample

head(abundances(pseq)["Akkermansia",])
## Sample-1 Sample-2 Sample-3 Sample-4 Sample-5 Sample-6 
##       21       36      475       61       34       14

Select a subset by metadata fields:

pseq.subset <- subset_samples(pseq, nationality == "AFR")

Select a subset by providing sample names:

# Check sample names for African Females in this phyloseq object
s <- rownames(subset(meta(pseq), nationality == "AFR" & sex == "Female"))

# Pick the phyloseq subset with these sample names
pseq.subset2 <- prune_samples(s, pseq)

Pick samples at the baseline time points only:

pseq0 <- baseline(pseq)

数据转换

微生物组包为标准样本/OTU转换提供了一个封装器。对于任意的转换,请使用phyloseq包中的transform_sample_counts函数。如果数据包含零,则Log10转换为log(1+x)。另外 "Z"、"clr"、"hellinger "和 "shift "也可作为常用的变换。相对丰度(注意,输入的数据需要是绝对尺度,而不是对数!)。

pseq.compositional <- microbiome::transform(pseq, "compositional")

CLR("clr")转换也是可用的,它带有一个伪计数以避免零的出现。另一种方法是对零膨胀的未观察到的数值进行估算。有时会使用乘法Kaplan-Meier平滑样条(KMSS)替代,乘法lognormal替代,或乘法简单替代。这些都可以在zCompositions R包中找到(分别是函数multKM、multLN和multRepl)。在实践中至少要使用n.draws=1000,这里为了加快例子的速度而少用。

data(dietswap)
x <- dietswap
# Compositional data 
x2 <- microbiome::transform(x, "compositional")

变量操作

变量名称

sample_variables(pseq)
##  [1] "age"                   "sex"                   "nationality"          
##  [4] "DNA_extraction_method" "project"               "diversity"            
##  [7] "bmi_group"             "subject"               "time"                 
## [10] "sample"                "reads_sample"

选择特定变量

head(get_variable(pseq, sample_variables(pseq)[1]))
## [1] 28 24 52 22 25 42

为元数据分配新字段

# Calculate diversity for samples
div <- microbiome::alpha(pseq, index = "shannon")

# Assign the estimated diversity to sample metadata
sample_data(pseq)$diversity <- div

物种分类操作

物种数目

n <- ntaxa(pseq)

丰度最高的物种

topx <- top_taxa(pseq, n = 10)

名称

ranks <- rank_names(pseq)  # Taxonomic levels
taxa  <- taxa(pseq)        # Taxa names at the analysed level

子集

pseq.bac <- subset_taxa(pseq, Phylum == "Bacteroidetes")

选择特定条件物种

# List of Genera in the Bacteroideted Phylum
taxa <- map_levels(NULL, "Phylum", "Genus", pseq)$Bacteroidetes

# With given taxon names
ex2 <- prune_taxa(taxa, pseq)

# Taxa with positive sum across samples
ex3 <- prune_taxa(taxa_sums(pseq) > 0, pseq)

通过用户指定的函数值(这里是方差)进行过滤。

f <- filter_taxa(pseq, function(x) var(x) > 1e-05, TRUE)

列出门水平物种。

head(get_taxa_unique(pseq, "Phylum"))
## [1] "Actinobacteria"  "Firmicutes"      "Proteobacteria"  "Verrucomicrobia"
## [5] "Bacteroidetes"   "Spirochaetes"

挑选特定样本的分类群丰度

samplename <- sample_names(pseq)[[1]]

# Pick abundances for a particular taxon
tax.abundances <- abundances(pseq)[, samplename]

合并操作

将分类群聚集到更高的分类级别。这在系统发育树缺失的情况下特别有用。( merge_samples, merge_taxa and tax_glom).

pseq2 <- aggregate_taxa(pseq, "Phylum") 

将所需的分类群合并到 "其他 "类别。在这里,我们将所有的Bacteroides组合并为一个名为Bacteroides的单一组。

pseq2 <- merge_taxa2(pseq, pattern = "^Bacteroides", name = "Bacteroides") 

将phyloseq对象与phyloseq包合并

merge_phyloseq(pseqA, pseqB)

在一个数据框架中连接otu/asv表和分类法

library(dplyr) 
library(microbiome)
data("atlas1006") # example data from microbiome pkg
x <-atlas1006

asv_tab <- as.data.frame(abundances(x)) # get asvs/otus
asv_tab$asv_id <- rownames(asv_tab) # add a new column for ids
#tax_tab <- as.data.frame(tax_table(x)) # get taxonomy note: can be slow
tax_tab <- as(x@tax_table,"matrix") # get taxonomy note as matrix
tax_tab <- as.data.frame(tax_tab) # convert to data frame
tax_tab$asv_id <- rownames(tax_tab) # add a new column for ids
asv_tax_tab <- tax_tab %>% 
  left_join(asv_tab, by="asv_id") # join to get taxonomy and asv table

head(asv_tax_tab)[,1:8]
##            Phylum          Family                        Genus
## 1  Actinobacteria  Actinobacteria             Actinomycetaceae
## 2      Firmicutes         Bacilli                   Aerococcus
## 3  Proteobacteria  Proteobacteria                    Aeromonas
## 4 Verrucomicrobia Verrucomicrobia                  Akkermansia
## 5  Proteobacteria  Proteobacteria Alcaligenes faecalis et rel.
## 6   Bacteroidetes   Bacteroidetes           Allistipes et rel.
##                         asv_id Sample-1 Sample-2 Sample-3 Sample-4
## 1             Actinomycetaceae        0        0        0        0
## 2                   Aerococcus        0        0        0        0
## 3                    Aeromonas        0        0        0        0
## 4                  Akkermansia       21       36      475       61
## 5 Alcaligenes faecalis et rel.        1        1        1        2
## 6           Allistipes et rel.       72      127       34      344

抽平

pseq.rarified <- rarefy_even_depth(pseq)

物种分类

在分类级别之间进行转换(这里从属(Akkermansia)到门(Verrucomicrobia)。

m <- map_levels("Akkermansia", "Genus", "Phylum", tax_table(pseq))
print(m)
## [1] "Verrucomicrobia"

元数据

可视化给定因素(性别)水平在指定组(组)内的频率。

p <- plot_frequencies(sample_data(pseq), "bmi_group", "sex")
print(p)

# Retrieving the actual data values:
# kable(head(p@data), digits = 2)

提供自定义功能,将年龄或BMI信息切割成离散的类别。

group_bmi(c(22, 28, 31), "standard")
## [1] lean       overweight obese     
## Levels: underweight lean overweight obese severe morbid super
group_age(c(17, 41, 102), "decades")
## [1] [10,20)   [40,50)   [100,110]
## 10 Levels: [10,20) [20,30) [30,40) [40,50) [50,60) [60,70) [70,80) ... [100,110]

向phyloseq对象添加元数据。为了可重复性,我们在这个例子中只是使用了现有的元数据,但这可以用另一个data.frame(样本x字段)来代替。

# Example data
data(dietswap)
pseq <- dietswap

# Pick the existing metadata from a phyloseq object
# (or retrieve this from another source)
df <- meta(pseq)

# Merge the metadata back in the phyloseq object
pseq2 <- merge_phyloseq(pseq, sample_data(df))

https://microbiome.github.io/tutorials/Preprocessing.html

猜你喜欢

转载自blog.csdn.net/qq_42458954/article/details/115719759
今日推荐