Principal Component Analysis (PCA) is an unsupervised dimensionality reduction method that can effectively process high-dimensional data. However, PCA is not sensitive to variables with less correlation, and PLS-DA (Partial Least Squares Discriminant Analysis) can effectively solve this problem. Whereas OPLS-DA (Orthogonal Partial Least Squares Discriminant Analysis) combines orthogonal signals and PLS-DA to screen for differential variables.
“This analysis is mainly used for the screening of differential metabolites in metabolomics .
data set
Urine samples from 183 adults were analyzed by liquid chromatography high-resolution mass spectrometry (LTQ Orbitrap).
sacurine
list contains three data matrices:
dataMatrix
For the sample-metabolite content matrix (log10 transformed), the content information of various types of metabolites in each sample was recorded. A total of 183 samples (rows) and 109 metabolites (columns).
sampleMetadata
The year zero, weight, gender and other information of the individuals from which 183 samples were derived were recorded.
variableMetadata
Annotation details for 109 metabolites, MSI level.
rm(list = ls())
# load packages
library(ropls)
# load data
data(sacurine)
#查看数据集
head(sacurine$dataMatrix[ ,1:2])
head(sacurine$sampleMetadata)
head(sacurine$variableMetadata)
#提取性别分类
genderFc = sampleMetadata[, "gender"]
> head(sacurine$dataMatrix[ ,1:2])
(2-methoxyethoxy)propanoic acid isomer (gamma)Glu-Leu/Ile
HU_011 3.019766 3.888479
HU_014 3.814339 4.277149
HU_015 3.519691 4.195649
HU_017 2.562183 4.323760
HU_018 3.781922 4.629329
HU_019 4.161074 4.412266
> head(sacurine$sampleMetadata)
age bmi gender
HU_011 29 19.75 M
HU_014 59 22.64 F
HU_015 42 22.72 M
HU_017 41 23.03 M
HU_018 34 20.96 M
HU_019 35 23.41 M
OPLS-DA
# 分组以性别为例
# 通过orthoI指定正交组分数目
# orthoI = NA时,执行OPLS,并通过交叉验证自动计算适合的正交组分数
oplsda = opls(dataMatrix, genderFc, predI = 1, orthoI = NA)
OPLS-DA
183 samples x 109 variables and 1 response
standard scaling of predictors and response(s)
R2X(cum) R2Y(cum) Q2(cum) RMSEE pre ort pR2Y pQ2
Total 0.275 0.73 0.602 0.262 1 2 0.05 0.05
In the results, R2X
and R2Y
represent the interpretation rate of the X and Y matrices of the built model, respectively , and Q2
represent the predictive ability of the model. The closer their values are to 1, the better the fit of the model, and the more accurately the samples in the training set can be divided into in its original attribution.
Inertia bar chart (upper left)
R2Y
The sum of 3 orthogonal axes is shownQ2Y
. Assess the adequacy of orthogonal components by exhibiting cumulative interpretation rates.Significant diagnosis (top right)
R2Y
The scatter plot of the sum value of the actual and simulated models isQ2Y
randomly arranged. When the modelR2Y
sumQ2Y
(scatter point) is greater than the true value (horizontal line), it indicates that overfitting 2 occurs. Top right, R2Y and Q2Y of the OPLS-DA model compared to the corresponding values obtained after randomly permuting the data.Outlier display (bottom left)
The distances of each sample within the projection plane and the orthographic projection plane are shown, and samples with high values are marked with names, indicating that they are more different from other samples. Colors represent gender groupings.
x-score plot (bottom right)
The coordinates of each sample on the OPLS-DA axis, with colors representing gender groupings.
visualization
library(ggplot2)
library(ggsci)
library(tidyverse)
#提取样本在 OPLS-DA 轴上的位置
sample.score = oplsda@scoreMN %>% #得分矩阵
as.data.frame() %>%
mutate(gender = sacurine[["sampleMetadata"]][["gender"]],
o1 = oplsda@orthoScoreMN[,1]) #正交矩阵
head(sample.score)#查看
> head(sample.score)
p1 gender o1
HU_011 -1.582933 M -4.9806037
HU_014 1.372806 F -1.7443382
HU_015 -3.341370 M -3.4372771
HU_017 -3.590063 M -0.9794960
HU_018 -1.662716 M 0.3155845
HU_019 -2.312923 M 0.6561281
p <- ggplot(sample.score, aes(p1, o1, color = gender)) +
geom_hline(yintercept = 0, linetype = 'dashed', size = 0.5) + #横向虚线
geom_vline(xintercept = 0, linetype = 'dashed', size = 0.5) +
geom_point() +
#geom_point(aes(-10,-10), color = 'white') +
labs(x = 'P1(5.0%)',y = 'to1') +
stat_ellipse(level = 0.95, linetype = 'solid',
size = 1, show.legend = FALSE) + #添加置信区间
scale_color_manual(values = c('#008000','#FFA74F')) +
theme_bw() +
theme(legend.position = c(0.1,0.85),
legend.title = element_blank(),
legend.text = element_text(color = 'black',size = 12, family = 'Arial', face = 'plain'),
panel.background = element_blank(),
panel.grid = element_blank(),
axis.text = element_text(color = 'black',size = 15, family = 'Arial', face = 'plain'),
axis.title = element_text(color = 'black',size = 15, family = 'Arial', face = 'plain'),
axis.ticks = element_line(color = 'black'))
p
Differential Metabolite Screening
#VIP 值帮助寻找重要的代谢物
vip <- getVipVn(oplsda)
vip_select <- vip[vip > 1] #通常以VIP值>1作为筛选标准
head(vip_select)
vip_select <- cbind(sacurine$variableMetadata[names(vip_select), ], vip_select)
names(vip_select)[4] <- 'VIP'
vip_select <- vip_select[order(vip_select$VIP, decreasing = TRUE), ]
head(vip_select) #带注释的代谢物,VIP>1 筛选后,并按 VIP 降序排序
> head(vip_select)
msiLevel hmdb chemicalClass
p-Anisic acid 1 HMDB01101 AroHoM
Malic acid 1 HMDB00156 Organi
Testosterone glucuronide 2 HMDB03193 Lipids:Steroi
Pantothenic acid 1 HMDB00210 AliAcy
Acetylphenylalanine 1 HMDB00512 AA-pep
alpha-N-Phenylacetyl-glutamine 1 HMDB06344 AA-pep
VIP
p-Anisic acid 2.533220
Malic acid 2.479289
Testosterone glucuronide 2.421591
Pantothenic acid 2.165296
Acetylphenylalanine 1.988311
alpha-N-Phenylacetyl-glutamine 1.965807
#对差异代谢物进行棒棒糖图可视化
#代谢物名字太长进行转换
vip_select$cat = paste('A',1:nrow(vip_select), sep = '')
p2 <- ggplot(vip_select, aes(cat, VIP)) +
geom_segment(aes(x = cat, xend = cat,
y = 0, yend = VIP)) +
geom_point(shape = 21, size = 5, color = '#008000' ,fill = '#008000') +
geom_point(aes(1,2.5), color = 'white') +
geom_hline(yintercept = 1, linetype = 'dashed') +
scale_y_continuous(expand = c(0,0)) +
labs(x = '', y = 'VIP value') +
theme_bw() +
theme(legend.position = 'none',
legend.text = element_text(color = 'black',size = 12, family = 'Arial', face = 'plain'),
panel.background = element_blank(),
panel.grid = element_blank(),
axis.text = element_text(color = 'black',size = 15, family = 'Arial', face = 'plain'),
axis.text.x = element_text(angle = 90),
axis.title = element_text(color = 'black',size = 15, family = 'Arial', face = 'plain'),
axis.ticks = element_line(color = 'black'),
axis.ticks.x = element_blank())
p2
reference
Implementation of OPLS-DA in R Language | Little Blue's Knowledge Wasteland (blog4xiang.world)
Partial Least Squares Discriminant Analysis (PLS-DA) and Orthogonal Partial Least Squares Discriminant Analysis (OPLS-DA) for the R package ropls
Analysis of metabolomic data with PLS and OPLS
ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and feature selection of omics data (bioconductor.org)