R combat | OPLS-DA (Orthogonal Partial Least Squares Discriminant Analysis) Screening Difference Variables (VIP) and its visualization

Principal Component Analysis (PCA) is an unsupervised dimensionality reduction method that can effectively process high-dimensional data. However, PCA is not sensitive to variables with less correlation, and PLS-DA (Partial Least Squares Discriminant Analysis) can effectively solve this problem. Whereas OPLS-DA (Orthogonal Partial Least Squares Discriminant Analysis) combines orthogonal signals and PLS-DA to screen for differential variables.

This analysis is mainly used for the screening of differential metabolites in metabolomics .

data set

Urine samples from 183 adults were analyzed by liquid chromatography high-resolution mass spectrometry (LTQ Orbitrap).

sacurinelist contains three data matrices:

dataMatrixFor the sample-metabolite content matrix (log10 transformed), the content information of various types of metabolites in each sample was recorded. A total of 183 samples (rows) and 109 metabolites (columns).

sampleMetadataThe year zero, weight, gender and other information of the individuals from which 183 samples were derived were recorded.

variableMetadataAnnotation details for 109 metabolites, MSI level.

rm(list = ls())
# load  packages
library(ropls)
# load data
data(sacurine)
#查看数据集
head(sacurine$dataMatrix[ ,1:2])
head(sacurine$sampleMetadata)
head(sacurine$variableMetadata)
#提取性别分类
genderFc = sampleMetadata[, "gender"]
> head(sacurine$dataMatrix[ ,1:2])
       (2-methoxyethoxy)propanoic acid isomer (gamma)Glu-Leu/Ile
HU_011                               3.019766           3.888479
HU_014                               3.814339           4.277149
HU_015                               3.519691           4.195649
HU_017                               2.562183           4.323760
HU_018                               3.781922           4.629329
HU_019                               4.161074           4.412266
> head(sacurine$sampleMetadata)
       age   bmi gender
HU_011  29 19.75      M
HU_014  59 22.64      F
HU_015  42 22.72      M
HU_017  41 23.03      M
HU_018  34 20.96      M
HU_019  35 23.41      M

OPLS-DA

# 分组以性别为例
# 通过orthoI指定正交组分数目
# orthoI = NA时,执行OPLS,并通过交叉验证自动计算适合的正交组分数
oplsda = opls(dataMatrix, genderFc, predI = 1, orthoI = NA)
OPLS-DA
183 samples x 109 variables and 1 response
standard scaling of predictors and response(s)
      R2X(cum) R2Y(cum) Q2(cum) RMSEE pre ort pR2Y  pQ2
Total    0.275     0.73   0.602 0.262   1   2 0.05 0.05
d6906b5178edb370219b89f68a5a3f22.png
Snipaste_2021-10-28_21-32-57

In the results, R2Xand R2Yrepresent the interpretation rate of the X and Y matrices of the built model, respectively , and Q2represent the predictive ability of the model. The closer their values ​​are to 1, the better the fit of the model, and the more accurately the samples in the training set can be divided into in its original attribution.

  • Inertia bar chart (upper left)

    R2YThe sum of 3 orthogonal axes is shown Q2Y. Assess the adequacy of orthogonal components by exhibiting cumulative interpretation rates.

  • Significant diagnosis (top right)

    R2YThe scatter plot of the sum value of the actual and simulated models is Q2Yrandomly arranged. When the model R2Ysum Q2Y(scatter point) is greater than the true value (horizontal line), it indicates that overfitting 2 occurs. Top right, R2Y and Q2Y of the OPLS-DA model compared to the corresponding values ​​obtained after randomly permuting the data.

  • Outlier display (bottom left)

    The distances of each sample within the projection plane and the orthographic projection plane are shown, and samples with high values ​​are marked with names, indicating that they are more different from other samples. Colors represent gender groupings.

  • x-score plot (bottom right)

    The coordinates of each sample on the OPLS-DA axis, with colors representing gender groupings.

visualization

library(ggplot2)
library(ggsci)
library(tidyverse)
#提取样本在 OPLS-DA 轴上的位置
sample.score = oplsda@scoreMN %>%  #得分矩阵
  as.data.frame() %>%
  mutate(gender = sacurine[["sampleMetadata"]][["gender"]],
         o1 = oplsda@orthoScoreMN[,1]) #正交矩阵
head(sample.score)#查看
> head(sample.score)
              p1 gender         o1
HU_011 -1.582933      M -4.9806037
HU_014  1.372806      F -1.7443382
HU_015 -3.341370      M -3.4372771
HU_017 -3.590063      M -0.9794960
HU_018 -1.662716      M  0.3155845
HU_019 -2.312923      M  0.6561281
p <- ggplot(sample.score, aes(p1, o1, color = gender)) +
  geom_hline(yintercept = 0, linetype = 'dashed', size = 0.5) + #横向虚线
  geom_vline(xintercept = 0, linetype = 'dashed', size = 0.5) +
  geom_point() +
  #geom_point(aes(-10,-10), color = 'white') +
  labs(x = 'P1(5.0%)',y = 'to1') +
  stat_ellipse(level = 0.95, linetype = 'solid', 
               size = 1, show.legend = FALSE) + #添加置信区间
  scale_color_manual(values = c('#008000','#FFA74F')) +
  theme_bw() +
  theme(legend.position = c(0.1,0.85),
        legend.title = element_blank(),
        legend.text = element_text(color = 'black',size = 12, family = 'Arial', face = 'plain'),
        panel.background = element_blank(),
        panel.grid = element_blank(),
        axis.text = element_text(color = 'black',size = 15, family = 'Arial', face = 'plain'),
        axis.title = element_text(color = 'black',size = 15, family = 'Arial', face = 'plain'),
        axis.ticks = element_line(color = 'black'))
p
db67cb54cf39d680097e02dc5c83d3be.png
Snipaste_2021-10-28_22-49-44

Differential Metabolite Screening

#VIP 值帮助寻找重要的代谢物
vip <- getVipVn(oplsda)
vip_select <- vip[vip > 1]    #通常以VIP值>1作为筛选标准
head(vip_select)

vip_select <- cbind(sacurine$variableMetadata[names(vip_select), ], vip_select)
names(vip_select)[4] <- 'VIP'
vip_select <- vip_select[order(vip_select$VIP, decreasing = TRUE), ]
head(vip_select)    #带注释的代谢物,VIP>1 筛选后,并按 VIP 降序排序
> head(vip_select)   
                               msiLevel      hmdb chemicalClass
p-Anisic acid                         1 HMDB01101        AroHoM
Malic acid                            1 HMDB00156        Organi
Testosterone glucuronide              2 HMDB03193 Lipids:Steroi
Pantothenic acid                      1 HMDB00210        AliAcy
Acetylphenylalanine                   1 HMDB00512        AA-pep
alpha-N-Phenylacetyl-glutamine        1 HMDB06344        AA-pep
                                    VIP
p-Anisic acid                  2.533220
Malic acid                     2.479289
Testosterone glucuronide       2.421591
Pantothenic acid               2.165296
Acetylphenylalanine            1.988311
alpha-N-Phenylacetyl-glutamine 1.965807
#对差异代谢物进行棒棒糖图可视化
#代谢物名字太长进行转换
vip_select$cat = paste('A',1:nrow(vip_select), sep = '')
p2 <- ggplot(vip_select, aes(cat, VIP)) +
  geom_segment(aes(x = cat, xend = cat,
                   y = 0, yend = VIP)) +
  geom_point(shape = 21, size = 5, color = '#008000' ,fill = '#008000') +
  geom_point(aes(1,2.5), color = 'white') +
  geom_hline(yintercept = 1, linetype = 'dashed') +
  scale_y_continuous(expand = c(0,0)) +
  labs(x = '', y = 'VIP value') +
  theme_bw() +
  theme(legend.position = 'none',
        legend.text = element_text(color = 'black',size = 12, family = 'Arial', face = 'plain'),
        panel.background = element_blank(),
        panel.grid = element_blank(),
        axis.text = element_text(color = 'black',size = 15, family = 'Arial', face = 'plain'),
        axis.text.x = element_text(angle = 90),
        axis.title = element_text(color = 'black',size = 15, family = 'Arial', face = 'plain'),
        axis.ticks = element_line(color = 'black'),
        axis.ticks.x = element_blank())
p2
13bd0392ae69ab92edf3f997485392df.png
Snipaste_2021-10-28_23-35-09

reference

  1. Implementation of OPLS-DA in R Language | Little Blue's Knowledge Wasteland (blog4xiang.world)

  2. Partial Least Squares Discriminant Analysis (PLS-DA) and Orthogonal Partial Least Squares Discriminant Analysis (OPLS-DA) for the R package ropls

  3. Analysis of metabolomic data with PLS and OPLS

  4. ropls: PCA, PLS(-DA) and OPLS(-DA) for multivariate analysis and feature selection of omics data (bioconductor.org)

Past

  1. Multivariate Analysis of Single Omics | 1. PCA and PLS-DA

  2. Multivariate Analysis of Monoomics | 2. Sparse Partial Least Squares Discriminant Analysis (sPLS-DA)

4e2d7dd694b456fbfd401affd81aef4b.png

Guess you like

Origin blog.csdn.net/weixin_45822007/article/details/121045882