R drawing practice | GSEA enrichment analysis graph

insert image description here

write in front

It is rare to have a private letter from a reader in the background. I asked if the GSEA diagram in the article in the picture below can be drawn with R. Today, I will write a simple teaching.

title figure_gsea

GSEA (Gene Set Enrichment Analysis), that is, gene set enrichment analysis , its basic idea is to use predefined genes, sort the genes according to the degree of differential expression in the two types of samples , and then test whether the preset gene set is in this Top or bottom enrichment of the sorted table.

GSEAThe difference between and GO, KEGG pathwayis that the latter two will set a threshold in advance, and only focus on genes with large differences (equivalent to key classes). In this way, it is easy to miss some genes that are not significantly differentially expressed but have important biological significance (average grades, but very talented). Therefore, GSEA analysis is more suitable for data sets with too few samples after screening by traditional analysis methods.

data preparation

Directly use the data after the previous transcriptome differential analysis to demonstrate, the data format is as follows:

gene_diff

As for how to do gene difference analysis, there are many methods, commonly used are DESeq2, EdgeR, limma, if you have any questions, you can send a private message ~ because GSEA only needs SYMBOL(gene name) and foldchange(or logFC) two columns, so you can delete the unnecessary ones. (The above operations can be performed in EXCEL, or in R tidyverse(for data frame processing, please refer to the article: ).

gene_arrage

start analysis

Install and import the R packages to be used

BiocManager::install("clusterProfiler") #感谢Y叔的clusterprofiler包
BiocManager::install("enrichplot")  #画图需要
BiocManager::install("org.Hs.eg.db") #基因注释需要
library(org.Hs.eg.db)
library(clusterProfiler)
library(enrichplot)

Import Data

setwd("D:/Note/MZBJ/Q_A") #设置文件所在位置
df = read.table("gene_diff.txt",header = T) #读入txt
# df = read.csv("gene_diff.csv",header = T) #读入csv
head(df)#查看前面几行
dim(df)#数据总共几行几列

> head(df)#查看前面几行
        SYMBOL    logFC
1         CD74 41.99218
2      MAB21L3 35.00852
3     KCNQ1OT1 22.78417
4 RP3-323A16.1 22.25173
5    LINC00504 16.82801
6       MALAT1 16.64222
> dim(df)#数据总共几行几列
[1] 5057    2

Convert gene ID

If the gene name is symbol, the gene ID needs to be converted into a Entrez IDformat. Entrez IDIt actually refers to the Entrez gene IDone that corresponds to the one on the chromosome gene location. Each discovered gene will be compiled with a uniform number, Entrez IDwhich refers to the number used by NCBIits Entrez genedatabase. Because Entrez IDof its specificity, subsequent analysis is more suitable Entrez ID.

df_id<-bitr(df$SYMBOL, #转换的列是df数据框中的SYMBOL列
            fromType = "SYMBOL",#需要转换ID类型
            toType = "ENTREZID",#转换成的ID类型
            OrgDb = "org.Hs.eg.db")#对应的物种,小鼠的是org.Mm.eg.db
>'select()' returned 1:many mapping between keys and columns
Warning message:
In bitr(df$SYMBOL, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = "org.Hs.eg.db") :
  7.87% of input gene IDs are fail to map...  #7.87%没有比对到就是没有转换成功

Merge the two data frames dfand df_idby column.SYMBOL

df_all<-merge(df,df_id,by="SYMBOL",all=F)#使用merge合并
head(df_all) #再看看数据
dim(df_all) #因为有一部分没转换成功,所以数量就少了。

> head(df_all)
    SYMBOL        logFC ENTREZID
1      A2M -0.713519723        2
2     AAK1 -0.089497971    22848
3     AAMP -0.014536797       14
4    AARS2  0.077105219    57505
5 AASDHPPT -0.000560858    60496
6    ABCA1  0.436678052       19
> dim(df_all)
[1] 4660    3

GAEA

df_all_sort <- df_all[order(df_all$logFC, decreasing = T),]#先按照logFC降序排序
gene_fc = df_all_sort$logFC #把foldchange按照从大到小提取出来
head(gene_fc)
names(gene_fc) <- df_all_sort$ENTREZID #给上面提取的foldchange加上对应上ENTREZID
head(gene_fc)

> head(gene_fc)
[1] 41.99218 35.00852 22.78417 16.82801 16.64222 15.33221
> head(gene_fc)
     972   126868    10984   201853   378938     3514 
41.99218 35.00852 22.78417 16.82801 16.64222 15.33221 

Prepare the above things, and the next line of code will solve it.

#以KEGG Pathway示例
KEGG <- gseKEGG(gene_fc, organism = "hsa") #具体参数在下面

> KEGG <- gseKEGG(gene_fc, organism = "hsa")
Reading KEGG annotation online:

Reading KEGG annotation online:

preparing geneSet collections...
GSEA analysis...
leading edge analysis...
done...
Warning messages:
1: In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam,  :
  There are ties in the preranked stats (0.13% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
2: In serialize(data, node$con) : 载入时'package:stats'可能无用
3: In serialize(data, node$con) : 载入时'package:stats'可能无用
4: In serialize(data, node$con) : 载入时'package:stats'可能无用
5: In serialize(data, node$con) : 载入时'package:stats'可能无用
6: In serialize(data, node$con) : 载入时'package:stats'可能无用
7: In serialize(data, node$con) : 载入时'package:stats'可能无用
8: In fgseaMultilevel(...) :
  For some pathways, in reality P-values are less than 1e-10. You can set the `eps` argument to zero for better estimation.

What about GO enrichment?

#GO富集
GO <- gseGO(
  gene_fc, #gene_fc
  ont = "BP",# "BP"、"MF"和"CC"或"ALL"
  OrgDb = org.Hs.eg.db,#人类注释基因
  keyType = "ENTREZID",
  pvalueCutoff = 0.05,
  pAdjustMethod = "BH",#p值校正方法
)
#KEGG富集
gseKEGG(
  geneList,
  organism = "hsa",
  keyType = "kegg",
  exponent = 1,
  minGSSize = 10,
  maxGSSize = 500,
  eps = 1e-10,
  pvalueCutoff = 0.05,
  pAdjustMethod = "BH",
  verbose = TRUE,
  use_internal_data = FALSE,
  seed = FALSE,
  by = "fgsea",
  ...
)
head(KEGG)#看一下这个文件
> head(KEGG)
               ID                    Description setSize enrichmentScore       NES
hsa03010 hsa03010                       Ribosome      99      -0.8707285 -2.370839
hsa05152 hsa05152                   Tuberculosis      87       0.8678558  1.786981
hsa05171 hsa05171 Coronavirus disease - COVID-19     142      -0.5976011 -1.704522
hsa04512 hsa04512       ECM-receptor interaction      19      -0.8866402 -1.913989
               pvalue     p.adjust      qvalues rank                  leading_edge
hsa03010 0.0000000001 0.0000000257 2.431579e-08  289 tags=65%, list=6%, signal=62%
hsa05152 0.0002124294 0.0272971804 2.582695e-02  279 tags=30%, list=6%, signal=29%
hsa05171 0.0004376904 0.0290749106 2.750893e-02  289 tags=46%, list=6%, signal=45%
hsa04512 0.0004525278 0.0290749106 2.750893e-02  250 tags=58%, list=5%, signal=55%
                                                                                                                                                                                                                                                                                                                                        core_enrichment
hsa03010           6231/6193/4736/6235/2197/6218/6166/6167/6157/3921/6129/140801/6152/6125/6169/6124/9349/6141/6138/6187/6228/6144/6135/6202/6155/6154/6132/6160/6159/6147/6156/6210/6230/6175/6122/6128/11224/23521/9045/25873/6161/6201/6208/6189/6181/6188/6133/6165/6194/6139/6168/6224/6143/6142/6222/6164/6176/6232/6206/6223/6171/6233/6134/6137
hsa05152                                                        

Briefly explain what the result means:

  • IDKEGGSignaling pathways in representation
  • DescriptionDescription of Signal Pathways
  • setSizeThe number of genes in this signaling pathway
  • enrichmentScoreEnrichment Score, aka ES
  • NESAfter standardization ES, the full name is normalized enrichment score,
  • qvalues, or FDR q-val (false discovery rate) false discovery rate
  • rankranking
  • core_enrichment, enriching the gene list of the target pathway.
sortKEGG<-KEGG[order(KEGG$enrichmentScore, decreasing = T),]#按照enrichment score从高到低排序
head(sortKEGG)
dim(sortKEGG)
write.table(sortKEGG,"gsea_sortKEGG.txt") #保存结果

Results visualization

#gseaplot2用法
gseaplot2(
  x, #gseaResult object,即GSEA结果
  geneSetID,#富集的ID编号
  title = "", #标题
  color = "green",#GSEA线条颜色
  base_size = 11,#基础字体大小
  rel_heights = c(1.5, 0.5, 1),#副图的相对高度
  subplots = 1:3, #要显示哪些副图 如subplots=c(1,3) #只要第一和第三个图,subplots=1#只要第一个图
  pvalue_table = FALSE, #是否添加 pvalue table
  ES_geom = "line" #running enrichment score用先还是用点ES_geom = "dot"
)

Come up with a regular

gseaplot2(KEGG, "hsa05152", color = "firebrick", rel_heights=c(1, .2, .6))
hsa05152

In another article:

paths <- c("hsa03010", "hsa05152", "hsa05171", "hsa04512")#选取你需要展示的通路ID
gseaplot2(KEGG,paths, pvalue_table = TRUE)
pathsGSEA
gseaplot2(KEGG,paths,color = colorspace::rainbow_hcl(4),subplots=c(1,2), pvalue_table = TRUE)
#换个颜色,只显示上面两个副图
pathsGSEA_rainbow

The rest of the details can be processed with AI~

The above tutorial is my superficial record. If there are any mistakes, please correct me if you find it useful, I hope you will like and share more

Code collection : Private message " GSEA " in the background of the official account to receive all codes.


Past content:

"R Data Science" study notes | Note14: Vector

"R Data Science" study notes | Note13: function

"R Data Science" study notes | Note12: Use magrittr for pipeline operations

Guess you like

Origin blog.csdn.net/weixin_45822007/article/details/114989817