write in front
It is rare to have a private letter from a reader in the background. I asked if the GSEA diagram in the article in the picture below can be drawn with R. Today, I will write a simple teaching.
GSEA (Gene Set Enrichment Analysis), that is, gene set enrichment analysis , its basic idea is to use predefined genes, sort the genes according to the degree of differential expression in the two types of samples , and then test whether the preset gene set is in this Top or bottom enrichment of the sorted table.
GSEA
The difference between and GO
, KEGG pathway
is that the latter two will set a threshold in advance, and only focus on genes with large differences (equivalent to key classes). In this way, it is easy to miss some genes that are not significantly differentially expressed but have important biological significance (average grades, but very talented). Therefore, GSEA analysis is more suitable for data sets with too few samples after screening by traditional analysis methods.
data preparation
Directly use the data after the previous transcriptome differential analysis to demonstrate, the data format is as follows:
As for how to do gene difference analysis, there are many methods, commonly used are DESeq2
, EdgeR
, limma
, if you have any questions, you can send a private message ~ because GSEA only needs SYMBOL
(gene name) and foldchange
(or logFC
) two columns, so you can delete the unnecessary ones. (The above operations can be performed in EXCEL, or in R tidyverse
(for data frame processing, please refer to the article: ).
start analysis
Install and import the R packages to be used
BiocManager::install("clusterProfiler") #感谢Y叔的clusterprofiler包
BiocManager::install("enrichplot") #画图需要
BiocManager::install("org.Hs.eg.db") #基因注释需要
library(org.Hs.eg.db)
library(clusterProfiler)
library(enrichplot)
Import Data
setwd("D:/Note/MZBJ/Q_A") #设置文件所在位置
df = read.table("gene_diff.txt",header = T) #读入txt
# df = read.csv("gene_diff.csv",header = T) #读入csv
head(df)#查看前面几行
dim(df)#数据总共几行几列
> head(df)#查看前面几行
SYMBOL logFC
1 CD74 41.99218
2 MAB21L3 35.00852
3 KCNQ1OT1 22.78417
4 RP3-323A16.1 22.25173
5 LINC00504 16.82801
6 MALAT1 16.64222
> dim(df)#数据总共几行几列
[1] 5057 2
Convert gene ID
If the gene name is symbol, the gene ID needs to be converted into a Entrez ID
format. Entrez ID
It actually refers to the Entrez gene ID
one that corresponds to the one on the chromosome gene location
. Each discovered gene will be compiled with a uniform number, Entrez ID
which refers to the number used by NCBI
its Entrez gene
database. Because Entrez ID
of its specificity, subsequent analysis is more suitable Entrez ID
.
df_id<-bitr(df$SYMBOL, #转换的列是df数据框中的SYMBOL列
fromType = "SYMBOL",#需要转换ID类型
toType = "ENTREZID",#转换成的ID类型
OrgDb = "org.Hs.eg.db")#对应的物种,小鼠的是org.Mm.eg.db
>'select()' returned 1:many mapping between keys and columns
Warning message:
In bitr(df$SYMBOL, fromType = "SYMBOL", toType = "ENTREZID", OrgDb = "org.Hs.eg.db") :
7.87% of input gene IDs are fail to map... #7.87%没有比对到就是没有转换成功
Merge the two data frames df
and df_id
by column.SYMBOL
df_all<-merge(df,df_id,by="SYMBOL",all=F)#使用merge合并
head(df_all) #再看看数据
dim(df_all) #因为有一部分没转换成功,所以数量就少了。
> head(df_all)
SYMBOL logFC ENTREZID
1 A2M -0.713519723 2
2 AAK1 -0.089497971 22848
3 AAMP -0.014536797 14
4 AARS2 0.077105219 57505
5 AASDHPPT -0.000560858 60496
6 ABCA1 0.436678052 19
> dim(df_all)
[1] 4660 3
GAEA
df_all_sort <- df_all[order(df_all$logFC, decreasing = T),]#先按照logFC降序排序
gene_fc = df_all_sort$logFC #把foldchange按照从大到小提取出来
head(gene_fc)
names(gene_fc) <- df_all_sort$ENTREZID #给上面提取的foldchange加上对应上ENTREZID
head(gene_fc)
> head(gene_fc)
[1] 41.99218 35.00852 22.78417 16.82801 16.64222 15.33221
> head(gene_fc)
972 126868 10984 201853 378938 3514
41.99218 35.00852 22.78417 16.82801 16.64222 15.33221
Prepare the above things, and the next line of code will solve it.
#以KEGG Pathway示例
KEGG <- gseKEGG(gene_fc, organism = "hsa") #具体参数在下面
> KEGG <- gseKEGG(gene_fc, organism = "hsa")
Reading KEGG annotation online:
Reading KEGG annotation online:
preparing geneSet collections...
GSEA analysis...
leading edge analysis...
done...
Warning messages:
1: In preparePathwaysAndStats(pathways, stats, minSize, maxSize, gseaParam, :
There are ties in the preranked stats (0.13% of the list).
The order of those tied genes will be arbitrary, which may produce unexpected results.
2: In serialize(data, node$con) : 载入时'package:stats'可能无用
3: In serialize(data, node$con) : 载入时'package:stats'可能无用
4: In serialize(data, node$con) : 载入时'package:stats'可能无用
5: In serialize(data, node$con) : 载入时'package:stats'可能无用
6: In serialize(data, node$con) : 载入时'package:stats'可能无用
7: In serialize(data, node$con) : 载入时'package:stats'可能无用
8: In fgseaMultilevel(...) :
For some pathways, in reality P-values are less than 1e-10. You can set the `eps` argument to zero for better estimation.
What about GO enrichment?
#GO富集
GO <- gseGO(
gene_fc, #gene_fc
ont = "BP",# "BP"、"MF"和"CC"或"ALL"
OrgDb = org.Hs.eg.db,#人类注释基因
keyType = "ENTREZID",
pvalueCutoff = 0.05,
pAdjustMethod = "BH",#p值校正方法
)
#KEGG富集
gseKEGG(
geneList,
organism = "hsa",
keyType = "kegg",
exponent = 1,
minGSSize = 10,
maxGSSize = 500,
eps = 1e-10,
pvalueCutoff = 0.05,
pAdjustMethod = "BH",
verbose = TRUE,
use_internal_data = FALSE,
seed = FALSE,
by = "fgsea",
...
)
head(KEGG)#看一下这个文件
> head(KEGG)
ID Description setSize enrichmentScore NES
hsa03010 hsa03010 Ribosome 99 -0.8707285 -2.370839
hsa05152 hsa05152 Tuberculosis 87 0.8678558 1.786981
hsa05171 hsa05171 Coronavirus disease - COVID-19 142 -0.5976011 -1.704522
hsa04512 hsa04512 ECM-receptor interaction 19 -0.8866402 -1.913989
pvalue p.adjust qvalues rank leading_edge
hsa03010 0.0000000001 0.0000000257 2.431579e-08 289 tags=65%, list=6%, signal=62%
hsa05152 0.0002124294 0.0272971804 2.582695e-02 279 tags=30%, list=6%, signal=29%
hsa05171 0.0004376904 0.0290749106 2.750893e-02 289 tags=46%, list=6%, signal=45%
hsa04512 0.0004525278 0.0290749106 2.750893e-02 250 tags=58%, list=5%, signal=55%
core_enrichment
hsa03010 6231/6193/4736/6235/2197/6218/6166/6167/6157/3921/6129/140801/6152/6125/6169/6124/9349/6141/6138/6187/6228/6144/6135/6202/6155/6154/6132/6160/6159/6147/6156/6210/6230/6175/6122/6128/11224/23521/9045/25873/6161/6201/6208/6189/6181/6188/6133/6165/6194/6139/6168/6224/6143/6142/6222/6164/6176/6232/6206/6223/6171/6233/6134/6137
hsa05152
Briefly explain what the result means:
ID
KEGG
Signaling pathways in representationDescription
Description of Signal PathwayssetSize
The number of genes in this signaling pathwayenrichmentScore
Enrichment Score, aka ESNES
After standardizationES
, the full name is normalized enrichment score,qvalues
, or FDR q-val (false discovery rate) false discovery raterank
rankingcore_enrichment
, enriching the gene list of the target pathway.
sortKEGG<-KEGG[order(KEGG$enrichmentScore, decreasing = T),]#按照enrichment score从高到低排序
head(sortKEGG)
dim(sortKEGG)
write.table(sortKEGG,"gsea_sortKEGG.txt") #保存结果
Results visualization
#gseaplot2用法
gseaplot2(
x, #gseaResult object,即GSEA结果
geneSetID,#富集的ID编号
title = "", #标题
color = "green",#GSEA线条颜色
base_size = 11,#基础字体大小
rel_heights = c(1.5, 0.5, 1),#副图的相对高度
subplots = 1:3, #要显示哪些副图 如subplots=c(1,3) #只要第一和第三个图,subplots=1#只要第一个图
pvalue_table = FALSE, #是否添加 pvalue table
ES_geom = "line" #running enrichment score用先还是用点ES_geom = "dot"
)
Come up with a regular
gseaplot2(KEGG, "hsa05152", color = "firebrick", rel_heights=c(1, .2, .6))
In another article:
paths <- c("hsa03010", "hsa05152", "hsa05171", "hsa04512")#选取你需要展示的通路ID
gseaplot2(KEGG,paths, pvalue_table = TRUE)
gseaplot2(KEGG,paths,color = colorspace::rainbow_hcl(4),subplots=c(1,2), pvalue_table = TRUE)
#换个颜色,只显示上面两个副图
The rest of the details can be processed with AI~
The above tutorial is my superficial record. If there are any mistakes, please correct me if you find it useful, I hope you will like and share more
Code collection : Private message " GSEA " in the background of the official account to receive all codes.
Past content:
"R Data Science" study notes | Note14: Vector
"R Data Science" study notes | Note13: function
"R Data Science" study notes | Note12: Use magrittr for pipeline operations