WGCNA Concise Guide | 1. Gene Coexpression Network Construction and Module Identification
reference
Introduction
Data import, cleaning and preprocessing
data import
Check for Excessive Missing Values and Outlier Samples
Load clinical feature data
Automatically build networks and identify modules
Determining suitable soft thresholds: network topology analysis
Build network and identify modules in one step
Past
reference
This article mainly refers to the official guide Tutorials for WGCNA R package (ucla.edu). For details, please refer to the official documentation.
Other information:
WGCNA - Anthology - Jianshu (jianshu.com)
WGCNA analysis, the latest simple and comprehensive tutorial - short book (jianshu.com)
Introduction
Weighted gene co-expression network analysis ( WGCNA , Weighted correlation network analysis) is a systems biology method used to describe gene association patterns between different samples, which can be used to identify gene sets with highly coordinated changes , and based on the interconnectivity of gene sets and associations between gene sets and phenotypes to identify candidate biomarker genes or therapeutic targets. In short, it divides genes into several modules, explores the correlation between phenotype data and gene modules, and finds the hub genes in the modules .
Data import, cleaning and preprocessing
data import
Download sample data: https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/FemaleLiver-Data.zip
# BiocManager::install("WGCNA")
library('WGCNA')
# 在读入数据时,遇到字符串之后,不将其转换为factors,仍然保留为字符串格式
options(stringsAsFactors = FALSE)
# 导入示例数据
femData = read.csv("LiverFemale3600.csv")
# 查看数据
dim(femData)
names(femData)
> dim(femData)
[1] 3600 143
> names(femData)
[1] "substanceBXH" "gene_symbol" "LocusLinkID" "ProteomeID"
[5] "cytogeneticLoc" "CHROMOSOME" "StartPosition" "EndPosition"
[9] "F2_2" "F2_3" "F2_14" "F2_15"
...
# 提取样本-基因表达矩阵
datExpr0 = as.data.frame(t(femData[, -c(1:8)]))
names(datExpr0) = femData$substanceBXH
rownames(datExpr0) = names(femData)[-c(1:8)]
> datExpr0[1:6,1:6]
MMT00000044 MMT00000046 MMT00000051 MMT00000076 MMT00000080 MMT00000102
F2_2 -0.0181000 -0.0773 -0.02260000 -0.00924 -0.04870000 0.17600000
F2_3 0.0642000 -0.0297 0.06170000 -0.14500 0.05820000 -0.18900000
F2_14 0.0000644 0.1120 -0.12900000 0.02870 -0.04830000 -0.06500000
F2_15 -0.0580000 -0.0589 0.08710000 -0.04390 -0.03710000 -0.00846000
F2_19 0.0483000 0.0443 -0.11500000 0.00425 0.02510000 -0.00574000
F2_20 -0.1519741 -0.0938 -0.06502607 -0.23610 0.08504274 -0.01807182
Check for Excessive Missing Values and Outlier Samples
# 检查缺失值太多的基因和样本
gsg = goodSamplesGenes(datExpr0, verbose = 3);
gsg$allOK
If the last statement returns TRUE
, all genes have passed the check. If not, we remove genes and samples that do not meet the requirements from the data.
if(!gsg$allOK)
{
#(可选)打印被删除的基因和样本名称:
if(sum(!gsg$goodGenes)>0)
printFlush(paste("Removinggenes:",paste(names(datExpr0)[!gsg$goodGenes], collapse =",")));
if(sum(!gsg$goodSamples)>0)
printFlush(paste("Removingsamples:",paste(rownames(datExpr0)[!gsg$goodSamples], collapse =",")));
#删除不满足要求的基因和样本:
datExpr0 = datExpr0[gsg$goodSamples, gsg$goodGenes]
}
Next, we cluster the samples (as opposed to clustering genes later) to see if there are any obvious outliers .
sampleTree = hclust(dist(datExpr0), method ="average");
# 绘制样本树:打开一个尺寸为12 * 9英寸的图形输出窗口
# 可对窗口大小进行调整
sizeGrWindow(12,9)
# 如要保存可运行下面语句
# pdf(file="Plots/sampleClustering.pdf",width=12,height=9);
par(cex = 0.6)
par(mar =c(0,4,2,0))
plot(sampleTree, main ="Sampleclusteringtodetectoutliers",sub="", xlab="", cex.lab = 1.5,
cex.axis= 1.5, cex.main = 2)
Fig1a shows an outlier. It can be removed manually or using automatic methods. Selecting a height to cut will remove anomalous samples, such as 15 (Fig1b), and use branch cuts at that height
# 绘制阈值切割线
abline(h = 15,col="red");
# 确定阈值线下的集群
clust = cutreeStatic(sampleTree, cutHeight = 15, minSize = 10)
table(clust)
# clust1包含想要留下的样本.
keepSamples = (clust==1)
datExpr = datExpr0[keepSamples, ]
nGenes =ncol(datExpr)
nSamples =nrow(datExpr)
datExpr
Contains expression data for network analysis.
Load clinical feature data
Match sample information with clinical features.
traitData =read.csv("ClinicalTraits.csv")
dim(traitData)
names(traitData)
# 删除不必要的列.
allTraits = traitData[, -c(31, 16)]
allTraits = allTraits[,c(2, 11:36) ]
dim(allTraits)
names(allTraits)
# 形成一个包含临床特征的数据框
femaleSamples =rownames(datExpr)
traitRows =match(femaleSamples, allTraits$Mice)
datTraits = allTraits[traitRows, -1]
rownames(datTraits) = allTraits[traitRows, 1]
collectGarbage() # 释放内存
We now variabledatExpr
have expression data in , datTraits
and corresponding clinical features in variables. Visualize the relationship between clinical features and sample dendrograms before network construction and module detection.
# 重新聚类样本
sampleTree2 = hclust(dist(datExpr), method ="average")
# 将临床特征值转换为连续颜色:白色表示低,红色表示高,灰色表示缺失
traitColors = numbers2colors(datTraits, signed = FALSE);
# 在样本聚类图的基础上,增加临床特征值热图
plotDendroAndColors(sampleTree2, traitColors,
groupLabels =names(datTraits),
main ="Sample dendrogramand trait heatmap")
Automatically build networks and identify modules
Determining suitable soft thresholds: network topology analysis
The soft thresholding, is a value used to power the correlation of the genes to that threshold. The assumption on that by raising the correlation to a power will reduce the noise of the correlations in the adjacency matrix. To pick up one threshold use the pickSoftThreshold function, which calculates for each power if the network resembles to a scale-free graph. The power which produce a higher similarity with a scale-free network is the one you should use.
WGCNA: What is
soft thresholding
? (bioconductor.org)
# 设置软阈值调参范围
powers =c(c(1:10),seq(from = 12, to=20,by=2))
# 网络拓扑分析
sft = pickSoftThreshold(datExpr, powerVector = powers, verbose = 5)
# 绘图
sizeGrWindow(9, 5)
# 1行2列排列
par(mfrow =c(1,2));
cex1 = 0.9;
# 无标度拓扑拟合指数与软阈值的函数(左图)
plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
xlab="SoftThreshold(power)",ylab="ScaleFreeTopologyModelFit,signedR^2",type="n",
main =paste("Scaleindependence"));
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
labels=powers,cex=cex1,col="red");
# 这条线对应于h的R^2截止点
abline(h=0.90,col="red")
# Mean Connectivity与软阈值的函数(右图)
plot(sft$fitIndices[,1], sft$fitIndices[,5],
xlab="SoftThreshold(power)",ylab="MeanConnectivity", type="n",
main =paste("Meanconnectivity"))
text(sft$fitIndices[,1], sft$fitIndices[,5],labels=powers, cex=cex1,col="red")
We choose power 6 , which is the lowest power at which the scale-free topology fit exponential curve flattens out at higher values.
Build network and identify modules in one step
net = blockwiseModules(datExpr,power= 6,
TOMType ="unsigned", minModuleSize = 30,
reassignThreshold = 0, mergeCutHeight = 0.25,
numericLabels = TRUE, pamRespectsDendro = FALSE,
saveTOMs = TRUE,
saveTOMFileBase ="femaleMouseTOM",
verbose = 3)
deepSplit
The parameter adjusts the sensitivity of the divided modules. The larger the value, the more sensitive it is, and the more modules will be obtained. The default is 2;minModuleSize
The parameter sets the number of genes in the smallest module, the smaller the value, the smaller the module will be retained;mergeCutHeight
Set the distance for merging similarity modules. The smaller the value, the less likely it will be merged, and the more modules will remain.
# 查看识别了多少模块以及模块大小
table(net$colors)
> table(net$colors)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
99 609 460 409 316 312 221 211 157 123 106 100 94 91 77 76 58 47 34
Indicates that there are 18 modules, labeled 1 to 18 in decreasing order of size, ranging in size from 609 to 34 genes. Label 0 is genes outside of all modules.
# 可视化模块
sizeGrWindow(12, 9)
# 将标签转换为颜色
mergedColors = labels2colors(net$colors)
# 绘制树状图和模块颜色图
plotDendroAndColors(net$dendrograms[[1]], mergedColors[net$blockGenes[[1]]],
"Modulecolors",
dendroLabels = FALSE, hang = 0.03,
addGuide = TRUE, guideHang = 0.05)
Save module assignments and module eigengene information for subsequent analysis.
moduleLabels = net$colors
moduleColors = labels2colors(net$colors)
MEs = net$MEs;
geneTree = net$dendrograms[[1]];
save(MEs, moduleLabels, moduleColors, geneTree,
file="FemaleLiver-02-networkConstruction-auto.RData")
Past
Mapping with Nature | Paired Dumbbell Plot + Grouped Fitting Curve + Categorical Variable Heat Map
(Free Tutorial + Code Collection)|Follow Cell to Learn Drawing Series Collection
Follow Nat Commun to learn to draw | 1. Batch boxplot + scatter + difference analysis
Follow Nat Commun to learn to map | 3. Species abundance stacking histogram
Follow Nat Commun to learn to draw | 4. Paired boxplot + difference analysis