WGCNA Concise Guide | 1. Gene Coexpression Network Construction and Module Identification

WGCNA Concise Guide | 1. Gene Coexpression Network Construction and Module Identification

  • reference

  • Introduction

  • Data import, cleaning and preprocessing

    • data import

    • Check for Excessive Missing Values ​​and Outlier Samples

    • Load clinical feature data

  • Automatically build networks and identify modules

    • Determining suitable soft thresholds: network topology analysis

    • Build network and identify modules in one step

  • Past

reference

This article mainly refers to the official guide Tutorials for WGCNA R package (ucla.edu). For details, please refer to the official documentation.

Other information:

  1. WGCNA - Anthology - Jianshu (jianshu.com)

  2. WGCNA analysis, the latest simple and comprehensive tutorial - short book (jianshu.com)

Introduction

Weighted gene co-expression network analysis ( WGCNA , Weighted correlation network analysis) is a systems biology method used to describe gene association patterns between different samples, which can be used to identify gene sets with highly coordinated changes , and based on the interconnectivity of gene sets and associations between gene sets and phenotypes to identify candidate biomarker genes or therapeutic targets. In short, it divides genes into several modules, explores the correlation between phenotype data and gene modules, and finds the hub genes in the modules .

Data import, cleaning and preprocessing

data import

Download sample data: https://horvath.genetics.ucla.edu/html/CoexpressionNetwork/Rpackages/WGCNA/Tutorials/FemaleLiver-Data.zip

# BiocManager::install("WGCNA") 
library('WGCNA')
# 在读入数据时,遇到字符串之后,不将其转换为factors,仍然保留为字符串格式
options(stringsAsFactors = FALSE)
# 导入示例数据
femData = read.csv("LiverFemale3600.csv")
# 查看数据
dim(femData)
names(femData)
> dim(femData)
[1] 3600  143

> names(femData)
  [1] "substanceBXH"   "gene_symbol"    "LocusLinkID"    "ProteomeID"    
  [5] "cytogeneticLoc" "CHROMOSOME"     "StartPosition"  "EndPosition"   
  [9] "F2_2"           "F2_3"           "F2_14"          "F2_15"  
  ...
# 提取样本-基因表达矩阵
datExpr0 = as.data.frame(t(femData[, -c(1:8)]))
names(datExpr0) = femData$substanceBXH
rownames(datExpr0) = names(femData)[-c(1:8)]
> datExpr0[1:6,1:6]
      MMT00000044 MMT00000046 MMT00000051 MMT00000076 MMT00000080 MMT00000102
F2_2   -0.0181000     -0.0773 -0.02260000    -0.00924 -0.04870000  0.17600000
F2_3    0.0642000     -0.0297  0.06170000    -0.14500  0.05820000 -0.18900000
F2_14   0.0000644      0.1120 -0.12900000     0.02870 -0.04830000 -0.06500000
F2_15  -0.0580000     -0.0589  0.08710000    -0.04390 -0.03710000 -0.00846000
F2_19   0.0483000      0.0443 -0.11500000     0.00425  0.02510000 -0.00574000
F2_20  -0.1519741     -0.0938 -0.06502607    -0.23610  0.08504274 -0.01807182

Check for Excessive Missing Values ​​and Outlier Samples

# 检查缺失值太多的基因和样本
gsg = goodSamplesGenes(datExpr0, verbose = 3);
gsg$allOK

If the last statement returns TRUE, all genes have passed the check. If not, we remove genes and samples that do not meet the requirements from the data.

if(!gsg$allOK)
{
  #(可选)打印被删除的基因和样本名称:
  if(sum(!gsg$goodGenes)>0)
    printFlush(paste("Removinggenes:",paste(names(datExpr0)[!gsg$goodGenes], collapse =",")));
  if(sum(!gsg$goodSamples)>0)
    printFlush(paste("Removingsamples:",paste(rownames(datExpr0)[!gsg$goodSamples], collapse =",")));
  #删除不满足要求的基因和样本:
  datExpr0 = datExpr0[gsg$goodSamples, gsg$goodGenes]
}

Next, we cluster the samples (as opposed to clustering genes later) to see if there are any obvious outliers .

sampleTree = hclust(dist(datExpr0), method ="average");
# 绘制样本树:打开一个尺寸为12 * 9英寸的图形输出窗口
# 可对窗口大小进行调整
sizeGrWindow(12,9)
# 如要保存可运行下面语句
# pdf(file="Plots/sampleClustering.pdf",width=12,height=9);
par(cex = 0.6)
par(mar =c(0,4,2,0))
plot(sampleTree, main ="Sampleclusteringtodetectoutliers",sub="", xlab="", cex.lab = 1.5,
cex.axis= 1.5, cex.main = 2)
84b6de0db19aa3697726736412f6efbc.png
Fig1a. Sample dendrogram

Fig1a shows an outlier. It can be removed manually or using automatic methods. Selecting a height to cut will remove anomalous samples, such as 15 (Fig1b), and use branch cuts at that height

# 绘制阈值切割线
abline(h = 15,col="red");
# 确定阈值线下的集群
clust = cutreeStatic(sampleTree, cutHeight = 15, minSize = 10)
table(clust)
# clust1包含想要留下的样本.
keepSamples = (clust==1)
datExpr = datExpr0[keepSamples, ]
nGenes =ncol(datExpr)
nSamples =nrow(datExpr)
5762aa110fcafc6f73f83e6d82edcad3.png
Fig1b. The red line indicates the cutting height

datExprContains expression data for network analysis.

Load clinical feature data

Match sample information with clinical features.

traitData =read.csv("ClinicalTraits.csv")
dim(traitData)
names(traitData)
# 删除不必要的列.
allTraits = traitData[, -c(31, 16)]
allTraits = allTraits[,c(2, 11:36) ]
dim(allTraits)
names(allTraits)
# 形成一个包含临床特征的数据框
femaleSamples =rownames(datExpr)
traitRows =match(femaleSamples, allTraits$Mice)
datTraits = allTraits[traitRows, -1]
rownames(datTraits) = allTraits[traitRows, 1]
collectGarbage() # 释放内存

We now variabledatExprhave expression data in , datTraitsand corresponding clinical features in variables. Visualize the relationship between clinical features and sample dendrograms before network construction and module detection.

# 重新聚类样本
sampleTree2 = hclust(dist(datExpr), method ="average")
# 将临床特征值转换为连续颜色:白色表示低,红色表示高,灰色表示缺失
traitColors = numbers2colors(datTraits, signed = FALSE);
# 在样本聚类图的基础上,增加临床特征值热图
plotDendroAndColors(sampleTree2, traitColors,
                    groupLabels =names(datTraits),
                    main ="Sample dendrogramand trait heatmap")
d8023e1d25f80971f2eabc6b41e17ac8.png
Figure 2: Clustering dendrogram of samples based on their Euclidean distance.

Automatically build networks and identify modules

Determining suitable soft thresholds: network topology analysis

The soft thresholding, is a value used to power the correlation of the genes to that threshold. The assumption on that by raising the correlation to a power will reduce the noise of the correlations in the adjacency matrix. To pick up one threshold use the pickSoftThreshold function, which calculates for each power if the network resembles to a scale-free graph. The power which produce a higher similarity with a scale-free network is the one you should use.

WGCNA: What is soft thresholding? (bioconductor.org)

# 设置软阈值调参范围
powers =c(c(1:10),seq(from = 12, to=20,by=2))
# 网络拓扑分析
sft = pickSoftThreshold(datExpr, powerVector = powers, verbose = 5)
# 绘图
sizeGrWindow(9, 5)
# 1行2列排列
par(mfrow =c(1,2));
cex1 = 0.9;
# 无标度拓扑拟合指数与软阈值的函数(左图)
plot(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
     xlab="SoftThreshold(power)",ylab="ScaleFreeTopologyModelFit,signedR^2",type="n",
     main =paste("Scaleindependence"));
text(sft$fitIndices[,1], -sign(sft$fitIndices[,3])*sft$fitIndices[,2],
     labels=powers,cex=cex1,col="red");
# 这条线对应于h的R^2截止点
abline(h=0.90,col="red")
# Mean Connectivity与软阈值的函数(右图)
plot(sft$fitIndices[,1], sft$fitIndices[,5],
     xlab="SoftThreshold(power)",ylab="MeanConnectivity", type="n",
     main =paste("Meanconnectivity"))
text(sft$fitIndices[,1], sft$fitIndices[,5],labels=powers, cex=cex1,col="red")
bb5459e5a27fb36003d1514291d5dad7.png
Figure 3: Network topology analysis for various soft thresholds

We choose power 6 , which is the lowest power at which the scale-free topology fit exponential curve flattens out at higher values.

Build network and identify modules in one step

net = blockwiseModules(datExpr,power= 6,
                       TOMType ="unsigned", minModuleSize = 30,
                       reassignThreshold = 0, mergeCutHeight = 0.25,
                       numericLabels = TRUE, pamRespectsDendro = FALSE,
                       saveTOMs = TRUE,
                       saveTOMFileBase ="femaleMouseTOM",
                       verbose = 3)
  • deepSplitThe parameter adjusts the sensitivity of the divided modules. The larger the value, the more sensitive it is, and the more modules will be obtained. The default is 2;

  • minModuleSizeThe parameter sets the number of genes in the smallest module, the smaller the value, the smaller the module will be retained;

  • mergeCutHeightSet the distance for merging similarity modules. The smaller the value, the less likely it will be merged, and the more modules will remain.

# 查看识别了多少模块以及模块大小
table(net$colors)
> table(net$colors)

  0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
 99 609 460 409 316 312 221 211 157 123 106 100  94  91  77  76  58  47  34

Indicates that there are 18 modules, labeled 1 to 18 in decreasing order of size, ranging in size from 609 to 34 genes. Label 0 is genes outside of all modules.

# 可视化模块
sizeGrWindow(12, 9)
# 将标签转换为颜色
mergedColors = labels2colors(net$colors)
# 绘制树状图和模块颜色图
plotDendroAndColors(net$dendrograms[[1]], mergedColors[net$blockGenes[[1]]],
                    "Modulecolors",
                    dendroLabels = FALSE, hang = 0.03,
                    addGuide = TRUE, guideHang = 0.05)
8aec0672819df28e64f8c8cbb8815655.png
Figure 4: Cluster dendrogram of genes, with different topological-based overlaps, and module colors assigned

Save module assignments and module eigengene information for subsequent analysis.

moduleLabels = net$colors
moduleColors = labels2colors(net$colors)
MEs = net$MEs;
geneTree = net$dendrograms[[1]];
save(MEs, moduleLabels, moduleColors, geneTree,
     file="FemaleLiver-02-networkConstruction-auto.RData")

Past

  1. Mapping with Nature | Paired Dumbbell Plot + Grouped Fitting Curve + Categorical Variable Heat Map

  2. (Free Tutorial + Code Collection)|Follow Cell to Learn Drawing Series Collection

  3. Follow Nat Commun to learn to draw | 1. Batch boxplot + scatter + difference analysis

  4. Follow Nat Commun to learn to draw | 2. Timeline graph

  5. Follow Nat Commun to learn to map | 3. Species abundance stacking histogram

  6. Follow Nat Commun to learn to draw | 4. Paired boxplot + difference analysis


92e592c5679276fd7e3e628638a044e4.png

Guess you like

Origin blog.csdn.net/weixin_45822007/article/details/121965497