DESeq2 package分析RNA-seq差异基因

`##数据处理对象,未经过标准化处理的矩阵,而非FPKM,RPKM,TPM等处理后的数据。
##DESeq2存储数据的对象是DESeqDataSet。
##构建DESeqDataSet的方式有两种,分别是自行导入矩阵数据构建,和从summarizedExperiment对象构建。
####从summarizedexperiment构建
library("airway")
data("airway")
se <- airway
library("DESeq2")
dds <- DESeqDataSet(se, design = ~ cell + dex)##dex即condition(trt和control)
dds
####自行导入矩阵数据构建
condition<-factor(c(rep("liver",5),rep("lung",8),rep("bone",8),rep("kidney",4)))##先构建分组信息
condition
##it is preffered in R the first factor be the first level
condition<-relevel(condition,"kidney")###调整level`

 `##构建dds对象
colData <- data.frame(row.names=colnames(exprSet),
condition=condition)
colData
dds <- DESeqDataSetFromMatrix(countData = exprSet,
colData = colData,
design = ~ condition)
dds
## design 也是一个对象,可以通过design(dds)来继续修改分组信息

####pre-filtering过滤部分数据,有些行只有0或1个read
dds 1, ]

####注意因子的水平
dds$condition <- factor(dds$condition, levels=c("untreated","treated"))
dds$condition <- relevel(dds$condition, ref="untreated")
##若移除了某些sample,则通过droplevels移除因子
dds$condition <- droplevels(dds$condition)

####差异表达分析DESeq函数
dds <- DESeq(dds)
res <- results(dds)
res##可查看分组比较信息
mcols(res1, use.names=TRUE)##metadata information,查看分组比较

##提取差异分析结果
resOrdered <- res[order(res$padj),]##按校正后的p值排序
summary(res)##查看总体信息分布
sum(res$padj < 0.05, na.rm=TRUE)##查看多少p值小于0.05
res05 <- results(dds, alpha=0.05)###alpha即校正后的p值
summary(res05)
##根据p值筛选
res1.05<-results(dds,alph=0.05)
table(res1.05$padj<.05)
table(res1.05$pvalue<.05)
##log2FC筛选
res1LFC<-results(dds,lfcThreshold=1)
table(res1LFC$Padj<0.05)

####将ensembel ID 转换为genesymbol
###annotation information, mapping esemble Id to another
library("AnnotationDbi")
library("org.Hs.eg.db")
columns(org.Hs.eg.db)
res1$symbol <- mapIds(org.Hs.eg.db,
keys=row.names(res1),
column="SYMBOL",
keytype="ENSEMBL",
multiVals="first")
###导出差异基因
write.csv(res1,"liver_kidney01.csv")##
###问题?DESeq2包中的筛选功能results,与excel筛选不同。` 

 `<footer class="entry-meta" style="display: block; color: rgb(102, 102, 102); clear: both; font-size: 12px; line-height: 18px;">发表在 [RNA-seq](https://medliu.wordpress.com/category/%e7%94%9f%e4%bf%a1%e7%ac%94%e8%ae%b0/rna-seq/) | 标签为 [limma](https://medliu.wordpress.com/tag/limma/)、[RNA-seq](https://medliu.wordpress.com/tag/rna-seq/)、[差异基因](https://medliu.wordpress.com/tag/%e5%b7%ae%e5%bc%82%e5%9f%ba%e5%9b%a0/) | [Leave a reply](https://medliu.wordpress.com/2018/01/19/deseq2-package%e5%88%86%e6%9e%90rna-seq%e5%b7%ae%e5%bc%82%e5%9f%ba%e5%9b%a0/#respond)[编辑](https://wordpress.com/post/medliu.wordpress.com/78)</footer>` </article>

 `<article id="post-46" class="post-46 post type-post status-publish format-standard hentry category-rna-seq" style="display: block; border-bottom: 1px solid rgb(221, 221, 221); margin: 0px 0px 1.625em; padding: 0px 0px 1.625em; position: relative;">

<header class="entry-header" style="display: block;">

# [运用limma包分析RNA-seq数据](https://medliu.wordpress.com/2018/01/19/%e8%bf%90%e7%94%a8limma%e5%8c%85%e5%88%86%e6%9e%90rna-seq%e6%95%b0%e6%8d%ae/)

Posted on [<time class="entry-date" datetime="2018-01-19T18:21:23+00:00">2018年1月19日</time>](https://medliu.wordpress.com/2018/01/19/%e8%bf%90%e7%94%a8limma%e5%8c%85%e5%88%86%e6%9e%90rna-seq%e6%95%b0%e6%8d%ae/ "下午6:21")

[](https://medliu.wordpress.com/2018/01/19/%e8%bf%90%e7%94%a8limma%e5%8c%85%e5%88%86%e6%9e%90rna-seq%e6%95%b0%e6%8d%ae/#respond)

</header>

`###参考资料: limma users guide,本质上是将测序数据类似于芯片数据处理
####Note: read counts are converted to log2-counts-Per-Million (logCPM)
#### Two approach: "Voom" and "limma Trend"
##2-filtering and nomalization
exprSet 10) >= 2,]
exprSet<- DGEList(counts = counts)
exprSet<- calcNormFactors(exprSet)`

##分组因子
condition=factor(c(rep(“lung”,5),rep(“bone”,8),rep(“bone”,8),rep(“kidney”,4)))
condition
design<-model.matrix(~-1+condition)
design
colnames(design)<-c("liver","lung","bone","kidney")
contrast.matrix<-makeContrasts(liver-kidney,lung-kidney,bone-kidney,levels = design)
contrast.matrix

##3-1differential exression (limma-trend approach)
####NOte:样本间测序深度相近,最大与最小的不超过3倍,推荐该方法
logCPM <- cpm(exprSet, log=TRUE, prior.count=3)
fit <- lmFit(logCPM, design)
fit <- eBayes(fit, trend=TRUE)

##3-2 differential expression (voom)
####Note:样本间测序深度差别较大时,推荐该方法
v <- voom(exprSet, design, plot=TRUE)##voom transformation
fit <- lmFit(v, design)
fit <- eBayes(fit)

##4 extract differential expressed genes
dif1<-topTable(fit, coef=1,n=nrow(fit),adjust="BH")

<footer class="entry-meta" style="display: block; color: rgb(102, 102, 102); clear: both; font-size: 12px; line-height: 18px;">发表在 [RNA-seq](https://medliu.wordpress.com/category/%e7%94%9f%e4%bf%a1%e7%ac%94%e8%ae%b0/rna-seq/) | [Leave a reply](https://medliu.wordpress.com/2018/01/19/%e8%bf%90%e7%94%a8limma%e5%8c%85%e5%88%86%e6%9e%90rna-seq%e6%95%b0%e6%8d%ae/#respond)[编辑](https://wordpress.com/post/medliu.wordpress.com/46)</footer>

</article>

<article id="post-42" class="post-42 post type-post status-publish format-standard hentry category-654023889 category-rna-seq tag-603246210 tag-limma tag-rna-seq tag-603246208" style="display: block; border-bottom: none; margin: 0px 0px 1.625em; padding: 0px 0px 1.625em; position: relative;">

<header class="entry-header" style="display: block;">

# [RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR](https://medliu.wordpress.com/2018/01/19/rna-seq-analysis-is-easy-as-1-2-3-with-limma-glimma-and-edger/)

Posted on [<time class="entry-date" datetime="2018-01-19T18:06:00+00:00">2018年1月19日</time>](https://medliu.wordpress.com/2018/01/19/rna-seq-analysis-is-easy-as-1-2-3-with-limma-glimma-and-edger/ "下午6:06")

[](https://medliu.wordpress.com/2018/01/19/rna-seq-analysis-is-easy-as-1-2-3-with-limma-glimma-and-edger/#respond)

</header>

####学习下limma, Glimma, edgeR三个包联用处理数据
####2017-09-30 RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR
####edgeR-limma workflow , Glimma for interactive graphics
#### Data source: GSE63310,Illumina HiSeq 2000 (Mus musculus)
####data informaton: raw counts data, 11 sample, only use 3basal, 3LP and 3ML
####NOte: only 9 samples were used to do analysis next

#source( “[http://bioconductor.org/biocLite.R&#8221](http://bioconductor.org/biocLite.R&#8221);)
#options(BioC_mirror=”[http://mirrors.ustc.edu.cn/bioc/&#8221](http://mirrors.ustc.edu.cn/bioc/&#8221);)##使用中科大镜像
# biocLite(“org.Mm.eg.db”,lib=”D:/R/R-3.4.1/library”)
##org.Mm.eg.db-小鼠, org.Mm.eg.db
library(limma)
library(edgeR)
setwd(“D:/R/GSE63310_RAW”)

####1-data import 注意数据需要解压缩
files <- c("GSM1545535_10_6_5_11.txt", "GSM1545536_9_6_5_11.txt",
"GSM1545538_purep53.txt","GSM1545539_JMS8-2.txt",
"GSM1545540_JMS8-3.txt","GSM1545541_JMS8-4.txt",
"GSM1545542_JMS8-5.txt","GSM1545544_JMS9-P7c.txt",
"GSM1545545_JMS9-P8c.txt")
read.delim(files[1], nrow=5)##读取其中的第一个文件查看数据结构
x <- readDGE(files, columns=c(1,3))##clolumns需要注意读取哪些列
class(x)##注意到创建了DGEList对象
dim(x)##查看数据结构

####2-organising sample information
####注意考虑实验变量的影响,包括生物学上的和技术上两方面,例如细胞类型(basal LP ML)
####样本表型(性别,年龄,疾病状态),干预措施,批次效应信息
samplenames <- substring(colnames(x), 12, nchar(colnames(x)))##从12位开始提取,nchar返回向量
samplenames
colnames(x) <- samplenames##将列名改为样本名
##以下为分组信息,注意是有顺序的
group <- as.factor(c("LP", "ML", "Basal", "Basal", "ML", "LP", "Basal", "ML", "LP"))
group##注意因子在没有指定顺序时,首字母排序
x$samples$group <- group
lane <- as.factor(rep(c("L004","L006","L008"), c(3,4,2)))##注意这种设置因子的方法
x$samples$lane <- lane
x$samples
library(Mus.musculus)##鼠基因注释包
geneid <- rownames(x)
genes <- select(Mus.musculus, keys=geneid, columns=c("SYMBOL", "TXCHROM"),
keytype="ENTREZID")
head(genes)##获取了注释信息
dim(genes)
####Note:由于ID与gene的映射好多不是一对一的关系,因此检查重复的基因ID很重要
genes <- genes[!duplicated(genes$ENTREZID),]##移除重复的geneid
dim(genes)##可以看到有100多个重复的ID
x$genes <- genes
####Note: DGEList-objec中有一个genes的数据框用于存储注释信息
x

####3-Data preprocessing
####基因差异表达分析,通常需要对raw counts预处理,消除基因长度等的影响,使其具有可比性
####常见的处理方法有,CPM,logCPM, RPKM, FPKM
cpm <- cpm(x)
lcpm =3##cpm>1区分表达与不表达
x <- x[keep,, keep.lib.sizes=FALSE]
dim(x)##数据量已减少到约一半
##绘制raw data 与 filtered data的分布情况
##################################################################
library(RColorBrewer)
nsamples <- ncol(x)
col <- brewer.pal(nsamples, "Paired")##选择颜色集
par(mfrow=c(1,2))
plot(density(lcpm[,1]), col=col[1], lwd=2, ylim=c(0,0.21), las=2,
main="", xlab="")
title(main="A. Raw data", xlab="Log-cpm")
abline(v=0, lty=3)
for (i in 2:nsamples){
den <- density(lcpm[,i])
lines(den$x, den$y, col=col[i], lwd=2)
}
legend("topright", samplenames, text.col=col, bty="n")
lcpm <- cpm(x, log=TRUE)
plot(density(lcpm[,1]), col=col[1], lwd=2, ylim=c(0,0.21), las=2,
main="", xlab="")
title(main="B. Filtered data", xlab="Log-cpm")
abline(v=0, lty=3)
for (i in 2:nsamples){
den <- density(lcpm[,i])
lines(den$x, den$y, col=col[i], lwd=2)
}
legend("topright", samplenames, text.col=col, bty="n")
###################################################################

####4-normalising gene expression distribution
####由于实验处理过程中,外部因素对实验的影响等,例如批次效应
####标准化过程要求保证各个样品的表达值分布相似
####对于DGE-List object,使用TMM方法做标准化,
####标准化因子自动储存于x$samples$norm.factors
x <- calcNormFactors(x, method = "TMM")
x$samples$norm.factors

####5-Unsupervised clustering of samples
####无监督聚类分析,multidimensional scaling (MDS) plot,用于质量控制
####理想状况下,相似样品往往聚集成簇,据此筛选不合群样品,以便正式挑选差异

############################################################################
par(mfrow=c(1,2))
col.group <- group
levels(col.group) <- brewer.pal(nlevels(col.group), "Set1")##选择颜色集
col.group <- as.character(col.group)
col.lane <- lane
levels(col.lane) <- brewer.pal(nlevels(col.lane), "Set2")
col.lane <- as.character(col.lane)
plotMDS(lcpm, labels=group, col=col.group)##分组
title(main="A. Sample groups")
##
plotMDS(lcpm, labels=lane, col=col.lane, dim=c(3,4))##batch
title(main="B. Sequencing lanes")
library(Glimma)
glMDSPlot(lcpm, labels=paste(group, lane, sep="_"), groups=x$samples[,c(2,5)],
launch=FALSE)
##############################################################################
####6-Differential expression analysis
##建立实验设计矩阵
design <- model.matrix(~0+group+lane)
design
colnames(design) <- gsub("group", "", colnames(design))##将group去掉
design
##比较矩阵
contrast.matrix <- makeContrasts(BasalvsLP = Basal-LP,
BasalvsML = Basal – ML,
LPvsML = LP – ML,
levels = colnames(design))
contrast.matrix

####Removing heteroscedascity from count data
##产生了voom EList-object
v <- voom(x, design, plot=TRUE)
v
####Fitting linear models for comparisons of interest
vfit <- lmFit(v, design)
vfit <- contrasts.fit(vfit, contrasts=contrast.matrix)
efit <- eBayes(vfit)
plotSA(efit)

####Examing the number of DEGs
##默认设置FDR=0.05
summary(decideTests(efit))
##treat method,进一步挑选
tfit <- treat(vfit, lfc=1)
dt <- decideTests(tfit)
summary(dt)##0代表没有差异的基因
de.common <- which(dt[,1]!=0 & dt[,2]!=0)##前两个比较的差异交集
length(de.common)
head(tfit$genes$SYMBOL[de.common], n=20)
vennDiagram(dt[,1:2], circle.col=c("turquoise", "salmon"))
write.fit(tfit, dt, file="results.txt")

####
basal.vs.lp <- topTreat(tfit, coef=1, n=Inf)#n=Inf指所有基因
basal.vs.ml <- topTreat(tfit, coef=2, n=Inf)#toptreat排好序了
head(basal.vs.lp)

####heatmap绘制热图,这方面还需要进一步学习啦
library(gplots)
dif1.topgenes <- basal.vs.lp$ENTREZID[1:20]##基因数量
i <- which(v$genes$ENTREZID %in% dif1.topgenes)
mycol <- colorpanel(1000,"blue","white","red")
heatmap.2(v$E[i,], scale="row",
labRow=v$genes$SYMBOL[i], labCol=group,
col=mycol, trace="none", density.info="none",
margin=c(8,6), lhei=c(2,10), dendrogram="column")

猜你喜欢

转载自blog.csdn.net/weixin_34025051/article/details/86875605