GEO Data mining full-flow analysis

Disclaimer: The following information gathered from learning according to "Health Letter skill tree" Network series free teaching materials, codes from "Health Letter skill tree," the president jimmy GitHub . GEO database mining series of knowledge-sharing program, starting in 2016, Seng Credit rookie group blog . Supporting the instructional video on the B station are hereby acknowledged.

Preface: About GEO data

Our goal is to read literature from the engraved experimental literature, to master the ability GEO data mining. The first is to read widely, when reading the literature, refining context, read the literature using data sets which one or GSE, what do the data processing. After a clear understanding can download the corresponding data sets, to obtain a matrix expression, a series of downstream analysis for difference analysis, annotation and the like.
An article may have one or more GSE data sets, a GSE where you can have one or more GSM samples. GSM plurality of samples can be integrated according to the study object of the study is a GDS, has its own set of data corresponding to each chip platform (GPL), there may be a plurality of platforms GSE measured data.
This analysis is based on the R language platform, it requires some basic knowledge of the R language.
Learn GEO, expression chip and R

The first part: GEO chip data download and finishing

GEO's official website
this case to GSE42872 data set, for example, learning GEO data mining analysis, analysis of the literature
Here Insert Picture Description
to find the appropriate GSE data sets by reading literature, and the official website can under the appropriate dataset information and background knowledge
https: //www.ncbi. nlm.nih.gov/geo/query/acc.cgi?acc=GSE42872
Here Insert Picture Description
this data set using GPL6244 the chip platform, composed of six samples, the first three of the control group, three groups of treatment. Understand the background of this data set is after, then you need to download the data. There are many ways to download data, here we use R GEOquery package to download, download the specific code as follows:

rm(list = ls())  
## 魔幻操作,一键清空~当前环境中对象全部删除
options(stringsAsFactors = F)
#在调用as.data.frame的时,将stringsAsFactors设置为FALSE可以避免character类型自动转化为factor类型
f='GSE42872_eSet.Rdata'
#把GSE42872_eSet.Rdata赋值给f,方便后面流程化处理
##根据数据集不同修改相应的GSE号

library(GEOquery)
# 这个包需要注意两个配置,一般来说自动化的配置是足够的。
#Setting options('download.file.method.GEOquery'='auto')
#Setting options('GEOquery.inmemory.gpl'=FALSE)

if(!file.exists(f)){
  gset <- getGEO('GSE42872', destdir=".",
                 AnnotGPL = F,     ## 注释文件
                 getGPL = F)       ## 平台文件
  save(gset,file=f)   ## 保存到本地
}
##这是一个函数,利用包将数据集的表达信息下载下来,赋值给了gset,而不下载注释信息和平台信息,病保存到本地,文件名为f。

load('GSE42872_eSet.Rdata') 
 ## 载入数据
class(gset) 
 #查看数据类型
length(gset) 
 ##看一下有几个元素
gset[[1]]
#取第一个元素
class(gset[[1]])
 #查看改元素的数据类型
# 因为这个GEO数据集只有一个GPL平台,所以下载到的是一个含有一个元素的list

a=gset[[1]] 
##取出第一个元素赋值给一个对象a
dat=exprs(a) 
#a现在是一个对象,取a这个对象通过看说明书知道要用exprs这个函数,该函数得到表达矩阵
#现在 得到的dat就是一个表达矩阵,只不过基因的ID是探针名
dim(dat)
#看一下dat这个矩阵的维度
dat[1:5,1:5] 
#查看dat这个矩阵的1至4行和1至4列,逗号前为行,逗号后为列
#这个表达矩阵是已经log之后的,表达量一般是0-10左右,如果是原始芯片表达的信号值一般是几千到一万,则需要log处理。

boxplot(dat,las=2) 
#画个图看一下各样本之间有没有批次效应,一般中位数都差不多,las是将横坐标样本信息竖着排列

pd=pData(a) 
#通过查看说明书知道取对象a里的临床信息用pData
View(pd)
## 查看一下,挑选一些感兴趣的临床表型,这里我们欲得到其分组title信息。
library(stringr)
#运行一个字符分割包
group_list=str_split(pd$title,' ',simplify = T)[,4]
#抽取title一列,按照空格分割,取第四个元素即Control和Vemurafenib
table(group_list)
#看一下两个分组各有几个

GPL6244 This is the data chip platform, first check the page out if there is a corresponding package, available directly download the appropriate package, do not use the following direct download function.
## URL: HTTP: //www.bio-info-trainee.com/1399.html
## can be found in the corresponding package for hugene10sttranscriptcluster
## Therefore we prefer to use the package to find the correspondence between the probe and the gene name, in bioconductor download this package.
If you downloaded the package corresponding chip platform, run the following program:

library(hugene10sttranscriptcluster.db)
##运行这个包
ls("package:hugene10sttranscriptcluster.db")
#对这个包进行探索,看一下有多少元素
##找到有SYMBOL的那个元素就是我们需要的对应关系
##例如:[34] "hugene10sttranscriptclusterSYMBOL"    
ids=toTable(hugene10sttranscriptclusterSYMBOL) 
#toTable这个函数:通过看hgu133plus2.db这个包的说明书知道提取probe_id(探针名)和symbol(基因名)的对应关系的表达矩阵的函数为toTable。并赋值给ids
##这时候我们得到19825个探针对应的基因名。
###刚才我们的表达矩阵中是33297个基因探针,这就意味着刚才的表达矩阵中可能存在多个探针重复对应一个基因名。这就需要我们对数据进行进一步筛选、处理。
head(ids) 
#head为查看前六行

If the chip is not found in the corresponding R internet download platform package requires comment information, follow

if(F){
  library(GEOquery)
  #Download GPL file, put it in the current directory, and load it:
  gpl <- getGEO('GPL6244', destdir=".")
##需要修改相应的平台号,把平台信息赋值给gpl
  colnames(Table(gpl))  
  head(Table(gpl)[,c(1,15)]) 
## you need to check this , which column do you need
  probe2gene=Table(gpl)[,c(1,15)]
  head(probe2gene)
  library(stringr)  
  save(probe2gene,file='probe2gene.Rdata')
}
load(file='probe2gene.Rdata')
ids=probe2gene 

Now we get the correspondence between the probe and the GPL genes in two ways.

head(ids)
#为查看前六行
colnames(ids)=c('probe_id','symbol')  
#将列名统一改为'probe_id','symbol'方便后续统一操作。

length(unique(ids$symbol)) 
#[1] 18832个独特的基因探针,意味着本来19825个里面有一部分是重复的
tail(sort(table(ids$symbol)))
table(sort(table(ids$symbol)))
#每个对象出现的个数
plot(table(sort(table(ids$symbol))))
#画图观察

ids=ids[ids$symbol != '',]
ids=ids[ids$probe_id %in%  rownames(dat),]
##%in%用于判断是否匹配,然后取匹配的几行,去掉无法匹配的信息。

dat[1:5,1:5]   
dat=dat[ids$probe_id,] 
#取表达矩阵中可以与探针名匹配的那些,去掉无法匹配的表达数据,这时只剩下19825个探针及表达信息,其余已被剔除。

ids$median=apply(dat,1,median) 
#ids新建median这一列,列名为median,同时对dat这个矩阵按行操作,取每一行的中位数,将结果给到median这一列的每一行
ids=ids[order(ids$symbol,ids$median,decreasing = T),]
#对ids$symbol按照ids$median中位数从大到小排列的顺序排序
##即先按symbol排序,相同的symbol再按照中位数从大到小排列,方便后续保留第一个值。
##将对应的行赋值为一个新的ids,这样order()就相当于sort()
ids=ids[!duplicated(ids$symbol),]
#将symbol这一列取取出重复项,'!'为否,即取出不重复的项,去除重复的gene ,保留每个基因最大表达量结果.最后得到18832个基因。
dat=dat[ids$probe_id,] 
#新的ids取出probe_id这一列,将dat按照取出的这一列中的每一行组成一个新的dat
rownames(dat)=ids$symbol
#把ids的symbol这一列中的每一行给dat作为dat的行名
head(dat)
##至此我们就得到了该数据集的表达矩阵,最后将结果保存。
save(dat,group_list,file = 'step1-output.Rdata')
write.csv(dat,file="expressionmetrix_GSE.csv")

Part II: Analysis of differential expression

Unfinished, to be updated

Released two original articles · won praise 0 · Views 54

Guess you like

Origin blog.csdn.net/Eric_blog/article/details/104580174
Recommended