GTEx database-a good helper for TCGA data mining

If you are infringing j, please contact WeChat: biozhangz to delete.

GTEx database-a good helper for TCGA data mining

Usually when we mine the TCGA database, we will find that the normal tissue sequencing results included in the project are very few, which means that many patients will not have the transcriptome sequencing results of his normal tissue, such as breast cancer, 1200 There are about transcriptome data, about 1100 of which are sequencing data of tumor tissue, and only about 100 normal controls.

At this time, we need to find ways to increase the number of normal tissue sequencing samples. Since the TCGA database is not available, we will start with other databases.

The highly recommended here is the GTEx database, Genotype-Tissue Expression (GTEx)

background knowledge

Phase I

In 2015, GTEx released its first stage results, publishing three research results in Science magazine at one time , and the results were also selected as cover articles. GTEx's study collected 1,641 autopsy samples from 175 deceased persons. These samples came from 54 different body parts and observed the gene expression patterns of almost all transcribed genes to determine the specific regions of the genome that affect gene expression. One of the other two articles describes the gene expression profile in all human tissues, proving that certain tissue-specific genes often determine the regulation of tissue-specific gene expression; the other explains truncated protein variants How to affect gene expression in tissues.

Phase II

In 2017, four research results were published in nature at one time . The GTEx Research Alliance ’s research collected and studied more than 7,000 autopsy samples from 449 pre-healthy human donors , covering 44 organizations (42 different tissue types ), Including 31 solid organ tissues, 10 brain partitions, whole blood, and two cell lines from donor blood and skin. The authors used these samples to study how gene expression differs among different tissues and individuals. Papers titled "Landscape of X chromosome inactivation across human tissues" and "Dynamic landscape and regulation of RNA editing in mammals" used GTEx data to explore how gene mutations associated with gene expression can regulate RNA editing and X chromosome inactivation phenomenon.

Introduction to database content

Usually go directly to  https://gtexportal.org/ to  find the downloadable data set, as follows:

Among them, the most important for us is the expression matrix. You can download the 496M file of gene read counts in the figure. The sample ID in the expression matrix must be customized by the database organizer, so we also need to find the annotation information of the sample ID .

More about the introduction of the webpage of this database. We Shengxin engineers usually don't need it, so I won't go into details.

Note the version information of the database:

The current release is V7 including 11,688 samples, 53 tissues and 714 donors

First look at the annotation information of the database

The key points are:

 # SMTS Tissue Type, area from which the tissue sample was taken. 
 # SMTSD Tissue Type, more specific detail of tissue type

You can see which organization each sample belongs to, so that it is convenient to extract their information to assist their own research.

Import the 496M expression matrix of gene read counts into R:

if(F){
 options(stringsAsFactors = F)
 GTEx=read.table('~/Downloads/GTEx_Analysis_2016-01-15_v7_RNASeQCv1.1.8_gene_reads.gct.gz'
 ,header = T,sep = '\t',skip = 2) GTEx[1:4,1:4] h=head(GTEx) save(h,file = 'GTEx_head.Rdata') } 

Pick the expression matrix of the organization of interest

Above we have detailed understanding of the organizations annotated by different samples, so the code is very simple:

 load('~/Desktop/GTEx_all.Rdata')
 a[1:4,1:4] colnames(a) # SMTS Tissue Type, area from which the tissue sample was taken. # SMTSD Tissue Type, more specific detail of tissue type b=read.table('GTEx_v7_Annotations_SampleAttributesDS.txt', header = T,sep = '\t',quote = '') table(b$SMTS) breat_gtex=a[,gsub('[.]','-',colnames(a)) %in% b[b$SMTS=='Breast',1]] rownames(breat_gtex)=a[,1] dat=breat_gtex 

It is to select the sample names belonging to the breast organization and take a subset of the expression matrix above.

It is worth noting that the gene name of the expression matrix at this time is not a symbol, and ID conversion is required. The code is as follows:

dat=breat_gtex
 ids=a[,1:2]
 head(ids)
 colnames(ids)=c('probe_id','symbol')
 dat=dat[ids$probe_id,]
 dat[1:4,1:4] 
 ids$median=apply(dat,1,median)
 ids=ids[order(ids$symbol,ids$median,decreasing = T),]
 ids=ids[!duplicated(ids$symbol),]
 dat=dat[ids$probe_id,]
 rownames(dat)=ids$symbol
 dat[1:4,1:4] 
 breat_gtex=dat
 save(breat_gtex,file = 'breat_gtex_counts.Rdata')

The expression matrix is ​​as follows:

Analysis of expression matrices of normal breast tissue samples

Under normal circumstances, it should be analyzed with the tumor data. That kind of analysis will be diversified. Here is a simple one, you can perform pam50 classification:

if(T){
 ddata=t(dat)
 ddata[1:4,1:4] s=colnames(ddata);head(s) library(org.Hs.eg.db) s2g=toTable(org.Hs.egSYMBOL) g=s2g[match(s,s2g$symbol),1];head(g) # probe Gene.symbol Gene.ID dannot=data.frame(probe=s, "Gene.Symbol" =s, "EntrezGene.ID"=g) ddata=ddata[,!is.na(dannot$EntrezGene.ID)] dannot=dannot[!is.na(dannot$EntrezGene.ID),] head(dannot) library(genefu) # c("scmgene", "scmod1", "scmod2","pam50", "ssp2006", "ssp2003", "intClust", "AIMS","claudinLow") s<-molecular.subtyping(sbt.model = "pam50",data=ddata, annot=dannot,do.mapping=TRUE) table(s$subtype) tmp=as.data.frame(s$subtype) subtypes=as.character(s$subtype) } library(genefu) pam50genes=pam50$centroids.map[c(1,3)] pam50genes[pam50genes$probe=='CDCA1',1]='NUF2' pam50genes[pam50genes$probe=='KNTC2',1]='NDC80' pam50genes[pam50genes$probe=='ORC6L',1]='ORC6' x=dat x=x[pam50genes$probe[pam50genes$probe %in% rownames(x)] ,] tmp=data.frame(subtypes=subtypes) rownames(tmp)=colnames(x) library(pheatmap) pheatmap(x,show_rownames = T,show_colnames = F, annotation_col = tmp, filename = 'ht_by_pam50_raw.png') x=t(scale(t(x))) x[x>1.6]=1.6 x[x< -1.6]= -1.6 pheatmap(x,show_rownames = T,show_colnames = F, annotation_col = tmp, filename = 'ht_by_pam50_scale.png') 

Take out the expression matrices of 50 genes contained in pam50 separately for heat map clustering:

It can be seen from the above figure that the expression levels of different genes are very different. Generally, we will not compare the expression levels of different genes, but only compare the expression differences of the same gene in different samples.

Therefore, we do not need to look at the expression level of different genes, then we can normalize to a certain degree, and re-draw as follows:

It can be clearly seen that even if the transcriptome sequencing results of normal tissues are classified by pam50, various classification results can be obtained.

But the classification of pam50 is a model trained on the chip expression matrix of breast cancer patients, because we have used the wrong place, you can look at the classification results in METEBRIC:

The above classification is the result of the pam50 algorithm, and the following classification is clinical information.

It can be seen that the results of basal are still very uniform, and they are more in line with the definition of TNBC, that is, the expression levels of PGR, ESR1, and ERBB2 are very low.

Remove batch effects

If you really want to compare the transcriptome expression matrix of GTEx database with TCGA, you also need to remove the batch effect to some extent.

I used to explain many times in the Shengxin skill tree, and I wo n’t go into details here.

Guess you like

Origin www.cnblogs.com/xiaojikuaipao/p/12681561.html