GEO and TCGA

Introduction to GEO database

1. What the hell is the GEO database?

The full name of the GEO database is GENE EXPRESSION OMNIBUS, which is a gene expression database created and maintained by the National Center for Biotechnology Information NCBI. It was founded in 2000 and contains high-throughput gene expression data submitted by research institutions around the world. That is to say, as long as it is a published paper, the gene expression detection data involved in the paper can be found in this database.

The point is that this data is free! It's free! It's free! The world is still very beautiful when you think about it this way.

2. What are the search entries in the GEO database?

There are two most commonly used methods, one is to enter directly through the website http://www.ncbi.nlm.nih.gov/geo, and the other is to enter through pubmed.

3. Article source and usage tutorial

Introduction to TCGA

1. What is TCGA? What data is in TCGA?

The full name of TCGA is The Cancer Genome Atlas. This project started in 2005. It aims to catalog cancer-related gene mutations using gene sequencing and bioinformatics. TCGA uses high-throughput genome analysis technology to help us better understand the genetic basis of cancer, thereby enhancing our ability to diagnose cancer and to treat and prevent cancer.

TCGA is under the supervision of the Cancer Genome Center under the National Cancer Institute and the Human Genome Institute.

TCGA includes genome characterization centers (GCCs) that mainly perform sequencing and genome data analysis centers (GDACs) that are responsible for sequencing data analysis. So far, TCGA has a total of 39 cancer-related sequencing data, involving 29 Kinds of cancer organs, more than 10,000 tumor samples, and more than 270,000 documents.

2. So what types of data can be downloaded from TCGA?

TCGA's data types mainly include the following:
(1) Clinical: Including the general condition of the patient, diagnosis and treatment, TNM staging, tumor pathology, survival, etc.
(2) mRNA expression data: mRNA expression measured by mRNA chip or RNAseq
(3) microRNA: microRNA expression measured by microRNA chip or microRNA-Seq
(4) Copy number variation: tumor tissue comparison obtained by SNP chip The ratio of the fragments on the chromosomes of normal tissues
(5) Mutation: Nucleotide mutations of the tumor tissue sequencing results relative to the reference genome, including changes such as insertions and deletions
(6) Protein: About 200 common cancer-related proteins obtained by protein chip sequencing the expression
(7) Mythelation: DNA methylation data measured methylation chip, the main data 27 and the two chips 450

Among them, mRNA-Seq, miRNA-Seq and Methylation Array are widely used.

3. There are 3 types of mRNA-Seq data:
HTSeq-Counts; HTSeq-FPKM; HTSeq-FPKM-UQ.

The first two are easier to understand. The difference between the third and the second lies in the different standardization methods. For the formula, please refer to https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/Expression_mRNA_Pipeline/

4. TCGA data level:

level1: raw data
level2: processed data
level3: segmented and interpreted data
level4: area of ​​interest or summary

All in all, the first two levels of data are generally not available and require permission. Generally, only foreign PIs can apply (heard). The open data we generally get is the kind of data that has been standardized.

5. TCGA sample classification:
In addition to knowing the data grade, we also need to understand the TCGA sample classification, such as which is a normal sample and which is a tumor sample

Generally, we can see the sample name such as: TCGA-19-2619-10A. What we need to pay attention to is the last 10A. Generally, 01 represents the cancer sample, and 11 represents the adjacent sample. In fact, 01-09 are tumors, cancer samples; 10-29 are normal, cancer samples. It's just that the points are more detailed. For details, please refer to the official website

6. TCGA data download method

There are three main ways to download TCGA data. One is to use the official download tool of GDC; the other is to download using cbioportal; and the third is to use TCGA-assembler 2.

Article Source

Guess you like

Origin blog.csdn.net/weixin_47542175/article/details/113817140
GEO