Download GDC TCGA data in R language! Note: non-R package

This article will introduce what the TCGA database is and how as a newbie, you can find the data you want on TCGA. In this tweet, I will teach you step by step how to download TCGA data.

There are three main ways to download data from TCGA:

1. Use TCGA official tool gdc-client to download: This method can ensure that the latest files are downloaded in real time, but the steps are a bit cumbersome. You have to merge a single file yourself, which is not conducive to novice operations.

2. Download the TCGA biolinks package in R language: This method is recommended first, especially when the gdc-client download on the official website is stuck and the download is unsuccessful. This package will be downloaded soon and provides the merge function.

3. UCSC xena browser download: This is the simplest and foolproof method. Beginners are recommended to use this method to explore TCGA.

(All roads lead to Rome, isn’t it easy to download data?)

But recently, bloggers have discovered that maybe their network is not good or there are other problems, and decompression always fails after downloading data from the GDC TCGA webpage. Finding another way out and downloading in R language will be a very reliable and fast method!

Our TCGA database has been revised and updated in April 2022, from Data release 18.0 to  Data release 33.0 , Release Notes - GDC Docs (cancer.gov), and of course it will continue to be updated in the future.

picture

We can see that these updates have nothing to do with the tumors we want to study with the TCGA RNAseq data. These data will not change even if the TCGA database is updated. So what’s updated? According to the introduction of each update, we can understand that GDC has added a lot of new data . For example, the New project-Exceptional Responders Initiative is added below. What we need to pay attention to is which updated data is relevant to us: clinical data ! For example, follow-up data and survival data used for prognostic analysis are also constantly updated.

Here, the blogger first updates to obtain the GDC TCGA transcriptome data in R language...the kind with scripts~

picture

 After the update, STAR is used as the comparison tool . Workfolw Type only has one STAR-Counts. This is actually good news for TCGA's "conscience discovery" because the difficulty of integrating the expression matrix has been reduced. Gene_name is provided in a file for each sample. Gene types, as well as corrected Counts, FPKM, and TPM data formats are available for us to extract.

 Start practicing

Open the GDC TCGA official website ( GDC ), click on the repository to enter the warehouse - download a small sample size to test the water!

picture

Then select the type of downloaded data in files , such as downloading transcriptome data. The selection is only selected when multiple options appear. If there is only one option, it does not matter if it is not checked.

For example, there is only one data format, tsv, currently. It doesn’t matter if you don’t check it.

picture

 Next, add the selected data to the shopping cart . The number in the shopping cart will change to the number of files. Currently it is 40. You may have questions here. The case is obviously 33, but the number of files in the shopping cart is 40? This is actually normal, because the case is the cancer patient number. Sometimes a cancer patient takes multiple tissues at the same time (such as primary cancer tissue, metastatic cancer tissue, blood control, etc.), and different file names may correspond to one case ID, so there may be more files than cases.

picture

Click on the shopping cart at this time and you will enter the download page. When the amount of data is large, it is not recommended to download directly but to use the officially recommended GDC Data Transfer Tool - if you don't have it - link: https://pan .baidu.com/s/1K8fPxi3R1bW79_KLvF5etA 
Extraction code: aaaa 
-- Sharing from Baidu Netdisk Super Member V4

The blogger has already done it, take it away without any thanks! Here is windows64x

picture

 To use this tool, you need to download these two files from GDC TCGA, one is Manifest and the other is metadata . These two files will generate names corresponding to the dates based on the download time.

You need to put the GDC Data Transfer Tool and these files in, and then create some folders. These are habits and not necessary. 

picture

In the current directory, start downloading data!

rm(list = ls())

setwd("D:/R.result/2.He/ww2023.8.3_breast cancer/04.symbol")#Set your own path

if(!file.exists("rawdata")) dir.create("rawdata")#Create the folder for raw data download
manifest <- "gdc_manifest_20230817_063606.txt"#
rawdata <- "rawdata"

command <- sprintf("./gdc-client download -m %s -d %s",
                   manifest,
                   rawdata)
system(command = command)

Then wait for a while and a download progress bar will appear. The download speed varies from person to person. My download speed here is very slow.

picture

 Of course, everyone may encounter some problems when downloading here. If there are problems, you can run it again. After the download is completed, it will be displayed:

picture

picture

At this time, all the data is in the rawdata folder. There are 40 folders in total. There is a file in each one when you click it. 

 

At this time, if you are doing it for the first time, you must open the file to see what is in it and what we need. We need the first column gene_id, and the fourth column unstranded is the traditional counts data . The previous TCGA data only provided two columns.

The wonderful thing about the revised TCGA is that it provides the converted gene_name and gene_type, including protein_coding and non-coding information, which can be kept for gene annotation.

picture

 OK! At this point, we have finished using R language to download data. Follow-up bloggers may also download clinical data, merge and process the downloaded data, etc.~~ If you want to know what happens next, Qing Dynasty will share it next time

Guess you like

Origin blog.csdn.net/Queen_yu/article/details/132341075