GDC database for medical image processing

1. Open the GDC database:

  • Log in to the GDC interface of the TCGA database: https://portal.gdc.cancer.gov/

    TCGA GDC interface

  • First, make sure that there are no previous file records in the Cart. If there are other files (that is, the number of files is not 0), clear the Cart.

    Check that the Cart is empty

  • If the number of Cart files is not 0, click to enter the Cart interface to clear.

  • Empty Cart

2. Select the sample type and nature:

  • Click Repository to enter the data warehouse, and then click the selection of Cases sample type and nature:

    Click Cases
  • First determine the sample site, taking a prostate cancer sample as an example:

    Select sample site

  • Select the sample source item, if you only analyze TCGA samples, only select TCGA:

    Select project source

  • Some of our previous choices will continue to narrow the sample range, so we found that there is only one TCGA-PRAD under the Project option, we don’t need to click it, and if we don’t choose it, it means that all the content under this option is required.
    Disease Type is selected here according to the analysis needs, and here I have selected it in order to unify the pathological types.
    Gender does not need to choose if there is no special need.
    Vital Status Generally, if we need to perform survival analysis, we select alive and dead patients. Patients not reported indicate that the survival data are incomplete and can be eliminated.
    Age at Diagnosis and Days to Death are set according to the needs of your own subject, and generally no filter conditions are set by default.

  • finer filtering

     

  • Race and Ethnicity generally do not set filter conditions, and there are too many nor reported samples here, so we do not filter to avoid losing too many samples.

  • Race and Ethnic Selection

3. Select the omics data type and format:

  • Click Files to select the data type and format.
  • Data Category uses the most common transcriptome data as an example, select transcriptome profiling.
  • Data Type selects Gene Expression Quantification, which represents the sequencing data of protein-coding genes and long-chain non-coding genes. The sequencing data of the miRNA gene is not included, you need to select miRNA Expression Quantification instead of Gene Expression Quantification.
  • There is only one option for Experimental Strategy, which is not selected by default. Workflow Type is based on your own needs. Generally, Counts data or FPKM data are commonly used.
    *Generally, you don’t need to click other filter conditions after selecting here, and generally there is only one option left for other options.
  • Access indicates the data permission. Our ordinary users can only use open data. If there is non-open data, remember to only click open here.

    Select data type and format

     

4. Download the selected data:

  • Add the selected data to the shopping cart, and then click Cart to enter the shopping cart interface.

    Add the selected data to the shopping cart
  • On the Cart interface, click Metadata (download annotation file) and Download (download data). The Download option provides two ways to download data: Manifest means to use gdc-client software to download data after downloading the Manifest file ( gdc-client download data method ), this method is suitable for downloading large files; Cart means to download directly through the browser, this method is more Convenient, but not suitable for downloading very large files.
  • Two ways to download data
  • So far, the TCGA data download has been completed. 

5. Naming rules for TCGA files

TCGA : Project name , all TCGA sample names start with this.

02 :     issue source site , organization source code. More callouts: https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/tissue-source-site-codes

0001 : Participant , participant number. A patient may correspond to multiple samples, such as TCGA-A6-6650 can get 3 sample data: TCGA-A6-6650-01A-11R-1774-07, TCGA-A6-6650-01A-11R-A278-07 , TCGA-A6-6650-01B-02R-A277-07

01 :     Sample , the key number , where numbers 01~09 represent tumors, and 10~19 represent normal controls

A :       Vial, the sequence in a series of patient tissues, most samples are coded at this position ; very few are B

01 :      Portion, the sequential number of different parts belonging to the same patient organization

D :         Analyte, the type of molecule analyzed

0182 :   Plate, the sequence in a series of 96- well plates, the larger the value, the later the plate making

07 :       Center, sequencing or identification center code

Naming Rules for GDC Database Samples

 6.  Read data through GDC Data Transfer Tool

①Original method:

  • Decompress the downloaded compressed package to get gdc-client.exe. Put the MANIFEST.txt file and gdc-client.exe in a folder .
  • Open the cmd command window in the file directory.

  • Enter gdc-client download -m MANIFEST.txt (Note: The downloaded manifest file is added after -m, which needs to be changed to your own file name. You can also add --latest at the back to indicate the latest file data and download clinical data It is more convenient), press Enter to start downloading.

  • gdc-client download -m MANIFEST.txt 
    #or
    gdc-client download -m MANIFEST.txt --latest
    download page

 ②Download data + preprocess data:

MarvinLer/tcga_segmentation: Whole Slide Image segmentation with weakly supervised multiple instance learning on TCGA | MICCAI2020 https://arxiv.org/abs/2004.05024 (github.com) gives a method to preprocess the downloaded data.

Downloading TCGA cohorts + WSI pre-processing

  1. Download the GDC Data Transfer Tool executable (not included here for license issues)
  2. Constitute any cohort on the TCGA GDC Data Portal, then download the associated manifest file, and place it in a source_folder
  3. Launch the download and pre-processing pipeline with
python -m code.data_processing.main --gdc gdc_executable_path source_folder

This script first downloads all files in the manifest file, then tiles WSI, extracts tiles of a given magnification, removes background tiles, and finally seeks to extract per-slide binary labels from their name.

Guess you like

Origin blog.csdn.net/weixin_45958695/article/details/127782355