TCGA data download program

table of Contents

Today, I studied at how to write a script batch file download .svs in TCGA. In two parts, the use of python scripts and requests using wget shell script. shell script would be better, that may be stuck Python.

Any downloading tools are written out of people under certain rules, and no fantasy. Thinking + + practice = optimization tool .

requests

The code above, a simple modification. The main added .svs output file name and download speed display.

# coding:utf-8
'''
from https://blog.csdn.net/qq_35203425/article/details/80992727
This tool is to simplify the steps to download TCGA data.The tool has two main parameters,
-m is the manifest file path.
-s is the location where the downloaded file is to be saved (it is best to create a new folder for the downloaded data).
This tool supports breakpoint resuming. After the program is interrupted, it can be restarted,and the program will download file after the last downloaded file. Note that this download tool converts the file in the past folder format directly into a txt file. The file name is the UUID of the file in the original TCGA. If necessary, press ctrl+c to terminate the program.
author: chenwi
date: 2018/07/10
mail: [email protected]

@cp
* change the save file name of the form e.g.  TCGA-J8-A3YE-01Z-00-DX1.83286B2F-6D9C-4C11-8224-24D86BF517FA.svs instead of
the id name 66fab868-0b7e-4eb1-885f-6c62d1e80936.txt
* show the download speed
'''
import os
import pandas as pd
import requests
import sys
import argparse
import signal
import time

print(__doc__)

requests.packages.urllib3.disable_warnings()


def download(url, file_path):
    r = requests.get(url, stream=True, verify=False)
    total_size = int(r.headers['content-length'])
    print(f"{total_size/1024/1024}MB")
    temp_size = 0
    size = 0

    with open(file_path, "wb") as f:
        time1 = time.time()
        for chunk in r.iter_content(chunk_size=1024):
            if chunk:
                temp_size += len(chunk)
                f.write(chunk)
                done = int(50 * temp_size / total_size)
                if time.time()-time1 > 5:
                    speed = (temp_size - size) / 1024 / 1024 / 5
                    sys.stdout.write("\r[%s%s] %d%%%s%sMB/s" %
                                     ('#' * done, ' ' * (50 - done), 100 * temp_size / total_size, ' '*10, speed))
                    sys.stdout.flush()
                    size = temp_size
                    time1 = time.time()
    print()


def get_UUID_list(manifest_path):
    UUID_list = pd.read_csv(manifest_path, sep='\t', encoding='utf-8')['id']
    UUFN_list = pd.read_csv(manifest_path, sep='\t', encoding='utf-8')['filename']
    UUID_list = list(UUID_list)
    UUFN_list = list(UUFN_list)
    return UUID_list, UUFN_list

def get_last_UUFN(file_path):
    dir_list = os.listdir(file_path)
    if not dir_list:
        return
    else:
        dir_list = sorted(dir_list, key=lambda x: os.path.getmtime(os.path.join(file_path, x)))

        return dir_list[-1]


def get_lastUUFN_index(UUFN_list, last_UUFN):
    for i, UUFN in enumerate(UUFN_list):
        if UUFN == last_UUFN:
            return i
    return 0


def quit(signum, frame):
    # Ctrl+C quit
    print()
    print('You choose to stop me.')
    exit()
    print()


if __name__ == '__main__':

    signal.signal(signal.SIGINT, quit)
    signal.signal(signal.SIGTERM, quit)

    parser = argparse.ArgumentParser()
    parser.add_argument("-m", "--manifest", dest="M", type=str, default="gdc_manifest.txt",
                        help="gdc_manifest.txt file path")
    parser.add_argument("-s", "--save", dest="S", type=str, default=os.curdir,
                        help="Which folder is the download file saved to?")
    args = parser.parse_args()

    link = r'https://api.gdc.cancer.gov/data/'

    # args
    manifest_path = args.M
    save_path = args.S

    print("Save file to {}".format(save_path))

    UUID_list, UUFN_list = get_UUID_list(manifest_path)
    last_UUFN = get_last_UUFN(save_path)
    print("Last download file {}".format(last_UUFN))
    last_UUFN_index = get_lastUUFN_index(UUFN_list, last_UUFN)
    print("last_UUFN_index:", last_UUFN_index)

    for UUID, UUFN in zip(UUID_list[last_UUFN_index:], UUFN_list[last_UUFN_index:]):
        url = os.path.join(link, UUID)
        file_path = os.path.join(save_path, UUFN)
        download(url, file_path)
        print(f'{UUFN} have been downloaded')

wget

The advantage of using wget that supports unlimited automatic reconnect, support for HTTP, for downloading large files very friendly, do not worry about down to half off the net. The following script is a batch file to download svs, continuously updated for some time.

#!/usr/bin/env bash
# 文件名:wget_download.sh

# download and re-download
#location="cervix_uteri"
location=$1
if [ ! -d "${location}" ]
then
    mkdir "${location}"
fi

echo run script: $0 $*  > download_${location}.log  2>&1

manifest_file=`echo gdc_manifest_${location}.txt`
row_num=`awk 'END{print NR}' ${manifest_file}`
file_num=$((${row_num}-1))
echo file_num is ${file_num} >> download_${location}.log  2>&1
uuid_array=($(awk '{print $1}' ${manifest_file}))
uufn_array=($(awk '{print $2}' ${manifest_file}))
start=`ls  ./${location} | wc -l`
if [ ${start} -eq 0 ];then
    start=1
fi

echo start from ${start} >> download_${location}.log  2>&1
for k in `seq ${start} ${file_num}`
do
    echo ${uuid_array[$k]} ${uufn_array[$k]} >> download_${location}.log  2>&1
    wget -c -t 0 -O ./${location}/${uufn_array[$k]} https://api.gdc.cancer.gov/data/${uuid_array[$k]}
done

In use, input

chmod +x wget_download.sh
./wget_download.sh cervix_uteri

EVERYTHING

Think about how you can download it faster
Axel multi-threaded downloading tool
mwget multi-threaded version of the wget download tool

Published 135 original articles · won praise 7 · views 30000 +

Guess you like

Origin blog.csdn.net/math_computer/article/details/103907655