table of Contents
Today, I studied at how to write a script batch file download .svs in TCGA. In two parts, the use of python scripts and requests using wget shell script. shell script would be better, that may be stuck Python.
Any downloading tools are written out of people under certain rules, and no fantasy. Thinking + + practice = optimization tool .
requests
The code above, a simple modification. The main added .svs output file name and download speed display.
# coding:utf-8
'''
from https://blog.csdn.net/qq_35203425/article/details/80992727
This tool is to simplify the steps to download TCGA data.The tool has two main parameters,
-m is the manifest file path.
-s is the location where the downloaded file is to be saved (it is best to create a new folder for the downloaded data).
This tool supports breakpoint resuming. After the program is interrupted, it can be restarted,and the program will download file after the last downloaded file. Note that this download tool converts the file in the past folder format directly into a txt file. The file name is the UUID of the file in the original TCGA. If necessary, press ctrl+c to terminate the program.
author: chenwi
date: 2018/07/10
mail: [email protected]
@cp
* change the save file name of the form e.g. TCGA-J8-A3YE-01Z-00-DX1.83286B2F-6D9C-4C11-8224-24D86BF517FA.svs instead of
the id name 66fab868-0b7e-4eb1-885f-6c62d1e80936.txt
* show the download speed
'''
import os
import pandas as pd
import requests
import sys
import argparse
import signal
import time
print(__doc__)
requests.packages.urllib3.disable_warnings()
def download(url, file_path):
r = requests.get(url, stream=True, verify=False)
total_size = int(r.headers['content-length'])
print(f"{total_size/1024/1024}MB")
temp_size = 0
size = 0
with open(file_path, "wb") as f:
time1 = time.time()
for chunk in r.iter_content(chunk_size=1024):
if chunk:
temp_size += len(chunk)
f.write(chunk)
done = int(50 * temp_size / total_size)
if time.time()-time1 > 5:
speed = (temp_size - size) / 1024 / 1024 / 5
sys.stdout.write("\r[%s%s] %d%%%s%sMB/s" %
('#' * done, ' ' * (50 - done), 100 * temp_size / total_size, ' '*10, speed))
sys.stdout.flush()
size = temp_size
time1 = time.time()
print()
def get_UUID_list(manifest_path):
UUID_list = pd.read_csv(manifest_path, sep='\t', encoding='utf-8')['id']
UUFN_list = pd.read_csv(manifest_path, sep='\t', encoding='utf-8')['filename']
UUID_list = list(UUID_list)
UUFN_list = list(UUFN_list)
return UUID_list, UUFN_list
def get_last_UUFN(file_path):
dir_list = os.listdir(file_path)
if not dir_list:
return
else:
dir_list = sorted(dir_list, key=lambda x: os.path.getmtime(os.path.join(file_path, x)))
return dir_list[-1]
def get_lastUUFN_index(UUFN_list, last_UUFN):
for i, UUFN in enumerate(UUFN_list):
if UUFN == last_UUFN:
return i
return 0
def quit(signum, frame):
# Ctrl+C quit
print()
print('You choose to stop me.')
exit()
print()
if __name__ == '__main__':
signal.signal(signal.SIGINT, quit)
signal.signal(signal.SIGTERM, quit)
parser = argparse.ArgumentParser()
parser.add_argument("-m", "--manifest", dest="M", type=str, default="gdc_manifest.txt",
help="gdc_manifest.txt file path")
parser.add_argument("-s", "--save", dest="S", type=str, default=os.curdir,
help="Which folder is the download file saved to?")
args = parser.parse_args()
link = r'https://api.gdc.cancer.gov/data/'
# args
manifest_path = args.M
save_path = args.S
print("Save file to {}".format(save_path))
UUID_list, UUFN_list = get_UUID_list(manifest_path)
last_UUFN = get_last_UUFN(save_path)
print("Last download file {}".format(last_UUFN))
last_UUFN_index = get_lastUUFN_index(UUFN_list, last_UUFN)
print("last_UUFN_index:", last_UUFN_index)
for UUID, UUFN in zip(UUID_list[last_UUFN_index:], UUFN_list[last_UUFN_index:]):
url = os.path.join(link, UUID)
file_path = os.path.join(save_path, UUFN)
download(url, file_path)
print(f'{UUFN} have been downloaded')
wget
The advantage of using wget that supports unlimited automatic reconnect, support for HTTP, for downloading large files very friendly, do not worry about down to half off the net. The following script is a batch file to download svs, continuously updated for some time.
#!/usr/bin/env bash
# 文件名:wget_download.sh
# download and re-download
#location="cervix_uteri"
location=$1
if [ ! -d "${location}" ]
then
mkdir "${location}"
fi
echo run script: $0 $* > download_${location}.log 2>&1
manifest_file=`echo gdc_manifest_${location}.txt`
row_num=`awk 'END{print NR}' ${manifest_file}`
file_num=$((${row_num}-1))
echo file_num is ${file_num} >> download_${location}.log 2>&1
uuid_array=($(awk '{print $1}' ${manifest_file}))
uufn_array=($(awk '{print $2}' ${manifest_file}))
start=`ls ./${location} | wc -l`
if [ ${start} -eq 0 ];then
start=1
fi
echo start from ${start} >> download_${location}.log 2>&1
for k in `seq ${start} ${file_num}`
do
echo ${uuid_array[$k]} ${uufn_array[$k]} >> download_${location}.log 2>&1
wget -c -t 0 -O ./${location}/${uufn_array[$k]} https://api.gdc.cancer.gov/data/${uuid_array[$k]}
done
In use, input
chmod +x wget_download.sh
./wget_download.sh cervix_uteri
EVERYTHING
Think about how you can download it faster
Axel multi-threaded downloading tool
mwget multi-threaded version of the wget download tool