如何根据GSE/SRA/SRR号进行原始的数据下载

pre:

(最近在准备托福考试,出现的英文比较多,纯手打,为了训练一下,祝我好运呀~)
----------------------------------------------------分割线------------------------------------------------
  When we read an article about using bioinformatic analysis, we usually have some ideas to repeat these outcomes.How to do that ? Basically, authors will upload these data to public database. (在看到一篇生信文章的时候,我们会有想根据原始数据重新分析一遍的冲动,一般的文章是会将原始的数据进行上传到公共数据库。)
  For example : I am doing a ChIP-seq practice. The data comes in this paper which publiced in Cell in 2008. how to get these data (.fastq?)
(举个栗子,我现在正在做一个ChIP-seq的项目分析,这些数据来自于08年发表在cell的一篇文章)

Chen X , Xu H , Yuan P , et al. Integration of External Signaling Pathways with the Core Transcriptional Network in Embryonic Stem Cells[J]. Cell, 2008, 133(6):0-1117.

  Although the practice item has prepare the raw data in a .zip document,and it contains raw fq document and reference genome index :(虽然这个练习的项目已经准备了原始数据,压缩包长这样~)

14720037-d5143593b3952a28.png
压缩包内容.png

  But ! How to download the raw data when we see some words in an article?(但是当我们看到文章当中有这句话的时候,我们如何下载原始数据呢?)
14720037-2e964e8013d284ca.png
GSE号信息

-------------------------------------------------------分割线-----------------------------------------------
  From this words we can see a GSE number ,then we can look up this GSE number in NCBI-GEO to see the basic clincial information.(我们可以根据文章当中的GSE号先到NCBI的网站进行搜索,搜索方式见图一)

1: choose GEO DataSets

2: print GSE number

3: click search

14720037-5665a25157c5a0b5.png
1.png

4:the click the search outcome and you can see the data information (点击搜索的结果,就可以看到数据的基本信息)

14720037-b7f07e4378c32fcb.png
GEO Accession viewer.png

5: find the SRA Number (找到SRA的编号)

 5.1 the first arrow is the SRA number (第一个箭头是SRA号,一般有了SRA号就可以进行原始的fq下载)
 5.2 the second arrow is the analysis outcomes which contains the .bed and .txt documents (第二个箭头是作者上传的分析结果,包含了bed文件和txt文件)
 5.3 the third arrow is the information about the RAW.tar(第三个箭头是这个压缩包的基本信息)


14720037-6e6d7ff18beb8df3.png
SRA Number.png

6:search the SRA number in NCBI-SRA

 6.1 select the SRA item
 6.2 print the SRP number
 6.3 search
 6.4 click the Send results to Run selector

14720037-92d647920bbe4382.png
SRP1.png

7: Download the SRR

 7.1 Find the SRR number you want to download,just use oct4 as example (找到你想下载SRR号,这里以oct4为例子)


14720037-cc062aaf321a5332.png
NCBI __ SRA Run Selector.png
14720037-950d86d147e816b1.png
SRR.png

 7.2 then you can write a code to download it (写一个循环代码下载)

这里需要注意的有几点!

ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/ 是前面基本的地址
SRR002是这个项目基本编号,然后SRR+前面三位数
SRR0020$i是项目的SRR的具体编号
SRR0020$i.sra在这个具体项目编号的文件夹里

for ((i = 12;i<=15;i++));
do
wget ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByExp/sra/SRR002/SRR0020$i/SRR0020$i.sra
done

8:change the SRA into fastq

notice: prepare install fastq-dump
fastq-dump --split-3 -O ChIP_seq/ SRR***.sra 
rm SRR***.sra

--split-3参数可以将PE的sra文件解压后的fastq文件拆分成_1.fastq和_2.fastq,如果示例数据集是SE测序,不会进行拆分。
然后将sra删掉

14720037-fd4e833a4845c030.png
结果

附:
14720037-4bb9d5f9bf38ba8c.png
fastq-dump 帮助文档

猜你喜欢

转载自blog.csdn.net/weixin_34293141/article/details/88238824
今日推荐