0079-【生信软件】-人类基因组hg19、hg38构建bwa索引

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/leadingsci/article/details/82947869

在临床数据分析时,使用的人类参考基因组为UCSC上面的hg19,hg39版本,且常常将除1-22,X,Y,M,以外的染色体去除,避免数据的干扰。

hg19索引构建

  1. 进入UCSC的官网,hg19的ftp网址 http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/
  2. 参考基因组版本说明,与NCBI的对应版本
This directory contains the Feb. 2009 assembly of the human genome (hg19,
GRCh37 Genome Reference Consortium Human Reference 37 (GCA_000001405.1)),
as well as repeat annotations and GenBank sequences.

The Feb. 2009 human reference sequence (GRCh37) was produced by the
Genome Reference Consortium:
	http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/
  1. 需要下载的文件
chromFa.tar.gz - The assembly sequence in one file per chromosome.
    Repeats from RepeatMasker and Tandem Repeats Finder (with period
    of 12 or less) are shown in lower case; non-repeating sequence is
    shown in upper case.

md5sum.txt - checksums of files in this directory

hg19.chrom.sizes - Two-column tab-separated text file containing assembly
    sequence names and sizes.
  1. 检测文件完整性
# 打开码文件
cat md5sum.txt
## 将chromFa.tar.gz的码写入一个新文件
 echo "ec3c974949f87e6c88795c17985141d3  chromFa.tar.gz" > check_md5sum.txt

## 检测
 md5sum -c check_md5sum.txt
 ##显示 chromFa.tar.gz: OK
  1. 查看染色体,看染色体命名规律
cat hg19.chrom.sizes |head -n 25
## 显示
chr1    249250621
chr2    243199373
chr3    198022430
chr4    191154276
chr5    180915260
chr6    171115067
chr7    159138663
chrX    155270560
chr8    146364022
chr9    141213431
chr10   135534747
chr11   135006516
chr12   133851895
chr13   115169878
chr14   107349540
chr15   102531392
chr16   90354753
chr17   81195210
chr18   78077248
chr20   63025520
chrY    59373566
chr19   59128983
chr22   51304566
chr21   48129895
chr6_ssto_hap7  4928567
  1. 解压参考基因组,显示所有染色体单个文件
tar -zxf chromFa.tar.gz
  1. 将不要的染色体,移动到暂存的文件夹
mv chrUn_gl0002* chr*_random.fa  chr*ctg* chr*apd* chr*cox* chr*hap* temp

#mkdir chrfa
#mv chromFa.tar.gz chrfa/

当面目录只剩下

chr1.fa chr2.fa chr3.fa chr4.fa chr5.fa chr6.fa chr7.fa chr8.fa chr9.fa chr10.fa chr11.fa chr12.fa chr13.fa chr14.fa chr15.fa chr16.fa chr17.fa chr18.fa chr19.fa chr20.fa chr21.fa chr22.fa chrX.fa chrY.fa chrM.fa
  1. 将需要的染色体进行合并,并移动到单个fa文件暂存文件夹
for i in chr1.fa chr2.fa chr3.fa chr4.fa chr5.fa chr6.fa chr7.fa chr8.fa chr9.fa chr10.fa chr11.fa chr12.fa chr13.fa chr14.fa chr15.fa chr16.fa chr17.fa chr18.fa chr19.fa chr20.fa chr21.fa chr22.fa chrX.fa chrY.fa chrM.fa
do
	echo $i;
	#cat $i >> hg19.fa
	mv $i chrfa
done
  1. bwa进行索引构建
bwa index -a bwtsw -p hg19.fa hg19.fa

索引构建完后显示

[BWTIncConstructFromPacked] 670 iterations done. 6154206942 characters processed.
[BWTIncConstructFromPacked] 680 iterations done. 6177547390 characters processed.
[bwt_gen] Finished constructing BWT in 687 iterations.
[bwa_index] 2423.30 seconds elapse.
[bwa_index] Update BWT... 19.05 sec
[bwa_index] Pack forward-only FASTA... 16.77 sec
[bwa_index] Construct SA from BWT and Occ... 1192.17 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index -a bwtsw -p hg19.fa hg19.fa
[main] Real time: 3814.250 sec; CPU: 3676.297 sec

文件生成查看

toucan@DELL5577:/mnt/j/BIO001-bioinfos-refseq-doc/UCSC/hg19$ ls -lht
total 8.0G
-rwxrwxrwx 1 toucan toucan 3.0G Sep 27 09:53 hg19.fa

-rwxrwxrwx 1 toucan toucan 1.5G Sep 27 11:17 hg19.fa.sa
-rwxrwxrwx 1 toucan toucan 6.6K Sep 27 10:57 hg19.fa.amb
-rwxrwxrwx 1 toucan toucan  944 Sep 27 10:57 hg19.fa.ann
-rwxrwxrwx 1 toucan toucan 739M Sep 27 10:57 hg19.fa.pac
-rwxrwxrwx 1 toucan toucan 2.9G Sep 27 10:57 hg19.fa.bwt

-rwxrwxrwx 1 toucan toucan 3.5K Sep 27 13:16 work_index.log
-rwxrwxrwx 1 toucan toucan  685 Sep 27 10:06 work_index.sh
drwxrwxrwx 1 toucan toucan 4.0K Sep 27 09:58 chrfa
drwxrwxrwx 1 toucan toucan 4.0K Sep 27 09:17 temp
drwxrwxrwx 1 toucan toucan 4.0K Sep 27 09:00 download
  1. 索引构建完成


hg38构建索引

  1. 进入UCSC官网 http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/
  2. 查看下载文件
hg38.fa.gz - "Soft-masked" assembly sequence in one file.
    Repeats from RepeatMasker and Tandem Repeats Finder (with period
    of 12 or less) are shown in lower case; non-repeating sequence is
    shown in upper case.

 hg38.fa.gz - "Soft-masked" assembly sequence in one file.
    Repeats from RepeatMasker and Tandem Repeats Finder (with period
    of 12 or less) are shown in lower case; non-repeating sequence is
    shown in upper case.

md5sum.txt - checksums of files in this directory

hg38.chrom.sizes - Two-column tab-separated text file containing assembly
    sequence names and sizes.
  1. 检查文件完整性
toucan@DELL5577:/mnt/j/BIO001-bioinfos-refseq-doc/UCSC/hg38/genome/downloads$ echo "1c9dcaddfa41027f17cd8f7a82c7293b  hg38.fa.gz" >> check_md5sum.txt
toucan@DELL5577:/mnt/j/BIO001-bioinfos-refseq-doc/UCSC/hg38/genome/downloads$ echo "a5aa5da14ccf3d259c4308f7b2c18cb0  hg38.chromFa.tar.gz" >> check_md5sum.txt
toucan@DELL5577:/mnt/j/BIO001-bioinfos-refseq-doc/UCSC/hg38/genome/downloads$ cat check_md5sum.txt
1c9dcaddfa41027f17cd8f7a82c7293b  hg38.fa.gz
a5aa5da14ccf3d259c4308f7b2c18cb0  hg38.chromFa.tar.gz

toucan@DELL5577:/mnt/j/BIO001-bioinfos-refseq-doc/UCSC/hg38/genome/downloads$ md5sum -c check_md5sum.txt
hg38.fa.gz: OK
hg38.chromFa.tar.gz: OK
  1. 查看染色体命名规则
toucan@DELL5577:/mnt/j/BIO001-bioinfos-refseq-doc/UCSC/hg38/genome/downloads$ cat hg38.chrom.sizes |head -n 25
chr1    248956422
chr2    242193529
chr3    198295559
chr4    190214555
chr5    181538259
chr6    170805979
chr7    159345973
chrX    156040895
chr8    145138636
chr9    138394717
chr11   135086622
chr10   133797422
chr12   133275309
chr13   114364328
chr14   107043718
chr15   101991189
chr16   90338345
chr17   83257441
chr18   80373285
chr20   64444167
chr19   58617616
chrY    57227415
chr22   50818468
chr21   46709983
  1. 解压染色体单个文件
tar -zxf hg38.chromFa.tar.gz
cd chroms/
  1. 合并需要的染色体文件
 mkdir chrfa

 cat work.sh
for i in chr1.fa chr2.fa chr3.fa chr4.fa chr5.fa chr6.fa chr7.fa chr8.fa chr9.fa chr10.fa chr11.fa chr12.fa chr13.fa chr14.fa chr15.fa chr16.fa chr17.fa chr18.fa chr19.fa chr20.fa chr21.fa chr22.fa chrX.fa chrY.fa chrM.fa
do
        echo $i;
        cat $i >> ../hg38.fa
        mv $i ../chrfa
done

toucan@DELL5577:/mnt/j/BIO001-bioinfos-refseq-doc/UCSC/hg38$ rm -rf chroms/
toucan@DELL5577:/mnt/j/BIO001-bioinfos-refseq-doc/UCSC/hg38$ mv hg38.chromFa.tar.gz chrfa/

查看结果

toucan@DELL5577:/mnt/j/BIO001-bioinfos-refseq-doc/UCSC/hg38$ cat hg38.fa |grep ">"|uniq -c
      1 >chr1
      1 >chr2
      1 >chr3
      1 >chr4
      1 >chr5
      1 >chr6
      1 >chr7
      1 >chr8
      1 >chr9
      1 >chr10
      1 >chr11
      1 >chr12
      1 >chr13
      1 >chr14
      1 >chr15
      1 >chr16
      1 >chr17
      1 >chr18
      1 >chr19
      1 >chr20
      1 >chr21
      1 >chr22
      1 >chrX
      1 >chrY
      1 >chrM

  1. bwa构建索引
bwa index -a bwtsw -p hg38.fa hg38.fa

输出显示:

[BWTIncConstructFromPacked] 630 iterations done. 6018708242 characters processed.
[BWTIncConstructFromPacked] 640 iterations done. 6055482898 characters processed.
[BWTIncConstructFromPacked] 650 iterations done. 6088164258 characters processed.
[BWTIncConstructFromPacked] 660 iterations done. 6117207522 characters processed.
[BWTIncConstructFromPacked] 670 iterations done. 6143017202 characters processed.
[BWTIncConstructFromPacked] 680 iterations done. 6165952882 characters processed.
[bwt_gen] Finished constructing BWT in 686 iterations.
[bwa_index] 2564.67 seconds elapse.
[bwa_index] Update BWT... 19.27 sec
[bwa_index] Pack forward-only FASTA... 16.36 sec
[bwa_index] Construct SA from BWT and Occ... 1230.34 sec
[main] Version: 0.7.17-r1188
[main] CMD: bwa index -a bwtsw -p hg38.fa hg38.fa
[main] Real time: 3969.074 sec; CPU: 3855.406 sec

输出文件

-rwxrwxrwx 1 toucan toucan 3150052305 Oct  5 15:44 hg38.fa*
-rwxrwxrwx 1 toucan toucan      16843 Oct  5 16:34 hg38.fa.amb*
-rwxrwxrwx 1 toucan toucan        954 Oct  5 16:34 hg38.fa.ann*
-rwxrwxrwx 1 toucan toucan 3088286508 Oct  5 16:33 hg38.fa.bwt*
-rwxrwxrwx 1 toucan toucan  772071602 Oct  5 16:34 hg38.fa.pac*
-rwxrwxrwx 1 toucan toucan 1544143256 Oct  5 16:54 hg38.fa.sa*


其他索引构建

fai结尾

samtools faidx hg19.fa

## 显示生成
hg19.fa.fai

samtools faidx hg38.fa

dict结尾

gatk CreateSequenceDictionary -R hg19.fa -O hg19.dict

猜你喜欢

转载自blog.csdn.net/leadingsci/article/details/82947869