基因组处理

    去除基因组序列中的未定位的scaffold、Contig序列和线粒体序,主要针对NCBI提供refseq基因组序列,组装到染色体级别的物种基本都通用。将所有碱基统一成大写字母,并计算每条染色体长度,每80个字符换行。

处理脚本如下:

image

  1 use strict;
  2 open A,"$ARGV[0]";
  3 open B,">$ARGV[1]";
  4 open C,">$ARGV[2]";
  5 my $help=<<USAGE;
  6 Usage: perl $0 genome.fa new.fa chrlen.list
  7 
  8 USAGE
  9 die "$help",unless(@ARGV==3);
 10 
 11 $/=">";
 12 <A>;
 13 my %chrlen;
 14 while(<A>){
 15 	chomp;
 16 	my @line=split /\n+/,$_;
 17 	my $seqName=shift @line;
 18 	my $chr=(split /\s+/,((split /,/,$seqName)[0]))[-1];
 19 	next if $chr=~ /scaffold/;
 20 	next if $chr=~ /Contig/;
 21 	next if $chr=~ /mitochondrion/;
 22 	$chr="chr".$chr;
 23 	my $seq=join "",@line;
 24 	$seq=~s/\n//g;
 25 	$seq=uc($seq);
 26 	my $len=length($seq);
 27 	$chrlen{$chr}=$len;
 28 	$seq=~ s/(\w{80})/$1\n/g;
 29 	if($len % 80 == 0){
 30 		print B ">$chr\n$seq";
 31 	}
 32 	else{
 33 		print B ">$chr\n$seq\n";
 34 	}
 35 	print C "$chr\t$chrlen{$chr}\n";
 36 }

猜你喜欢

转载自www.cnblogs.com/mmtinfo/p/12165080.html