Genome processing

    Removing the genomic sequence of non-targeted scaffold, Contig sequences and mitochondrial sequences, mainly for providing refseq NCBI genome sequence, into the chromosome level assembly basically common species. All bases unified capital letters, and calculate the length of each chromosome, 80 characters per line breaks.

Processing script as follows:

image

  1 use strict;
  2 open A,"$ARGV[0]";
  3 open B,">$ARGV[1]";
  4 open C,">$ARGV[2]";
  5 my $help=<<USAGE;
  6 Usage: perl $0 genome.fa new.fa chrlen.list
  7 
  8 USAGE
  9 die "$help",unless(@ARGV==3);
 10 
 11 $/=">";
 12 <A>;
 13 my %chrlen;
 14 while(<A>){
 15 	chomp;
 16 	my @line=split /\n+/,$_;
 17 	my $seqName=shift @line;
 18 	my $chr=(split /\s+/,((split /,/,$seqName)[0]))[-1];
 19 	next if $chr=~ /scaffold/;
 20 	next if $chr=~ /Contig/;
 21 	next if $chr=~ /mitochondrion/;
 22 	$chr="chr".$chr;
 23 	my $seq=join "",@line;
 24 	$seq=~s/\n//g;
 25 	$seq=uc($seq);
 26 	my $len=length($seq);
 27 	$chrlen{$chr}=$len;
 28 	$seq=~ s/(\w{80})/$1\n/g;
 29 	if($len % 80 == 0){
 30 		print B ">$chr\n$seq";
 31 	}
 32 	else{
 33 		print B ">$chr\n$seq\n";
 34 	}
 35 	print C "$chr\t$chrlen{$chr}\n";
 36 }

Guess you like

Origin www.cnblogs.com/mmtinfo/p/12165080.html