从NCBI下载的测序数据很多是去过接头的,并且整理成readscount格式,即每行第一列为reads,第二列为reads数,而我们需要把它整理成fasta格式,并且每个read都整理为一条序列
原始文件:
cat GSM3124755_WTB_PARE.csv | head
GATCTTTCGAACTTTCCCAAC,1
ACTCTCTGCACTAAACAAAA,1
TTTTGTCATTGATTTTTGTA,4
GCAATCGAAATTCTCTGACG,1
GTAGTGACGAAAGCTGGCTCC,1
ATTACAGCTTCTGATGTCTT,4
CATCTTGGTCATGTCTTTGA,1
CATACAATATGGAGATGAAG,1
CCGACTTTGAGGGAGTTCGT,1
TACATTGGTGTTGGTACTGT,1
python脚本
fw = open('GSM3124755_WTB_PARE.fas', 'w')
s = 0
with open('GSM3124755_WTB_PARE.csv', 'r') as fr:
for line in fr.readlines():
s += 1
count = str(line.strip().split(',')[1])
seq = str(line.strip().split(',')[0])
for i in range(int(count)):
fw.write('>' + str(s) + '_' + str(i + 1) + '\n' + seq + '\n')
fw.close()
输出结果:
cat cat GSM3124755_WTB_PARE.fas | head
>1_1
GATCTTTCGAACTTTCCCAAC
>2_1
ACTCTCTGCACTAAACAAAA
>3_1
TTTTGTCATTGATTTTTGTA
>3_2
TTTTGTCATTGATTTTTGTA
>3_3
TTTTGTCATTGATTTTTGTA
>3_4
TTTTGTCATTGATTTTTGTA
>4_1
GCAATCGAAATTCTCTGACG
>5_1
GTAGTGACGAAAGCTGGCTCC
>6_1
ATTACAGCTTCTGATGTCTT
>6_2
ATTACAGCTTCTGATGTCTT
>6_3
ATTACAGCTTCTGATGTCTT
>6_4
ATTACAGCTTCTGATGTCTT
>7_1
CATCTTGGTCATGTCTTTGA
>8_1
CATACAATATGGAGATGAAG
>9_1
CCGACTTTGAGGGAGTTCGT
>10_1
TACATTGGTGTTGGTACTGT