python for bioinformatics相关题目

题目完整版来自：http://rosalind.info/problems/list-view/；

学习的网友脚本来自生信技能树：http://www.biotrainee.com/forum-59-1.html。

每个题可能有多种解法，不同解法用分别用## 1/2/3表示，通常## 1 是我自己脚本，而其他解法是参考其他网友的脚本。如果你也同我一样刚用python处理生信数据的话，请务必先自己做一遍再看文中代码。

1. 计算序列中各碱基数目

test.txt文件：

GAGCCTACTAACGGGAT
CATCGTAATGACGGCCT

#!/usr/bin/env python3
nts = {c:0 for c in 'ATGC'}
with open('./test.txt','r') as f:
  for a in f:
    a = a.upper()
    for nt in a.rstrip():
      nts[nt] += 1
print (nts)

2. 将DNA序列转化为RNA序列

## 1
import re
with open('./test.txt','r') as f:
  for line in f:
    line = line.upper()
    RnaSeq = re.sub('T','U',line.rstrip())
print(RnaSeq)
## 2
with open('./test.txt','r') as f:
  for line in f:
    line = line.upper()
    print(line.replace('T','U'))

3. 获取序列的反向互补序列

## 1
trans = {'A':'T','T':'A','G':'C','C':'G'}
with open('./test.txt','r') as f:
  for line in f:
    seq = ''
    line = line.upper()
    for aa in line.rstrip():
      seq += trans.get(aa)
     print(seq[::-1])
## 2
def reverse_complement(seq):
  ntComplement = {'A':'T','T':'A','G':'C','C':'G'}
  RevSeqList = list(reversed(seq))
  RevComSeqList = [ntComplement[k] for k in RevSeqList]
  RevComSeq = ''.join(RevComSeqList)
  return RevComSeq

seq = ''
with open('./test.txt','r') as f:
  for line in f:
    line = line.upper()
print (reverse_complement(line.rstrip()))

C
TC
CTC
GCTC
GGCTC
AGGCTC
TAGGCTC
GTAGGCTC
AGTAGGCTC
TAGTAGGCTC
TTAGTAGGCTC
GTTAGTAGGCTC
CGTTAGTAGGCTC
CCGTTAGTAGGCTC
CCCGTTAGTAGGCTC
TCCCGTTAGTAGGCTC
ATCCCGTTAGTAGGCTC
G
TG
ATG
GATG
CGATG
ACGATG
TACGATG
TTACGATG
ATTACGATG
CATTACGATG
TCATTACGATG
GTCATTACGATG
CGTCATTACGATG
CCGTCATTACGATG
GCCGTCATTACGATG
GGCCGTCATTACGATG
AGGCCGTCATTACGATG
AGGCCGTCATTACGATG

4. 找出fasta文件中GC含量最大的序列

## 1 
import re
Seq = {}
seqGC = {}
with open('./test.fa','r') as f:
        for line in f:
                if re.match(">",line):
                        SeqName = line[1:]
                        Seq[SeqName] = ''
                        seqGC[SeqName] = 0
                else:
                        line = line.upper()
                        line = line.rstrip()
                        Seq[SeqName] += line
                        seqGC[SeqName] += line.count('G')
                        seqGC[SeqName] += line.count('C')
maxGC = 0
for key , value in Seq.items():
        if maxGC < float(seqGC[key]/ len(value)*100):
                maxGC = float(seqGC[key] / len(value)*100)
                tmp = key
print ('>'+tmp+Seq[tmp])
## 2
from operator import itemgetter
from collections import OrderedDict
SeqTest = OrderedDict()
GcContent = OrderedDict()
with open('./test.fa','r') as f:
        for line in f:
                line = line.rstrip()
                if line.startswith('>'):
                        SeqName = line[1:]
                        SeqTest[SeqName] = ''
                        continue
                SeqTest[SeqName] += line.upper()

for key, value in SeqTest.items():
        totalLength = len(value)
        gcNum = value.count('G') + value.count('C')
        gcContent[key] = float(gcNum/totalLength)*100
sortedGC = sorted(gcContent.items(),key = itemgetter(1))
largeName = sortedGC[-1][0]
largeGCcontent = sortedGC[-1][1]
print ('most GC ratio rate is %s and it is %s ' %(largeName,largeGCContent))

5. 计算点突变数目

给两个长度为t的序列s，t和s之间的哈明距离（Hamming distance）定义为dH(s,t)。该问题即返回两条序列的哈明距离。

## 1
fh = open('./test.txt','r')
lst = []
for line in fh:
        lst.append(line.rstrip())
hamming_dis = 0
for i in range(len(lst[0])):
        if lst[0][i] == lst[1][i]:
                continue
        hamming_dis += 1
print (hamming_dis)
## 2
fh = open('./test.txt','r')
seq = file.readlines()
seq1, seq2 = seq[0].strip(), seq[1].strip()
mutation = [i for i in range(len(seq1)) if seq1[i] != seq2[i]]
print (len(mutation))

6. 孟德尔第一定理

一个群体中有三种基因型的生物：k,显性纯合子；m,杂合子；n,隐性纯合子。假设这对形状由一对等位基因控制，且群体中随机选取的任何两个个体都能交配，求随机选取两个个体交配后，子代拥有显性等位基因的概率。

## 1
k = int(input("enter the number of homozygous dominant: "))
m = int(input("enter the number of heterozygous: "))
n = int(input("enter the number of homozygous recessive: "))

num = int(k + m + n)
choice = num*(num-1)/2.0
p = 1 - (n*(n-1)/2 + 0.25*m*(m-1)/2 + m*n*0.5)/choice
print(p)
## 2
from scipy.misc import comb
num = input("Number of individuals(k,m,n): ")
[k,m,n] = map(int,num.split(','))
t = k + m + n
rr = comb(n,2)/comb(t,2)
hh = comb(m,2)/comb(t,2)
hr = comb(n,1)*comb(m,1)/comb(t,2)

p = 1 - (rr+hh*1/4+hr*1/2)
print(p)

7. 将RNA翻译成蛋白质

def translate_rna(sequence):
    codonTable = {
    'AUA':'I', 'AUC':'I', 'AUU':'I', 'AUG':'M',
    'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACU':'T',
    'AAC':'N', 'AAU':'N', 'AAA':'K', 'AAG':'K',
    'AGC':'S', 'AGU':'S', 'AGA':'R', 'AGG':'R',
    'CUA':'L', 'CUC':'L', 'CUG':'L', 'CUU':'L',
    'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCU':'P',
    'CAC':'H', 'CAU':'H', 'CAA':'Q', 'CAG':'Q',
    'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGU':'R',
    'GUA':'V', 'GUC':'V', 'GUG':'V', 'GUU':'V',
    'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCU':'A',
    'GAC':'D', 'GAU':'D', 'GAA':'E', 'GAG':'E',
    'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGU':'G',
    'UCA':'S', 'UCC':'S', 'UCG':'S', 'UCU':'S',
    'UUC':'F', 'UUU':'F', 'UUA':'L', 'UUG':'L',
    'UAC':'Y', 'UAU':'Y', 'UAA':'', 'UAG':'',
    'UGC':'C', 'UGU':'C', 'UGA':'', 'UGG':'W',
    }
    proteinsequence = ''
    for n in range(0,len(sequence),3):
        if sequence[n:n+3] in codonTable.keys():
            proteinsequence += codonTable[sequence[n:n+3]]
    return proteinsequence


protein_fh = open('./protein.txt','w')
with open('./rna.txt','r') as f:
        for line in f:
                protein_fh.write(translate_rna(line.strip('\n')))
## 2
import re
from collections import OrderedDict

codonTable = OrderedDict()
codonTable={
'AUA':'I','AUC':'I','AUU':'I','AUG':'M',
'ACA':'T','ACC':'T','ACG':'T','ACU':'T',
'AAC':'N','AAU':'N','AAA':'K','AAG':'K',
'AGC':'S','AGU':'S','AGA':'R','AGG':'R',
'CUA':'L','CUC':'L','CUG':'L','CUU':'L',
'CCA':'P','CCC':'P','CCG':'P','CCU':'P',
'CAC':'H','CAU':'H','CAA':'Q','CAG':'Q',
'CGA':'R','CGC':'R','CGG':'R','CGU':'R',
'GUA':'V','GUC':'V','GUG':'V','GUU':'V',
'GCA':'A','GCC':'A','GCG':'A','GCU':'A',
'GAC':'D','GAU':'D','GAA':'E','GAG':'E',
'GGA':'G','GGC':'G','GGG':'G','GGU':'G',
'UCA':'S','UCC':'S','UCG':'S','UCU':'S',
'UUC':'F','UUU':'F','UUA':'L','UUG':'L',
'UAC':'Y','UAU':'Y','UAA':'','UAG':'',
'UGC':'C','UGU':'C','UGA':'','UGG':'W',
}

rnaseq = ''
with open('./rna.txt','r') as f:
        for line in f:
                line = line.rstrip()
                line += line.upper()

aminoAcids = []
i = 0
while i < len(rnaseq):
        condon = rnaseq[i:i+3]
        if codonTable[condon] != '':
                aminoAcids.append(codonTable[condon])
        i += 3

peptide = ''.join(aminoAcids)
print(peptide)
## 3
from Bio.Seq import Seq
from Bio.Alphabet import generic_dna,generic_rna

# translate
rna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG", generic_rna)
print(rna.translate())

8. 寻找DNA motif

## 1 
seq = 'GATATATGCATATACTT'
motif = 'ATAT'
motif_len = len(motif)
position = []
for i in range(len(seq)-motif_len):
        if seq[i:i+motif_len] == motif:
                position.append(i+1)
print(position)
## 2
import re
seq = 'GATATATGCATATACTT'
print([i.start()+1 for i in re.finditer('(?=ATAT)',seq)])

9. 多个等长序列的一致性序列

比如序列如下：

>Rosalind_1
ATCCAGCT
>Rosalind_2
GGGCAACT
>Rosalind_3
ATGGATCT
>Rosalind_4
AAGCAACC
>Rosalind_5
TTGGAACT
>Rosalind_6
ATGCCATT
>Rosalind_7
ATGGCACT

各位点碱基个数：

A   5 1 0 0 5 5 0 0
C   0 0 1 4 2 0 6 1
G   1 1 6 3 0 1 0 0
T   1 5 0 0 0 1 1 6
Consensus   A T G C A A C T

## 1
def seq_list(fasta):
        seq_list = []
        for line in fasta.readlines():
                if not line.startswith('>'):
                        seq = list(line.rstrip())
                        seq_list.append(seq)
        return seq_list
def statistic_base(seq_list):
        for base in 'ATGC':
                base_total = []
                for sit in range(len(seq_list[0])):
                        col = [x[sit] for x in seq_list]
                        num = col.count(base)
                        base_total.append(num)
                print('%s:%s'%(base,base_total))
fh =  open('./test.fa','r')
sequence_list = seq_list(fh)
statistic_base(sequence_list)
## 2
from collections import Counter
from collections import OrderedDict
seq = OrderedDict()
seqLength = 0
fh = open('./test.consensus.txt','wt')

with open('./test.fa','r') as f:
        for line in f:
                if line.startswith('>'):
                        seq_name = line.rstrip()
                        seq[seq_name] = ''
                        continue
                seq[seq_name] += line.upper().rstrip()
        seqLength = len(seq[seq_name])

a,t,g,c = [],[],[],[]
consensus = ''
for i in range(seqLength):
        sequence = ''
        for j in seq.keys():
                sequence += seq[j][i]
        a.append(sequence.count('A'))
        t.append(sequence.count('T'))
        g.append(sequence.count('G'))
        c.append(sequence.count('C'))
        counts = Counter(sequence)
        consensus += counts.most_common()[0][0]
fh.write(consensus+'\n')
fh.write('\n'.join(['A:\t'+'\t'.join(map(str,a)),'C:\t'+'\t'.join(map(str,c)),'G:\t'+'\t'.join(map(str,g)),'T:\t'+'\t'.join(map(str,t))])+'\n')
fh.close()

10. 致命的斐波那契兔子

斐波那契序列是一个序列的数字定义的递归关系Fn = Fn-1+ Fn−2 ,我们设置的起始值F1 = F2 = 1。
假设每只兔子可以活m个月，n个月后有多少只兔子？

## 1
def fib(n,m):
        f= [0,1,1]
        for i in range(3,n+1):
                if i <= m:
                        total = f[i-1] + f[i-2]
                elif i == m+1:
                        total = f[i-1] + f[i-2] - 1
                else:
                        total = f[i-1] + f[i-2] - f[i-m-1]
                f.append(total)
        return(f[n])

inp = input('live month of rabbit(m),and afther n-th month;n<=100,m<=20;input(n,m): ')
[n,m]=map(int,inp.split(','))

print(fib(n,m))

11. 11. Graph Theory

文件介绍好麻烦，自己看：http://rosalind.info/problems/grph/
总之该题有三个碱基的首尾相同就连接起来，
输入文件：

>Rosalind_0498
AAATAAA
>Rosalind_2391
AAATTTT
>Rosalind_2323
TTTTCCC
>Rosalind_0442
AAATCCC
>Rosalind_5013
GGGTGGG

输出结果：

Rosalind_0498 Rosalind_2391
Rosalind_0498 Rosalind_0442
Rosalind_2391 Rosalind_2323

seq = {}
with open('./overlap.fa','r') as f:
        for line in f:
                line = line.rstrip()
                if line.startswith('>'):
                        seqname = line[1:]
                        seq[seqname] = ''
                        continue
                seq[seqname] += line.upper()

for key , value in seq.items():
        for key2 ,value2 in seq.items():
                if key != key2 and value[-3:] == value2[:3]:
                        print(key+'\t'+key2)

12. 12. 计算后代的期望值

同样懒得解释原理，具体原理看：http://rosalind.info/problems/iev/
现在有6种基因型组合夫妇:

AA-AA
AA-Aa
AA-aa
Aa-Aa
Aa-aa
aa-aa

给定6个非负整数，代表6种基因型组合的夫妇数量，求下一代显性性状的个数，假设每对夫妻有2个孩子。

def expected(a,b,c,d,f,g):
        AA_AA = 1
        AA_Aa = 1
        AA_aa = 1
        Aa_Aa = 0.75
        Aa_aa = 0.5
        aa_aa = 0
        p = (AA_AA*a + AA_Aa*b + AA_aa*c + Aa_Aa*d + Aa_aa*f + aa_aa*g)*2
        return (p)

inp = input('input(a,b,c,d,f,g): ')
[a,b,c,d,f,g] = map(int,inp.split(','))
print(expected(a,b,c,d,f,g))