Needleman-Wunsch算法
今天几乎所有比对软件使用的算法都是从这个经典算法衍生出来的
用 Needleman-Wunsch 算法为序列 p 和序列 q 创建全局比对。输入值除了两条序列之外,还要有替换积分矩阵以确定不同字母间的相似度得分,以及空位罚分(图 1)。空位罚分就是当字母对空位的时候应该得几分。我们还是希望一致或相似的字母尽可能的对在一起,字母对空位的情况和不相似的字母对在一起的情况一样,都不是我们希望的,还是少出现为好,所以通常字母对空位会得到一个负分,这个负分就叫做空位罚分。这里我们让空位罚分,也就是 gap 分值为-5 分。在比对中没有空位对空位的情况。输入值就是这些。
用python实现替换记分矩阵
def matchBase(base1, base2):
base = base1+base2
if base1 == base2:
match_score = {"AA": 10, "GG": 7, "CC": 9, "TT": 8}
if base1 == 'A':
return match_score['AA']
elif base1 == 'G':
return match_score['GG']
elif base1 == 'C':
return match_score['CC']
else:
return match_score['TT']
else:
mismatch = {"AG":-1,"GA":-1,"AC":-3,"CA":-3,"AT":-4,"TA":-4,"GC":-5,"CG":-5,"GT":-3,"TG":-3,"CT":0,"TC":0}
if base == "AG" or base == "GA":
return mismatch['AG']
elif base == "AC" or base == "CA":
return mismatch['AC']
elif base == "AT" or base == "TA":
return mismatch['AT']
elif base == "GC" or base == "CG":
return mismatch['GC']
elif base == "GT" or base == "TG":
return mismatch['GT']
elif base == "CT" or base == "TC":
return mismatch['CT']
接下来我们要创建一个得分矩阵,并根据公式,把得分矩阵填满。填满后全局比对就会跃然于纸上。得分矩阵的第一行是序列 p,第一列是序列 q,这一步和打点法很像。不过要注意,p 和 q 的前面各留一个空列和一个空行,也就是第 0 列和第 0 行。
根据公式:
s(0,0)是初始值 0
第 0 行:s(0,j) = gap * j
j 从 1 到 m, m 是序列 p 的长度。也就是 s(0,1)=gap*1=-5,s(0,2)=gap*2=-10,
依次类推。第 0 行实际是一种极端情况的假设。也就是当序列 p 全部对空位时的得分。A 对空位是-5 分,AC 都对空位就累计到了-10 分,ACG 都对空位就累积到了-15 分,如果序列p 全部对空位,最终的累积得分就是-25 分。
第 0 列:s(i,0) = gap * i
第 0 列和第 0 行一样,也是反映了序列 q 如果全部对空位的累计得分。对一个空位累积gap*1=-5 分,对两个空位累积 gap*2=-10 分,对三个空位累积 gap*3=-15 分,对四个空位累积 gap*4=-20 分。
第 0 行和第 0 列的值python代码实现
def matrix(seq1, seq2):
input_1 = []
input_2 = []
for i in range(len(seq1)):
input_1.append(i * -5)
for i in range(len(seq2)):
input_2.append(i * -5)
return input_1, input_2
def run(seq1, seq2):
score = {}
input_1, input_2 = matrix(seq1, seq2)
for i in range(len(input_1)):
s = "(0," + str(i) + ")"
score[s] = input_1[i]
for i in range(len(input_2)):
s = "(" + str(i) + ",0)"
score[s] = input_2[i]
print(score)
if __name__ == "__main__":
flag = True
while flag:
seq1 = input("Please input long sequence:")
seq2 = input("Please input short sequence:")
seq1 = "-" + seq1.upper()
seq2 = "-" + seq2.upper()
run(seq1, seq2)
tmp = input("press n to exit:[y/n]")
if tmp.strip() == "n":
flag = False
else:
flag = True
运行结果
Please input long sequence:ACGTC
Please input short sequence:AATC
{'(0,0)': 0, '(0,1)': -5, '(0,2)': -10, '(0,3)': -15, '(0,4)': -20, '(0,5)': -25, '(1,0)': -5, '(2,0)': -10, '(3,0)': -15, '(4,0)': -20}
press n to exit:[y/n]
第 0 行和第 0 列相对简单,其他的格就稍微复杂一点儿了。接下来填 s(1,1)这个格里的值来源于三个值中的最大值。哪那三个值呢,一个是上面格 s(0,1)里的值加 gap,一个是左面格 s(1,0)里的值加 gap,还有一个是斜上格 s(0,0)里的值加当前这个位置字母对字母在替换记分矩阵里的分值 w(i,j)。什么意思呢?就是累积到这个位置时,是字母对字母得分高,还是序列 p 的字母对空位得分高,还是序列 q 的字母对空位得分高?有且只有这三种情况,我们要的是得分最高的那种情况。逐个看一下,上面格 s(0,1)+gap= -5+-5=-10。左面格 s(1,0) +gap=-5+-5=-10。斜上格 s(0,0)+w(1,1)=0+10=10。max(-10,-10,10)=10。所以当前这个格 s(1,1)的分值就是 10。此外,我们还需要用箭头记录一下这个 10 是从哪里来的。它是从斜上这个格来的,所以我们画一个指向斜上的箭头
接下这个格 s(1,2)值的计算(图 5),仍然是找三个值中的最大值。上面格 s(0,2)+
gap=-10+-5=-15。左面格 s(1,1)+gap=10+-5=5。斜上格 s(0,1)+w(1,2)=-5+-3=
-8。max(-15,5,-8)=5。大值是 5,来源于左面格 s(1,1),画上向左的箭头。
pythons实现上述过程
def matrix(seq1, seq2):
input_1 = []
input_2 = []
for i in range(len(seq1)):
input_1.append(i * -5)
for i in range(len(seq2)):
input_2.append(i * -5)
return input_1, input_2
def matchBase(base1, base2):
base = base1+base2
if base1 == base2:
match_score = {"AA": 10, "GG": 7, "CC": 9, "TT": 8}
if base1 == 'A':
return match_score['AA']
elif base1 == 'G':
return match_score['GG']
elif base1 == 'C':
return match_score['CC']
else:
return match_score['TT']
else:
mismatch = {"AG":-1,"GA":-1,"AC":-3,"CA":-3,"AT":-4,"TA":-4,"GC":-5,"CG":-5,"GT":-3,"TG":-3,"CT":0,"TC":0}
if base == "AG" or base == "GA":
return mismatch['AG']
elif base == "AC" or base == "CA":
return mismatch['AC']
elif base == "AT" or base == "TA":
return mismatch['AT']
elif base == "GC" or base == "CG":
return mismatch['GC']
elif base == "GT" or base == "TG":
return mismatch['GT']
elif base == "CT" or base == "TC":
return mismatch['CT']
def run(seq1, seq2):
score = {}
input_1, input_2 = matrix(seq1, seq2)
for i in range(len(input_1)):
s = "(0," + str(i) + ")"
score[s] = input_1[i]
for i in range(len(input_2)):
s = "(" + str(i) + ",0)"
score[s] = input_2[i]
result_match = matchBase(seq1[1], seq2[1])
getScore(1, 1, result_match, score)
def getScore(i, j, result_match,score):
num_s1 = "(" + str(j - 1) +"," +str(i - 1) + ")"
num_si = "(" + str(j - 1) +"," + str(i) + ")"
num_sj = "(" + str(j) + "," + str(i - 1) + ")"
score_1 = score[num_s1] + result_match
score_i = score[num_si] - 5
score_j = score[num_sj] - 5
score_max = max(score_1, score_i, score_j)
print(score_max)
if __name__ == "__main__":
flag = True
while flag:
seq1 = input("Please input long sequence:")
seq2 = input("Please input short sequence:")
seq1 = "-" + seq1.upper()
seq2 = "-" + seq2.upper()
run(seq1, seq2)
tmp = input("press n to exit:[y/n]")
if tmp.strip() == "n":
flag = False
else:
flag = True
运行结果:
Please input long sequence:ACGTC
Please input short sequence:AATC
10
press n to exit:[y/n]
图 中标出的红色箭头是写出全局比对的唯一依据。追溯箭头是从右下角到左上角,但是写全局比对是从左上角开始,如果是斜箭头则是字符对字符,如果是水平箭头或垂直箭头则是字符对空位,箭头指着的序列为空位。我们看第一个是斜箭头,字母对字母,就是 A对 A,第二个是水平箭头,字母对空位,箭头指着的序列是空位,也就是 C 对空位。然后斜箭头 G 对 A,斜箭头 T 对 T,斜箭头 C 对 C,一直写到右下角,全局比对就出现了。唯一的一个空位插在序列 q 的 A 与 A 之间,这样最终的比对得分最高。不信的话可以试试,其他任何一种插入空位的比对结果,得分都不会超过 21 分。因为我们在得分矩阵的创建过程中,每一步都是在上一步最优的情况下得出的当前最优结果
python实现 Needleman-Wunsch算法
def introduce():
print("*********************************************")
print("Welcome to use short sequence alignment tool!")
print("Author : sunchengquan")
print("input1: long sequence!!!!!")
print("input2: short sequence!!!!!")
print("*********************************************")
print('\n')
def matrix(seq1, seq2):
input_1 = []
input_2 = []
for i in range(len(seq1)):
input_1.append(i * -5)
for i in range(len(seq2)):
input_2.append(i * -5)
return input_1, input_2
def matchBase(base1, base2):
base = base1+base2
if base1 == base2:
match_score = {"AA": 10, "GG": 7, "CC": 9, "TT": 8}
if base1 == 'A':
return match_score['AA']
elif base1 == 'G':
return match_score['GG']
elif base1 == 'C':
return match_score['CC']
else:
return match_score['TT']
else:
mismatch = {"AG":-1,"GA":-1,"AC":-3,"CA":-3,"AT":-4,"TA":-4,"GC":-5,"CG":-5,"GT":-3,"TG":-3,"CT":0,"TC":0}
if base == "AG" or base == "GA":
return mismatch['AG']
elif base == "AC" or base == "CA":
return mismatch['AC']
elif base == "AT" or base == "TA":
return mismatch['AT']
elif base == "GC" or base == "CG":
return mismatch['GC']
elif base == "GT" or base == "TG":
return mismatch['GT']
elif base == "CT" or base == "TC":
return mismatch['CT']
def getScore(i, j, result_match,score):
num_s1 = "(" + str(j - 1) +"," +str(i - 1) + ")"
num_si = "(" + str(j - 1) +"," + str(i) + ")"
num_sj = "(" + str(j) + "," + str(i - 1) + ")"
score_1 = score[num_s1] + result_match
score_i = score[num_si] - 5
score_j = score[num_sj] - 5
score_max = max(score_1, score_i, score_j)
a = "(" + str(j) + "," + str(i) + ")"
if score_max == score_1:
con[a] = score_max
elif score_max == score_i:
a_i[a] = score_max
else:
b_j[a] = score_max
score[a] = score_max
def getPath(j, i,seq1, seq2,flag):
a = "(" + str(j) + "," + str(i) + ")"
score_res1 = con.get(a)
score_res2 = a_i.get(a)
score_res3 = b_j.get(a)
if score_res1 != None:
res1.append(seq1[i])
res2.append(seq2[j])
if j == 0:
return res2
res_j = getPath(j - 1, i - 1, seq1, seq2, flag)
elif score_res2:
res1.append("-")
res2.append(seq2[j])
if j == 0:
return res2
res_j = getPath(j - 1, i, seq1, seq2, flag)
else:
if score_res3 != None:
res2.append("-")
res1.append(seq1[i])
flag = False
else:
res2.append(seq2[j])
res1.append(seq1[i])
flag = True
if j == 0:
return res2
res_j = getPath(j, i - 1, seq1, seq2, flag)
return res_j
def run(seq1, seq2):
score = {}
input_1, input_2 = matrix(seq1, seq2)
for i in range(len(input_1)):
s = "(0," + str(i) + ")"
score[s] = input_1[i]
for i in range(len(input_2)):
s = "(" + str(i) + ",0)"
score[s] = input_2[i]
for j in range(len(seq2)-1):
j += 1
for i in range(len(seq1)-1):
i += 1
result_match = matchBase(seq1[i], seq2[j])
getScore(i, j, result_match, score)
flag = True
res_j = getPath(len(seq2) - 1, len(seq1) - 1, seq1, seq2, flag)
return res_j,score
if __name__ == "__main__":
flag = True
while flag:
introduce()
seq1 = input("Please input long sequence:")
seq2 = input("Please input short sequence:")
seq1 = ">" + seq1.upper()
seq2 = ">" + seq2.upper()
a_i = {}
b_j = {}
con = {}
res1 = []
res2 = []
res_j ,score= run(seq1, seq2)
res_j.reverse()
res1.reverse()
print(" ".join(res1))
print(" ".join(res_j))
print("比对得分:%s" %(max(score.values())))
tmp = input("press n to exit:[y/n]")
if tmp.strip() == "n":
flag = False
else:
flag = True
运行结果:
*********************************************
Welcome to use short sequence alignment tool!
Author : sunchengquan
input1: long sequence!!!!!
input2: short sequence!!!!!
*********************************************
Please input long sequence:ACGTC
Please input short sequence:AATC
> A C G T C
> A - A T C
比对得分:21
press n to exit:[y/n]
部分来源于 :山东大学 基础医学院 生物信息学课件http://www.crc.sdu.edu.cn/bioinfo