第七章突变和随机化

　　DNA突变是生物学中一个基本的现象，大多数的DNA突变是良性的，并不影响蛋白质的作用，少数突变会导致癌症等疾病。DAN中的突变可能来自辐射、化学试剂、复制错误和其他原因，我们将使用Python的随机数生成器模拟突变。

　　随机化是一种计算机技术，在日常程序中经常出现，最常见的是加密，例如，当你想生成一个难以猜测的密码时；另外，许多算法采用随机化来加速。

　　用计算机程序模拟突变的能力可以帮助研究进化、疾病和基本细胞过程，例如分裂和DNA修复机制。

　　以下是本章将要讲解的部分：

- 随机选择列表或字符串的索引：这是在DNA（或其它数据）中选择随机位置的基本方法。
- 随机数突变模型通过学习如何随机选择DNA的核苷酸然后将其随机突变为其他核苷酸。
- 使用随机数生成DNA序列数据集，可用于研究实际基因组中随机性的程度。
- 反复突变DNA以研究在进化过程中随时间积累的突变的影响。

1. 随机数生成器

　　随机数生成器输出的数字并不是随机的，而是按照一定算法模拟产生的，其结果是确定的，是可见的，因此它们被称为伪随机数。计算机的伪随机数是由随机种子根据一定的计算方法计算出来的数值。所以，只要计算方法一定，随机种子一定，那么产生的随机数就是固定的。

　　在下面的示例中，我们使用Python的random模块来演示随机数生成。

2. 使用随机化的程序

　　例子7-1演示了在简单程序中使用随机化，它随机组合句子的一部分来构建一个故事。该示例虽然不是生物信息学内容，但是学习随机化基础知识的有效方法。你将学习如何从数组中随机选择元素，这些元素将在未来改变DNA的示例中应用。

　　该示例声明了一些填充了部分句子的数组，然后程序将它们随机化为完整的句子。

例子7-1 儿童的随机数游戏

#!/usr/bin/env python
# Children's game, demonstrating primitive artificial intelligence,
#  using a random number generator to randomly select parts of sentences.

import random
import time


# Here are the arrays of parts of sentences:
nouns = [
'Dad',
'TV',
'Mom',
'Groucho',
'Rebecca',
'Harpo',
'Robin Hood',
'Joe and Moe',
]

verbs = [
'ran to',
'giggled with',
'put hot sauce into the orange juice of',
'exploded',
'dissolved',
'sang stupid songs with',
'jumped with',
]

prepositions = [
'at the store',
'over the rainbow',
'just for the fun of it',
'at the beach',
'before dinner',
'in New York City',
'in a dream',
'around the world',
]



# This loop composes six-sentence "stories".
#  until the user types "quit".
while True:
    # (Re)set $story to the empty string each time through the loop
    story = ''  

    # Make 6 sentences per story.
    for count in range(6):

        #  Notes on the following statements:
        #  1) len(list) gives the number of elements in the array.
        #  2) rand returns a random number greater than 0 and 
        #     less than len(list).
        #  3) int removes the fractional part of a number.
        #  4) + joins two strings together.
        sentence   = nouns[random.choice(range(len(nouns)))] \
                    + " "  \
                    + verbs[random.choice(range(len(verbs)))] \
                    + " " \
                    + nouns[random.choice(range(len(nouns)))] \
                    + " " \
                    + prepositions[random.choice(range(len(prepositions)))]  \
                    + '. ' 

        story += sentence


    # Print the story.
    print("\n%s\n" % story)

    # Get user input.
    print('\nType \"quit\" to quit, or press Enter to continue: ')

    input_value = input().rstrip()
    
    # Exit loop at user's request
    if input_value == 'quit': break


exit()

　　下面是一些例子7-1的输出：

Joe and Moe jumped with Rebecca in New York City. Rebecca exploded Groucho
in a dream. Mom ran to Harpo over the rainbow. TV giggled with Joe and Moe
over the rainbow. Harpo exploded Joe and Moe at the beach. Robin Hood giggled
with Harpo at the beach. 

Type "quit" to quit, or press Enter to continue: 

Harpo put hot sauce into the orange juice of TV before dinner. Dad ran to
Groucho in a dream. Joe and Moe put hot sauce into the orange juice of TV
in New York City. Joe and Moe giggled with Joe and Moe over the rainbow. TV
put hot sauce into the orange juice of Mom just for the fun of it. Robin Hood
ran to Robin Hood at the beach. 

Type "quit" to quit, or press Enter to continue: quit

3. 模拟DNA突变程序

　　例子7-1为你提供了突变DNA所需的工具，在下面的例子中，你将像往常一样用A、C、G和T字符来表示DNA字符串。首先，你将DNA字符串转成字母列表，然后使用索引的来改变字符，最后通过join函数将列表转成DNA字符串。

　　这一次，让我稍微改变一下，首先编写一些在显示整个程序之前需要用到的函数。

3.1 伪代码

　　从简单地伪代码开始，这里是一个函数的设计，它将DNA中的随机位置变为随机核苷酸：

　　在DNA字符串中选择一个随机位置

　　选择随机核苷酸

　　将随机核苷酸替换为DNA字符串中的随机位置

3.1.1 在DNA字符串中选择一个随机位置

　　如何随机选择DNA字符串中的位置？回想一下，内置函数len返回列表的长度。因此，你可以使用与示例7-1中相同的一般概念，并创建一个子函数：

# randomposition
#
# A subroutine to randomly select a position in a string.
#
# WARNING: make sure you call srand to seed the
#  random number generator before you call this function.

def randomposition(dna):
    return random.choice(len(dna))

　　randomposition函数太过简单，因此这里没有注释。

3.1.2 随机选择一个核苷酸

　　接下来，让我们编写一个函数，随机选择四种核苷酸中的一种：

# randomnucleotide
#
# A subroutine to randomly select a nucleotide
#
# WARNING: make sure you call srand to seed the
#  random number generator before you call this function.

def randomnucleotide(nucs)
    return random.choice(nucs)

　　同样，这也是一个简短的函数

3.1.3 将随机核苷酸置于随机位置

　　现在编写第三个也是最后一个函数，实际上就是突变。代码如下：

# mutate
#
# A subroutine to perform a mutation in a string of DNA
#

def mutate(dna):
    nucleotides = ['A', 'C', 'G', 'T']

    # Pick a random position in the DNA
    position = randomposition(dna)

    # Pick a random nucleotide
    newbase = randomnucleotide(nucleotides)

    # Insert the random nucleotide into the random position in the DNA.
    dna = dna[ : position] + newbase + dna[position+1 :]
    
    return dna

　　这里又是一个简短的函数，当你仔细研究时，请注意它的阅读和理解相对容易。你通过挑选一个随机位置然后随机选择一个核苷酸并在该字符串中的那个位置取代该核苷酸进行突变。（如果您忘记了字符串索引操作，请参阅其他Pythonl文档。）

　　另外，请注意这个函数是如何由其他的函数构成的。

3.2 结合函数模拟突变

　　现在你编写了所有的需要的函数，开始编写例子7-2主程序，看看你的新程序是否有效。

例子7-2 突变DNA

#!/usr/bin/env pythonl

import random
################################################################################
# Subroutines for Example 7-2
################################################################################

#  Notice, now that we have a fair number of subroutines, we
#  list them alphabetically

# A subroutine to perform a mutation in a string of DNA
#
# WARNING: make sure you call srand to seed the
#  random number generator before you call this function.

def mutate(dna):
    nucleotides = ['A', 'C', 'G', 'T']

    # Pick a random position in the DNA
    position = randomposition(dna)

    # Pick a random nucleotide
    newbase = randomnucleotide(nucleotides)

    # Insert the random nucleotide into the random position in the DNA
    # The substr arguments mean the following:
    #  In the string $dna at position $position change 1 character to
    #  the string in $newbase
    dna = dna[:position] + newbase + dna[position+1 :]

    return dna

# A subroutine to randomly select an element from an array
#
# WARNING: make sure you call srand to seed the
#  random number generator before you call this function.

def randomelement(array):

    return random.choice(array)


# randomnucleotide
#
# A subroutine to select at random one of the four nucleotides
#
# WARNING: make sure you call srand to seed the
#  random number generator before you call this function.

def randomnucleotide():

    nucleotides = ['A', 'C', 'G', 'T']

    return randomelement(nucleotides)


# randomposition
#
# A subroutine to randomly select a position in a string.
#
# WARNING: make sure you call srand to seed the
#  random number generator before you call this function.

def randomposition(dna):

    return random.choice(range(len(dna)))


# Mutate DNA
#  using a random number generator to randomly select bases to mutate


# Declare the variables

# The DNA is chosen to make it easy to see mutations:
DNA = 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA'


# Let's test it, shall we?
mutant = mutate(DNA)

print("\nMutate DNA\n\n")

print("\nHere is the original DNA:\n\n")
print("%s\n" % DNA)

print("\nHere is the mutant DNA:\n\n")
print("%s\n" % mutant)

# Let's put it in a loop and watch that bad boy accumulate mutations:
print("\nHere are 10 more successive mutations:\n\n")

for i in range(10):
    mutant = mutate(mutant)
    print("%s\n" % mutant)


exit()

　　如下是例子7-2的一些输出：

Mutate DNA

Here is the original DNA:

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Here is the mutant DNA:

AAAAAAAAAAAAAAAAAAAAGAAAAAAAAA

Here are 10 more successive mutations:

AAAAAAAAAAAAAAAAAAAAGACAAAAAAA
AAAAAAAAAAAAAAAAAAAAGACAAAAAAA
AAAAAAAAAAAAAAAAAAAAGACAAAAAAA
AAAAAAAAAAAAAACAAAAAGACAAAAAAA
AAAAAAAAAAAAAACAACAAGACAAAAAAA
AAAAAAAAAAAAAACAACAAGACAAAAAAA
AAAAAAAAAGAAAACAACAAGACAAAAAAA
AAAAAATAAGAAAACAACAAGACAAAAAAA
AAAAAATAAGAAAACAACAAGACAAAAAAA
AAAAAATTAGAAAACAACAAGACAAAAAAA

4 生成随机序列

　　随机DNA可用于研究来自生物体的实际DNA，在本节中，我们将编写一些程序来生成随机DNA序列。

　　假设我们需要的是一组不同长度的随机DNA片段，你的函数必须指定最大和最小长度，以及要生成的片段数。

4.1 自下而上与自上而下

　　在例子7-2中，你编写了基本函数，然后是一个调用基本函数的函数，最后是主程序。如果忽略伪代码，这是自下而上设计的一个例子；从构建块开始，然后将它们组装成更大的结构。

　　现在让我们看一下从主程序开始，调用函数，并在之后需要时找到编写的函数，这称为自上而下的设计。

4.2 生成一组随机DNA的函数

　　鉴于我们的目标是生成随机DNA，或许你想要的是一个数据生成函数：

random_DNA = make_random_DNA_set(minimum_length, maximum_length, size_of_set )

　　这看起来没问题，但它引出了如何实际完成整体任务的问题。（这是自上而下的设计！），所以你需要向下移动并为make_random_DNA_set函数写下伪代码：

repeat size_of_set times:

    length = random number between minimum and maximum length

    dna = make_random_DNA ( length )

    add dna to set
　　
　　return set

　　现在，继续自上而下的设计，你需要一些make_random_DNA函数的伪代码：

from 1 to size

    base = randomnucleotide

    dna .= base

return dna

　　不需要再进一步了，例子7-2中已经有了一个随机核苷酸子程序。

4.3 将伪代码转成函数

　　现在我们有了自上而下的设计，如何进行编码？由于python是顺序执行，让我们按照自下而上的设计，编写程序。例7-3从函数定义开始，按照你在伪代码中执行的自顶向下设计的顺序继续，然后是函数。

例子7-3 产生随机DNA

#!/usr/bin/perl

import random
################################################################################
# Subroutines
################################################################################

# make_random_DNA_set
#
# Make a set of random DNA
#
#   Accept parameters setting the maximum and minimum length of
#     each string of DNA, and the number of DNA strings to make
#
# WARNING: make sure you call srand to seed the
#  random number generator before you call this function.

def make_random_DNA_set(minimum_length, maximum_length, size_of_set):

    # set of DNA fragments
    dna_set = []

    # Create set of random DNA
    for i in range(size_of_set):

        # find a random length between min and max
        length = randomlength (minimum_length, maximum_length)

        # make a random DNA fragment
        dna = make_random_DNA ( length )

        # add dna fragment to dna_set
        dna_set.append(dna)

    return dna_set


# Notice that we've just discovered a new subroutine that's
# needed: randomlength, which will return a random
# number between (or including) the min and max values.
# Let's write that first, then do make_random_DNA

# randomlength
#
# A subroutine that will pick a random number from
# minlength to maxlength, inclusive.
#
# WARNING: make sure you call srand to seed the
#  random number generator before you call this function.

def randomlength(minlength, maxlength):

    # Calculate and return a random number within the
    #  desired interval.
    # Notice how we need to add one to make the endpoints inclusive,
    #  and how we first subtract, then add back, minlength to
    #  get the random number in the correct interval.
    return random.choice(range(maxlength - minlength + 1)) + minlength


# make_random_DNA
#
# Make a string of random DNA of specified length.
#
# WARNING: make sure you call srand to seed the
#  random number generator before you call this function.

def make_random_DNA(length):

    for i in range(length):

        dna .= randomnucleotide()
    
    return dna


# We also need to include the previous subroutine
# randomnucleotide.
# Here it is again for completeness.

# randomnucleotide
#
# Select at random one of the four nucleotides
#
# WARNING: make sure you call srand to seed the
#  random number generator before you call this function.

def randomnucleotide():

    nucleotides = ['A', 'C', 'G', 'T']

    # scalar returns the size of an array. 
    # The elements of the array are numbered 0 to size-1
    return randomelement(nucleotides)


# randomelement
#
# randomly select an element from an array
#
# WARNING: make sure you call srand to seed the
#  random number generator before you call this function.

def randomelement(array):

    return random.choice(array)


# Generate random DNA
#  using a random number generator to randomly select bases


# Declare and initialize the variables
size_of_set = 12
maximum_length = 30
minimum_length = 15


# And here's the subroutine call to do the real work
random_DNA = make_random_DNA_set(minimum_length, maximum_length, size_of_set )

# Print the results, one per line
print("Here is an array of %s randomly generated DNA sequences\n" % size_of_set)
print("  with lengths between %s and %s:\n\n" % (minimum_length, maximum_length))

for dna in random_DNA:

    print("%s\n" % dna)


print("\n")

exit()

　　如下输出的是一个由12个随机生成的长度在15~30之间DNA序列组成的阵列：

TACGCTTGTGTTTTCGGGGGAC
GGGGTGTGGTAAGGCTGTCTCAGATGTGC
TGAACGACAACCTCCTGGACTTTACT
ATCTATGCTTTGCCATGCTAGT
CCGCTCATTCCTCTTCCTCGGC
TGTACCCCTAATACACTTTAGCCGAATTTA
ATAGGTCGGGGCGACAGCGCCGG
GATTGACCTCTGTAA
AAAATCTCTAGGATCGAGC
GTATGTGCTTGGGTAAAT
ATGGAGTTGCGAGGAAGTAGCTGAGT
GGCCCATGACCAGCATCCAGACAGCA

5. 分析DNA

　　在这个处理随机化的最后一个例子中，你将收集一些关于DNA的统计数据来回答这个问题：平均来说，两个随机DNA序列之间的碱基相似百分比是多少？该程序的目的是表明你现在拥有必要的编程能力来询问和回答有关DNA序列的问题。

　　像往常一样，让我们尝试用伪代码概述程序的概念：

Generate a set of random DNA sequences, all the same length

For each pair of DNA sequences

    How many positions in the two sequences are identical as a fraction?

　　以百分比形式报告前面计算的平均值

　　显然，要编写此代码，你至少可以重用已经完成的一些工作。此外，还需要一个可用于按位置比较两个序列碱基地函数，让我们编写一些伪代码，将一个序列中的每个核苷酸与另一个序列中相同位置的核苷酸进行比较：

assuming DNA1 is the same length as DNA2,

for each position from 1 to length(DNA)

    if the character at that position is the same in DNA_1 and DNA_2

        ++$count
return count/length

　　您还必须编写选择每对序列的代码，收集结果，最后获取结果的平均值并以百分比形式报告。例子7-4给出了完整代码

例子7-4 计算随机DNA序列对之间的平均%同一性

#!/usr/bin/env python
import random

################################################################################
# Subroutines
################################################################################

# matching_percentage
#
# Subroutine to calculate the percentage of identical bases in two
# equal length DNA sequences

def matching_percentage(string1, string2):

    # we assume that the strings have the same length
    length = len(string1)

    count = 0

    for position in range(length):
        if string1[position] == string2[position]:
            ++$count

    return count / length

# make_random_DNA_set
#
# Subroutine to make a set of random DNA
#
#   Accept parameters setting the maximum and minimum length of
#     each string of DNA, and the number of DNA strings to make
#
# WARNING: make sure you call srand to seed the
#  random number generator before you call this function.

def make_random_DNA_set(minimum_length, maximum_length, size_of_set):

    # set of DNA fragments
    dna_set = []

    # Create set of random DNA
    for i in range(size_of_set):

        # find a random length between min and max
        length = randomlength (minimum_length, maximum_length)

        # make a random DNA fragment
        dna = make_random_DNA ( length )

        # add $dna fragment to dna_set
        dna_set.append(dna)

    return dna_set


# randomlength
#
# A subroutine that will pick a random number from
# $minlength to $maxlength, inclusive.
#
# WARNING: make sure you call srand to seed the
#  random number generator before you call this function.

def randomlength(minlength, maxlength):

    # Calculate and return a random number within the
    #  desired interval.
    # Notice how we need to add one to make the endpoints inclusive,
    #  and how we first subtract, then add back, minlength to
    #  get the random number in the correct interval.
    return random.choice(range(maxlength - minlength + 1)) + minlength 


# make_random_DNA
#
# Make a string of random DNA of specified length.
#
# WARNING: make sure you call srand to seed the
#  random number generator before you call this function.

def make_random_DNA(length):

    for i in range(length):
        dna .= randomnucleotide()

    return dna


# randomnucleotide
#
# Select at random one of the four nucleotides
#
# WARNING: make sure you call srand to seed the
#  random number generator before you call this function.

def randomnucleotide():

    nucleotides = ['A', 'C', 'G', 'T']

    return randomelement(nucleotides)


# randomelement
#
# randomly select an element from an array
#
# WARNING: make sure you call srand to seed the
#  random number generator before you call this function.

def randomelement(array)

    return random.choice(array)


# Calculate the average percentage of positions that are the same
# between two random DNA sequences, in a set of 10 sequences.

percentages = []

#  Generate the data set of 10 DNA sequences.
random_DNA = make_random_DNA_set( 10, 10, 10 );

# Iterate through all pairs of sequences
for k in range(len(random_DNA) -1):
    for i in range(k+1, len(random_DNA)):

        # Calculate and save the matching percentage
        percent = matching_percentage(random_DNA[k], random_DNA[i])
        percentages.append(percent )



# Finally, the average result:
result = 0;

for percent in percentages:
  result += percent


result = result / len(percentages)
#Turn result into a true percentage
result = int (result * 100)

print("In this run of the experiment, the average percentage of \n")
print("matching positions is %s%%\n\n" % result)

exit()

　　如下是例子7-4的一个输出：

In this run of the experiment, the average number of 
matching positions is 0.24%

6. 练习

1. 写一个程序，要求你挑选氨基酸，然后（随机）猜测你选择了哪种氨基酸。

2. 编写一个程序，选择四个核苷酸中的一个，然后继续提示，直到你正确猜出它挑选的核苷酸。

3. 编写一个子程序来随机混洗数组的元素。子例程应该将一个数组作为参数，并返回一个具有相同元素但以随机顺序混洗的数组。原始数组的每个元素应该在输出数组中只出现一次，就像改组一副牌一样。

4. 编写一个突变蛋白质序列的程序，类似于实例7-2中变异DNA的代码。

5. 编写一个子程序，给定一个密码子（长度为3的DNA片段），在密码子中返回一个随机突变。

6. 有时并非所有选择都会随机选择。编写一个随机返回核苷酸的子程序，其中可以指定每个核苷酸的概率。将子程序四个数作为参数传递，代表每个核苷酸的概率;如果每个概率为0.25，则子程序同样可能挑选每个核苷酸。作为错误检查，让子程序确保四个概率的总和为1。