Counting Point Mutations 统计点突变

Problem

Given two strings $s$ and $t$ of equal length, the Hamming distance between $s$ and $t$ , denoted $d_H(s,t)$ , is the number of corresponding symbols that differ in and . See Figure 2.

Given: Two DNA strings $s$ and $t$ of equal length (not exceeding 1 kbp).

Return: The Hamming distance

xxx

Figure 2. The Hamming distance between these two strings is 7. Mismatched symbols are colored red.

Sample Dataset

GAGCCTACTAACGGGAT
CATCGTAATGACGGCCT

Sample Output

两个字符串之间的汉明距离(Hamming distance)是指两个相等长度的字符串，对应位置上不同字符的个数

s = 'GAGCCTACTAACGGGAT'
t = 'CATCGTAATGACGGCCT'
hamm = (i for i in range(len(s)) if s[i] != t[i] )
print(len(list(hamm)))

孟德尔第一定律/分离定律

问题

复习一下概率论中学到的知识

概率定义：大量的试验证明，当试验的重复次数n逐渐增大时，事件A发生的频率逐渐稳定与某个常数p。这个p就是事件A发生的概率，用于表示在一次试验中，事件A 发生的可能性大小，记事件A的概率为P(A)
例1：例：从一所高中中随机抽取一名学生，已知抽到一名女生的概率是0.5，抽到一名高二学生的概率是0.3，抽到一名高二女生的概率是0.2。那么抽到一名高二学生或一名女生的概率是？

使用公式P(A∪B)=P(A)+P(B)-P(AB)，记事件A={抽到一名女生，事件B={抽到一名高二学生}，则P(A)=0.5，P(B)=0.3，P(AB)=0.2
P(A∪B)=P(A)+P(B)-P(AB)=0.5+0.3-0.2=0.6

排列：从n个不同元素中,任取m个元素,按照一定的顺序排成一列,叫做从n个不同元素中取出m个元素的一个排列.与顺序有关
组合：从n个不同元素中,任取m个元素,并成一组,叫做从n个不同元素中取出m个元素的一个组合.与顺序无关
例2：从n个不同的元素中取出m个元素，并按照一定的顺序排成一列，则共有:

$A_{n}^m =n(n-1)(n-2)...(n-m+1) = \frac{n!}{(n-m)!}$

例3：从n个不同的元素中取出m个元素，则共有:

$C_{n}^m = \frac{n(n-1)(n-2)...(n-m+1)}{m!} = \frac{n!}{(n-m)!m!}$

例4：一千张彩票中任意抽取一张，有多少基本事件？任意抽取两张有多少基本事件？

任意抽取一张，基本事件1000个；任意抽取2张，则应该是从1000个任意取2个的组合数:

$C_{1000}^2 = \frac{1000 * 999}{2} = 499500$

下面举一个古典概型例子

一个口袋装有5只球，其中3只红球，2只蓝球，从袋中取球两次，每次随机地取一只，考虑两种取球方式：

（a）第一次取一只球，观察其颜色后放回，搅匀后再取一次，这种取球方式叫做放回抽样

（b）第一次取一球不放回袋中，第二次从剩余的球中再取一球，这种方式叫做不放回取样

试分别就上面两种情况：

（1）取到的两只球都是红球的概率

（2）取到的两只球颜色相同的概率

（3）取到的两只球中至少一只是红球的概率

解：

以 $A,B,C$ 分别表示“取到的两只球都是红球”，“取到的两只球都是蓝球”“ 取到的两只球中至少一只是红球”。易知“取到两只颜色相同的球”这一事件即为 $A\cup B$ ，而 $C = \overline B$

放回抽样的情况

$P(A) = \frac{3}{5} \times \frac{3}{5} = \frac{9}{25}$

$P(B) = \frac{2}{5} \times \frac{2}{5} = \frac{4}{25}$

由于 $AB=\phi$ 得

$P(A\cup B) = P(A) + P(B) - P(AB) = \frac{13}{25}$

$P(C) = 1 - P(B) = \frac{21}{25}$

不放回抽样的情况

第一次从袋中取红球有3个球可以取，第二次取红球只有2个球可以取
$P(A) = \frac {C_3^1 C_2^1} {C_5^1 C_4^1} = \frac {3}{10}$

第一次从袋中取蓝球有2个球可以取，第二次取蓝球只有1个球可以取

$P(B) = \frac {C_2^1} {C_5^1 C_4^1} = \frac {1}{10}$

$P(C) = 1 - P(B) = \frac {9}{10}$

示意图：
xxx

说明

现在给三个整数 $k,m,n$ 代表 $k+m+n$ 个生物， $k$ 个是显性纯合， $m$ 个是显性杂合， $n$ 个是隐性纯合

返回的结果是：随机的亲本杂交，统计子代基因型显性的比例

$P = 1 - \frac{(C_n^2 + \frac{1}{4}C_m^2 + \frac{1}{2}C_m^1\times C_n^1 )} {C_{k+m+n}^2}$

样本集

2 2 2

结果输出

0.78333

from scipy.special import comb


def mendel_law(k, m, n):
    s = k + m + n
    rr = comb(n, 2) / comb(s, 2)
    hh = comb(m, 2) / comb(s, 2)
    hr = comb(n, 1) * comb(m, 1) / comb(s, 2)
    probability = 1 - (rr + hh * 1 / 4 + hr * 1 / 2)
    return probability


print("%.5f" % mendel_law(2, 2, 2))

0.78333


def character_list(parent_number):
    '''输入包含纯合，杂合样本的数量的信息的列表，得到所有样本性状的列表'''
    all_character_list = ['HH'] * parent_number[0] + ['Hr'] * parent_number[1] + ['rr'] * parent_number[2]
    return all_character_list

def character_probability(character_a, character_b):
    '''统计两个亲本性状的所有可能子代的基因型'''
    total = {'HH':0, 'Hr':0, 'rr':0, 'rH':0}
    for base_a in character_a:
        for base_b in character_b:
            later = base_a + base_b
            total[later] +=1
    HH = total['HH']
    Hr = total['Hr'] + total['rH']
    rr = total['rr']
    return HH, Hr, rr

def main(parent_number):
    '''统计所有可能得子代基因型，保存在字典中'''
    total_number = {'HH':0, 'Hr':0, 'rr':0}
    all_character_list = character_list(parent_number)
    for i in range(len(all_character_list)-1):
        character_A = all_character_list[i]
        for j in range(i + 1, len(all_character_list)):
            character_B = all_character_list[j]
            HH, Hr, rr = character_probability(character_A, character_B)
            total_number['HH'] += HH
            total_number['Hr'] += Hr
            total_number['rr'] += rr
    dominance = float((total_number['HH'] + total_number['Hr'])) / sum(total_number.values())
    print('子代为显性的可能性为：%.5f' % dominance)



parent ='2,2,2' #input('请分别输入纯显，杂合，纯隐的样本数（k,m,n）:')
parent_number = list(map(int, parent.split(',')))
main(parent_number)

子代为显性的可能性为：0.78333

Translating RNA into Protein/RNA翻译成蛋白质

Problem

The 20 commonly occurring amino acids are abbreviated by using 20 letters from the English alphabet (all letters except for B, J, O, U, X, and Z). Protein strings are constructed from these 20 symbols. Henceforth, the term genetic string will incorporate protein strings along with DNA strings and RNA strings.

The RNA codon table dictates the details regarding the encoding of specific codons into the amino acid alphabet.

Given: An RNA string

corresponding to a strand of mRNA (of length at most 10 kbp).

Return: The protein string encoded by

Sample Dataset

AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA

Sample Output

MAMAPRTEINSTRING

codon_table = {
    'GCU':'A', 'GCC':'A', 'GCA':'A', 'GCG':'A', 'CGU':'R', 'CGC':'R',
    'CGA':'R', 'CGG':'R', 'AGA':'R', 'AGG':'R', 'UCU':'S', 'UCC':'S',
    'UCA':'S', 'UCG':'S', 'AGU':'S', 'AGC':'S', 'AUU':'I', 'AUC':'I',
    'AUA':'I', 'UUA':'L', 'UUG':'L', 'CUU':'L', 'CUC':'L', 'CUA':'L',
    'CUG':'L', 'GGU':'G', 'GGC':'G', 'GGA':'G', 'GGG':'G', 'GUU':'V',
    'GUC':'V', 'GUA':'V', 'GUG':'V', 'ACU':'T', 'ACC':'T', 'ACA':'T',
    'ACG':'T', 'CCU':'P', 'CCC':'P', 'CCA':'P', 'CCG':'P', 'AAU':'N',
    'AAC':'N', 'GAU':'D', 'GAC':'D', 'UGU':'C', 'UGC':'C', 'CAA':'Q',
    'CAG':'Q', 'GAA':'E', 'GAG':'E', 'CAU':'H', 'CAC':'H', 'AAA':'K',
    'AAG':'K', 'UUU':'F', 'UUC':'F', 'UAU':'Y', 'UAC':'Y', 'AUG':'M',
    'UGG':'W',
    'UAG':'STOP', 'UGA':'STOP', 'UAA':'STOP'
    }


def translate_rna(sequence, start, length):
    prot = ''
    for i in range(start-1, len(RNA), 3):
        codon = RNA[i:i + 3]
        if codon in codon_table:
            if codon_table[codon] == 'STOP':
                prot = prot + '*'
            else:
                prot = prot + codon_table[codon]
        else:
            prot = prot + '-'

    i = 0
    while i < len(prot):
        print(prot[i:i + int(length)])
        i = i + int(length)

RNA = 'AUGGCCAUGGCGCCCAGAACUGAGAUCAAUAGUACCCGUAUUAACGGGUGA'
translate_rna(RNA, 1, 4)
print("-" * 50)
translate_rna(RNA, 1, 20)

MAMA
PRTE
INST
RING
*
--------------------------------------------------
MAMAPRTEINSTRING*

Finding a Motif in DNA/在DNA中找模体

Problem

Given two strings and , is a substring of if is contained as a contiguous collection of symbols in (as a result, must be no longer than $s$ ).

The position of a symbol in a string is the total number of symbols found to its left, including itself (e.g., the positions of all occurrences of $U$ in $AUGCUUCAGAAAGGUCUUACG$ are 2, 5, 6, 15, 17, and 18). The symbol at position $i$ of $s$ is denoted by $s[i]$ .

A substring of $s$ can be represented as $s[j:k]$ , where $j$ and $k$ represent the starting and ending positions of the substring in $s$ ; for example, if $s = AUGCUUCAGAAAGGUCUUACG$ , then $s[2:5] = UGCU$ .

The location of a substring $s[j:k]$ is its beginning position $j$ ; note that $t$ will have multiple locations in $s$ if it occurs more than once as a substring of $s$ (see the Sample below).

Given: Two DNA strings $s$ and $t$ (each of length at most 1 kbp).

Return: All locations of $t$ as a substring of $s$ .

Sample Dataset

GATATATGCATATACTT
ATAT

Sample Output

2 4 10

模体（motif）是蛋白质分子中具有特定空间构象和特定功能的结构成分
对DNA而言，motif是比较有特征的短序列，会多次出现的，一般认为它的生物学意义重大，其实motif这个单词就是形容一种反复出现的模式，而序列motif往往是DNA上的反复出现的模式，并被假设拥有生物学功能。而且，经常是一些具有序列特异性的蛋白的结合位点（如，转录因子）或者是涉及到重要生物过程的（如，RNA 起始，RNA 终止， RNA 剪切等等）。

seq = 'GATATATGCATATACTT'
pattern = 'ATAT'
def find_motif_1(seq, pattern):
    position = []
    for i in range(len(seq) - len(pattern)):
        if seq[i:i + len(pattern)] == pattern:
            position.append(str(i + 1))
    return ' '.join(position)

print(find_motif_1(seq, pattern))

2 4 10

%%time
for i in range(16):
    seq += seq

find_motif_1(seq, pattern)

CPU times: user 955 ms, sys: 13.9 ms, total: 969 ms
Wall time: 980 ms

import re

seq = 'GATATATGCATATACTT'
pattern = 'ATAT'
def find_motif_2(seq, pattern):
    for i in re.finditer('(?=' + pattern + ')', seq):
        yield i.start() + 1

print(" ".join(map(str, list(find_motif_2(seq, pattern)))))

2 4 10

%%time
for i in range(16):
    seq += seq
def find_motif_2(seq, pattern):
    for i in re.finditer('(?=' + pattern + ')', seq):
        p =  i.start() + 1

find_motif_2(seq, pattern)

CPU times: user 215 ms, sys: 1.96 ms, total: 217 ms
Wall time: 225 ms

seq = 'GATATATGCATATACTT'
pattern = 'ATAT'
def find_motif_3(seq, pattern):
    n = 0
    while 1:
        p = seq.find(pattern, n)
        if p == -1:break
        yield p + 1
        n = p + 1

print(" ".join(map(str, list(find_motif_3(seq, pattern)))))

2 4 10

%%time
for i in range(16):
    seq += seq

def find_motif_3(seq, pattern):
    n = 0
    while 1:
        p = seq.find(pattern, n)
        if p == -1:break
        n = p + 1
find_motif_3(seq, pattern)

CPU times: user 141 ms, sys: 0 ns, total: 141 ms
Wall time: 149 ms

速度：find_motif_3 > find_motif_2 > find_motif_1

Consensus and Profile/寻找一致序列

Problem

A matrix is a rectangular table of values divided into rows and columns. An $m \times n$ matrix has $m$ rows and $n$ columns. Given a matrix $A$ , we write $A_{i,j}$ to indicate the value found at the intersection of row and column .

Say that we have a collection of DNA strings, all having the same length $n$ . Their profile matrix is a $4 \times n$ matrix $P$ in which $P_{1,j}$ represents the number of times that ‘A’ occurs in the $j$ th position of one of the strings, $P_{2,j}$ represents the number of times that C occurs in the $j$ th position, and so on (see below).

A consensus string $c$ is a string of length $n$ formed from our collection by taking the most common symbol at each position; the $j$ th symbol of $c$ therefore corresponds to the symbol having the maximum value in the $j$ -th column of the profile matrix. Of course, there may be more than one most common symbol, leading to multiple possible consensus strings.

DNA Strings

A T C C A G C T
G G G C A A C T
A T G G A T C T
A A G C A A C C
T T G G A A C T
A T G C C A T T
A T G G C A C T

Profile

A   5 1 0 0 5 5 0 0
C   0 0 1 4 2 0 6 1
G   1 1 6 3 0 1 0 0
T   1 5 0 0 0 1 1 6

Consensus A T G C A A C T

Given: A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format.

Return: A consensus string and profile matrix for the collection. (If several possible consensus strings exist, then you may return any one of them.)

Sample Dataset

>Rosalind_1
ATCCAGCT
>Rosalind_2
GGGCAACT
>Rosalind_3
ATGGATCT
>Rosalind_4
AAGCAACC
>Rosalind_5
TTGGAACT
>Rosalind_6
ATGCCATT
>Rosalind_7
ATGGCACT

Sample Output

ATGCAACT
A: 5 1 0 0 5 5 0 0
C: 0 0 1 4 2 0 6 1
G: 1 1 6 3 0 1 0 0
T: 1 5 0 0 0 1 1 6

import numpy as np
from collections import Counter


fasta_list = [list(i.strip()) for i in open('./data/test3.fa') if not i.startswith('>')]
arr = np.array(fasta_list)  

comsquence = ''
result = np.zeros((4, 8), dtype = np.str) 
for i in range(arr.shape[1]):  
    col = arr[:, i]
    result[0, i] = col[col == "A"].size
    result[1, i] = col[col == "C"].size
    result[2, i] = col[col == "G"].size
    result[3, i] = col[col == "T"].size
    comsquence = comsquence + Counter(col).most_common()[0][0]

print(comsquence)
print("A:", " ".join(result[0, ]), "\nC:", " ".join(result[1, ]))
print("G:", " ".join(result[2, ]), "\nT:", " ".join(result[3, ]))

ATGCAACT
A: 5 1 0 0 5 5 0 0 
C: 0 0 1 4 2 0 6 1
G: 1 1 6 3 0 1 0 0 
T: 1 5 0 0 0 1 1 6

刷题ROSALIND，练编程水平

markdown 中使用数学公式

python生信编程6-10

文章目录

Counting Point Mutations 统计点突变

Problem

Sample Dataset

Sample Output

孟德尔第一定律/分离定律

问题

说明

样本集

结果输出

Translating RNA into Protein/RNA翻译成蛋白质

Problem

Sample Dataset

Sample Output

Finding a Motif in DNA/在DNA中找模体

Problem

Sample Dataset

Sample Output

Consensus and Profile/寻找一致序列

Problem

Sample Dataset

Sample Output

猜你喜欢