On the basis of sequence alignment (2) replacement of (scoring) Matrix series

On the basis of sequence alignment (2) replacement of (scoring) Matrix series

  Mainly to introduce BLOSUM matrix PAM matrix based. Disclaimer: This section describes the contents of the book a little less, so I went online to search for relevant courseware and literature and several foreign universities (obtained from a graduate student at the blogger).
Then this article will first introduce BLOSUM matrix bar
  BLOck SUBstitution Matrix: BLOSUM matrix. Detail, they are set free from a gap region coupled with the family of proteins, these protein families from BLOCKS database. 1

BLOSUM62 matrix widely used in a double sequence alignment program is BLAST default scoring matrix calls.
Le 62 What does it mean?
 Blocks containing IS derived from the BLOSUM62> 62 is%
Identity in ungapped Sequence Alignment. 2

  That is the BLOSUM62 matrix between the sequence from residues equivalent ratio of more than 62% of the Blocks (blocks). While inter-sequence block is coupled with the free space region.
Small summarized: 1. a first predetermined threshold value L, such as BLOSUM62 matrix you want finally, 62.2 L was set to put the protein sequence database complies sequence identical residues between the sequences is greater than the ratio L classified as a class. . Inside the class 3. The obtained sequence long on sequence alignment (multiple sequence alignment performed using the PAM matrix). 4. In contrast, the conserved region is divided ungapped block. The statistical frequency within the block, a block corresponding to a matching model. Using log odds ratio obtained s (a, b).

E.g:undefined

 The dark part of the figure is a block. Here we calculate the value s (a, b) Alignment between the residues in the block as an example.
 Core computing is the last article mentioned in the log odds ratios (log odds ratio) that is s ( a , b ) = log ( p a b q a q b ) s ( a , b ) = \log \left( \frac { p _ { a b } } { q _ { a } q _ { b } } \right) . Statistics obtained by the normalized frequency to represent the probabilities.
  Calculation process:

c i i ( k ) c _ { i i } ^ { ( k ) } = C n i 2 C _ { n _ { i } } ^ { 2 }
c i j ( k ) c _ { i j } ^ { ( k ) } = n i n _ { i } * n j n _ {j}
c i j = k c i j ( k ) c _ { i j } = \sum _ { k }c _ { i j } ^ { ( k ) }

c i j ( k ) c _ { i j } ^ { ( k ) } : K-th column of the number of residues (i, j) is observed.
n i n _ { i } : The number of residues in column i is observed.

T=W* C N 2 C _ { N } ^ { 2 }
q i j = c i j T q _ { i j } = \frac { c _ { i j } } { T }

  W: the number of columns, N: the number of rows. The normalized frequency represents the probability.

p i = q i i + j = i q i j 2 p _ { i } = q _ { i i } + \sum _ { j = i } \frac { q _ { i j } } { 2 }

p i p _ { i } : I probability residues occurring in the block.

e i i = p i 2 e _ { i i } = p _ { i } ^ { 2 }
e i j = 2 p i p j e _ { i j } = 2 p _ { i } p _ { j }

e i j e _ { i j } : Probability residues (i, j) appears random

s ( i , j ) = log 2 q i j e i j s(i,j)= \log_ { 2 } \frac { q _ { i j } } { e _ { i j } }

 Finally, BLOSUM matrix [i, j] integer = 2 * s (i, j ), and taking the nearest.
Benpian Summary:
  see here I believe you have some knowledge of the establishment of the scoring matrix. However, if you think about it there will be a question. Directly associated with the observed normalized frequency parameter indicates the probability (the probability associated with each residue between in nature.) Matching the model M, i.e. one sample with observations direct estimation of the overall probability parameter. It is not gotta maximum likelihood estimation to estimate population parameters? Is not no maximum likelihood estimate?
Maybe look at a few articles will be answered.
IS A the this the BLOSUM62 matrix .


  1. "Biological Sequence Analysis" ↩︎

  2. Courseware Address: http://www.cs.columbia.edu/4761/assignments/assignment1/reference1.pdf ↩︎

Published 53 original articles · won praise 13 · views 9164

Guess you like

Origin blog.csdn.net/weixin_43770577/article/details/104023846