Blast算法初探 Basic Local Alignment Search Tool

Blast算法初探Basic Local Alignment Search Tool

Sequence Databst Search

BLAST是一种启发式的算法，也就是说，它并不确保能找到最优解，但尽力在更短时间内找到足够好的解

全局比对：dp
$\begin{array}{l} F(0,0)=0 \\ F(i, j)=\max \left\{\begin{array}{l} F(i-1, j-1)+s\left(x_{i}, y_{j}\right) \\ F(i-1, j)+d \\ F(i, j-1)+d \end{array}\right. \end{array} Global Alignment$

为什么提出局部比对：

功能相关的pr在整体序列上相差甚远，但是有相同的功能域，序列片段能发挥独立生物学功能，全局比不出来

内含子

局部比对blast：减少dp矩阵，计算最优比对路径，找局部最优，限制了最低的分数，计算局部的相似性
$\begin{array}{l} F(0,0)=0 \\ F(i, j)=\max \left\{\begin{array}{l} F(i-1, j-1)+s\left(x_{i}, y_{j}\right) \\ F(i-1, j)+d \\ F(i, j-1)+d \\ 0 \end{array}\right. \end{array} LocalAlignment$
在这里插入图片描述

Traceback: Decode the Local Alignment.

Trace back begins at the highest score in the matrix and continues until you reach 0.

可能会存在多条局部匹配结果。

Blast Ideas：Seeding and extending

Find matches(seed)between the query and subject.

(先找到种子)
Extend seed into High Scoring Segment Pairs （HSPs）-Run Smith-Waterman algorithm on the specified region only.

(以种子为基础向两端延展并构建比对)
Assess the reliability of the alignment.

(计算统计显著性)

在这里插入图片描述

Seeding

For a given word length w (usually 3 for proteins and 11 fornucleotides), slicing the query sequence into multiple continuous “seed words”

给定长度w

在这里插入图片描述

Speedup: Index database

The database was pre-indexed to quickly locate all positions in the database for a given seed.（近似常数时间/线性）

将seed与预先索引好的序列进行比对

在这里插入图片描述

Diagonal and Two-hits

最优比对路径绝应当平行于主对角线（分数最大），可以去掉零散的hits，连续>2的保留，减少搜索空间

在这里插入图片描述

对hit cluser为基础向左右方向延申和扩展，直到总分的下降超过x就停止。

扩展区域可以用右下角的算法。

在这里插入图片描述

Speedup: mask low-complexity

Low complexity
sequences yield false positives.

CACACACACACACACA，K=0.36
KLKLKLKLKLKL

屏蔽重复性的低复杂度区域，以免产生太多假阳性hit

在这里插入图片描述

事先定不出来，用的时候找哪个可能不适合

neighbourhood words

To improve sensitivity, in addition to the seedword itself, the BLAST also use these highly similar"neighbourhood words" (based on thesubstitution matrix) for seeding.

具体来说，对seed word所有可能变形根据替代矩阵来计算分数。

DKT seed
DRT=6+2+5=13等等，当前版本分数>=11才考虑进来（降低假阳性）

评估统计显著性Quality Assessment

Given the large data volume, it’s critical to provide some measures for assessing the statistical significance of a given hit.

得到最终的比对之后评估QA

确保比对不是由随机因素引起的（数据库够大的时候随机产生的序列也能匹配到结果）

E-Value: How a match is likely to arise by chance

The expected number of alignments with a given score that would be expected to occur at random in the database that has been searched

随机情况下，获得比当前比对分数相等或更高分数的比对条数。

e.g. if E=10, 10 matches with scores this high are expected to be found by chance

$e^{-\lambda S}$

Expectation>1
m是query sequence length
n是database size
s是分数
k和λ是和打分矩阵相关的，相当于normalization factor。

n数据库越大，随机匹配可能性越大；e值也和m（查询序列长度）成正比，因为blast是局部比对不需要全长匹配；e和s负相关，也就是说分数越高，随机碰上的概率越小；k和λ平衡不同打分矩阵和搜索空间对结果影响。

E期望和p进行转换

在这里插入图片描述

为了方便解释，我们可以进一步的把p值和E值进行相互转换，从图上可以看出，在小于0.1时， E值和p值，也就是概率值，几乎相等。特别的，当p取0.05时，对应的E-value为0.0513，因此也常有人将0.05作为E-value的cut-off。

与Needleman-Wunsch、Smith-Waterman等基于动态规划的算法不同， BLAST是一种启发式的算法，也就是说，它并不确保能找到最优解，但尽力在更短时间内找到足够好的解。具体来说，BLAST通过应用Seeding-and-extending策略，只在有限区域应用动态规划算法，从而有效地降低了计算量、提高了计算速度。然而，速度的提高是以灵敏度的下降为代价的，这也是一系列启发式算法所共有的trade-off。

参考：北大生物信息学