Attention principle + vector inner product + Scaled Dot-Product Attention in Transformer

1. Attention principle

insert image description here

  将 S o u r c e Source The constituent elements in S o u rce are imagined as a series of< Key , Value > <Key,Value><Key,Value> Data pair composition, givenT arget TargetAn element in T a r get Query QueryQ u ery , by computingQuery QueryQ u ery and eachKey KeyThe similarity or correlation of Key , get each Key KeyKey对应 V a l u e Value The weight coefficient of V l u e , and then V alue ValueVal u e is weighted and summed to get the final Attention AttentionA tt e n t i o n value. Soessentially Attention AttentionA tt e n t i o n mechanism is forSource SourceValue Valueof elements in S o u rceV l u e values ​​are weighted and summed, andQuery QueryQ u ery andKey KeyKey is used to calculate the correspondingValue ValueValue 's weight coefficient . _ That is, its essential idea can be rewritten as the following formula:

A t t e n t i o n ( Q u e r y , S o u r c e ) = ∑ i = 1 L x S i m i l a r i t y ( Q u e r y , K e y i ) ∗ V a l u e i Attention(Query,Source)=\sum_{i=1}^{L_{x}}Similarity(Query,Key_{i})*Value_{i} Attention(Query,Source)=i=1LxSimilarity(Query,Keyi)Valuei

2. Vector inner product

  Vector inner product is also called vector dot product, the formula is as follows:

a ⃗ ⋅ c ⃗ = ∥ a ⃗ ∥ × ∥ c ⃗ ∥ × c o s θ \vec{a}\cdot \vec{c}=\parallel\vec{a}\parallel\times \parallel\vec{c}\parallel \times cos \theta a c =∥a ×c ×cosθ

insert image description here

  The derivative formula of vector inner product is as follows:

∂ ( x ˉ ⋅ w ˉ ) ∂ w ˉ = x ˉ T \frac{\partial(\bar{x}\cdot \bar{w})}{\partial \bar{w}}=\bar{x}^{T} wˉ(xˉwˉ)=xˉT

3. Scaled Dot-Product Attention in Transformer

  The formula is as follows:

A t t e n t i o n ( Q , K , V ) = s o f t m a x ( Q K T d k ) V Attention(Q,K,V)=softmax(\frac{QK^{T}}{\sqrt{d_{k}}})V Attention(Q,K,V)=softmax(dk QKT)V

  For a set of key-value pairs and n queries, two matrix multiplications can be used to calculate each element in parallel.

insert image description here

Guess you like

Origin blog.csdn.net/python_plus/article/details/130750293