Vision Transformer (ViT): Analysis of image segmentation, image block embedding, category marking, QKV matrix and self-attention mechanism

Author: CSDN @ _Yakult_

This article will introduce the key points in Vision Transformers (ViT). Including image patching (Image Patching), image block embedding (Patch Embedding), category marking, (class_token), QKV matrix calculation process, cosine similarity (cosine similarity), Softmax, self-attention mechanism and other concepts. It mainly introduces the calculation process of QKV matrix.



一、Image Patching

The process of dividing an image into small blocks is "Image Patching" or simply "Patching". In this process, the image is divided into a series of small blocks of the same or different sizes, these small blocks are often called "Image Patches" (image blocks) or simply "Patches".

The image patching process is shown in the figure.

insert image description here

A "Patch" refers to a small area or segment in an image. This concept is often used to decompose large-sized images into smaller parts so that each small block can be processed, analyzed or feature extracted individually.

The advantages of dividing the image into small blocks (ie Patch):

  • Feature extraction: In some tasks, information from specific regions is more useful than the entire image. By extracting the features of each Patch, more fine-grained information can be obtained, which helps to better understand the image content.

  • Handling large images: For very large images, you may run into computational and storage constraints. Dividing an image into small patches can help reduce computational complexity and make it easier to process these small-sized patches.

  • Adaptability: In some adaptive processing algorithms, it is common to adopt different strategies for different image regions. Dividing the image into patches can make the algorithm more flexible and adaptive in the local area.

二、Patch Embedding

"Patch Embedding" is a concept in the field of computer vision that is related to Convolutional Neural Networks (CNN) in image processing and deep learning.

The traditional convolutional neural network uses pixel-level operations in image processing, and features are extracted by sliding the convolution kernel on the image. In "Patch Embedding", this concept introduces a more advanced feature representation. It divides the input image into small patches (also called "patches"), and then converts each patch into a low-dimensional vector representation. This vector representation can be used as input for subsequent tasks.

The purpose of Patch Embedding is to reduce computational complexity and improve the efficiency of feature extraction. Since in traditional convolution operations, adjacent pixels usually have a large amount of overlap, and Patch Embedding divides the image into blocks, which can reduce redundant calculations while retaining important feature information.

insert image description here

3. Class token

"Class token" is a special token used to represent the class information of the whole image. Usually, it will be added to a certain position in the vector sequence obtained after Patch Embedding, so that the model can use this category information for classification or generation tasks.

3.1 Add Class token

In the Transformer model, "Class token" is usually added at the beginning of the input sequence, and a specific attention mechanism will be passed through the training process to enable the model to encode and utilize category information.

After the Patch Embedding operation, "Class token" is added to the beginning of the Patch Embedding vector sequence, which is used to represent the category information of the entire image to assist subsequent image classification or generation tasks.

insert image description here

The following example illustrates the Class token, assuming that the application is to classify whether the image is Ishihara Satomi. We use one-hot encoding to represent category information. Then there are two types of category information, yes and no, and now use vectors [1, 0]to represent yes and [0, 1]no. Then class_token is [1, 0]or [0, 1].

Now, we concatenate this "Class token" with the Patch Embedding vector of each patch to get the final input sequence. Assume that the obtained 196 Patch Embedding vectors are:

[v1, v2, v3, ..., v196]

Then, the final input sequence after adding "Class token" is:

[Class_token, v1, v2, v3, ..., v196]

In this way, the first vector in the entire input sequence is "Class token", which contains the category information of the entire image, that is, whether the image belongs to Satomi Ishihara. The model can use this category information during training to help with image classification tasks.

To be more specific, suppose v1 is a 2-dimensional vector, expressed as:

v1 = [0.2, 0.7]

This vector represents the features of the first patch. Now, we concatenate the "Class token" and v1 to get the final input sequence:

[Class_token, v1]

Assuming that "Class token" indicates that the image belongs to the category of Ishihara Satomi, its one-hot encoding is:

[1, 0]

Then the final input sequence is:

[[1, 0], [0.2, 0.7]]

This input sequence contains the category information of the entire image (the probability of belonging to Satomi Ishihara is 1, and the probability of not being Satomi Ishihara is 0) and the feature vector of the first small block [0.2, 0.7].

3.2 Positional Encoding

After understanding the class token, let's take a look at the class token in vit.

In the Vision Transformer (ViT) model, "PE" stands for Positional Encoding, which is used to associate each Patch Embedding vector in the image with its position information, and is used to introduce the global position information of the entire image into the Transformer model middle.

The position encoding is to provide the Transformer model with the position information in the input sequence, because the Transformer model does not explicitly preserve the position information like the convolutional neural network. In natural language processing tasks, the input is a sequence of words. In order to preserve the position information of words, position encoding is usually added. Similarly, in ViT, the input is the Patch Embedding sequence of the image, in order to preserve the location information of the Patch, it is also necessary to add a location code.

In ViT, PE(pos, 2i) and PE(pos, 2i + 1) are position encoding formulas used to calculate “Class token”. Position codes are calculated using the sin and cos functions. For the location code of "Class token", the calculation method is:

P E ( p o s , 2 i ) = s i n ( p o s / 1000 0 2 i / d m o d e l ) PE(pos, 2i) = sin(pos / 10000^{2i / dmodel}) PE ( p os ,2i ) _=sin ( pos / 1000 0 _2i/dmodel)
P E ( p o s , 2 i + 1 ) = c o s ( p o s / 1000 0 2 i / d m o d e l ) PE(pos, 2i + 1) = cos(pos / 10000^{2i / dmodel}) PE ( p os ,2i _+1)=cos(pos/100002 i / d m o d e l )

The position encoding adopts the form of sine and cosine functions, where PE(pos, 2i) is the position encoding corresponding to an even number of dimensions, and PE(pos, 2i + 1) is the position encoding corresponding to an odd number of dimensions. When calculating, pos represents the position of the patch in the sequence, i is the dimension index of the position code, starting from 0, and dmodel is the hidden layer dimension (also called feature dimension) in the Transformer model.

This calculation method of position encoding is common in Transformer, which makes Patch Embedding vectors in different positions have different position offsets in the feature space, so that the model can take their relative position into account when processing sequence data.


In order to better illustrate the calculation process of position encoding, let's take a simplified example. Suppose we have an image, divide it into 4x4 small blocks (Patch), a total of 16 small blocks, each small block is represented by a 2-dimensional vector. We assume a hidden layer size (d_model) of 4.

Now, let's calculate the "Class token" and the positional encoding of each small block.

First, the position of "Class token" is the whole image, we can choose a virtual position number pos = 0 to represent the position of "Class token". Then, we calculate the positional encoding of the "Class token":

d_model = 4
i = 0

PE(pos=0, 2i) = sin(0 / 10000^(2*0 / 4)) = sin(0) = 0
PE(pos=0, 2i + 1) = cos(0 / 10000^(2*0 / 4)) = cos(0) = 1

So the position of "Class token" is encoded as [0, 1].

Next, we compute the positional encoding for each patch. Assume that the tiles are numbered from 1 to 16 positions. We can use the following formula to calculate the positional encoding of each tile:

d_model = 4
i = 0, 1, 2, 3

pos = 1
PE(pos=1, 2*0) = sin(1 / 10000^(2*0 / 4)) = sin(1)0.8415
PE(pos=1, 2*0 + 1) = cos(1 / 10000^(2*0 / 4)) = cos(1)0.5403

pos = 2
PE(pos=2, 2*0) = sin(2 / 10000^(2*0 / 4)) = sin(2)0.9093
PE(pos=2, 2*0 + 1) = cos(2 / 10000^(2*0 / 4)) = cos(2)-0.4161

...
and so on, computing the positional encoding for each tile. Finally, the result of position encoding of each small block is obtained.

Please note that this is just a simplified example, and the hidden layer size (d_model) and position number of small blocks may vary according to the actual situation. In practice, ViT models use higher dimensional hidden layers, and the position numbers are more complex. The purpose here is to demonstrate the calculation process of the position code.

4. QKV

insert image description here

As shown in the figure above, the QKV matrix is ​​the three matrices used to calculate the attention weights in the Self-Attention Mechanism. These three matrices are usually obtained by linearly transforming the input sequence. they are, respectively:

  • Q matrix (Query Matrix): The Q matrix is ​​used to generate query vectors, and each query vector represents a small block (Patch) query in the attention mechanism, which is used to find information related to the current small block.

  • K matrix (Key Matrix): The K matrix is ​​used to generate key vectors, and each key vector represents the key of a small block (Patch) in the attention mechanism, which is used to represent the relationship between the current small block and other small blocks.

  • V matrix (Value Matrix): The V matrix is ​​used to generate value vectors, and each value vector represents the value of a patch (Patch) in the attention mechanism, which is used to represent the feature information of the current patch.

The first thing to know is that the dimensions of the X matrix and the Y matrix are the same, and the input dimensions are the same as the input dimensions.

Specifically, in the self-attention mechanism, the input sequence first passes through three different linear transformations to obtain the query matrix Q, key matrix K, and value matrix V, respectively. These three matrices will be used to calculate the attention weights, so that the input sequence is weighted and summed to obtain the final representation.

Among them, the matrix obtained by the dot product of Q and K is the attention weight matrix A. Assuming that if there is only the V matrix without going through the process of Q and K, then this is an ordinary network without adding an attention mechanism.

Assuming that no matter what linear transformation method you use, it is just how many hidden layers you use for linear transformation (this part is Baidu), now we get the QKV matrix and add the class token. As shown below,

insert image description here

Of course, when we calculate, QKV is stretched into a line. For the convenience of representation, the drawing here is still in the form of a rectangle.

4.1 cosine similarity

Before understanding Q and K point multiplication, you need to understand the concept of cosine similarity . Because the dot product of Q and K is to compare the cosine similarity, if the first patch in Q is compared with all the patches in K and the dot product is performed, then their cosine similarity will be calculated.

The greater the cosine similarity, the greater the self-attention weight.

The following is the concept and calculation method of cosine similarity,

Cosine similarity is a measurement method used to measure the similarity between two vectors, and is often used to calculate whether the directions of two vectors are similar. In cosine similarity, the length of the vector does not affect the calculation of the similarity, so it pays more attention to the direction of the vector.

Suppose there are two vectors A and B, they can be expressed as:

A = [ a 1 , a 2 , a 3 , . . . , a n ] A = [a₁, a₂, a₃, ..., aₙ] A=[a1,a2,a3,...,an]
B = [ b 1 , b 2 , b 3 , . . . , b n ] B = [b₁, b₂, b₃, ..., bₙ] B=[b1,b2,b3,...,bn]

where a₁, a₂, …, aₙ and b₁, b₂, …, bₙ are the elements of the two vectors, respectively.

The calculation formula of cosine similarity is as follows:

c o s i n e _ s i m i l a r i t y = ( A ⋅ B ) / ( ∣ ∣ A ∣ ∣ ∗ ∣ ∣ B ∣ ∣ ) cosine\_similarity = (A·B) / (||A|| * ||B||) cosine_similarity=(AB)/(∣∣A∣∣∣∣B∣∣)

in,

  • A·B means the dot product (inner product) of vector A and vector B, that is, a₁ * b₁ + a₂ * b₂ + … + aₙ * bₙ.
  • ||A|| represents the norm (or length) of the vector A, ie √(a₁² + a₂² + … + aₙ²).
  • ||B|| represents the norm of the vector B, ie √(b₁² + b₂² + … + bₙ²).

When calculating cosine similarity, first calculate the dot product of vector A and vector B, and then calculate their norms respectively. Finally divide the dot product by the product of the norms of the two vectors to get the cosine similarity value. The value range of cosine similarity is between -1 and 1,

  • When the cosine similarity is 1, it means that the directions of the two vectors are exactly the same, that is, they point to the same direction in space.
  • When the cosine similarity is -1, it means that the directions of the two vectors are completely opposite, that is, they point in opposite directions in space.
  • When the cosine similarity is 0, it means that the directions of the two vectors are vertical, that is, they are perpendicular to each other in space.

4.2 Q @ K T K^{T} KT

Let's take a look at the process of calculating the weight matrix A by Q and K, as shown in the red box in the figure.

insert image description here

insert image description here

As shown in the figure above, assume that the yellow rectangle represents the elements in the Q matrix, and the blue rectangle represents KTK^{T}KThe elements in the T matrix, the green rectangle represents the elements in the result matrix after the Q point is multiplied by K. Among them, q0represents a row, k0represents a column, and q0k0represents a number obtained by dot multiplying a row of yellow and a column of blue.

Here q 0 is the class_token pulled into a one-dimensional vector, q 1 is the first patch vector of the Q matrix (Satomi Ishihara picture); k 0 is a column of the K matrix transposed matrix, which represents the class_token pulled into a one-dimensional vector , k 1 is the first patch vector of the K matrix (Satomi Ishihara picture).

4.3 softmax( (Q @ K T K^{T} KT ) /dk \sqrt{dk}dk )

First, let's understand the Softmax function. Softmax is a function used to convert the elements of a vector into a probability distribution. Given an input vector z = [z₁, z₂, …, zₙ], the Softmax function converts each element zᵢ into a probability value pᵢ such that the sum of all probability values ​​equals one.

insert image description here

For example, here the values ​​of q 0 k 0 , q 0 k 1 ... q 0 k n are converted into probability values, and their sum is changed to 1.

In the self-attention mechanism, divide by dk \sqrt{d_k}dk It is to scale the attention weight, so as to avoid the gradient explosion problem caused by the excessive attention weight in the deep Transformer model.

Here d k is the dimension of the attention head in the model , then the size of the dot product result is d k , and the value range of the dot product result between different positions may vary greatly. Without scaling, some large dot product values ​​can become very large after Softmax, while small dot product values ​​can be close to 0 after Softmax. This can lead to large differences in attention weights, making some locations over- or under-influenced on others, thereby affecting the model's ability to learn and generalize.

By dividing by dk \sqrt{d_k}dk , the dot product results can be scaled so that the range of all dot product results is relatively stable and will not appear too large or too small. In this way, the attention weight obtained after Softmax will be relatively balanced, and it is more conducive to the model to learn effective global relations and representations.

4.4 A @ V

insert image description here

As shown in the figure, after the previous calculation, we have obtained the weight A matrix, and multiplying A and the Value matrix is ​​to apply the attention weight matrix to the V matrix. The yellow rectangle in the figure is the Y matrix calculated by the attention mechanism. The dimensions of the Y matrix are exactly the same as the dimensions of the X input matrix. So Transform is a plug-and-play module.

Here qk 0 is a row of the A weight matrix, v 0 is a column of the Value matrix, and qk 0 v 0 is a number obtained after their point multiplication (that is, q 0 k 0 v 00 +q 0 k 1 v 10 +q 0 k 2 v 20 +...).

Disclaimer:
As an author, I attach great importance to my own works and intellectual property rights. I hereby declare that all my original articles are protected by copyright law, and no one may publish them publicly without my authorization.
My articles have been paid for publication on some well-known platforms. I hope readers can respect intellectual property rights and refrain from infringement. Any free or paid (including commercial) publishing of paid articles on the Internet without my authorization will be regarded as a violation of my copyright, and I reserve the right to pursue legal responsibility.
Thank you readers for your attention and support to my article!

Guess you like

Origin blog.csdn.net/qq_35591253/article/details/131994377