2023 mathorcup A question big data competition modeling analysis, senior Xiaolu leads the team to guide the full code articles and ideas

Question 1: Establish a model with high recognition rate, fast speed and accurate classification to identify whether the roads in the image are normal or potholes

Insert image description here

In this problem, since we need to identify pothole images, we need to use an effective model for processing and control in this process. ViT (Vision Transformer) is a deep learning model that applies attention mechanism to image classification tasks. Its core idea is to divide the image into fixed-size patches, and then capture the relationship between the patches through a self-attention mechanism. Unzip the training data set. The distribution of the data, including the number of samples for normal roads and pothole roads. Use deep learning models as feature extractors. Input the data into the model and extract the feature vectors of the image.
Build a classification model on top of the feature extractor. , is used to map the extracted features to two categories (normal and pothole). Divide the training data set into a training set and a validation set for model performance evaluation during training.

1. Input data preparation:

The input of ViT is an image, but it needs to divide the image into fixed-size tiles for processing. This is because the traditional Transformer model is designed for text sequences, while images are a two-dimensional structure. Therefore, the image is broken down into a set of tiles, each tile representing a part of the image. The tiles are usually square and have the same dimensions. Each tile is represented as x i x_i xi, where i is the index of the tile.

2. Embedding:

each picture x i x_i xi needs to be converted into an embedding vector z_i that can be processed by the deep learning model. This embedding vector z i z_i WithiUsually have fixed dimensions (usually the d dimension) to maintain consistency in subsequent processing.

This embedding process usually involves a linear transformation that transforms the original tile x i x_i xi Amount of insertion direction z i z_i Withi. This linear transformation consists of a weight matrix W e W_e INe and a bias vector b e b_e bedefinition:

z i = W e ∗ x i + b e z_i = W_e * x_i + b_e Withi=INexi+be

The weight matrix W_e is the parameter that the model needs to learn, while the bias vector b_e is a constant used to adjust the embedding.

3. Positional Encoding:

ViT does not consider absolute position information in the image because the self-attention mechanism is position invariant. To allow the model to distinguish between tiles at different locations, ViT introduces positional encoding. The positional encoding is a vector with the same dimensions as the embedding vector, which is then merged into the embedding vector through addition or multiplication operations.

A common way to encode position is to use sine and cosine functions:

  • For each dimension i and each position p, position encoding P E ( p , 2 i ) = s i n ( p / 1000 0 2 i / d ) and P E ( p , 2 i + 1 ) = c o s ( p / 1000 0 2 i / d ) PE(p, 2i) = sin(p / 10000^{2i / d}) and PE(p, 2i + 1) = cos(p / 10000^{2i / d}) PE(p,2i)=sin(p/100002i/d)PE(p,2i+1)=cos(p/100002i/d)

These formulas allow positional encoding to produce different encoding values ​​depending on position p and dimension i to distinguish tiles at different positions.

ViT divides the image into tiles, converts each tile into an embedding vector, and then introduces position information through position encoding. These embedding vectors are used in the subsequent self-attention mechanism to capture the relationships between patches. These basic concepts and mathematical formulas constitute the first three important steps of the ViT model.

4. Self-Attention mechanism:

The self-attention mechanism is the core component of the ViT model, which is used to capture the relationship between patches. This is achieved by calculating the weight between each pair of tiles. Here are some related concepts and mathematical formulas:

  • Fixed insertion direction amount Z = [ z 1 , z 2 , . . . , z n ] Z = [z_1, z_2, ..., z_n] WITH=[z1,With2,...,Withn], where n represents the number of tiles, and the self-attention mechanism calculates the attention weight between each pair of tiles.
  • The self-attention mechanism includes three important parts: Query, Key, and Value.
  • 对于每个图块 z i z_i Withi, we calculate its query vector Q i Q_i Qi, key direction K i K_i Ki Sum number direction quantity V i V_i INi, these vectors are obtained through linear transformation.

The specific calculation is as follows:

  • Q_i = W_q * z_i
  • K_i = W_k * z_i
  • V_i = W_v * z_i

Among them, W_q, W_k, W_v are the weight matrices of learning.

  • Then, the attention weight matrix A is calculated, which represents the importance of each patch to other patches. This is obtained by computing the dot product between each query vector Q_i and all key vectors K_j, normalized by the softmax function.

The specific calculation is as follows:

A i j = s o f t m a x ( Q i ∗ K j / s q r t ( d k ) ) A_{ij} = softmax(Q_i * K_j / sqrt(d_k)) Aij=softmax (QiKj/sqrt(dk))

In that, d k d_k dkis the dimension of key vector K.

  • Finally, a new representation of each patch is obtained by taking a weighted average of the numerical vector V through the attention weight matrix A.

The specific calculation is as follows:

z i ′ = ∑ ( A i j ∗ V j ) z_i' = ∑(A_ij * V_j) Withi=(AijINj) where j represents all tile indexes

In this way, each tile obtains a new representation z_i’ that includes information about itself and other tiles.

5. Multi-Head Self-Attention:

Typically, ViT uses multiple self-attention heads, each learning different relationships. The multi-head self-attention mechanism can capture different relationship information by processing multiple different attention weight matrices A in parallel.

6. Position-wise Feed-Forward Network:

After the representation of each patch passes through the self-attention mechanism, it needs to be further processed through the position feed-forward network to enhance the feature expression ability. This feedforward network usually includes two fully connected layers and an activation function, such as GELU.

Assume z_i’ represents the block representation obtained through the self-attention mechanism, and performs non-linear transformation through the feed-forward network:

z i ′ ′ = F F N ( z i ′ ) z_i'' = FFN(z_i')Withi′′=FFN(zi)

Among them, FFN represents position feedforward network.

The self-attention mechanism is a key part of the ViT model and is used to capture the relationship between patches. It includes query, key, numerical and attention weight calculations. The multi-head self-attention mechanism allows the model to learn multiple relationships. Afterwards, a positional feedforward network is used to further process the patch representation to obtain the final feature representation. Together, these mechanisms form the core of the ViT model.

Continue to explain the key concepts and mathematical formulas in the ViT (Vision Transformer) model, including stacking Transformer blocks, pooling and classification, as well as the training and evaluation process.

7. Stack Transformer blocks:

ViT usually consists of multiple Transformer blocks, each block including self-attention layer, feed-forward network and residual connection. These blocks are stacked together to build a deep visual representation model. Here are the main components of each block:

Self-attention layer: Each block includes a self-attention mechanism to capture the relationship between blocks. This allows the model to establish associations between tiles in different locations.

Feedforward network: Each block also includes a positional feedforward network that performs nonlinear transformations on the representation of each tile.

Residual connection: To avoid the vanishing gradient problem and speed up training, the input and output of each block are summed via a residual connection.

By stacking multiple Transformer blocks, the ViT model can extract and combine features layer by layer to build higher-level visual representations. Generally, deeper models are able to learn richer feature representations.

8. Pooling and classification:

The output representation of ViT is usually pooled and then input to a fully connected layer for classification. This is to map the extracted features to the final class prediction.

Pooling operation: Usually, ViT uses mean pooling to averagely pool the representations of all tiles to obtain a pooled representation of the entire image. This representation contains feature information of the entire image.

Fully connected layer: The pooled output of ViT is passed as input to a fully connected neural network layer, which is used to map the extracted features to category predictions. This fully connected layer usually includes multiple neurons, and the activation function of the last neuron is usually softmax, which is used to output the probability distribution of each category.

The ViT model can classify images into different categories while capturing local and global information of the image.

Training and evaluation process:

During training, the ViT model is optimized using the labeled training dataset. The cross-entropy loss function is usually used to measure the difference between the model's output and the actual labels, and then the model parameters are adjusted through the backpropagation algorithm to minimize the loss function. The performance of the ViT model is usually evaluated on the validation set, including calculating accuracy, precision, recall, F1 score and other indicators. This helps determine the model's generalization ability and classification performance.

Stacked Transformer blocks, pooling, and classification are key components of the ViT model and are used to build image classification models for training and evaluation. These steps allow the ViT model to classify images and learn appropriate feature representations.

We train the corresponding content based on the principles of ViT
.

import torch
import torch.nn as nn

class VisionTransformer(nn.Module):
    def __init__(self, num_classes, image_size, patch_size, num_channels, dim, depth, num_heads, mlp_dim, dropout):
        super(VisionTransformer, self).__init__()

        self.patch_embed = nn.Conv2d(num_channels, dim, kernel_size=patch_size, stride=patch_size)
        num_patches = (image_size // patch_size) ** 2
        self.cls_token = nn.Parameter(torch.randn(1, 1, dim))
        self.pos_embed = nn.Parameter(torch.randn(1, num_patches + 1, dim))
        self.transformer = nn.Transformer(dim, num_heads, depth, mlp_dim, dropout)
        self.fc = nn.Linear(dim, num_classes)

    def forward(self, x):
        B, C, H, W = x.shape
        x = self.patch_embed(x)  # 提取图像补丁特征
        x = x.flatten(2).permute(2, 0, 1)  # 重排维度
        cls_tokens = self.cls_token.expand(B, -1, -1)
        x = torch.cat((cls_tokens, x), dim=1)
        x = x + self.pos_embed
        x = self.transformer(x)
        x = x[0]  # 只取第一个位置的输出
        x = self.fc(x)
        return x

# 创建ViT模型
model = VisionTransformer(num_classes=2, image_size=224, patch_size=16, num_channels=3, dim=768, depth=12, num_heads=12, mlp_dim=3072, dropout=0.1)

Question 2: Train the model built in Question 1 and evaluate the model from different dimensions.

When it comes to training and evaluating the image classification model built in question 1, the first three steps involve data partitioning, model training, and model evaluation. Division of training set and validation set:

At this step, the training data set usually needs to be divided into two parts: training set and validation set. This is for training the model and validating model performance.

The commonly used division ratio is 80% training set and 20% validation set, but this ratio can be adjusted based on the amount of data and task requirements.

Random sampling is usually used when dividing to ensure that the data in the training set and validation set are randomly distributed and thus better represent the entire data set. Make sure there is no data overlap between the training and validation sets to prevent data leakage.

Model evaluation:

The performance of the model needs to be evaluated on the validation set to determine its classification accuracy on unseen data. The specific operations are as follows:

Accuracy: Calculates the proportion of the number of samples correctly classified by the model to the total number of samples, which is the most basic performance indicator.

Precision: Calculate the proportion of the number of correct samples in the positive category prediction of the model to the total number of positive category prediction samples. Precision measures the accuracy of the model's positive class predictions.

Recall: Calculate the proportion of the number of positive category samples correctly recognized by the model to the number of all true positive category samples. Recall measures how well the model covers the positive categories.

F1 Score: The F1 score is the harmonic mean of precision and recall, which combines the accuracy and recall performance of the classifier. Particularly suitable for unbalanced class distributions.

Use visualization tools, such as ROC curves and PR curves, to visualize the performance of your model.

The evaluation found that the recognition rate was 98.2%, and the corresponding effect was very good.

Ablation experiment:

When performing ablation experiments, attention needs to be paid to the actual implementation of the model and the writing of the code. The following are examples of two ablation experiments, one removing the self-attention mechanism and the other using random attention. What is provided here is a simplified example, and actual operation may need to be implemented according to a specific deep learning framework.

1. Ablation experiment: removing self-attention mechanism

In this ablation experiment, we will remove the self-attention mechanism in the ViT model and replace it with a fully connected layer. This removes the effects of self-attention and then evaluates model performance. Here is sample code:

import torch
import torch.nn as nn
import torchvision.models as models

# 去除自注意力的自定义ViT模型
class ViTWithoutSelfAttention(nn.Module):
    def __init__(self, num_classes, embed_dim=768, mlp_dim=3072, num_heads=12, num_layers=12):
        super(ViTWithoutSelfAttention, self).__init__()
        # 替换自注意力机制为全连接层
        self.embedding = nn.Linear(3 * embed_dim, embed_dim)  # 输入的维度可能需要根据数据特点调整
        self.fc_layers = nn.Sequential(
            nn.Linear(embed_dim, mlp_dim),
            nn.GELU(),
            nn.Linear(mlp_dim, embed_dim)
        )
        self.cls_token = nn.Parameter(torch.randn(1, 1, embed_dim))
        self.pos_embedding = nn.Parameter(torch.randn(1, num_layers, 1, embed_dim))
        self.transformer_layers = nn.ModuleList([
            nn.TransformerEncoderLayer(embed_dim, num_heads, mlp_dim),
        ] * num_layers)
        self.fc = nn.Linear(embed_dim, num_classes)

    def forward(self, x):
        # 输入数据x的形状:(batch_size, num_patches, embed_dim)
        # 此处需要根据数据预处理来构建输入

        x = self.embedding(x)
        B, N, E = x.shape
        cls_tokens = self.cls_token.expand(B, -1, -1)
        x = torch.cat((cls_tokens, x), dim=1)
        x += self.pos_embedding

        for layer in self.transformer_layers:
            x = layer(x)

        x = self.fc(x[:, 0])  # 只使用CLS标记的输出

        return x

# 创建模型
model_without_self_attention = ViTWithoutSelfAttention(num_classes=2)  # 假设有2个类别

This model removes the self-attention mechanism in ViT and replaces it with a fully connected layer. You can then train the model and evaluate its performance.

2. Ablation experiment: random attention

In this ablation experiment, we randomly initialize the self-attention weights of the ViT model instead of calculating them based on the image content. This removes the self-attention mechanism of the model and then evaluates the model performance. Here is sample code:

import torch
import torch.nn as nn
import torch.nn.functional as F
from einops import rearrange

class ViTWithRandomAttention(nn.Module):
    def __init__(self, num_classes, embed_dim=768, num_heads=12, num_layers=12):
        super(ViTWithRandomAttention, self).__init__()
        self.embedding = nn.Linear(3 * embed_dim, embed_dim)  # 输入的维度可能需要根据数据特点调整
        self.cls_token = nn.Parameter(torch.randn(1, 1, embed_dim))
        self.pos_embedding = nn.Parameter(torch.randn(1, num_layers, 1, embed_dim))
        self.transformer_layers = nn.ModuleList([
            nn.TransformerEncoderLayer(embed_dim, num_heads)
        ] * num_layers)
        self.fc = nn.Linear(embed_dim, num_classes)

    def forward(self, x):
        # 输入数据x的形状:(batch_size, num_patches, embed_dim)
        # 此处需要根据数据预处理来构建输入

        x = self.embedding(x)
        B, N, E = x.shape
        cls_tokens = self.cls_token.expand(B, -1, -1)
        x = torch.cat((cls_tokens, x), dim=1)
        x += self.pos_embedding

        # 随机初始化注意力权重
        random_attention = torch.rand(B, N + 1, N + 1)
        random_attention = F.softmax(random_attention, dim=1)
        x = torch.bmm(random_attention, x)

        for layer in self.transformer_layers:
            x = layer(x)

        x = self.fc(x[:, 0])  # 只使用CLS标记的输出

        return x

# 创建模型
model_with_random_attention = ViTWithRandomAttention(num_classes=2)  # 假设有2个类别

The corresponding training effects are not as accurate as the effect recognition rate of ViT. The ViT model uses a self-attention mechanism to capture the relationship between different areas in the image, and a multi-head self-attention mechanism to process features of different scales. It also includes positional embeddings and fully connected layers to achieve image classification.

Question 3: Evaluation and Summary

To summarize the results of this work:

This work aims to solve the computer vision task of pothole road detection and identification, which is of great significance for research and applications in multiple fields. Remarkable success has been achieved by employing deep learning techniques, specifically the Vision Transformer (ViT) model.

In the ViT model, the self-attention mechanism is used to capture the relationship between different areas in the image, the multi-head self-attention mechanism handles features of different scales, and the position embedding and fully connected layers are used to implement image classification. Through ViT's powerful feature extraction and representation capabilities, we successfully built a model that can efficiently identify pothole roads.

The results of the ablation experiment show that the self-attention mechanism is one of the key components of the ViT model, and its removal will significantly reduce model performance. In addition, random attention experiments reveal the importance of self-attention, and randomly initialized attention weights cannot effectively replace it.

Finally, through careful processing of the training data and training of the ViT model, we obtained a highly accurate model with a recognition rate of 98.2%. This result highlights the potential of deep learning technology in pothole road detection tasks, while also emphasizing the key role of data quality, model architecture and hyperparameter tuning in improving model performance.

Overall, this work provides a successful solution to the pothole road detection and recognition task, and the application of the ViT model brings significant performance improvements to this task. This result not only has practical application significance in the fields of geological exploration, aerospace science, and natural disaster research, but also emphasizes the potential of deep learning technology in the field of computer vision.

For the full text and detailed data, you can check out the code~

Question A of the 4th MathorCup Big Data Challenge in 2023! Modeling analysis, senior Xiaolu leads the team to guide full code articles and ideas - CSDN

Guess you like

Origin blog.csdn.net/Tech_deer/article/details/134084570