Application of Computer Vision 11 - Application of Convolutional Neural Network and Attention Mechanism Based on Pytorch Framework to Street House Number Recognition

Hello everyone, I am Weixue AI. Today I will introduce to you the application of computer vision 11-the application of convolutional neural network and attention mechanism based on pytorch framework to the recognition of street house numbers. In this article, we use PyTorch to quickly build and train volumes. Models such as product neural network (CNN) to achieve accurate identification of street house numbers. Introduce and attention mechanism, which is a method to imitate the human visual attention mechanism, which has a wide range of applications in image processing tasks. By introducing an attention mechanism, the model can automatically focus on the area in the image that is related to the house number, improving the accuracy and robustness of recognition.

1. Project introduction

Street house number recognition is an important task in computer vision. Through automatic recognition of street house numbers, street images can be better understood and analyzed. This article will introduce how to use the PyTorch framework and attention mechanism, combined with the SVHN dataset, to realize the classification and recognition of street house numbers.

2. SVHN data set

SVHN (Street View House Numbers) is a publicly available large-scale street digital image dataset. This dataset contains house number images obtained from Google Street View, which can be used to train and test machine learning models to automatically recognize street house numbers.

2.1 Dataset download and load

First, we need to download and load the SVHN dataset. In PyTorch, we can use the datasets module in the torchvision library to achieve this step.

Dataset download and viewing:

train_dataset = datasets.SVHN(root='./data', split='train', download=True)

images = train_dataset.data[:10]  # shape: (10, 3, 32, 32)
labels = train_dataset.labels[:10]

images = np.transpose(images, (0, 2, 3, 1))

# Plot the images
fig, axs = plt.subplots(2, 5, figsize=(12, 6))
axs = axs.ravel()

for i in range(10):
    axs[i].imshow(images[i])
    axs[i].set_title(f"Label: {
      
      labels[i]}")
    axs[i].axis('off')

plt.tight_layout()
plt.show()

insert image description here

Data set loading, preprocessing, easy input model training:

import torch
from torchvision import datasets, transforms

# 数据预处理
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5,), (0.5,))])

# 下载并加载SVHN数据集
trainset = datasets.SVHN(root='./data', split='train', download=True, transform=transform)
testset = datasets.SVHN(root='./data', split='test', download=True, transform=transform)

trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)
testloader = torch.utils.data.DataLoader(testset, batch_size=64, shuffle=True)

3. Construction of convolutional network

Use PyTorch to build a convolutional neural network. Convolutional Neural Network (CNN) is a neural network mainly used to process data with a grid-like structure, such as images (2D grid of pixels) or text (1D grid of words).

3.1 Definition of network structure

The following is a basic convolutional neural network model, which contains two convolutional layers, two max pooling layers and two fully connected layers.

from torch import nn

class ConvNet(nn.Module):
    def __init__(self):
        super(ConvNet, self).__init__()
        self.layer1 = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=5, stride=1, padding=2),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.layer2 = nn.Sequential(
            nn.Conv2d(32, 64, kernel_size=5, stride=1, padding=2),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2))
        self.drop_out = nn.Dropout()
        self.fc1 = nn.Linear(7 * 7 * 64, 1000)
        self.fc2 = nn.Linear(1000, 10)

    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        out = out.reshape(out.size(0), -1)
        out = self.drop_out(out)
        out = self.fc1(out)
        return self.fc2(out)

4. Add attention mechanism

Attention mechanism is a technique that can improve the performance of models. In our model, we will add an attention layer to help the model better focus on important parts in the input image.

4.1 Definition of Attention Layer

I will implement the basic attention layer, which will generate an attention map of the same size as the input, and then multiply the input by the corresponding elements of the attention map to weight the input.

The mathematical principle of the attention mechanism layer:
the mathematical principle of the attention mechanism can be expressed by the following formula:

Given an input tensor x ∈ R b × c × h × wx \in \mathbb{R}^{b \times c \times h \times w}xRb × c × h × w wherebbb is the batch size,ccc is the channel number,hhh is height,www is the width. The attention mechanism is divided into two stages: feature extraction and feature weighting.
1. Feature extraction stage:xx
is passed through the adaptive average pooling layer (AdaptiveAvgPool2d)x performs average pooling on height and width, resulting in a shape ofb × c × 1 × 1 b \times c \times 1 \times 1b×c×1×Tensoryy of 1y . Adaptive average pooling is used here to make the tensoryyy also produces the same output on inputs of different sizes.
2. Feature weighting stage:
Next, through the fully connected layer (Linear) and the nonlinear activation function ReLU to the tensoryyy performs feature transformation, reduces the number of channels, and preserves important features. Then pass another fully connected layer and Sigmoid activation function to get the weight tensory ′ ∈ R b × c × 1 × 1 y' \in \mathbb{R}^{b \times c \times 1 \times 1}yRb × c × 1 × 1 , representing the weight value of each channel. The weight value here is between 0 and 1, which is used to control the proportion of each channel in subsequent calculations. The weight tensory 'y'y' expands to the same input tensorxxsame shape as x and multiply it with the input tensor to get the attention-weighted feature tensor. This enables adaptive feature weighting of input tensors.

数学表示为:
y = AdaptiveAvgPool2d ( x ) y ′ = Sigmoid ( Linear ( ReLU ( Linear ( y ) ) ) ) output = x ⊙ y ′ y = \text{AdaptiveAvgPool2d}(x) \\ y' = \text{Sigmoid}(\text{Linear}(\text{ReLU}(\text{Linear}(y)))) \\ \text{output} = x \odot y' y=AdaptiveAvgPool2d(x)y=Sigmoid(Linear(ReLU(Linear(y))))output=xy

where ⊙ \odot means element-wise multiplication operation.

The construction code of the attention mechanism layer:

class AttentionLayer(nn.Module):
    def __init__(self, channel, reduction=16):
        super(AttentionLayer, self).__init__()
        self.avg_pool = nn.AdaptiveAvgPool2d(1)
        self.fc = nn.Sequential(
            nn.Linear(channel, channel// reduction, bias=False),
            nn.ReLU(inplace=True),
            nn.Linear(channel // reduction, channel, bias=False),
            nn.Sigmoid()
        )

    def forward(self, x):
        b, c, _, _ = x.size()
        y = self.avg_pool(x).view(b, c)
        y = self.fc(y).view(b, c, 1, 1)
        return x * y.expand_as(x)

4.2 Adding an attention layer to the network

We add the attention layer to the ConvNet model:

class ConvNet(nn.Module):
    def __init__(self):
        super(ConvNet, self).__init__()
        self.layer1 = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=5, stride=1, padding=2),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            AttentionLayer(32))
        self.layer2 = nn.Sequential(
            nn.Conv2d(32, 64, kernel_size=5, stride=1, padding=2),
            nn.ReLU(),
            nn.MaxPool2d(kernel_size=2, stride=2),
            AttentionLayer(64))
        self.drop_out = nn.Dropout()
        self.fc1 = nn.Linear(8 * 8 * 64, 1000)
        self.fc2 = nn.Linear(1000, 10)

    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        out = out.reshape(out.size(0), -1)
        out = self.drop_out(out)
        out = self.fc1(out)
        return self.fc2(out)

5. Model training and testing

Next, we will train and test the model.

5.1 Model Training

import torch.optim as optim

model = ConvNet()
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.001, momentum=0.9)

for epoch in range(10):  # loop over the dataset multiple times
    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data

        # zero the parameter gradients
        optimizer.zero_grad()

        # forward + backward + optimize
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        # print statistics
        running_loss += loss.item()
        if i % 20 == 0:    # print every 2000 mini-batches
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 2000))
            running_loss = 0.0

print('Finished Training')

5.2 Model testing

correct = 0
total = 0
with torch.no_grad():
    for data in testloader:
        images, labels = data
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print('Accuracy of the network on the 10000 test images: %d %%' % (
    100 * correct / total))

6. Conclusion

This article is like a fantastic map that leads you into the amazing world of computer vision tasks. In this world, you will go hand in hand with two powerful partners, PyTorch and the attention mechanism, to explore the mysteries of street house number recognition.

Imagine you're on a busy street, with house numbers challenging your eyesight. But you have a magical vision and can easily identify every number. This extraordinary ability is the magic of computer vision tasks.

We want to work with PyTorch, a powerful tool, which is like a clever magic wand that can help us build powerful neural network models. With PyTorch, we can flexibly define the structure of the model, set various parameters, and perform efficient training and inference.

We encounter the attention mechanism, which is like a bright beacon, illuminating the direction we are going. The attention mechanism enables the neural network to focus on important areas in the image, thereby improving the accuracy of recognition. Using this mechanism, we can make the model more intelligent to pay attention to the location and details of the street house number, so as to better recognize it. The SVHN dataset is our guide for the expedition, which contains a large number of real-world street house number images. By importing this data, we can let the model learn from it and improve its recognition ability. These images will lead us through the corners of the city and feel the challenges and changes in different scenes. Through this article, we can not only gain a deeper understanding of the nature of computer vision tasks, but also get inspired. Like a fantastic adventure, we will learn how to use PyTorch and attention mechanism to realize the task of recognizing street house numbers. Let's follow this fascinating journey together, broaden our horizons, and pursue new possibilities!

Guess you like

Origin blog.csdn.net/weixin_42878111/article/details/132377671