Research and application of fine-grained classification based on deep learning

This paper mainly introduces the classic network structure and development history of deep learning image classification, and summarizes the attention mechanism in fine-grained image classification. Finally, it gives the model and related algorithms used by the Autohome team to participate in the CVPR2022 fine-grained classification competition , competition experience, etc., and also introduced the application of this model in the car series recognition business of Autohome. It has certain reference significance for readers who want to understand image classification tasks, related competition skills and business applications.

Neural Networks for Image Classification Based on Deep Learning

Since the birth of AlexNet[1], it achieved a top-1 accuracy rate of 62.5% in the ImageNet[2] competition, surpassing 8.2% of traditional algorithms such as SIFT+FVs[3], and deep neural networks have become the leader in image classification. The main algorithms have successively appeared VGG[4], ResNet[5], Inception[6~8], DenseNet[9], etc. In 2019, EfficientNet[10] proposed by Google pushed the network structure design from manual to the era of automatic search. In 2020, Google proposed Vision Transformer (ViT) [13], which introduced the Transformer structure in the field of natural language processing into image classification, bringing image classification into the Transformer era.

VGG[4] was jointly developed by researchers from Google Deepmind and the University of Oxford, using multiple cascaded 3x3 small convolution kernels instead of 7x7 large convolution kernels, which greatly reduces the parameters of the network on the basis of ensuring the receptive field quantity. Another contribution of VGG[4] is to improve the classification accuracy by deepening the network structure. Using a 19-layer network on the ImageNet[2] dataset, the top-1 classification accuracy rate reached 74.5%.

In 2015, He Kaiming, Sun Jian and others, who were still engaged in research at Microsoft at the time, proposed ResNet[5]. By introducing the residual structure in Figure 1, it effectively solved the problem of gradient disappearance and gradient explosion in the training process of deep neural networks. It solves the "degeneration" problem that the classification accuracy gets worse and worse as the network deepens. For the first time, a 152-layer ultra-deep network was used on the ImageNet[2] dataset to obtain better classification accuracy, with a top-1 accuracy rate of 78.57%, and won the first place in the 2015 ImageNet[2] competition classification track .

insert image description here
Figure 1 residual module

While some researchers represented by He Kaiming improved the classification effect by deepening the network depth, some Google researchers also made great progress in network width, and proposed the InceptionV1 V4 network structure in 2014 and 2016 . The design idea of ​​the InceptionV1[5] network is mainly to use dense components (dense components) to approximate the sparse structure in the network. For this reason, Google researchers eliminated the basic structure of Inception shown in Figure 2. This structure uses multiple parallel convolutions and maximum pooling, while approximating the sparse structure, it also introduces multi-scale features. On the basis of using multiple cascaded 3x3 convolutions instead of 5x5 convolutions by referring to papers such as VGG[4], InceptionV2[6] also added Batch Normalization (BN) to normalize the data, and the top-1 accuracy rate reached 74.8%. InceptionV3[6] proposed a method that can effectively reduce the amount of network parameters, that is, asymmetric factorization (Asymmetric Factorization). Asymmetric factorization is to decompose the nxn convolution into a cascaded form of 1xn and nx1, and the top-1 accuracy rate reaches 78.8%. InceptionV4 integrates the residual structure used in ResNet[5] into the Inception module, which greatly speeds up the training speed, and the top-1 accuracy rate reaches 80.10%.

insert image description here
Figure 2 Inception module

After great progress has been made in the research of network depth and width, some researchers began to consider the reuse of network features to improve the classification effect of the network. A typical example is DenseNet [9], the best paper of CVPR in 2017. ResNet[5] proved that the residual short connection can effectively solve the problem of gradient disappearance and network degradation. As shown in Figure 3, DenseNet draws on this idea and uses short connections between all layers. For an L-layer network In the Nth layer, the features of the first N-1 layer realize feature fusion in the Nth layer, and at the same time, the features of the Nth layer are also provided to the latter LN layer for feature fusion. Feature reuse avoids repeated extraction of invalid features, while improving the classification accuracy of the network, it also effectively reduces the number of network parameters. The top-1 accuracy rate of DenseNet[9] on the ImageNet[2] dataset reached 79.2%.

At this point, the work related to artificially designing network structures has entered an era of blooming flowers. At the same time, researchers at Google Brain proposed Neural Architecture Search in 2018. Since then, neural network design has entered the era of automation. Due to the large computing resources required by NAS, early NAS searched for a basic convolutional structural unit (Cell) on a small data set, such as CIFAR-10, and then copied these basic convolutional structural units, " Migration" to large datasets such as ImageNet [2]. As shown in Figure 4, the network search process is controlled by a RNN network, and the basic convolutional structural unit receives the output of the first two states hi and hi-1 in the "hidden state" list or the two states in the list (in Figure 4 Gray box), then select two from the operations shown in Figure 5 (yellow box in Figure 4), act on the two selected states, and finally use addition (add) or superposition (concat) ( The fusion method of the green box in Figure 4) performs fusion, and iteratively adds new basic convolutional structural units until the number of basic convolutional structural units reaches the preset N. Using this network search algorithm, the top-1 accuracy rate of NASNet[10] found on the ImageNet[2] dataset reached 82.7%, reaching and surpassing the artificially designed network structure.

NASNet [10] opened the era of neural network search and realized the automation of network structure design, but its defects are also obvious. The search space of NASNet [10] is still artificially set, and it is a network structure search algorithm based on a given search space. In response to this problem, FAIR He Kaiming's team proposed a method for designing the search space in 2020. In the RegNet [12] paper, the network structure search space is also part of the network structure design. As shown in Figure 6, by continuously optimizing the network search space, the optimal search space and the optimal network structure in this search space are simultaneously obtained.

In 2020, Google proposed ViT (Vision Transformer) [13], which introduced the Transformer used in the field of NLP (Natural Language Processing) into the field of vision, divided the image into several sub-blocks with the same resolution, and each sub-block is used as a sub-block in the NLP field. character (token) for processing. The introduction of the Transformer self-attention mechanism has greatly improved the classification effect of the network, and the top-1 accuracy rate on the ImageNet[2] dataset has reached 88.55%.

insert image description here
Figure 3 DenseNet

insert image description here
Figure 4 RNN controller

insert image description here
Figure 5 NASNet basic convolutional structural unit

insert image description here
Figure 6

Fine-grained image classification based on deep learning

In the past decade, deep learning has made great progress in image classification, but the granularity of categories in common image classification datasets, such as ImageNet [2], is still relatively coarse. For example, under the category of dogs, it can also be subdivided into subcategories such as Labrador, Golden Retriever, and Border Collie. Coarse-grained classification has become increasingly unable to meet the needs of actual production and life. The academic and industrial circles are eagerly hoping that deep learning can play an important role in fine-grained classification tasks. Different from coarse-grained classification, fine-grained classification pays more attention to the differences between object details and requires the model to pay more attention to some details. Therefore, the academic community has proposed an "attention" mechanism.

In recent years, the attention mechanism has been widely introduced into the field of fine-grained classification, and attention modules such as SE[14], GE[15], CBAM[16], SK[17] have appeared, and these modules have been integrated into various network structures. , effectively improving the classification effect.

The SE module was proposed relatively early. It was proposed by Momenta in 2017. The SENet constructed by the SE module also became the champion network of the last-generation ImageNet [2] classification competition in 2017. The Convolutional Neural Network (CNN) fuses both spatial and channel information, while the SE module pays more attention to the fusion of channel information. As shown in Figure 7, the Squeeze operation is first performed on the Feature Map U to obtain a channel descriptor, which is mainly used to describe the response distribution of each channel. Next, perform an excitation operation on the descriptor to obtain the weight vector of each channel, and use the weight vector to weight each channel of the Feature Map, strengthen the channel with a larger weight, and suppress the channel with a lower weight. This method implements the channel attention mechanism.

After using the SE[14] module to implement the channel attention mechanism, Momenta proposed the spatial attention module GE[15] in 2018. As shown in Figure 8, GE [15] implements an attention mechanism for spatial receptive field regions using custom Gather and Excite modules.

In 2018, another attention module that simultaneously fuses channel and spatial attention appeared, namely CBAM [16]. As shown in Figure 9, for any Feature Map, the CBAM module successively extracts channel and spatial attention information, and weights them with the corresponding Feature Map respectively, realizing channel and spatial attention at the same time.

After channel and spatial attention were introduced successively, the SK[17] module introduced multi-scale features, a common method in the field of computer vision, into the attention mechanism. As shown in Figure 10, the SK module first uses two convolution kernels of different sizes to process the Feature Map, then adds the results, and then after a series of operations, it obtains the weights a and b of each path, and uses a and b to After each feature map is weighted, the final feature map is obtained.

insert image description here
Figure 7 SE (Squeeze & Excitation) module

insert image description here
Figure 8 GE (Gather & Excite) module

insert image description here
Figure 9 CBAM module

insert image description here
Figure 10 SK module

Application of image fine-grained classification algorithm based on deep learning in CVPR competition

On June 19, CVPR 2022 was held in the United States. As one of the three top conferences in the world for computer vision, CVPR is known as the "Oscar" in the field of computer vision. The Autohome team won the second place in the CVPR 2022 Sorghum-100 Cultivar Identification-FGVC 9 (fine-grained image classification for sorghum variety identification) challenge held on Kaggle as part of the conference seminar, achieving A breakthrough in the company's history.

Fine-grained image classification has always been a research hotspot in the field of computer vision. The main difficulty is that the inter-class distance of fine-grained labeled images is small, and the intra-class distance is large, which makes it difficult to distinguish the categories of some images by human eyes. For example, this FGVC9 competition Among them, the data from the sorghum variety identification and herbarium identification competitions require strong professional knowledge to determine the category of the image. As shown in Figure 11, the distance between samples of the same color in two circles is called the intra-class distance, and the distance between samples of different colors is called the class distance.

insert image description here
Figure 11 Inter-class distance and distance between classes

In this competition, RegNetY-16.0GF was mainly used as the backbone network. The large-resolution image played a great role in improving the accuracy. When the image resolution was increased from 512 to 960, the accuracy increased from 84.1 on the private list. This improved to 91.9. Therefore, we believe that large-resolution images are of great help to improve the performance of fine-grained classification.

As mentioned above, the introduction of the attention mechanism can greatly improve the accuracy of the fine-grained image classification model. In addition to the SE[14] module in RegNetY-16.0GF as the backbone network, this competition also proposed a A novel attention region clipping strategy. Attention area clipping is a commonly used method in the field of fine-grained image classification. As shown in Figure 12, SCDA [18] uses the method of the largest connected domain to crop out the attention area, avoiding the influence of irrelevant areas on model training, so that The model pays more attention to the attention area. The maximum connected domain method has a better processing effect for the case where the attention area is relatively obvious, such as the bird shown in Figure 12, but it is difficult to apply to the Sorghum-100 data set. As shown in Figure 13, the attention area of ​​the Sorghum-100 dataset is relatively scattered. If the attention area is cut using the method of maximum connected domain, while obtaining a better attention area, part of the attention area will be lost, reducing the model classification accuracy. Therefore, we propose a random clipping of attention regions. The flow chart of this method is shown in Figure 14. After the input image is trained by an Epoch, a model can be obtained, and the model is used to predict all training images, and the original training image is cropped, and the cropped result is used as the training for the next Epoch Data, and so on, until the end of training. The random cropping process is shown in Figure 15. The model trained by the nth Epoch is used to predict the training image, and the attention image output before the fully connected layer is obtained, as shown in Figure 13. The attention image is binarized using a threshold T to obtain a black and white image G. Assuming that the width and height of the random crop image are w and h respectively, N cropping can be performed on the graph G, and the region (x, y, w, h) containing the most white areas is the n+1th Epoch. This graph is used for training area.

Using this random cropping method of attention area, on the one hand, it ensures that the model pays more attention to the attention area, and on the other hand, it avoids the loss of information caused by the scattered attention area.

insert image description here
Figure 12 SCDA

insert image description here
Figure 13 Random clipping of the attention area

insert image description here
Figure 14 Random clipping of the attention area

insert image description here
Figure 15 Flow chart of random cropping

In terms of data enhancement, in addition to common left-right flipping and random cropping, AutoAugment[19] proposed by Google in the CVPR2019 paper is also used. The best enhancement strategy for .

Pseudo Label, as a commonly used self-supervised learning method, is also widely used in the field of image classification. After each training session, the best model trained is used to predict the test set, and the prediction result is used as label information, added to the training set, and the cycle continues until the accuracy of the test set does not improve significantly. In this competition, after adding Pseudo Label, the accuracy on the private list has increased from 91.9 to 95.1.

Test Time Augmentation (TTA), as a common testing technique, was also applied to this competition. In addition to avoiding overfitting and improving model generalization during the training phase, data augmentation can also effectively improve the accuracy of the model during the testing phase.

Dropout is an effective method to prevent overfitting. In the final stage of the competition, the addition of dropout increased the model accuracy from 95.1 to 95.3 in the private list.

Ensemble is also a common technique in competitions. Weighting the embeddings predicted by different models, and then using the weighted embeddings for prediction can also effectively improve the accuracy of the model. In the final stage of this competition, the addition of Ensemble improved the model accuracy from 95.3 to 95.9 on the private list.

Application of image fine-grained classification algorithm based on deep learning in car family recognition business

As a leader in the vertical field of the Internet of Vehicles, Autohome has been deeply involved in artificial intelligence algorithms in the automotive field such as vehicle identification. Car identification currently supports the identification of more than 4,000 car series, covering most of the common car series such as Mercedes-Benz, BMW, Audi, etc.

After the competition, the car recognition model also used the RegNetY-16.0GF used in the competition, and the accuracy rate increased by 3.25%. As shown in Figure 17, the attention area of ​​the model is mainly concentrated on the front of the car. Therefore, within the same car series, the recognition accuracy is poor for cars with a large difference in the appearance of the front of the car; similarly, for different car series, the appearance of the front of the car is relatively similar. It is also easy to confuse the car series. That is, as mentioned in Figure 11, it is common in fine-grained classification problems, where the inter-class distance is small and the intra-class distance is large.

Summary and Outlook

In recent years, the development of deep learning has greatly promoted the implementation of fine-grained classification in transportation, medical care, industry, agriculture, e-commerce and other fields. Various related competitions that respond to the needs of the industry have also attracted a large number of practitioners, such as the iNat Challenge 2021[20] focusing on the classification of natural species, Fisheries Monitoring[21] on the protection of fishery resources, and the AliProducts Challenge sponsored by Alibaba. Similar to general image classification problems, the development of fine-grained classification also faces many challenges:

  1. Data annotation: The annotation of fine-grained images often requires relevant professional knowledge (such as medicine, etc.), which brings great difficulties to the annotation. Therefore, self-supervised learning is a major trend in the future. The self-supervised learning framework MAE[23] recently proposed by He Kaiming's team at FAIR has achieved SOTA (State of Art) results on the Imagenet[2] classification task. It is believed that self-supervised learning will also be used in fine-grained classification tasks in the near future. An impressive achievement.
  2. Recognition robustness: As we all know, image classification problems are greatly affected by image quality. Image quality problems such as dark light, overexposure, and blur will affect the accuracy of image classification. This impact is especially serious for fine-grained classification. How to improve fine-grained The robustness of classification models is also a major challenge faced by practitioners in this field.
  3. Categories not included in the training set: It is often difficult for a model trained on an image classification dataset to distinguish images outside the dataset, and sometimes misrecognize this part of the image as a certain category in the training set. To the OOD (Out of Distribution) problem. This type of problem often requires a detection or segmentation model to filter out images of categories that the model cannot recognize. If the "other" category is added to the training set, the recognition effect is often not good because other categories are too broad. Therefore, solving this problem will also be a major challenge in the field of fine-grained classification.
  4. Small sample recognition (long tail): It is difficult to collect many types of data for fine-grained classification, so there will be an imbalance of training/test samples, which is the "long tail" problem often mentioned in the industry. As a result, the model has a better recognition effect on categories with a large amount of data, and a poorer recognition effect on categories with a smaller amount of data.

insert image description here
Figure 16 The main APP of Autohome takes pictures and recognizes cars

insert image description here
Figure 17 Attention map of car recognition model

references

参考文献:
[1]. Krizhevsky, A., Sutskever, I., and Hinton, G. E. ImageNet classification with deep convolutional neural networks. In NIPS, pp. 1106–1114, 2012
[2]. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proc. CVPR, 2009
[3]. J. Sánchez and F. Perronnin. High-dimensional signature compression for large-scale image classification. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 1665–1672. IEEE, 2011
[4]. K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition. In International Conference on Learning Representations, 2015
[5]. K. He, X. Zhang, S. Ren and J. Sun, “Deep Residual Learning for Image Recognition,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770-778, doi: 10.1109/CVPR.2016.90.
[6]. C. Szegedy et al., “Going deeper with convolutions,” 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1-9, doi: 10.1109/CVPR.2015.7298594.
[7]. C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens and Z. Wojna, “Rethinking the Inception Architecture for Computer Vision,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2818-2826, doi: 10.1109/CVPR.2016.308.
[8]. Szegedy, C., Ioffe, S., Vanhoucke, V., et al. Inception-v4, Inception-Resnet and the Impact of Residual Connections on Learning. Thirty-First AAAI Conference on Artificial Intelligence, San Francisco, 4-9 February 2017, 4278-4284.2017
[9]. G. Huang, Z. Liu, L. Van Der Maaten and K. Q. Weinberger, “Densely Connected Convolutional Networks,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2261-2269, doi: 10.1109/CVPR.2017.243.
[10]. B. Zoph, V. Vasudevan, J. Shlens and Q. V. Le, “Learning Transferable Architectures for Scalable Image Recognition,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 8697-8710, doi: 10.1109/CVPR.2018.00907.
[11]. R. Doon, T. Kumar Rawat and S. Gautam, “Cifar-10 Classification using Deep Convolutional Neural Network,” 2018 IEEE Punecon, 2018, pp. 1-5, doi: 10.1109/PUNECON.2018.8745428.
[12]. N. Schneider, F. Piewak, C. Stiller and U. Franke, “RegNet: Multimodal sensor registration using deep neural networks,” 2017 IEEE Intelligent Vehicles Symposium (IV), 2017, pp. 1803-1810, doi: 10.1109/IVS.2017.7995968.
[13]. Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. ICLR 2021
[14]. J. Hu, L. Shen and G. Sun, “Squeeze-and-Excitation Networks,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 7132-7141, doi: 10.1109/CVPR.2018.00745.
[15]. Jie Hu and Li Shen and Samuel Albanie and Gang Sun and Andrea Vedaldi, Gather-Excite: Exploiting Feature Context in Convolutional Neural Networks,NIPS 2018
[16]. Woo, S., Park, J., Lee, JY., Kweon, I.S. (2018). CBAM: Convolutional Block Attention Module. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds) Computer Vision – ECCV 2018.
[17]. X. Li, W. Wang, X. Hu and J. Yang, “Selective Kernel Networks,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 510-519, doi: 10.1109/CVPR.2019.00060.
[18]. X. Wei, J. Luo, J. Wu and Z. Zhou, “Selective Convolutional Descriptor Aggregation for Fine-Grained Image Retrieval,” in IEEE Transactions on Image Processing, vol. 26, no. 6, pp. 2868-2881, June 2017, doi: 10.1109/TIP.2017.2688133.
[19]. E. D. Cubuk, B. Zoph, D. Mané, V. Vasudevan and Q. V. Le, “AutoAugment: Learning Augmentation Strategies From Data,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition(CVPR), 2019, pp. 113-123, doi: 10.1109/CVPR.2019.00020.
[20]. iNat Challenge 2021 https://www.kaggle.com/c/inaturalist-2021
[21]. Fisheries Monitoring https://www.kaggle.com/competitions/the-nature-conservancy-fisheries-monitoring/
[22]. https://tianchi.aliyun.com/competition/entrance/531884/introduction
[23]. Kaiming He and Xinlei Chen and Saining Xie and Yanghao Li and Piotr Dollar and Ross Girshick, Masked Autoencoders Are Scalable Vision Learners, IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

Guess you like

Origin blog.csdn.net/autohometech/article/details/126520894