LS-CNN: Characterizing Local Patches at Multiple Scales for Face Recognition--Qiangchang Wang

0. Summary

The problem to be solved in this paper: Face recognition has a large class gap due to various lighting poses, but few works learn local and multi-scale representations.

Why consider multi-scale: Because some distinguishing features may exist in different scales
How to solve: Use different convolution kernels to extract features (Inception-like) in a single layer, and aggregate features of different layers (Dencse-like).

Why consider local patches: Because after global or salient features are lost, some local patches can still help us identify or classify.
How to solve it: In other words, different regions have different importance, so a spatial attention module is designed.

Why add channel attention: One problem with using multi-scale is that when features from different convolution kernel operations or different layers are aggregated, they should contain different information and the importance should be different, so they cannot be aggregated directly. (Low-level channels describe local details or small-scale parts, high-level channels represent high-level abstractions or large-scale parts, and channel attention is used to determine whether local details are more important or abstract features are more important) Solution: Design a channel attention
module

1. Introduction

1.1 Learning multi-scale information

Discriminative face regions may appear in different scale features. For example, in Figure 1, the first row and the second row are different photos of the same person. During face recognition, the salient feature of some people is the mouth, some people are eyes, and some are even small single eyes. Then the problem comes, these features of different sizes, such as the features of larger areas may still appear in the upper layers, while the features of smaller areas may only appear in the lower layers. If these features of different sizes are to be considered at the same time, it is necessary to introduce Aggregating them at multiple scales.
insert image description here
Other works generally only aggregate the last two layers when considering multi-scale, but this paper uses "multi-scale" in two aspects to perform large-scale aggregation.

1) Inception-style aggregation. We all know better, 同一层using convolution kernels of different sizes for feature extraction.
2) DenseNet-style aggregation. We know that DenseNet is a feature of reuse 前面几层, and the residual connection of DenseNet ensures the gradient flow, which makes it have the advantages of ResNet.

The module responsible for this is called Harmonious multi-Scale Networks (HSNet), which means "harmonious multi-scale network".

1.2 Spatial attention

There are two reasons for spatial attention:

1) Some backgrounds are useless. For example, in face recognition, pixels such as background and hair are worthless, so spatial attention is used to suppress these pixels.
2) The face area is also very important and not so important, so spatial attention is used to highlight discriminative local patches.

So as shown in Figure 2, the background is basically filtered, and the face is also focused on the discriminative area.
insert image description here
The spatial attention module in this paper is named local aggregation network (LANet), which means "local aggregation network".

1.3 Channel Attention

In the multi-scale aggregation in Section 1.1, the feature channels from the same layer with different convolution kernel sizes and the features of different layers are primary and secondary, because some layers want to be able to aggregate features, but they prefer higher-level abstract features. important, so channel attention is needed to redistribute the importance of these channels.
insert image description here
For example, in Figure 3, the eye is an important feature for this image, so it is hoped that the channel full of eye features can be enhanced when the features are aggregated.

This article uses SENet as a channel attention module.

1.4 Summary

In summary,

module name usefulness Remark
HSNet Multi-scale Feature Aggregation Combining Inception+DenseNet
DFA channel and spatial attention Channel Attention SENet + Spatial Attention LANet
LS-CNN Combining HSNet+DFA 即Local and multi-Scale Convolutional Neural Networks

It has the taste of modularization, layer by layer packaging and naming...

2. Detailed structure

2.1 Inception Module

The Inception module is very classic. In this article, the author designed two kinds of Inception, one for feature extraction LS-CNN-D: the feature size of input and output is unchanged; the other is for converting LS-CNN-T : The output feature size is halved to act as a pooling layer.

  • LS-CNN-D
    If you know DenseNet, you should know that the layers in the DenseNet Block are used for feature extraction, and the layers between the Blocks are used for conversion. Then LS-CNN-D is used for feature extraction inside the Block, and LS-CNN-T is used for the conversion layer between Blocks.

As shown in Figure 4, LS-CNN-D is composed of LANet (spatial attention) + SENet (channel attention) + Inception.
It should be noted that:
1) The DFA attention module does not change the feature size
2) The first 1x1 convolution of Inception multi-branch is used to reduce the amount of calculation (compared to using 3x3 convolution to increase the number of features from mxk to 4k).
3) Use two 3x3 convolutions instead of 5x5 convolutions to expand the receptive field.
insert image description here

  • LS-CNN-T
    also uses the Inception structure in the conversion layer, as shown in Figure 5.

It should be noted that:
1) It is similar in structure to LS-CNN-D, but the 1x1 branch is replaced by maxpool
2) But the space is reduced by using stride=2, and the channel is also halved
3) The size change h ∗ w ∗ c − > h 2 ∗ w 2 ∗ c 2 h*w*c->\frac{h}{2}*\frac{w}{2}*\frac{c}{2}hwc>2h2w2c
insert image description here

2.2 DenseNet Module

To improve information and gradient flow, dense connections are used.
Regarding DenseNet, you can see the DenseNet paper and the original text of this article, both of which are introducing the DenseNet structure.

2.3 LANet Module

It's easy for people who understand attention to understand, very simple. The difference is that there is a rchannel compression rate (I don’t know if it was learned from SENet).
insert image description here
The specific structure is: conv+relu+conv+sigmoid

Channel attention is SENet:
insert image description here

2.4 Local and multi-scale convolutional neural networks (LS-CNN, overall structure)

After introducing the basic components of LS-CNN: LANet, SENet, LS-CNN-D, LS-CNN-T, let's see how they are combined into an overall structure
insert image description here

1) Comprehensive conv and pooling are equivalent to the stem layer, and initially extract features
2) Multiple LS-CNN-Ds in a dense block are connected by dense connection for feature extraction
3) LS-CNN-T is used for size reduction, It is equivalent to pooling to reduce the feature size
4) LANet and SENet are embedded in LS-CNN-D and LS-CNN-T as shown in Figures 4 and 5.

See Table 1 for the specific configuration.
insert image description here

3. Experiment

3.1 Ablation experiment

insert image description here
1)Inception与DenseNet相比效果很差。因为Inception虽然有多尺度能力,但是层和通道还是比较少,表征能力不强。但是DenseNet通过多尺度和特征复用,有很强的表征能力。
2)HSNet是Inception和DenseNet的结合,效果更强。一方面,Inception模块在单个层中学习具有不同大小卷积核的特征。另一方面,Dense Connection使DenseNet模块能够组合来自多个层的特征。可以理解为DenseNet只能复用每层固定卷积核尺寸的特征,而HSNet使得DenseNet复用每层不同卷积核尺寸的特征,是对DenseNet特征复用的加强核优化

3) For attention, the effectiveness can be studied through visualization effects. As shown in Figure 9 and Figure 10, SENet+HSNet and LS-CNN (LANet+SENet+HSNet) are compared. It can be seen that adding LANet (columns 3 and 6) can effectively strengthen some local features and strengthen the representation ability.
insert image description here
insert image description here

3.2 Different attention

DFA includes LANet+SENet. Here we replace DFA with the following attention to explore its effect.

SENet: Channel Attention. Here, SENet is replaced by DFA, that is, only channel attention
CBAM: walk through consider channel attention and spatial attention
BAM: consider channel attention and spatial attention in parallel, use hole convolution in the spatial attention part to expand the receptive field
insert image description here
1) SENet only considers channel attention, so it is worse than DFA considering space and channels on all data sets.
2) CBAM is also worse than DFA, both considering space and channel attention. Why is CBAM worse? Considering that maxpooling and avgpooling are used in CBAM, the author changed the avgpooling in DFA to maxpooling and avgpooling called DFA(Max&Avg pool) (in fact, it is also the avgpooling of SENet in DFA), and then found that the effect was poor, so maxpooling( In this task) is not appropriate. 其实考虑到数据集中的背景,maxpooling可能保留了背景等噪音,而使用avgpooling还能通道保留信息和噪声。反过来讲如果没有背景的任务,是否avgpooling+maxpooling就能更好呢?
3) For the reason why BAM is worse, the author changed the LANet spatial attention in DAF to the hole convolution operation, that is, DFA (Dilated conv) , and found that the effect is worse than that of DFA. One explanation is to pass multi-layer superposition in the later stage. In fact, the receptive field is already very large. At this time, each pixel value has rich information, and the hole convolution ignores some pixels and causes slight degradation.(如果是这样,那么使用更大的普通卷积的DFA既能捕获每个像素,也能扩大感受野,是不是就表现更高呢?)

3.3 Experiments on face recognition in different poses

In this part, we use the CFP face dataset, which is a dataset with multi-pose face recognition.

insert image description here
insert image description here
We notice that the discriminative local patch is smaller than the available area of ​​the frontal face due to occlusion caused by the rotation of the contoured face. And HSNet (a multi-scale module built by Inception and Densenet) can capture rich multi-scale information.

However, as the network goes deeper, local face regions in lower channels may fail to propagate to deeper layers (such as too small regions). In order to alleviate this problem, SENet is added to the multi-scale feature fusion part of HSNet to emphasize the important channels of the lower layers, as shown in Figure 9.

In addition, the LANET module is also introduced to mitigate the effect of background inconsistency (faces of the same person, but different backgrounds may affect). As shown in Figure 9, compared with the SENet-HSNet model, the class activation map (CAM) generated by the LS-CNN model tends to localize more discriminative parts in the front (column 3 to 2), and suppresses the side information. less areas (columns 6 to 5). Finally, compared to state-of-the-art techniques that require complex data augmentation (DR-GAN, DR-GANAM, PIM, DA-GAN) or multi-task training (p-CNN) or noise-tolerant paradigms (NoiseFace) or more advanced loss functions (ArcFace) Compared to our method, it is simple and effective.

Guess you like

Origin blog.csdn.net/qq_40243750/article/details/127573380