【Paper Reading】Semi-Supervised Group Emotion Recognition Based on Contrastive Learning

【Paper Reading】Semi-Supervised Group Emotion Recognition Based on Contrastive Learning

Summary

This blog refers to the Semi-Supervised Group Emotion Recognition Based on Contrastive Learning included in MDPI electronics in 2022 , and summarizes its main content in order to deepen understanding and memory.

1 Introduction

1)GER

In addition to problems such as face occlusion and low resolution in crowd images, the performance of GER is also affected by the interactions that exist between individuals and the crowd, as well as between the environment and the crowd . These reasons make GER more challenging than individual emotion recognition. The research work of social psychology shows that group emotions contain **"bottom-up" and "top-down" components, of which the bottom-up component refers to the combination of individual emotions**, such as human expressions and behavior, the top-down component refers to the influence on individuals at the group or scene level [1]. How to extract and fuse the features of these components and improve the recognition accuracy is the main problem of GER research.

In recent years, GER has begun to be used in many important application scenarios, including image retrieval [2], human depression detection [3], image memory prediction [4,5], public security [6], human-computer interaction [7] ]wait.

2) Semi-supervised

Manually annotating sentiment labels is a labor-intensive and costly process. Typically, the annotation process typically requires each set of images to be evaluated by three to five annotators, and the labels require another round of proofreading before they are released.

Semi-supervised learning has been shown to be a promising approach to exploit large amounts of unlabeled data to improve the performance of learning-based networks [8-11]. However, the quality or reliability of learned features may be compromised by the efficiency of semi-supervised learning strategies . Designing an effective semi-supervised learning strategy and improving the reliability of learned features remains a challenging task.

2. Related work

1) Group emotion recognition

Many GER studies focus on facial features and scene features because they are the most important factors affecting group mood. Some GER studies also introduce the influence of other factors, such as objects [12] and human bones [13].

  • Dhall et al. propose a GER framework to extract facial features from facial action units and use GIST and CENTRIST descriptors to characterize emotional cues from scenes .
  • Tan et al. [15] constructed three CNN models from aligned faces , unaligned faces and the whole image to learn emotional features. Since group emotions are regarded as the superposition of individual emotions, the average fusion strategy will be used to The outputs of these three models are combined.
  • Surface et al proposed a GER method consisting of a neural network and a Bayesian classifier , in which the neural network analyzes individual emotions based on a bottom-up approach , and the Bayesian classifier analyzes individual emotions based on a top-down approach. Estimation of scene expressions .

Among many fusion methods to improve model performance, attention mechanism is one of the most popular fusion techniques.

  • Fujii et al. [17] used a visual attention mechanism to focus on the main facial features within the group, while suppressing the facial features of other subjects .
  • Khan et al. propose a region-focusing mechanism to focus on more important people .

In order to improve the efficiency of feature fusion, some new methods are also proposed.

  • Long Short-Term Memory (LSTM) is used to aggregate scene and face features [18-20] .
  • Graph neural networks are also used to fuse different emotional cues, and to mine potential relationships and interactions between emotional cues [21] .

2) Contrastive learning

Contrastive learning is a promising method for pre-training deep models. It helps the backbone network to learn effective representations from unlabeled samples and serve downstream tasks .

  • Chen et al. [22] proposed a contrastive learning framework to capture similar features of paired input images and facilitated pre-trained networks for image classification tasks.
  • He et al. [23] propose a momentum contrast method to minimize the distance between features learned from different augmented views of the same image and maximize the distance between features learned from the same augmented view of different images .
  • Thanks to gradient stopping, a simple contrastive method called SimSiam [24] is also proposed, which can significantly reduce the batch size and number of training epochs compared to existing methods. The SimSiam method also shows the effect of learning visual representations without semantic information.
  • Contrastive learning has also been used for face recognition [25] and face generation [26] tasks and achieved impressive performance.

3) Semi-supervised learning

The size of the most widely used datasets for GER is still limited, much smaller than that of some classic image classification datasets .

Semi-supervised learning technique is a method to exploit unlabeled sample information and improve the performance of learning-based recognition models. Semi-supervised learning uses large amounts of unlabeled data to improve the performance of learning-based networks with the help of a limited amount of labeled data. Labeling unlabeled data with pseudo-labels for training is one of the most popular strategies in semi-supervised learning. These pseudo-label based methods first use limited labeled data to train a tagger that provides pseudo-labels for unlabeled data [32]. Then, the data with real labels and pseudo-labels are used together to update the parameters of the proposed network. Pseudo-label based methods have been widely used in the learning process to improve the performance of recognition networks.

  • Xie et al. [8] proposed an iterative method to generate pseudo-labels for unlabeled data and use them to improve the accuracy and robustness of ImageNet models.
  • Sohn et al. [9] proposed an ensemble classifier to give unlabeled images with pseudo-labels and used them to improve the performance of the model on image classification tasks.
  • Hao et al. [33] infer pseudo-labels for unlabeled data and improve image change detection through graph-based label propagation.

Apart from classification, semi-supervised learning is used in many other applications, e.g., object detection [10, 11], motion analysis [34] and multi-view models [35]. By labeling pseudo-labels on unlabeled samples, all the above methods [8-11, 32-35] utilize unlabeled data to help update learning-based networks and improve recognition performance. However, the reliability or uncertainty of labeled pseudo-labels may affect the efficiency of learning-based networks. How to compensate the uncertainty of pseudo-labels is still an open problem for semi-supervised learning methods.

3. Method

The framework of SSGER consists of two networks: SFNet and FusionNet.

  • Two inputs are provided to SFNet, the face image cropped from the group image and the scene image obtained from the group image . SFNet extracts preliminary emotional information from face and scene images.
  • FusionNet is used to fuse emotional features extracted from human faces and scene images to generate more comprehensive group emotional features .

1)SFNet

Using the ResNet-50 network as the backbone of SFNet, features are captured from face images and scene images as semantic emotion features for groups [36, 37]. Segment all face regions from the group image, and name it as a face image, randomly crop a region from the Group Image, and represent it as a scene image . Each face image and scene image form an image pair . We then feed the image pairs into SFNet. The operation of feature extraction can be expressed by equations (1) and (2):
xis = φ ( I is ) xijf = φ ( I ijf ) x_i^s = φ(I_i^s) \\ x_{ij}^f = φ(I_{ij}^f)xis=f ( Iis)xijf=f ( Iijf)
where φ is the process of SFNet,I is I_i^sIisis the scene image of the i-th image, I ijf I_{ij}^fIijfis the j-th face corresponding to the scene of the i-th image

2)FusionNet

FusionNet is used to fuse the emotional features extracted from face images and scene images, which consists of an attention mechanism module and a prediction fusion module . FusionNet takes scene and facial features as input.

The structure and training process of FusionNet: a is the training process of SFNet and FusionNet; b is the attention mechanism module in FusionNet

  • In the attention mechanism module, the scene features are connected with each face feature separately, the connected features are input into a fully connected layer, and the Sigmoid function is used to learn the attention weight :

σ ij = S igmoid ( W fxijc + bf ) σ ij is the attention weight of the jth facial feature in the i-th image σ_{ij} = Sigmoid(W_fx_{ij}^c+b_f) \;\;\; \;σ_{ij} is the attention weight of the jth facial feature in the i-th imagepij=Sigmoid(Wfxijc+bf)pijis the attention weight for the jth facial feature in the i- th image

  • Aggregate facial features:

xif = ∑ j = 1 N σ ijxijf ∑ j = 1 N σ ij xif represents the facial features of the ith image after facial feature aggregation x_i^f = \frac {\sum^N_{j=1} σ_{ij} x_ {ij}^f} {\sum^N_{j=1} σ_{ij} } \;\;\;\;\;x_i^f represents the facial features of the i-th image after facial feature aggregationxif=j=1Npijj=1NpijxijfxifIndicates the facial features of the i -th image after facial feature aggregation

  • Input the aggregated facial features and scene features into the fully connected layer, and obtain the group mood prediction yify^f_i from the facial emotional information and scene emotional information respectivelyyif y i s y^s_i yis. Fuse face and scene information through the FusionNet fusion prediction module:

yi ′ = σ ifyif + σ isyis y'_i =σ_{i}^fy_i^f + σ_{i}^sy_i^syi=pifyif+pisyis

σ if σ_{i}^fpifσ is σ_{i}^spisThey are the fusion weights of face features and scene features, and the fusion weights need to satisfy σ if ≥ 0 , σ is ≥ 0 , σ if + σ is ≤ 1 σ_{i}^f ≥ 0 , σ_{i}^s ≥ 0 , σ_{i}^f + σ_{i}^s≤1pif0 pis0 pif+pis1 constraint. The fusion weight is generated based on the learning method, and the learning method is as follows:
[ σ if , σ is ] = S oftmax ( W gxig + bg ) [σ_{i}^f,σ_{i}^s] = Softmax(W_gx_i^ g + b_g)[ pif,pis]=Softmax(Wgxig+bg)

3) Training process (see the original text for the specific process and formula)

  • Pre-training SFNet using contrastive learning method to extract semantic emotion information from unlabeled data of scene and face images
  • Using limited labeled images to train SFNet and FusionNet
  • Freeze the parameters of SFNet and FusionNet and use them to provide pseudo-labels for unlabeled images
  • Use images with pseudo-labels and images with real labels to further train SFNet and FusionNet

To suppress the influence from samples with unreliable pseudo-labels , the authors propose a weighted cross-entropy loss LWCE for the backpropagation process in stage 4 .

4. Experiment

1) Datasets: GAF2, GAF3, GroupEmoW

2) Data preprocessing

To evaluate the effectiveness of the proposed method in semi-supervised scenarios, the authors randomly select labeled samples with a given labeling rate (the ratio of the number of labeled samples to the total number of training samples), and treat the rest of the samples in the training set as unlabeled data .

Set the maximum number of faces extracted from an image to 16. Resize the cropped face to 224×224 pixels. Perform data augmentation operations on resized faces, including random horizontal flip and Gaussian blur.

Whole images are also considered as scene input, cropped and resized to 224 × 224 pixels. The same data augmentation as that performed on face images is performed on the whole image.

3) Performance metrics on GroupEmoW

4) Ablation experiment

Contrastive learning for SFNet pre-training, the process of labeling unlabeled samples with pseudo-labels, introducing WCE loss to compensate for the uncertainty of pseudo-labels

Guess you like

Origin blog.csdn.net/qq_44930244/article/details/130304755