【Paper Reading】Non-volume preserving-based fusion to group-level emotion recognition on crowd videos

【Paper Reading】Non-volume preserving-based fusion to group-level emotion recognition on crowd videos

Summary

This blog refers to the paper Non-volume preserving-based fusion to group-level emotion recognition on crowd videos included in Pattern Recognition in 2022 , and summarizes its main content in order to deepen understanding and memory.

1 Introduction

1) Expression recognition ER (Emotion Recognition)

Emotion recognition (ER) based on facial expressions (i.e., movements of facial muscles) based on face action units (FACS) has been studied for many years in the fields of affective computing , e-learning , healthcare , virtual reality , and human-computer interaction (HCI). ER methods can be technically divided into two types: individual ER and group-level ER.

Although the study of individual ER is quite mature, the study of population-level ER is still in its infancy. One challenge of group-level ER is to detect all faces in a group and aggregate the emotional content of the group in a scene (image or video). Crowd emotion recognition is a growing field of research as security domains and social media increasingly demand assessments of groups of people of all sizes.

Traditional ER methods are based on handcrafted features, as shown by Shan et al. [1], Kahou et al. [2]. However, with the advent of deep learning, rich large-scale datasets, and the computational power of graphics processors, the performance of computer vision tasks has improved enormously, and so has individual (traditional) ER. The optimal deep learning models are able to extract deeper discriminative features than traditional hand-crafted models. It turns out that these deep feature-based ER solutions are not only capable of classifying sentiment at the group level on a single image, but also videos of individual or group ER.

2) Drawing on (author's contribution)

Multiple emotions presented by groups: Solve the problem of unclear emotions caused by low-resolution faces (faces are too small to record any emotions)

  • A high-performance and low-cost facial expression recognition network EmoNet is proposed for extracting individual facial expression features
  • A new fusion mechanism, called non-volume-preserving fusion NVPF, is proposed to model the feature-level spatial relationship among a set of facial expressions. Unlike previous work that only presents one emotion, multiple emotion regions can be clustered
  • For the group-level ER problem of crowd videos, a new dataset GECV is introduced

2. Related work

1) Image-based group emotion recognition

  • Scene features are extracted from the whole image as a global representation [12, 13], and facial features are extracted from a given image as a local representation [14-17]. [6, 7, 18, 19] propose a hybrid network based on face, scene, skeleton, body and visual attention to recognize group emotions. Most state-of-the-art methods employ “naive” mechanisms such as averaging [4, 19, 20], concatenation [3], weighting [4, 5, 18], etc. to combine global information and local representations.

  • The correlation or weighting introduced by Guo et al. uses 7 different CNN-based models that are trained on different parts of the scene, background, face and skeleton and optimize the prediction results.

  • Tan et al. [4] constructed three CNN models [21-26] for the aligned face , unaligned face and the whole image respectively , each CNN generates a score for each class, and then combines them by an averaging strategy to obtain the final Fraction.

  • Wei et al. used the LSTM network to model the spatial information between faces. The local information of each face was represented by VGGFace-lstm and DCNN-lstm, and the global information was extracted by PHOG, CENTRIST, DCNN, and VGG features, and passed the score Fusion realizes the fusion of local features and global features.

  • Rassadin et al. [5] used a face recognition trained CNN to extract the feature vector of the detected face, and used a random forest classifier to predict the emotional score.

  • [19] propose to use three cues of face, body and global image, and average all the scores of face, body and global image through 3 CNNs to get the final score.

  • Abbas et al. [20] employ a densely connected network to combine 1×3 subvectors from the scene and 1×3 subvectors from facial features

  • Gupta et al. proposed different weighted fusion mechanisms of local and global information, and their attention model was performed at feature level or score level.

  • Khan et al. proposed to use ResNet-18 and ResNet-34 for both small and large faces, and designed it as a four-stream hybrid network

In addition to identifying group emotions, group cohesion [27–29], the tendency of a group to come together for a common goal or emotion, can also be predicted . In the EmotiW 2019 challenge, the organizers presented their research on group cohesion prediction in static images [30, 31]. They extended the Group Affect (GAF) database [32] with group cohesion labels and introduced a new GAF cohesion database.

  • In their paper, they used Inception V3 [33] and CapsNet [34] to extract image-level (global) and face-level (local) features to predict group cohesion, respectively.
  • Recently, Mou et al. proposed a framework to predict contextual information from individuals and groups in different environments, using facial and body behavioral cues, temporal modeling of videos using multimodal fusion and long short-term memory networks (LSTM).
  • Recently, the emergence of large-scale datasets for crowd counting and localization, such as NWPU-Crowd [36], etc., helps to advance crowd scene understanding, as proposed by Wang et al. [37], [38].

2) Video-based personal emotion recognition

  • Kahou et al. [39] combined multiple deep neural networks at EmotiW2013, including deep CNNs, deep belief networks, deep autoencoders, and shallow networks for different data patterns, by averaging score decisions to fuse the time between frames information.
  • Liu et al. [40] used three types of image set models on EmotiW2014, linear subspace, covariance matrix, and Gaussian distribution, and conducted research on three classification methods of logistic regression and partial least squares for video sets, Temporal information between frames is fused by averaging decision making.
  • Ebrahimi Kahou et al. [41] used RNN to model temporal information (instead of averaging) on ​​EmotiW2015, MLP has a separate hidden layer for each modality, and connects
  • Li et al. [42] proposed a common framework to predict sentiment using two streams of information from videos: images and audio. For image streaming, a CNN-based network is employed to extract spatio-temporal features from cropped faces and image sequences. For audio streams, audio features, i.e. low-level descriptors and spectrograms, are extracted to compute fused audio scores, and finally all scores are weighted and summed.
  • Bargal et al. [43] adopted a spatial method for video classification, in which the feature encoding module was based on SSR (Signed Square Root) and L2 normalization, by connecting FC5 and ResNet of VGG13 + FC7 and VGG16 + pool, and finally classified by SVM.
  • Fan et al. [44] proposed a video-based ER system whose core module is a hybrid network combining RNN and 3D-CNN. 3D-CNN encodes appearance and motion information differently, and RNN encodes motion information later.
  • Hu et al. [8] proposed supervised scoring ensemble SSE, which supervised deep, shallow and intermediate layers. Through a new fusion structure, the classification scoring activations of different complementary feature layers are concatenated, which is further used as the input of the second level supervision, which acts as a deep feature set in this single CNN architecture.
  • Recently, Wang et al. [45] proposed to use OpenFace [46], OpenPose [47] and Convolution3D [48] to extract multimodal features from a series of images, i.e. eye gaze, head pose, body pose and motion features, and then pass The average weights are used to ensemble these models.
  • Recently, Sheng and Li [49] proposed a multi-task network that can recognize identity and emotion from gait

3) Group emotion recognition based on video

There are few studies on crowd analysis and violence analysis. Favaretto et al. [11] proposed a method for predicting the personality and mood of crowds in videos. They detect and track each person, then identify and classify the five dimensions of personality (OCEAN) and emotions in the videos according to the OCC emotion model.

According to the literature review of this paper, most of the previous works only solve the problem of group ER and individual ER on videos by simple ensemble/fusion methods. Furthermore, most previous work using facial features cannot handle the case of faces with multiple resolutions, nor the presence of multiple group emotions in images.

3. Method

EmoNet performs expression recognition on a single face → clusters the detected faces into groups according to the relative spatial distance → performs deep feature vectorization and structuring on each group of faces to obtain group-level facial expression features as the input of NVPF → Get the fusion features of each frame and the entire video through the temporal-spatial fusion method (Temporal NVPF) (this step is based on the video so I don’t watch it)

1)EmoNet

A lightweight and high-performance deep neural network, EmoNet, for effective and accurate recognition of individual facial expressions (in group ER, there are a large number of faces in one image to process, so a very deep network is used in the feature space Extracting their representations in the can be very expensive and ineffective)

The main policy drivers for designing EmoNet:

  • Perform convolution via depthwise separable convolution [51], which is faster and more memory-efficient
  • Increasing network capacity for embedding emotional features via residual-connected bottomlenet
  • Rapidly reduce the spatial dimension of the first few layers while expanding layer by layer

The input is a 112 × 112 × 3 face image, which has been cropped and aligned to remove unnecessary information such as background, hair, etc.

EmoNet Network Diagram: Input Size, Number of Blocks (B), Operators, Strides (S), Number of Output Channels© and Residual Connections®

The network composition of bottleneck block bottleNet:

  • 1×1 convolutional layer with ReLU activation B1
  • 3 × 3 deep convolutional layer with stride s, ReLU activation, B2
  • 1 × 1 convolutional layer, B3

Suppose the size of the input x is w×h×c, then the bottleNet operator can be defined mathematically as:
B ( x ) = [ B 3 ( B 2 ( B 1 ( x ) ) ) ] B 1 ( x ) = R w × h × c → R w × h × c B 2 ( x ) = R w × h × c → R ws × hs × tc B 3 ( x ) = R w × h × c → R ws × hs × c 1 B(x) = [B_3(B_2(B_1(x)))] \\ B_1(x) = R^{w×h×c} → R^{w×h×c} \\ B_2(x) = R^{w×h×c} → R^{\frac ws × \frac hs × tc} \\ B_3(x) = R^{w×h×c} → R^{\frac ws × \frac hs × c_1}B(x)=[B3(B2(B1(x)))]B1(x)=Rw×h×cRw×h×cB2(x)=Rw×h×cRsw×sh×tcB3(x)=Rw×h×cRsw×sh×c1
t is the expansion factor. The difference between bottle blocks with or without residual connections lies in the step size s. In a block with residuals, s is set to 1 for learning residual features; in blocks without residuals, s is set to 2. , for downsizing

2) Group sentiment recognition based on non-preserving fusion (NVPF)

New fusion mechanism: A group of faces are fused nonlinearly through a CNN-based multi-layer fusion unit. The ultimate goal of this structure is to obtain group-level features in the form of probability density distribution for emotion recognition. In this way, instead of simply concatenating or linearly weighting, the separated facial features can be naturally embedded into the unified group-level features in NVPF, thereby improving the performance of emotion recognition in subsequent steps.

Formally, given a set of N faces { f 1 , f 2 , . . . , fn } \{f_1,f_2,...,f_n\}{ f1,f2,...,fn} , first use the EmoNet structure to obtain their feature representationxi = E mo N et ( fi ) , i = 1. N x_i = EmoNet(f_i),i=1.Nxi=EmoNet(fi),i=1. N. _ These features are then stacked into a grouping feature S,S = G ( x 1 , x 2 , . . . , x N ) S=G(x_1,x_2,...,x_N)S=G(x1,x2,...,xN) , where G is a grouping function (G can have many choices, and the emotional features are superimposed to the matrixS ∈ RM × NS ∈ R^{M×N}SRM × N is one option). Any other choice can be easily adapted to this structure, and the grouping function G still treatsxi x_ixi, so directly using S for emotion recognition amounts to a trivial solution that does not exploit the relationship between faces in the group. Therefore, to effectively consider this relationship, we propose to model S in the form of a density distribution in a higher-level feature domain H, such that not only features xi x_ixiare modeled, and the relationship between them is also naturally embedded in the distribution presented by H. We define this mapping from a feature domain S to a new feature domain H as a fusion process; S and H can be regarded as local and population-level features, respectively; let F be a nonlinear function that can exploit S ∈ RM × NS ∈ R^{M×N}SRM × N toH ∈ RM × NH∈R^{M×N}HRM × N的映射
F : S → HH = F ( S ; θ F ) F: S → H \\ H = F(S;θ_F)F:SHH=F(S;iF)
The probability distribution of S can be expressed as:
P s ( S ; θ F ) = PH ( H ) ∣ d F ( S ; θ F ) d S ∣ P_s(S;θ_F) = P_H(H)|\frac {d F(S;θ_F)} {dS} |Ps(S;iF)=PH(H)dSdF(S;iF)∣This
formula, calculating the density function of S is equivalent to estimating the density distribution of H with a related Jacobian matrix, because it is a triangular matrix, so its determinant can be efficiently calculated, and there is no need to calculate two features S and H[ 53] Jacobian matrix. By learning such a mapping function F, we can adopt from the local feature S to the densityp H ( H ) p_H(H)pH( H ) Transformation of embedding H. This property leads us to the conclusion that if we takep H ( H ) p_H(H)pH( H ) as a prior density distribution, and choosep H ( H ) p_H(H)pH( H ) Gaussian distribution, then F naturally becomes a mapping function from S to a latent variable H with Gaussian distribution. Therefore, through F, local features can be fused into a unique Gaussian distributed feature, which is embedded in each xi and allxi x_ixiand xj x_jxjAll information presented in .

In order to strengthen the nonlinear properties, we construct F as a nonlinear unit UF i U_{Fi}UFi, where each unit exploits a certain degree of correlation (i.e., emotional similarity, connection, or interaction) between facial emotional features of a group of people.
F ( S ) = ( UF 1 ◦ UF 2 ◦ . . . ◦ UFN ) ( S ) F(S) = (U_{F1} ◦ U_{F2} ◦ . . . ◦ U_{FN} )(S)F(S)=(UF 1UF2 _◦...◦UFN) ( S )
As shown in the network structure diagram above, by expressing S as a feature map, the convolution operation can make very effective use ofxi x_ixiThe spatial relationship between , and the longer-distance relationship can be easily extracted by stacking multiple convolutional layers, that is, x 1 x_1x1with x N x_NxN. Therefore, we propose to build each mapping unit as a combination of multiple convolutional layers. Thus, F becomes a deep CNN network with the ability to capture the embedded non-linear relationship between faces in a group. Note that unlike other types of CNN networks, our NVPF network is based on p S ( S ; θ F ) p_S(S; θ_F)pS(S;iF) , the output is the fused group-level feature h. In addition, in order to make the determinant of each unit UFi easy to calculate, we adopt the nonlinear unit structure in Duong et al. [54] as follows:
Y = ( 1 − b ) ⊙ [ r 1 ( exp ( S ′ ) ) + r 2 ( S ′ ) ] + S ′ Y = (1-b) ⊙ [r_1 (exp(S')) + r_2(S')] + S'Y=(1b)[r1(exp(S))+r2(S)]+S
Where Y is the fusion unitUF 1 U_{F1}UF 1The output of , S = b S, b is a binary mask in which the first half of b is all 1 and the rest is 0. ⊙ is the Hadamard product. We refer to scale and light shift as transformations T1 and T2, respectively. In practical applications, T1 and T2 functions can be realized by a residual block with skip connections, similar to the building blocks of residual networks (ResNet) [55]. Then, by stacking the fusion units UFi together, the output Y will be the input of the next fusion unit, and so on. Finally, we get the above activation function

The parameter θF of the NVPF can be learned by maximizing the log-likelihood or minimizing the negative log-likelihood as follows:
θ ∗ F = argmin θ FL 11 = − log ( P s ( S ) ) = argmin θ f − log ( PH ( H ) ) − log ( ∣ d F ( S , θ F ) d S ) θ*_F = argmin_{θ_F} L_{11} = -log(P_s(S)) = argmin_{θ_f} -log (P_H(H)) - log(|\frac {dF(S,θ_F)} {dS})θF=argminiFL11=log(Ps(S))=argminiflog(PH(H))log(dSdF(S,iF))
To further enhance the discriminative properties of feature H, we choose different Gaussian distributions (i.e., different means and standard deviations) for each emotion class during training. Optimization parameterθ F θ_FiFFinally, F has the ability to transform local features into group-level features and enforce this feature to predict the corresponding distribution of sentiment classes. By matching distributions, sentiment classification can be provided for corresponding group-level features. For simplicity, we only consider the distribution of three classes, positive, negative and neutral, however, any number of classes can be easily adopted by changing the distribution of classes.

3) Spatio-temporal group emotion recognition based on time non-quantity-preserving fusion (TNVPF)

4. Dataset GECV

5. Experiment

1) Experimental setup

  • Data preprocessing: first use RetinaFace[57] to detect all faces, and then use a fixed size of 112 × 112 similarity transformation to align all faces to a predefined template, that is, based on 5 points, including eyes, nose, mouth corners The 5 landmark points in α are given by the RetinaFace [57] detector.

  • Each cropped face can obtain the emotional features of the corresponding face through EmoNet and provide a separate ER output.

  • To further classify the emotions of a group of people, we first train a Region Proposal Network (RPN) to provide clustered regions of faces:

    A similar RPN structure is used in Faster-RCNN [59] to propose candidate sub-windows containing a set of faces. The backbone architecture is ResNet-18, which only uses the convolutional layer to calculate the 512−d feature map (remove the average pooling layer and FC layer). These feature maps are then used by RPN, which consists of a 3 × 3 convolutional layer with ReLU and two parallel 1 × 1 convolutional layers, i.e. for box regression (reg) and class scores (cls), respectively. RPN simultaneously predicts k sub-window proposals at each position of the conv feature map. Then, the reg layer will provide 4k outputs corresponding to the coordinates of k sub-windows, and the cls layer will give 2k scores representing the probability of face/non-face for each sub-window. Instead of directly predicting the coordinates of k sub-windows, we predict parameters suggested by k sub-windows, called anchors, based on k template sub-windows.

    At each feature map position, the template subwindow has a scale and aspect ratio, 3 different scales and 3 aspect ratios provide a total of a = 9 anchor points and W × H × a for a W × H feature map anchor point. We train the RPN using the collected database described in Section 4, and the training process is similar to that in Faster-RCNN [59]. Our RPN achieves 86.4% mAP on the validation set.

  • For each fusion unit NVPF, we use a resnet-like [60] architecture to implement a nonlinear mapping function F with 10 fusion units UF i U_{Fi}UFi. Each fusion unit UF i U_{Fi}UFiThe two transforms T1 and T2 in are implemented by two Residual Network (ResNet) blocks with rectified nonlinearity and skip connections. The filter size of the convolutional layer is set to 3 × 3, and the number of filters/feature maps is set to 32. TNVPF has 4096 memory and hidden units. We first train TNVPF with two time steps and then further extend it to five time steps.

  • In the training phase, the batch sizes of EmoNet, RPN, NVPF, and TNVPF are set to 512, 256, 64, and 64, respectively. The learning rate starts at 0.1 and the momentum is 0.9. We train all models using the Adam optimizer [61]. All models are trained in the MXNET environment using a machine with a Core i7-6850K @3.6 GHz CPU, 64.00 GB RAM and four P6000 GPUs. We run inference on an Nvidia GTX 1080Ti GPU machine, face detection takes 8 milliseconds, EmoNet takes 4 milliseconds to extract each face feature, NVPF calculates each frame/image fusion feature takes an average of 0.2 seconds, TNVPF predicts the entire The emotion category of the video takes 0.5 seconds on average, and about 50 seconds in total for a 10-second full HD (1280 × 720) video.

2) Image-based single-person facial expression recognition

AffectNet dataset [64]

3) Image-based group expression recognition

EmotiW

4) Group emotion recognition based on video

Guess you like

Origin blog.csdn.net/qq_44930244/article/details/130277208