【Paper reading】ConGNN:Context-consistent cross-graph neural network for group emotion recognition in the wild

【Paper Reading】ConGNN: Context-consistent cross-graph neural network for group emotion recognition in the wild

1. Summary

This blog summarizes the paper ConGNN: Context-consistent cross-graph neural network for group emotion recognition in the wild included in Information Sciences in 2022 , in order to deepen understanding and memory.

2. Group emotion recognition

1 Introduction

Group Emotion Recognition (GER) is a sub-challenge in the Outdoor Emotion Recognition Challenge, and it is a research direction that has attracted much attention in the field of affective computing and computer vision in recent years. Efficient and robust GER plays an important role in understanding human emotions and analyzing human intentions.

2) Application

GER can be used in a variety of application fields, such as human-computer interaction, behavior and event prediction, smart city construction, etc.

3) Difference from Personal Emotion Recognition (Challenge)

GER does not recognize the expression of a single face, but focuses on the emotional state of a group of people in a complex scene, aiming to classify the overall emotions of a group of people into three categories: positive, neutral and negative. This requires not only a solid understanding of individual facial expressions, but also contextual information about image content and scenes .

The figure below shows the differences and challenges between traditional facial expression recognition (FER) and GER in the wild. Compared with traditional FER tasks, outdoor GER faces additional challenges, such as undefined multiple emotional cues , complex facial expressions , crowd relations , and emotional bias among different emotional cues .

4) Related work (existing solutions)

① Facial expression-based method

Facial expression-based methods recognize group-level emotions only through the facial expressions of individuals in images, without considering the background .

  • Due to the considerable challenges of multi-person expression recognition in a crowd, early GER methods only analyzed the intensity of positive emotions, that is, happiness. Hernández et al [20] calculated and averaged the smile intensity of each individual in the crowd to obtain group-level well-being.
  • Considering the impact of human behavior, Dhall et al. [21] estimated the happiness intensity based on the population structure and local properties such as occlusion, and achieved a mean absolute error (MAE) of 0.379 on the HAPPEI dataset.
  • Vonikakis et al [15] used geometric facial features, the distribution of 100 individual expressions, and the importance of each face in the crowd for group-level emotion prediction.

However, the above studies only considered the face-related information of GER, ignoring the rich scene information, which is not enough to effectively analyze and recognize group emotions.

②A method based on multiple clues

In recent years, due to the development of deep learning and group emotion datasets, many studies have begun to combine facial expression information with scene context information for GER.

  • In [22], group sentiment is estimated from facial expressions and overall image semantic features.
  • Ghosh et al. [23] utilized facial expression information, scene information, and high-level facial visual attributes for GER.
  • Recently, Guo et al. [24] utilized face and full-scene features coupled with deep CNN for group-level emotion prediction.
  • Huang et al. [25] proposed an information aggregation method to generate face, upper body, and scene feature descriptions for outdoor GER.
  • Guo et al. [4] proposed a GNN-based model for extracting and fusing multiple emotional information, including scene, facial and object features.

However, despite the positive results of multi-cue strategies, research on outdoor multi-cue extraction and fusion and emotional bias among multiple cues is still ongoing.

③ Relational Learning

Relational learning frameworks have been widely used in computer vision and image recognition, such as image re-ranking [26] and sentiment classification [27, 28], which can effectively represent the relationship between objects and models [29, 30].

Currently commonly used relational learning models can be divided into two categories, namely, attention-based methods and graph-based methods [31].

  • [32] proposed a cascaded attention network to exploit the importance of each face in an image to generate a global representation of GER . Since graphs can model the relationships between nodes, GNN-based relational learning has received increasing attention [33, 34].
  • In recent years, more and more GNN-based methods have been used to improve the performance of GER.
  • Guo et al. [4] used GNNs to understand the emotion of images based on multiple cues . This GNN achieves good GER performance due to modeling the relationship among faces, objects and the emotions of the scene.

Although the above methods can help to model and learn the relationship between multiple features, they mainly focus on the feature relationship within a branch. How to fully understand the inter-branch and intra-branch relationships is still an open research problem.

5) Dataset

In order to develop GER technology, many outdoor group emotion data sets have been proposed and constructed in recent years, such as HAPPEI[13] , GAF[18] , GAF 2.0[7] , GAF 3.0[2] , **GroupEmoW[4]**, etc. .

These data sets come from Google, Baidu, Bing, Flickr websites, and are crawled through some emotional keywords. Due to the difficulty of labeling and acquisition , most of these datasets do not consider geographical location and scene differences . This may greatly limit the practical application of GER technology. Therefore, creating a new GER dataset with geographic variance and outdoor information and developing a more robust and favorable benchmark is highly necessary for the GER task.

3.ConGNN

1) Using MFE to extract multi-cue emotional features from different information branches

To obtain multi-cued emotional information from crowd scenes, we introduce three parallel feature extraction branches to extract multiple faces, local objects (including bodies and items in the scene), and global scene features, respectively. Three pre-trained DNNs Resnet50 [36], LSTM [37] and SE-Resnet50 [38] are used as facial feature, object feature and scene feature extractors.

  • facial feature extraction

In the facial feature extraction branch, the standard face detector RetinaFace [39] is first used to detect and crop facial regions to construct the face stream input. Then, we run these face regions through the pre-trained Resnet50 [36] and fine-tune them in the corresponding GroupEmoW and SiteGroEmo datasets to extract the size of 112 × 112. And further use two-layer LSTM network to learn the difference between faces. dependencies. Formally, suppose the image is p, and the number of detected face regions is N 1 N_1N1, we can get the extracted facial expression features X 1 ∈ RL 1 × N 1 X_1 ∈ R^{L_1× N_1}X1RL1×N1, can be given by: X 1 = [ x 11 , x 12 , . . . , x 1 N 1 ] X_1 = [x_{11},x_{12},...,x_{1N_1}]X1=[x11,x12,...,x1N1] , whereL 1 L_1L1is the dimensionality of each facial expression feature.

  • Item Feature Extraction

For local object extraction, each image is first extracted using a bottom-up attention model, the Resnet50-FPN detector [40], to obtain salient objects (such as human bodies, flowers, and cups) that are most relevant to group emotions. Then, local object features are extracted using SE-ResNet50, which is pretrained on the ImageNet-1 K database and fine-tuned on the corresponding GroupEmoW and SiteGroEmo datasets. Formally, given an image p as input, the number of detected objects is N 2 N_2N2, the emotional feature of the object is X 2 ∈ RL 2 × N 2 X_2 ∈ R^{L_2 × N_2}X2RL2×N2Can be written as: X 2 = [ x 21 , x 22 , . . . , x 2 N 2 ] X_2 = [x_{21},x_{22},...,x_{2N_2}]X2=[x21,x22,...,x2N _2] , whereL 2 L_2L2is the dimensionality of each item feature.

  • scene feature extraction

In the global scene extraction branch, we use the pre-trained SE-ResNet50 [38] to extract the whole scene semantic features. The pre-trained models are also fine-tuned on the corresponding GroupEmoW and SiteGroEmo datasets. We can get the extracted global scene features X 3 ∈ RL 3 × 1 X_3 ∈ R^{L_3 × 1}X3RL3× 1 , whereL 3 L_3L3is the dimension of scene semantic features.

After multi-cue feature extraction, multi-cue emotional representation X = { X 1 , X 2 , X 3 } X = \{X_1,X_2,X_3\}X={ X1,X2,X3} can be passed to the following C-GNN for intra-branch and inter-branch emotional relationship modeling.

2) C-GNN for emotional relationship learning

Based on the multi-cue emotion representation X, a C-GNN is proposed for emotion relation learning to achieve a robust comprehensive emotion representation. C-GNN consists of two stages of cross-branch graph construction and group relation learning .

  • Construction of cross-branch graphs

Using the multi-cue emotional feature X, we initially constructed three complete cross-branch graphs for emotional relationship learning, namely the face graph (used to learn the relationship between faces), the object-context graph (used to establish local relationship between objects and the global scene) and scene-context graph (to learn the relationship and interaction between all cues in the scene, including faces, objects and scenes).

Cross-branch graph construction consists of three steps, namely node tensor definition , message aggregation initialization , and graph construction :

Ⅰ. Node tensor definition: Given each feature vector xij ∈ X x_{ij} ∈ XxijX as input, we first use a ReLU function to normalize the input and project it to an initialized node vectorhij 0 h^0_{ij}hij0, and then the iiAll node vectors of i branches are concatenated to form a node tensorH i 0 H^0_iHi0. ,
hij 0 = R e LU ( W ixij + bi ) H i 0 = [ hi 1 0 , hi 2 0 , . , 3 , h^0_{ij} = ReLU(W_ix_{ij} + b_i) \\ H^0_{i} = [h^0_{i1},h^0_{i2},...,h^0_ {iN_1}] ∈ R^{L_h×N_i},i = 1,2,3,hij0=R e LU ( Wixij+bi)Hi0=[hi 10,hi20,...,hiN10]RLh×Ni,i=1,2,3 ,
it is worth noting that Wi and bi are shared between nodes of the same hint type

Ⅱ. Message aggregation initialization

For node tensors, the message passing (one-way edge) between arbitrary nodes a and b is expressed as: r ( a , b ) = { ra ⬅ b 0 , ra ➡ b 0 } r(a,b) = \ {r^0_{a⬅b},r^0_{a➡b}\}r(a,b)={ rab0,rab0} , where a,b ∈ {j}, and a ≠ b, can be calculated:
ra ⬅ b 0 = W bhib 0 , rb ⬅ a 0 = W ahia 0 r^0 _{a ⬅ b} = W_b h^ 0_{ib}, \\ r^0_{b⬅a} = W_ah^0_{ia}rab0=Wbhib0,rba0=Wahia0
Then, aggregate the messages delivered by all neighboring nodes to a certain node as mij 0 m^0_{ij}mij0, thus forming an initialized message aggregation tensor M i 0 = { mij 0 } , mij 0 = ∑ lrj ⬅ l 0 M^0_i = \{m^0_{ij}\}, m^0_{ij}=\sum_l r^0_{j⬅l}Mi0={ mij0}mij0=lrj l0, where l represents all neighbor nodes of node j in the graph

Ⅲ. Construction of cross-branch graph

Construct face graphs, object-context graphs, and scene-context graphs in an intersecting manner based on node tensors and message aggregation:

Face map: use face node tensor H 1 0 H^0_1H10and information aggregation tensor M 1 0 M^0_1M10to construct N_1 with N 1N1Face graph G ( H f 0 , M f 0 ) of nodes, where H f 0 = H 1 0 , M f 0 = M 1 0 G(H^0_f,M^0_f), where H^0_f= H ^0_1,M^0_f=M^0_1G(Hf0,Mf0) , where Hf0=H10,Mf0=M10

Object-context graph: Considering that the integration of global scene and local object features helps to suppress the emotional bias between different emotional cues, the node tensor H 2 0 H^0_2 of the local objectH20The node tensor H 3 0 H^0_3 with the global sceneH30Combined to build a rich object-context node tensor H c 0 = { H 2 0 , H 3 0 } H^0_c = \{H^0_2,H^0_3\}Hc0={ H20,H30} . Through message aggregation, the object node tensor and the node tensor of the global scene are aggregated intoM c 0 M^0_cMc0, thus constructing N 2 + N 3 N_2+N_3N2+N3object-context graph G of nodes ( H c 0 , M c 0 ) G(H^0_c,M^0_c)G(Hc0,Mc0)

Scene-context graph: Combine the node tensors of the three branches as H w 0 = { H 1 0 , H 2 0 , H 3 0 } H^0_w = \{H^0_1,H^0_2,H^0_3\ }Hw0={ H10,H20,H30} , getM w 0 M^0_wMw0, thus constructing N 1 + N 2 + N 3 N_1+N_2+N_3N1+N2+N3node scene-context graph G ( H w 0 , M w 0 ) G(H^0_w,M^0_w)G(Hw0,Mw0)

  • group relationship learning

Visual relationships have been shown to be key to many computer vision tasks [41]. In order to obtain the group emotional relationship in complex scenes, it is necessary to realize a more comprehensive emotional representation in large scenes by interpreting and modeling the relationship between different emotional cues in the image. Motivated by this goal, we capture and simulate the internal and interrelationships of different emotional cues in a group via C-GNN.

① First, k-layer GRU is used to model the relationship between graph nodes, and the features of each node in each graph of the cross-branch sentiment graph are updated until the learning converges. After K iterations in the GRU (according to experience, the number of iteration layers of the GRU is set to K = 4), we get the updated graph node features H f K , H c K , H w KH^K_f, H^K_c , H^K_wHfK,HcK,HwK

② Then, use 3 parallel MLPs to learn comprehensive emotional features from the updated graph node features O = { O f , O c , O w } O=\{O_f,O_c,O_w\}O={ Of,Oc,Ow}

First, integrate the node features of the face graph and the object-context graph into the entire scene-context graph through the splicing operation, that is, H wk = Concatenate ( H fk , H ck ) H^k_w = Concatenate(H^k_f,H ^k_c)Hwk=Concatenate(Hfk,Hck)

Then, using MLP f MLP_fMLPf M L P c MLP_c MLPcand MLP w MLP_wMLPwThree multilayer perceptrons as cross-branch emotion encoders.

We denote the extracted face branch, context branch and fused cross-branch emotion representation as O f = MLP f ( H ​​f K ) O_f=MLP_f(H_f^K)Of=MLPf(HfK) O c = M L P c ( H c K ) O_c=MLP_c(H_c^K) Oc=MLPc(HcK) sumO w = MLP w ( H w K ) O_w=MLP_w(H_w^K)Ow=MLPw(HwK)

③In addition, we introduce ECL (Emotion Collection Learning) with BPF (Back Propagation) to further interact these graphs across branches, helping C-GNN to alleviate emotion bias and achieve emotion consistent learning.

Using multi-cue features X and C-GNN, we can estimate the sentiment of outdoor groups. However, we observe that C-GNN can focus on modeling the relationship between branches to obtain a comprehensive emotional representation, but ignore the emotional bias between different branches, e.g., facial expression and scene context emotion in the same image Possibly of opposite emotional polarity. This neglect can easily lead to misclassification of emotions in GER.

To this end, we propose a novel ECL mechanism and its corresponding affective BPF to further interact these branches to help the network achieve consistent learning, thereby alleviating the impact of affective bias on GER. ECL with sentiment BPF includes three graph losses:

  • Face map loss: L f = − 1 N f ∑ i = 1 N f ∑ c = 1 c 1 [ c = yi ] log P fi , c L_f = - \frac {1} {N_f} \sum^{N_f }_{i=1} \sum^c_{c=1} 1[c=y_i]logP_{f_i,c}Lf=Nf1i=1Nfc=1c1[c=yi]logPfi,c
  • Object-context graph loss: L c = − 1 N c ∑ i = 1 N c ∑ c = 1 c 1 [ c = yi ] log P ci , c L_c= - \frac {1} {N_c} \sum^{ N_c}_{i=1} \sum^c_{c=1} 1[c=y_i]logP_{c_i,c}Lc=Nc1i=1Ncc=1c1[c=yi]logPci,c
  • Entire scene-context graph loss: L w = − 1 N w ∑ i = 1 N w ∑ c = 1 c 1 [ c = yi ] log P wi , c L_w = - \frac {1} {N_w} \sum^ {N_w}_{i=1} \sum^c_{c=1} 1[c=y_i]logP_{w_i,c}Lw=Nw1i=1Nwc=1c1[c=yi]logPwi,c

Among them, C is the number of categories (emotional category, item category...), N is the number of detected instances (face, item...), 1 [ c = yi ] 1[c=y_i]1[c=yi] is a binary index, and P is the probability that faces, objects, and scenes are related to group emotions.

In order to optimize the consistent direction of the above three losses in the learning process, ECL introduces an affective BPF that constrains the graph loss learning in the opposite direction to achieve context consistent learning: BPF = ( 1 + λ ∗ f ( yif , yic )
) ∗ ( L f + L c + L w ) f ( yif , yic ) = { 0 if yif = yic 1 if yif ≠ yic BPF = (1 + λ * f(y^f_i,y^c_i))*(L_f + L_c + L_w) \\ f(y^f_i,y^c_i) = \left\{\begin{aligned}0 \;\; if\;\; y^f_i=y^c_i \\ 1 \;\ ; if\;\; y^f_i≠y^c_i \end{aligned}\right.BPF=(1+lf(yif,yic))(Lf+Lc+Lw)f(yif,yic)={ 0ifyif=yic1ifyif=yic
Where λ is the penalty coefficient, which is used to control the degree of penalty in the learning process. f is the penalty indicator function, indicating whether the penalty should be increased or not. yif, yicy^f_i, y^c_iyifyicis the prediction result (positive, neutral, negative) of the face map and the context map. BPF is an adaptive and consistent learning objective that effectively constrains and guides face, object-context and whole scene-context graph losses. In summary, training C-GNN with ECL to recognize group sentiment can help ensure that information from each graph branch is properly focused and fully learned, resulting in a consistent and robust GER.

④ Forecast

For prediction, we only use the entire ensemble of cross-branch sentiment features O w O_wOwto predict group sentiment . Cross fusion incorporates all emotional cues and can be predicted as a composite emotional representation of a group. We use the Softmax operation to predict the emotional class probability:
P c = e W c ⋅ O w + bc ∑ c = 1 ce W c ⋅ O w + bc P_c = \frac {e^{W_c O_w + b_c}} {\ sum^c_{c=1} e^{W_c O_w+b_c}}Pc=c=1ceWcOw+bceWcOw+bc
where P c P_cPcis the predicted probability of emotion class c, where c is the number of emotion classes. W c W_cWcis the cth row of the network weight matrix W, bc b_cbcis the cth element of the bias vector b.

4. Experimental Dataset

To comprehensively evaluate the proposed ConGNN approach, extensive experiments are conducted on two challenging group sentiment datasets GroupEmoW [4] and SiteGroEmo. SiteGroEmo is a new, more realistic benchmark collected and labeled by the authors of this paper.

1) SiteGroEmo: New GER dataset

The newly established SiteGroEmo is a group-level emotion dataset consisting of 10,034 outdoor images collected from different tourist attractions around the world. This dataset contains rich geographical information and variations, which can be used for several downstream tasks and real-world applications, such as GER, place emotion extraction and travel recommendation. Each image in the dataset is labeled with one of the negative, neutral and positive emotion categories. The number of negative, neutral and positive emotion categories are 1019, 4355 and 4660, respectively.

  • data collection

To build a group-level emotion dataset in the wild, we collect a large number of user-generated images from social networking sites, namely Flickr and Weibo platforms. These images depicting various human emotions come from travel destinations in China, Japan, Korea, Thailand, the United States and more. We also developed a crawler program to collect these high-definition images from the Internet as a sample source of facial expressions in the wild. After scraping the data, we manually removed images with fewer than two people, retaining group sentiment images for hundreds of attractions around the world. In the end, we collected about 15,000 images from hundreds of travel websites, including images from various locations, social contexts, and events.

  • Data annotation

In the SiteGroEmo dataset, each photo is labeled as negative, neutral, or positive by five annotators. We developed a software called ExpreLabelTool to help labelers label efficiently. In order to ensure the professionalism of annotation, we selected five annotators trained in emotional knowledge to annotate the collected images. If more than three annotators give the same sentiment annotation to an image, the image with that sentiment annotation will be kept . Otherwise, the image will be eliminated. Finally, the dataset contains 10034 images. For evaluation, the SiteGroEmo dataset is divided into training, validation and test sets with 6,096, 1,972 and 1966 images, respectively. Figure 7(a) shows some examples of different travel websites in the SiteGroEmo dataset.

2) GroupEmoW database

GroupEmoW [4] is a public GER dataset containing 15,894 images. It is divided into training set, validation set and test set, and each set has 11127, 3178 and 1589 images respectively. The photos were collected from Google, Baidu, Bing, and Flickr by searching keywords related to social events such as funerals, birthdays, protests, conferences, and weddings. The collective sentiment of the images was also labeled as negative, neutral or positive valence.

5. Experiment and result analysis

Guess you like

Origin blog.csdn.net/qq_44930244/article/details/130056680