【Paper Reading】Automatic emotion recognition for groups: a review

Summary

This blog refers to the paper Automatic emotion recognition for groups: a review included in IEEE2021 , and summarizes its main content in order to deepen understanding and memory.

1 Introduction

Applications of Crowd Sentiment Monitoring: Surveillance, Automatic Image and Video Annotation, Event Detection [1], [2]. Current crowd monitoring methods are often resource-intensive and rely on human attention [3], [4]. Emotion is the basis of group behavior [4], [5], and monitoring emotion is preferably real-time, which can be used to predict and intervene when necessary [6].

When defining group sentiment, in general, group sentiment consists of two parts, a bottom-up part (individuals and their emotions) and a top-down part [7] (environmental and group information [1], [ 8-10]), group sentiment can be defined as a common condition in a group, which is the result of bringing the two parts together [11]. In affective computing, the goal is to classify these group sentiments as realistically as possible.

Detecting group sentiment is more complex than individual sentiment, since estimating a group's sentiment is not necessarily akin to simply combining the sentiments of all individuals in the group [12-14]. It is known in a user survey [9] that both local and global features participate in the decision-making process of human annotators. It can thus be concluded that, besides simply averaging individual sentiment, other techniques should be considered for sentiment detection at the group level [15], [16].

2. Article selection method

Initial article search (explore) → Dense query extraction DQE among found articles → filter

3. Crowd Types and Sentiment Models

1) Group type

  • A study investigating the emotions of students in the classroom[23]

  • Research [24] combined these social gathering images with images of people in explicit uncontrolled environments

  • [25] contains videos that also show groups in uncontrolled settings

This does not change the relationship between group members compared to studies using only images of social gatherings (containing many posed images), but introducing an uncontrolled environment may lead to a more dynamic environment with more defined interactions.

  • [26] consider outdoor videos, and [27] consider both outdoor and indoor video clips. These videos include people in natural environments, such as walking down the street or exercising
  • In [4], the emotions in the event of a stampede after a boxing match were studied
  • [28] and [29] look at sporting events and other events that gather crowds, such as riots
  • [30] Investigate different city events, from riots to celebrations

2) Emotional model

① Discrete emotion labels [31]

  • In 1971, Ekman and Friesen described six basic, universal emotions [31], namely happiness, sadness, anger, surprise, disgust, and fear
  • [4] Four are selected, namely anger, fear, happiness and sadness. They refer to [33] to confirm that the omission of disgust and surprise
  • In [26], surprise and disgust are also omitted, replaced by excitement and neutral, MED dataset
  • [23] uses Ekman's classification, omits the disgust and adds the neutral category
  • [20] considered smiling, surprised and neutral
  • [28] considered joy, anger and neutrality
  • [29] considered pro, con and neutral
  • [16] considered interest (and indirectly boredom, considered the opposite of interest)

②Emotional plane of arousal valence[32]

  • The arousal valence (AV) emotional plane was first proposed by Russell in 1980, and emotional states are represented by continuous values ​​along two axes. Excited states range from low to high, with neutrals in between, and valences range from negative to positive, with neutrals in between.
  • [34] predicted arousal and valence, each discretized into three values. Arousal can take high, medium and low values ​​while valence can take positive, neutral and negative values
  • [27] also considered arousal and valence, and used the full arousal valence plane (with 10 steps on each axis) to plot crowd sentiment curves
  • In [21] and [22], only arousal is considered and the valence dimension is ignored. This can be explained by the method of identifying them, which takes into account the audience's physical responses to (audiovisual) stimuli. GAFF
  • [30] and [25] are two studies that use the valence dimension but not for the GAFF dataset

A measure of emotion (happiness)

The study predicted happiness intensity scores, ranging from 0 (neutral) to 5 (excited)

4. Dataset

1) Group sentiment dataset

2) Single-person facial expression recognition data set

3) Video dataset

4) Other modality datasets

5. Method

In [7], Barsade and Gibson describe bottom-up and top-down (sometimes referred to as local context and global context, respectively) as components of group sentiment. We will first consider bottom-up research methods, then top-down methods and methods that combine bottom-up and top-down methods, known as hybrid methods.

1) Bottom-up approach

  • Huang et al. [62] proposed a Riesz-based volume-local binary pattern as a face descriptor (RVLBP), using continuous conditional random fields to model group emotions, while combining the size of faces and their relative distances
  • Individual facial features are all fed directly into the classifier to predict group sentiment, rather than predicting individual sentiment first. The classifiers employed are either a combination of non-neural classifiers [63], or a combination of long short-term memory (LSTM) and dense layers [64]
  • In [65], neural networks are used for emotion prediction of individual faces, and individual predictions are fused to form group predictions.
  • In [67], individual faces are fed to multiple CNNs for individual emotion prediction. These predictions are then combined in heatmap images (one heatmap per face), which are again fed to the CNN for the final group sentiment prediction

2) Top-down approach

  • The study in [43] gives a baseline of the GReco sub-channel of EmotiW 2016. Features are extracted from images using CENTRIST descriptors, which are then used for classification via support vector regression.
  • Video data was used in [26]. They proposed a 3D CNN for learning high-level spatio-temporal features for emotion detection. The third dimension, which forms the temporal dimension of CNNs, implicitly captures motion

3) Hybrid approach

Most studies combine face hierarchy analysis with scene hierarchy analysis.

  • [9] Research on Traditional Methods for Faces and Scenes
  • Rassadin et al. [71] combined classical CNN-based face feature extraction with scene-level analysis combined with CNN.
  • Surace et al. [72] used scene descriptors as nodes in Bayesian networks, in addition to CNNs for faces (also used as inputs to Bayesian networks)
  • A study by [73] used CNNs for both faces and scenes, but added two classical descriptors (beside CNNs) for the whole image when no faces were detected
  • There are [1], [15], [74], [75], [76], and [77] studies on combining faces and scenes using neural networks.
  • Khan et al. [78] distribute these two aspects (face and scene) in four different streams. They trained the network on faces, images with attention heatmaps for each face, images with faces blurred, and entire images with no additions.
  • In [10], separate neural networks are trained for facial features and facial expressions, followed by CNN scene analysis.
  • In the work of [79], not all images are fed to a scene-level CNN. First, face-level CNNs and SVMs distinguish between positive and non-positive (combined neutral and negative) images. According to the authors, based on a survey of the dataset, positive emotions were the easiest to distinguish. Then, only non-positive images are fed to a scene-level CNN to more carefully classify them as neutral, negative, and (harder to distinguish) positive.

Some mixed-methods studies incorporated additional information at or in place of faces and scenes

  • In the studies of [35] and [34], upper body information was also exploited, using traditional methods
  • Body analysis is also utilized in [80], and skeleton analysis is added in [81], using deep learning methods
  • Faces, scenes, and skeletons are also analyzed using CNNs in [82], where at the face level the CNN output is fed to an LSTM, and at the scene level an attention mask is placed on the image
  • Attention is also applied in [2], next to faces, scenes, and bones, by feeding CNN and LSTM with 16 salient regions, including visual attention (a salient region found by neural attention mechanism that is important for emotion detection)
  • In the work of [?], the same aspect is considered, replacing visual attention with objects. Each aspect is first fed to a CNN, where the features are used as nodes in the full graph. This graph is then updated at certain time steps, enabling features from different aspects to interact.
  • The work of [83] also presents an analysis of objects for group sentiment detection (via CNN)
  • [24] conducted a comparative study where neural networks were used to extract group emotions from faces, scenes and places. These results are not fused, but rather compared to investigate the performance of face-based methods (face, scene) versus non-face methods (location for scene recognition)

4) Fusion method

① Fusion of different aspects (face, scene, skeleton...)

  • [79] first analyze the image based on its individual faces, and based on the predictions fed continuously to a scene-level network
  • Nagarajan and Oruganti [24] aim to compare different modalities rather than fuse them together
  • Fusing individual predictions in a weighted manner in [81], [2], [78], [71] and [77]
  • Combinations of individual predictions are also used in [34], [80], and [42], the latter employing majority voting
  • In [1], [69] and [10], one or more fully connected layers for fusing various aspects are employed
  • LSTM for feature fusion is adopted in [15] and [70]
  • [83], [82] and [35] used SVM, the latter using Modified Local Multikernel Learning (MKL)
  • MKL is also proposed for fusion (and prediction) in [9], other fusion methods used are cascade
  • KNN [75], fusion network [73], Bayesian network [72], or a combination of weighted feature fusion (fusion of local information with global information) and random forest (fusion of fusion features from different CNNs) [76]
  • In [74] features are concatenated but the steps leading to common sentiment predictions are not described

② Fusion of different instance information of the same aspect (such as fusion between faces)

A rough distinction can be made between feature-level fusion and decision-level fusion . With feature-level fusion, different features (different individuals in this case) are fused before classification, while with decision-level fusion, each feature (individual in this case) gets its own classification, and then the These results are combined.

Decision level fusion:

Some methods take the mean of all individuals

  • as in [62], where individual weighted happiness intensities (the label with the highest probability for each face) are averaged
  • Averages are also taken in [2], [72] and [77], where the sentiment category with the highest probability is chosen as the group sentiment
  • In [66], each face gets a confidence score for each emotion

To get the final well-being prediction, the sum of each possible strength multiplied by its corresponding confidence score is rounded to the nearest strength

  • Ghosh et al. [10] perform group-level pooling on faces
  • Guo et al. [42] adopted majority voting
  • Other studies usually weigh faces according to their importance to the overall image, and weighting schemes are used in [38], [81], [78], [61], and [65]

Other research takes machine learning fusion approach

  • In the work of [60], they experimented with different fusion techniques to feed the mean and distribution of individual sentiments in the image into the MLP to give the final group sentiment, which leads to the best performance.
  • In [67], a heatmap is generated for each face, indicating face size and emotional intensity with multiple neural networks. Averaging is done for each face, then superimposed for each image (each heatmap is at the original face location). The superimposed images, along with a heatmap of all individuals, are fed to a CNN for the final group sentiment prediction.
  • Cerekovic [69] feeds individual sentiment predictions along with information about their location and size to LSTM for group sentiment classification. Since LSTMs take inputs in a sequential manner, this also solves the problem of varying numbers of individuals in each image.
  • A Bayesian approach is adopted in [20], where individual emotions (combined from facial and voice features) affect group emotions through a Bayesian network.
  • Weighting by a neural network to use decision-level fusion: In [79], [74] and [80], each emotion is assigned a weight given by a neural network. In studies using social media messages, features are extracted at the individual level and thus must also be pooled to obtain group sentiment.
  • Gong et al. [30] did not incorporate individual sentiments, but computed group sentiment errors estimated from individual sentiments.

Feature level fusion:

Features are combined in some way before being fed to the classifier

  • Liu et al. [75] used a simple average of face features
  • [63] average individual embeddings to obtain an image feature
  • In [21] and [22], group arousal is calculated from the ratio of responding individuals to all individuals, which is similar to taking the average of a binary response variable (given some threshold, whether an individual responds or not)
  • Rassadin et al. [71] construct feature vectors by taking the median of all individual face features
  • [76] takes a weighted average, whose weight depends on the size of the face
  • A visual vocabulary is used in [9]. Use the words in the dictionary to represent each face in the image with the set dictionary size, which solves the problem of variable number of faces in each image
  • Huang et al. [35] proposed Information Aggregation (INFO) to stack individual features
  • Balaji and Oruganti [83] experimented with VLAD [84] encoding and Fisher Vector encoding [85]
  • In [36], a SVM is proposed for classification with a combined global alignment kernel independent of the number of faces
  • In [15], they used LSTM for face and scene features. Use the five largest faces, and perform zero padding if fewer than five faces are detected.
  • Other studies exploit the sequential nature of LSTMs to acquire variable numbers of faces, such as [70], [82], [64] and [73]

6. Performance

1)GAFF

2) An interesting finding is that most of the best performance (where comparisons are possible) are with hybrid approaches. Hybrid methods have been around since 2015 and have grown in popularity over the following years. Until 2019, they were used more frequently than bottom-up and top-down approaches, but in 2019 they were used just as frequently as bottom-up approaches.

7. Possibility of application

1) Suppose a group of people in the real world have the following characteristics:

  • The composition of a group may change, and people join or leave groups that are considered to be that group.
  • The emotions of a group may be heterogeneous, containing different emotional subpopulations.
  • The mood of a group changes over time.
  • The behavior of group members may change for reasons other than emotional ones. For example, voices emerging from a group may appear or disappear without emotional reason.

2) The method used to identify the sentiment of small groups is different from the method used to analyze large groups:

  • Technology that focuses on individuals through face or dialogue analysis may not work on a larger scale when individuals (whether their faces or their voices) are disorientated in a crowd. Another factor to consider is the computational complexity of such individual-based approaches when applied to large populations.

  • Conversely, the opposite of the small group trap can arise when large gatherings are considered. A model trained to recognize the movement of large crowds or sounds made by crowds will likely fail to recognize the movements and sounds of small gatherings.

3) A recent review by Dudzik et al. [87] poses a problem by describing how annotators (perceivers of sentiment data) can be biased when interpreting others' sentiments

8. The future of work

  • An issue that has not been adequately addressed in research is that real-world group characteristics are robust to changes in the data that are not caused by emotional changes
  • In order for research to progress towards a more realistic applicable framework, datasets should be created to reduce any bias that may arise from the data collection itself
  • Exploring the possibility of a mixed approach, exploring multimodality
  • Improve the real-world applicability of current methods. Therefore, we propose to develop systems that can cope with real-world population characteristics

suggestion:

  • Changing team sizes: The challenge of the future of work lies in establishing flexible methods for analyzing group sizes. Such as automatically detecting the group size and changing the analyzed network accordingly
  • A real-world feature that has not been incorporated into the current study is the presence of distinct emotional subgroups. When surveying a real-world group, the group may be divided into several smaller groups that share a common emotion that may differ from other groups. Therefore, future work can focus on finding and analyzing different emotion subpopulations.
  • Only three studies in this review analyzed changes in mood over time. Therefore, temporal analysis and the challenges it poses can be addressed in future work. Temporal analysis helps detect patterns that cannot be detected with data from a single point in time, which in turn helps predict sentiment.

Guess you like

Origin blog.csdn.net/qq_44930244/article/details/130434640