【论文阅读】An intelligent system for monitoring students‘ engagement in large classroom teaching through

【论文阅读】An intelligent system for monitoring students' engagement in large classroom teaching through facial expression recognition

Summary

This blog refers to the paper An intelligent system for monitoring students' engagement in large classroom teaching through facial expression recognition collected by Expert System in WILEY in 2021 , and summarizes its main content in order to deepen understanding and memory

1 Introduction

1) Large-scale offline classroom management systems can help teachers relieve the burden of heavy activities such as attendance tracking, collecting classroom feedback, student participation or attention monitoring, thereby enhancing the best teaching effect. It has become an active and challenging research area in recent years.

2) It is necessary for students to participate in classroom learning, which improves the overall classroom learning quality and academic progress (De Villiers & Werner, 2016). Currently, there is a growing problem of student disengagement due to various reasons such as short attention span, lack of teacher-student interaction, and imperfect teaching methods (Bradbury, 2016; Lamba et al, 2014). Large offline classrooms (with more than 60 students) can exacerbate this problem. Experienced teachers can monitor student engagement by observing student behavior and interactions in small class sizes. However, even these experienced teachers faced difficulties as class sizes increased and were unable to scale beyond the threshold number of students due to artificial limitations (Exeter et al, 2010). In addition, many universities (especially higher education institutions) do not have all teachers who are experienced teaching experts. They often have little or no training/allocated time to teach and improve teaching skills to increase student engagement and engagement. For teachers who want to improve their teaching, several challenges remain, including a lack of opportunities for adequate feedback on their teaching skills. Currently, the most effective practice for this type of professional development is to hire professional human experts to observe one or more lectures and provide individualized formative feedback to the lecturer. Of course, this is expensive, not scalable, and more importantly, precludes a continuous learning feedback loop for the teacher. Therefore, the automated approach for student engagement monitoring proposed in this study can support the professional development of novice teachers at scale, and can also potentially help experienced teachers assess and improve student engagement and the overall teaching process in large class teaching.

3) In the educational research literature, student engagement is defined as having multiple dimensions and components. Fredricks et al. (2004) defined it in three ways: behavioral, affective, and cognitive engagement . Behavioral engagement describes behavioral behaviors during learning, such as correct body posture and writing notes. Emotional engagement describes positive and negative emotional responses to learning, such as attention, boredom, frustration. Cognitive engagement leads to learning that enhances cognitive abilities, including problem solving, knowledge, and creative thinking. According to (Li, Y., & Lerner, RM 2013), behavior and emotional engagement are bidirectionally related. Furthermore, behavioral engagement influences cognitive engagement, an important outcome of the learning process. This work (Sathik & Jonathan, 2013) has statistically demonstrated that students' facial expressions are significantly correlated with their behavior and emotional state, which helps to identify their level of engagement in the lecture.

4) Whitehill et al (2014) divided student engagement evaluation methods into three categories, namely manual, semi-automatic and automatic methods .

① Manual methods include paper- or computer-based self-reports (Haddad 2014), survey-based methods such as the National Survey of Student Engagement (NSSE) (Kuh, GD 2003), the Student Engagement Instrument (SEI) (Appleton et al., 2006), Observation checklist and rating scale (Odiri Amatari, 2015; Dzelzkaleja & Kapenieks, 2016) method. These methods remain laborious, tedious, intermittent, and susceptible to bias.

② Semi-automatic methods include knowledge-tracking and physiological-based methods

  • In Knowledge Tracking, teachers assess student engagement by assessing students' responses to questions during instruction. (Griol, D., et al. 2017; Mogwe, aw 2018) are used to perform this method effortlessly.
  • In the physiological-based approach, student engagement is monitored by processing physiological signals such as brain signals (electroencephalogram [EEG]), heart signals (electrocardiogram [ECG]) using wearable devices such as fitness wristbands and electrodermal activity sensors. Estimated (Di Lascio et al., 2018). These semi-automatic methods have limitations such as being susceptible to human intervention and expensive.

Furthermore, to measure physiological signals, different wearable electrodermal activity sensors are attached to the user via cables, which makes it difficult to wear them for long periods of time (Dirican & Göktürk, 2011).

③Automated methods include vision-based methods that measure student engagement by analyzing nonverbal cues such as facial expressions and head gaze in video captured by high-definition (HD) surveillance cameras. This automated method is non-intrusive, effective, simple, unbiased, and inexpensive way to measure student engagement in any learning environment, such as online or offline classroom learning

5) Advances in artificial intelligence technologies, such as affective computing, computer vision, and deep learning, are used to develop automated engagement monitoring systems (AEMS) . AEMS automatically monitors and reports student engagement by analyzing nonverbal cues without human intervention. Analysis of students' academic affective states (e.g., emotions and moods) has the potential to create smart classrooms that autonomously monitor and analyze student engagement and behavior in real time . In the recent literature, many works have been proposed to exploit students' behavioral and emotional cues to develop AEMS in the field of education.

  • Among them, most works address the e-learning environment of a single student in a single video frame (Bosch et al. 2016; Krithika, L. & GG 2016; Ruipsamurez-Valiente et al. 2018; Sharma et al. et al. 2019; Mukhopadhyay et al. 2020; Bhardwaj, P. et al. 2021).
  • Some works support offline classroom environments with multiple students in one video frame (Zaletelj & Košir, 2017; Klein & Celik, 2017; Thomas & Jayagopi, 2017; Soloviev, 2018; Ashwin & Guddeti, 2019; Zheng Ru, et al. 2020; Luo Zhong, et al. 2020; Vanneste et al, 2021; Peng, S., et al. 2021).

For large offline classroom environments, most of these works suffer from scalability issues and cannot estimate student population engagement in real time. Furthermore, these works use basic emotions such as happiness, anger, fear, sadness, and surprise (Ekman, 1992) as emotional cues for attentiveness estimation, which are not suitable for academic settings . Earlier studies have demonstrated that there are some distinctions between academic emotions and basic emotions (Pekrun, 2000) .

6) The author's work and problems solved

Work:

① Not expressed by basic emotions (Wei, Q., et al. 2017). This study used six meaningful academic affective states, namely: 'boredom', 'confused', 'concentrated', 'frustrated', 'yawning' and 'sleepy', related to the learning environment (D'mello, S . 2013; Tonguç & Ozkara, 2020)

② Created a facial expression dataset for extracting academic emotions from students' faces from classroom lecture videos. This dataset adds similar facial expression samples from three public datasets: BAUM-1 (Zhalehpour et al. 2016), DAiSEE (Gupta et al. 2016); Kamath et al. 2016) and YawDD (Abtahi et al. 2014) , March) to increase the data set sample

③ How to address ethical and student data privacy issues to consider in this type of work

solved problem:

① Can we detect each student's face in each frame of a large offline classroom video?

②Can we recognize the academic emotional state of students through facial expressions?

③ Can we compute individual student group participation scores for each video frame?

④Can we estimate the overall participation of students in real time with sufficient computing resources?

⑤ Can we verify the correlation between the AEMS model and self-reported estimated student input?

2. Related work

1) Single player single frame

  • Whitehill et al. (2014) proposed a machine learning-based system to classify student engagement in an e-learning environment by analyzing their facial expressions and behavioral patterns. Their experiments concluded that SVM with Gabor features performed best in classifying student engagement with an area under the curve (AUC) value of 0.729.
  • This work (Bosch et al., 2016) used computer vision and machine learning algorithms to detect the influence of facial expressions and body movements of students while interacting with an educational game on a computer. They built 14 different machine learning models for this, such as SVM, decision tree. Classification accuracy for each affective state as measured by AUC values: bored (0.61), confused (0.65), happy (0.87), engaged (0.68) and depressed (0.63).
  • Krithika, L. & GG (2016) developed a system that can identify and monitor students' emotions in an e-learning environment and provide real-time feedback on students' concentration levels. They used emotions such as excitement, boredom, yawning, sleepiness, and abnormal head and eye movement patterns to predict concentration.
  • This work (Sharma et al 2019) proposes a real-time estimation system for student engagement in e-learning environments by analyzing basic facial expressions of students. They trained a CNN-based emotion recognition model with a validation accuracy of 70%.
  • Zhang, H. et al. (2019) proposed a binary classification model for student participation recognition systems in online learning environments based on Inflated 3D Convolutional Networks (I3D) on the DAiSEE dataset. For binary engagement classification, both engagement and non-engagement achieved 0.98% accuracy.
  • Mukhopadhyay et al. (2020) proposed a method for assessing the emotional state of students in online learning by combining basic facial expressions. They proposed and trained a convolutional neural network (CNN) based model using the FER2013 dataset and achieved a classification accuracy of 62%.
  • P Bhardwaj et al. (2021) proposed a deep learning based approach for real-time student engagement classification in an online learning environment by analyzing basic facial expressions. -

All of the above approaches address the problem of automatic student engagement monitoring of a single student in a single video frame in an e-learning environment. Therefore, these works are infeasible for solving the problem of automatic estimation of student group participation in large offline classroom environments with multiple students in a single video frame.

2) Multiplayer

  • Zaletelj and Košir (2017) attempted to automatically estimate student attention in an offline classroom setting using nonverbal cues. Using machine learning algorithms such as decision trees and k-nearest neighbors, they developed a model by extracting 2D and 3D features from the Kinect One camera. Their system achieves a test accuracy of 0.753%, which is evaluated by comparing the predicted attention with the true attention given by human annotations. Due to technical limitations of the Kinect camera, the analysis was limited to 6 students rather than the entire classroom.

  • Klein and Celik (2017) developed Wits Intelligent Teaching System (Wits), a CNN-based approach that helps teachers provide real-time feedback on student engagement using positive and negative behavioral cues in large offline classroom settings. Using the created student classroom behavior dataset, they trained a model based on the Alexnet architecture (Krizhevsky et al. 2012), achieving a validation accuracy of 89.60%. The study did not use emotional cues to estimate student engagement, and it involved computational overhead.

  • This work (Thomas & Jayagopi, 2017) used computer vision and machine learning algorithms to classify students' engagement with nonverbal facial cues. They used an open-source real-time facial analysis toolbox called OpenFace (Baltrušaitis et al., 2016) to create a dataset of correlated features containing 27-dimensional feature vectors. They used machine learning algorithms such as SVM and logistic regression to train the models on the data set, and the classification accuracy rates reached 0.89% and 0.76%, respectively. This study was not conducted in a large offline classroom setting. Also, it has not been tested to assess real-time participation of entire class groups of students.

  • Soloviev (2018) proposed a system that continuously analyzes streams of visual data from classroom cameras by classifying basic facial expressions of students as positive or negative emotions. They trained a model with a two-class boosted decision tree (Adaboost) method and achieved a classification accuracy of 84.80%. This study did not consider students' academic emotions to classify their level of engagement.

  • This work (Ashwin & Guddeti, 2019) developed a CNN-based system that analyzes nonverbal cues and classifies student engagement into four levels: 'not at all involved', 'nominally involved', 'participated' task" and "very involved". Their system was trained and tested on faces, hand gestures, and body poses in a large offline classroom setting, and was able to classify them with 71 percent accuracy. The method required 2153 milliseconds (2.153 seconds) to process a single image frame, indicating a high computational overhead. Therefore, it cannot be used in real-time implementations.

  • This work (Zheng, R. et al. 2020) designs an intelligent student behavior monitoring framework that can detect behaviors such as raising hands, standing, and sleeping of students in a classroom environment. They trained the model using a modified Faster R-CNN object detection algorithm to identify the aforementioned behavior with 57.6% mean precision (mAP). Since the model was only used to detect student behavior, academic emotional cues could not be used to predict overall student population engagement.

  • A 2020 study by Luo, Z. et al. presents a 3D model that includes hierarchical and conditional random forest algorithms, and an interaction using head pose, facial expression, and smartphones to estimate students' interest in the classroom environment platform. The model achieved a classification accuracy of 87.5%.

  • Peng, S. et al. (2021) proposed a multimodal fusion of facial cues, heart rate, and auditory features to monitor students' mental states.

A set of machine learning algorithms SVM, Random Forest and Multilayer Perceptron have been trained using various multimodal fusion techniques. The above two studies required multiple physical devices to measure students' multimodal data, which is expensive to use in a large offline classroom environment.

  • This study (Vanneste et al, 2021) presents a technique for assessing student engagement in a classroom environment by recognizing student behaviors such as raising hands and taking notes. They trained a deep learning model to recognize these behaviors, and it achieved a recall rate of 63 percent and a precision rate of 45 percent. The study did not conduct experiments in a large classroom setting for real-time engagement assessment. Furthermore, it does not consider the academic-emotional state of the students in its approach.

None of the above works tried to analyze students' academic emotional state through facial expressions in a large-scale offline classroom environment, and developed AEMS for real-time student engagement monitoring.

3) A summary of the work related to the monitoring of student participation in offline classrooms

# 3. Research significance and technical background

1)AEMS

Implementing AEMS in the field of education can have a wide range of applications:

  • In a distance learning setting, human teachers can receive real-time feedback on student engagement levels (low, medium, high) (Whitehill et al., 2014)
  • Student responses to instructional videos automatically identify and modify video content, causing viewers to lose interest (Whitehill et al., 2014)
  • Educational analysts have access to vast amounts of data to mine the factors and variables that affect student engagement. These data will have a higher temporal resolution than self-report and questionnaire results (Whitehill et al., 2014)
  • Analysis of student engagement can be used as instant feedback to adjust instructional strategies to enhance the learning process of students (Ashwin & Guddeti, 2019)
  • Daily feedback on teaching strategies is beneficial for novice teachers to quickly improve their teaching experience (Ashwin & Guddeti, 2019)
  • In the era of smart campus and smart university, the campus learning environment is diverse, including classrooms, webinars, e-learning environments, etc.
  • Manual monitoring of students is difficult and can be solved using AEMS (Al-Nawaashi et al., 2017; Ashwin & Guddeti, 2019)

Besides being used in the field of education, AEMS can also be used in many other fields, such as entertainment (Wang, S. & Ji,Q.; healthcare (Singh & Goyal, 2021), shopping (Yolcu et al, 2020), etc. Due to AEMS can be used in various fields, so each field needs to redesign a different set of contextual features according to the dimension of participation to obtain better predictions. It is sensitive to deal with people's visual data. This based on emotional artificial intelligence and affective computing technology The development and use of such autonomous systems introduces a new set of ethical issues that require responsible behavior, such as system design, ethical data use, transparency, and privacy (Gretchen Greene 2020; Robin Murdoch 2020).

2) Affective Computing

Affective computing (AC) is a field that researches and develops systems and devices that can sense, recognize, and process human emotions . It is an interdisciplinary field that includes computing, psychology, and cognitive science. With the help of artificial intelligence, we can transform computing machines into emotionally intelligent machines that can understand human emotions and respond accordingly. AC has a wide range of applications in education, healthcare, smart home, entertainment and many other fields. According to research by AC researchers, human communication not only relies on verbal communication such as voice and text, but also non-verbal communication such as facial expressions, eye gaze, head gaze, gestures and body postures (Poria et al, 2017).

Research (Sathik & Jonathan, 2013) has demonstrated that the non-verbal communication channels used more frequently by students listening in the classroom are facial expressions. Despite classroom seating arrangements and sizes, these facial expressions were less obscured by nonverbal parameters. Furthermore, processing this parameter is less computationally intensive than processing other non-linguistic parameters such as body pose estimation.

4. Method

The framework of the method includes two modules, offline and online, as shown in Figure 3. The offline module is based on the CNN-trained FER model, and the online module runs in real-time, using the CNN model trained by the offline module to estimate student engagement.

1) Privacy Protection

2) Offline module

The offline module is executed once to develop a CNN-based FER model that accepts face images as input and predicts appropriate emotional state labels as output. As part of the offline module, a dataset is also constructed to train the CNN architecture.

① Data set construction; ② Academic emotional state definition; ③ Data collection and participants; ④ Facial data annotation;

⑥ Proposed CNN model

3) Online modules

It includes five stages, namely: video acquisition stage, preprocessing stage, student emotion classification stage, postprocessing stage and visualization stage.

  • First takes a sequence of video frames and sets the frame counter to zero. The frame counter is incremented by 1 as each frame of video is processed to the preprocessing stage. The preprocessing stage returns aligned frontal faces, where the affective states from these face images are identified by the FER model trained in the offline module.
  • Once the frame counter value is equal to the predefined threshold, real-time engagement graphs are drawn for the processed video clips by applying the post-processing steps described in Section 4.2.4 to identify affective state labels (from step 1 to step 4).
  • After the lecture, the maximum cumulative group participation level label is returned as the student's overall participation feedback for the entire lecture.

①Video capture

②Pretreatment:

  • Frame Sampling: The frame sampling step samples several video frames per second to estimate the participation level of the student population. According to (Whitehill et al 2014), processing 4 frames per second of video at a time interval of 0.25 s yields almost the same results as processing 30 frames per second. Therefore, in this frame sampling step, only 4 video frames are processed per second with a time interval of 0.25 s, thereby reducing computational overhead;
  • Face Detection and Extraction: Extract the maximum number of faces from each video frame using a pre-trained face detection model. We use a multi-task cascaded convolutional neural network (MTCNN) as a pre-trained face detection model (Zhang, K., et al. 2016). The MTCNN model achieves state-of-the-art results in detecting smaller face patches with negligible false positive results (where non-face patches are detected as face patches). The face detection step returns a list of face patch coordinates and a list of face landmark coordinates. Each face patch coordinate contains four values, which are the x and y coordinate values ​​of the upper right corner, width and height of the detected face. These four values ​​are used to extract face image patches from video frames. The facial landmark coordinate table includes coordinate values ​​for two landmarks in the middle of each eye, one at the tip of the nose and two at the corners of the mouth.
  • Head Pose Estimation: The head pose detection step removes all non-frontal faces from the detected faces, including tilted left, tilted right, up and down.

    Since the FER model cannot assign proper emotional state labels to these non-frontal faces, the efficiency of the method is reduced. Head pose estimation involves using digital images to compute the 3D orientation of the head in relation to the camera pose. To this end, we implement the method proposed in the work of (Mallick, 2016). In this work, by associating six 2D facial landmarks (five landmarks as shown in Figure 8 and a sixth landmark on the chin) with their respective 3D positional landmarks (computed in terms of world coordinates), computed Three degrees related to attitude, namely yaw, pitch and roll. A sixth facial landmark coordinate is explicitly computed using the nose tip and mouth corner facial landmark coordinates generated by the MTCNN model. The rotation of an object in the vertical direction is called pitch. The rotation of an object in horizontal motion is called yaw. The rotation of an object in a circular (clockwise or counterclockwise) motion is called rolling. Eliminate left and right tilted faces by thresholding the degree of yaw motion. Likewise, up and down are eliminated by thresholding the degree of pitch shift.

  • Face alignment and resizing

The frontal face patches are further refined in the face alignment step. In general, there is no guarantee that all front sides will be aligned exactly, as shown in the image above. Some facades may slope to the right or left. Face alignment is a digital image face standard alignment preprocessing technology based on translation, scaling and rotation transformation. To this end, we implement an approach proposed in the work in (Rosebrock, 2017). Another implicit benefit of this step is image enhancement; it reduces image blur by repositioning pixels. Finally, these aligned fronts are resized to 48 × 48 (width × height) pixels, which are then used as input to the trained FER model.

③ Classification of students' emotional states

Low engagement (EL1): boredom, sleepiness; moderate engagement (EL2): yawning, frustration, confusion; high engagement (EL3): focus

④ post-processing

  • Accumulate all predicted student facial emotion labels extracted from video frames into respective accumulators (the accumulator acts as a counter for each emotion state label)
  • These accumulators are merged into their respective EL accumulators, EL1 (low), EL2 (medium), and EL3 (high)
  • Repeat this process until the frame counter is equal to the predefined threshold
  • Once the frame counter is equal to a predefined threshold, the input video stream is segmented into a video segment (the maximum EL accumulator label is returned as the group participation level (GEL) for that video segment)
  • Accumulate the GEL labels of each video segment into respective GEL accumulators, such as GEL1(low), GEL2(medium), GEL3(high), which use the full lecture participation feedback (FGEF)
  • After this processing, the GEL of the processed video clip is plotted on the real-time graph
  • Finally, when the input video stream is complete, the largest GEL accumulator label is returned as the FGEF for the entire lecture.

5. Experiments and Discussion

1) Experimental setup

For implementation and experiments, a computing system consisting of 8th generation Intel Core i5-8300H processor @2.30GHz, 16gb RAM and 4gb NVIDIA GeForce GTX 1050Ti graphics card was used.

A 2-megapixel (People Link Elite FHD-1080 20x optical zoom) network camera is installed in the smart classroom of this department to record classroom videos.

2) CNN model training and evaluation

3) Calculation time

4) Results and discussion

5) Limitations

This work was limited to cases where students did not have a large degree of heterogeneity in age, culture, and class background. In this case, we assume that there will be no significant changes in expression, then a single model is sufficient to recognize their facial expressions. Therefore, the operation of the proposed model can work to a certain extent in the above context, but it may be different when applied to students of different ages, cultures and backgrounds. However, when the system is deployed in situations where there is a large degree of heterogeneity in student age, culture, and background, we recommend training multiple FER models considering different populations, and integrating the outputs of these models to obtain the final result. Current research mainly assesses students' group participation through facial expressions. We have not considered combinations of different nonverbal cues such as body posture, head movement and eye gaze. In addition, this study was validated by students' self-reports of engagement measures, rather than external validation by teachers' own judgment, trained observers, and students' learning gains.

Guess you like

Origin blog.csdn.net/qq_44930244/article/details/130955034