(Thesis plus source code) Two-class EEG emotion recognition based on DEAP and MABHOB data sets (pytorch deep neural network (DNN) and convolutional neural network (CNN))

This paper was published in a top journal in 2021. (pytorch framework)

The code analysis part is on the personal homepage:

https://blog.csdn.net/qq_45874683/article/details/130007976?csdn_share_tail=%7B%22type%22%3A%22blog%22%2C%22rType%22%3A%22article%22%2C%22rId%22%3A%22130007976%22%2C%22source%22%3A%22qq_45874683%22%7D

(Thesis plus source code) Code analysis of two-class EEG emotion recognition (pytorch deep neural network (DNN) and convolutional neural network (CNN)) based on DEAP and MABHOB data sets

Please see your personal homepage for papers and source code:

https://download.csdn.net/download/qq_45874683/87667147

Paper plus source code) Two-class EEG emotion recognition (pytorch deep neural network (DNN) and convolutional neural network (CNN) based on DEAP and MABHOB data sets)


Table of contents

This paper was published in a top journal in 2021. (pytorch framework)

Summary

1 Introduction

2 Related work

2.1 Reproducibility of relevant works

3 data sets

3.1 DEAP

3.2 MAHNOB

3.3 Data set preprocessing

3.3.1 DEAP preprocessing

3.3.2 MAHNOB preprocessing

3.3.3 Summary of preprocessed data sets

4 models

4.1 Deep Neural Network (DNN)

4.2 Convolutional Neural Network (CNN)

5 Result Analysis

5.1 Analysis of results between data sets

5.2 Statistical testing of comparative models

5.2.1 McNemar's test

5.2.2 5x2cv paired t test

5.3 Arousa classification results

6 Conclusion


Summary

        As devices that record electroencephalogram (EEG) signals become increasingly affordable, there is growing interest in applications that use EEG data to predict human emotional states. However, research papers in this area often suffer from poor reproducibility [1], and the reported results are rather flimsy, lack statistical significance, and are often based on testing on a single data set.

        Therefore, the purpose of this paper: to test the obtained model through statistical experiments to compare different models and data sets.

        Of the two models considered, Deep Neural Networks (DNN) and Convolutional Neural Networks (CNN), the first was able to achieve maximum accuracy on a specific training set, but CNN proved to be better than DNN on average. Using the same models, it was also found that DEAP achieves higher accuracy than MAHNOB, but only to a small extent, indicating that these models are robust enough to perform almost equally well on both datasets.

        The method proposed in [2] for valence arousal classification from EEG was closely followed in an attempt to reproduce the results reported therein. To achieve the second goal, McNemar and 5x2cv tests were then used, as well as the models were compared to each other on two different datasets DEAP[3] and MAHNOB[4], with the purpose of understanding whether a model can perform on two identical but Perform similar operations on related data sets.

1 Introduction

        For a long time, emotion recognition was mostly based on video or audio recordings due to the low cost of sensors such as cameras and microphones. However, as technology advances, it is possible to build relatively low-cost sensors to capture physiological signals, so there has been a noticeable increase in interest in using this data recently within the affective computing community. Electroencephalogram (EEG) signals are no exception.

        Parallel to this, the use of deep learning techniques has also increased significantly, so it is not surprising that much recent academic research has focused on training deep neural networks to recognize emotions from electroencephalograms. Furthermore, since EEG data is known to be a complex signal that is difficult to understand, the ability of deep neural networks to automatically learn features sounds promising.

        Recent research in the field confirms this assumption, showing that deep neural models outperform traditional techniques. However, many of these studies have proven themselves difficult or even impossible to replicate and relied on a single data set to test their models. Some studies, such as [1], report surprising data on this problem: on average, EEG deep learning studies do not publish the dataset used (50% of the time) or the code of the model (90%), reproducibility The difficulty is usually very difficult to do (90%).

        The first goal of this study is to reproduce the steps and obtain predictors with similar performance to those reported in [2]. In this study, two neural network models, namely a simple deep neural network (DNN) and a convolutional neural network (CNN), were trained to classify emotions from EEG data. The dataset used is DEAP [3], a well-known benchmark database for affective computing applications. This study focuses on predicting emotional states based on the two continuous dimensions of valence and arousal proposed by Russell. In particular, the focus was on binary and three-category classifications of valence and arousal, whereas only binary classifications were considered in this study .

        Despite strictly following all the steps described in [2], the accuracy of our model was far from the reported accuracy, which led to the conclusion that some data preprocessing steps were omitted in their paper.

        Another purpose of this study is to statistically compare different models (especially DNN and CNN) to understand whether there are significant differences between the two. Furthermore, these models have been tested on two EEG datasets annotated with valence arousal labels, namely DEAP and MAHNOB, to find out whether the same architecture can work well in both domains.

        The results show that models trained and evaluated on DEAP tend to perform better than models trained and evaluated on MAHNOB, although this may be due to the size difference between the two datasets. In general, both models have been found to perform with similar performance on DEAP and MAHNOB .

        DNN and CNN models were also compared on both datasets using McNemar test and 5x2cv paired t-test. As pointed out in [5], these tests were chosen because they have low type I error and good statistical power, and they are the de facto standard today. While McNemar's test was unable to find any significant differences between the models, the 5x2cv test was more powerful and was able to show that the CNN model was statistically better than the DNN model.

        This report is structured as follows. Section 2 summarizes the preprocessing steps, methods and results of [2] and also discusses how many papers we were able to replicate. Section 3 describes the datasets and the preprocessing steps applied to each dataset, while Section 4 details the neural architecture used, relative hyperparameters and training procedures. Section 5 then contains the summary of results and statistical tests for comparison between models and datasets. Finally, Section 6 elaborates on the research results regarding the proposed objectives. The report is completed by a small appendix that serves as a reference to easily browse the source code of the models and experiments provided.


2 Related work

        The paper that inspired this research is [2], published by Tripathi et al. in 2017. The authors used a simple neural network model to predict valence and arousal from EEG data. The problem of predicting valence arousal was framed as a classification problem, and in particular they tested two- and three-class classification. For binary classification, valence arousal values ​​below 5 are considered low activation, while values ​​above 5 are considered high activation.

        The data used comes from a preprocessed version of the DEAP dataset [6]. Then, in order to use reasonable computing resources for training, they processed the dataset to reduce the dimensionality of the EEG data, divided each EEG trial into multiple batches, and used the mean, standard deviation, minimum , maximum value and other statistical values ​​are summarized for each batch.

        The two models used are basic deep neural networks. The first is a simple 4-layer neural network consisting of fully connected layers, and the other is a convolutional neural network with 2 convolutional layers, a max pooling layer, and 2 fully connected layers.

        Then, this paper reports the results obtained by 32-fold cross-validation using different hyperparameter configurations for DNN and CNN. The DNN model achieved 75.8% and 73.1% accuracy in valence and arousal, respectively, while the CNN model achieved an impressive 81.4% and 73.4% accuracy.


2.1 Reproducibility of relevant works

        This section discusses what we were able to replicate from [2]. Since the reproduced results were not satisfactory, some changes were made to the preprocessor, model architecture, and hyperparameters. For this reason, this section is placed before Sections 3 and 4, which describe the final dataset preprocessing steps and models used in this study.

        Although the code and data from this study have not been made public, replicating the same preprocessing steps and models was not a challenge because the data processing was simple and the neural models were basic. However, the trained predictors failed to learn from the EEG data: the models either underfit or overfitted, but they found no general pattern.

        As mentioned in Section 3, by normalizing the dataset, the model is able to learn some patterns from the data, thus alleviating this problem. The standardization step is not explicitly cited in [2], but it is a common step and one could claim that it is an implicit one.

        However, even after normalization, the accuracy of the resulting model is nowhere near the accuracy stated in [2], reaching a maximum of 80% in some train/test splits, but with an average accuracy of around 60 %, while the average accuracy in replication studies was 75%. The results reported in Section 5.1, although they are based on slightly different hyperparameter choices and model architecture, are almost identical to those obtained by an exact replication of the paper, so they can be used as a metric to compare expected accuracy with actually obtained accuracy.

        Since the training process and model architecture are no different from standard neural networks used in other fields, the problem may be data dependent. Different types of normalization have been tried (per channel, per trial, per participant, global; after or before dimensionality reduction), but no improvement in accuracy was obtained. Therefore, it is likely that the authors of [2] used a custom preprocessing version of DEAP, although they never explicitly mentioned performing additional preprocessing procedures.

        After being unable to reproduce the results of [2], some choices were made that differed from that study: for example, only 32 EEG channels were used instead of all 40 channels that also contained other physiological signals, in order to have The exact same set of characteristics. The model architecture and hyperparameters were also slightly modified. A detailed description of the dataset and model can be found in Sections 3 and 4.


3 data sets

        The DEAP and MAHNOB datasets were selected for this study because they both contain EEG data and valence arousal annotations. Valence and arousal annotations are based on the Russels scale, which is widely used in affective computing. Using Russels' valence arousal scale, each emotional state is a point on the 2D plane, with valence and arousal being the horizontal and vertical axes respectively (see Figure 1). Thus, the combination of valence and arousal produces a specific emotion. In particular, valence can vary between unpleasant and pleasant, and arousal can vary between inactive or active.


3.1 DEAP

        DEAP[3] is a data set for sentiment analysis released in 2014. It is one of the largest public datasets in affective computing and also contains a variety of different physiological and video signals.

The DEAP dataset consists of two parts:

        1) A database of 120 one-minute music videos, each rated by 14-16 volunteers based on valence, arousal and dominance.

        2) A subset of 40+ music videos, each with corresponding EEG and physiological signals for each of the 32 participants. As in part one, each video was rated on the dimensions of valence, arousal, and dominance.

        For the purpose of this report, only the second part of the DEAP dataset, which contains the EEG signals, was used.

        EEG signals were collected using a Biosemi ActiveTwo device, which records 32 EEG channels with a configurable sampling rate. DEAP was collected at 512Hz, but the creators of the dataset also provide a preprocessed version of the EEG signal, downsampled to 128Hz and with frequency filters and other useful preprocessing steps applied.

        In particular, for each of the 32 participants, the following preprocessed information exists:

        •Data: A 40 x 40 x 8064 array containing 8064 recordings for each of 40 channels and 40 music videos. There are 8064 recordings per video per channel, and since the trial duration is 63 seconds (3 seconds pretrial baseline + 60 seconds trial), the sampling rate is 128Hz (63 x 128 = 8064).

        • Tags: A 40 x 4 array containing annotations for valence, arousal, dominance, and linkage for each of the 40 music videos.

        This preprocessed information is processed again, as described in Section 3.3.


3.2 MAHNOB

        MAHNOB[4] is an emotion recognition data set released in 2012. It is a multimodal dataset providing audio, video and physiological signals, as well as eye gaze data. All data are synchronized and annotated with respect to the affective dimensions of valence and arousal. Four different types of experiments have been conducted: 1) In the first type of experiment, participants were shown a video and had to annotate their valence and arousal level to the video stimulus. 2) In the other three types of experiments, a label was placed at the bottom of the screen: this label may or may not be related to the movie being shown. In this condition, participants were asked to rate the relevance of the tag to the video. For the purpose of this report, only data from the first type of experiments were used. EEG signals were recorded using the same device Biosemi ActiveTwo used to collect the DEAP dataset. Therefore, the EEG signal also has 32 channels, but MAHNOB collects it at 256Hz instead of 512Hz. Contrary to DEAP, MAHNOB does not provide pre-processed versions of datasets, but rather raw collected files, which are in .bdf format of the EEG signals. In order to process this data, it is necessary to perform more preprocessing steps than DEAP, as described in Section 3.3.


3.3 Data set preprocessing

        The data of DEAP and MAHNOB have been preprocessed. The following two subsections explain in detail the preprocessing steps applied to these two datasets.

3.3.1 DEAP preprocessing

        Data dimensionality has been reduced. The 40 channels have been whittled down to 32, leaving only the EEG signal, and the 8064 readings per channel have been reduced to 99 values.

        To perform the latter processing done by [2], the 8064 records were divided into 10 batches of approximately 807 reads each. Then, the following statistical values ​​were extracted for each batch: mean, median, maximum, minimum, standard deviation, variance, range, skewness and kurtosis, resulting in 9 values ​​per batch (10 batches produces 90 values ​​at a time). The same value is then calculated for the entire 8064 readings, resulting in 9 additional values ​​for a total of 99 values.

        These summary values ​​are then normalized on the basis of the example using the following formula to give a mean of 0 and a standard deviation of 1:

        where X is the entire 32x99 sample and


3.3.2 MAHNOB preprocessing

        This dataset provides raw EEG data in .bdf format, collected with a Biosemi ActiveTwo device. Since this data is not preprocessed, some additional work must be done. To process raw EEG signals, the MNE Python library specifically designed for processing and visualizing human neurophysiological data has been used [8].

        The same preprocessing steps were applied on the official preprocessed version of the DEAP dataset as described in [6]. In particular, the EEG signal is referenced to channel "Cz", which is a common reference channel and is even suggested in the Biosemi FAQ [9]. A 4-45Hz bandpass filter was applied, but was less effective, so it was removed. In addition, since MAHNOB does not provide a fixed number of recordings per session and also contains 30 seconds of recordings before and after the experiment, the required recordings were extracted from the middle of the experiment.

        Then, the same preprocessing steps as in DEAP (explained in Section 3.3.1) were applied, with only one minor adjustment: 16128 (8064 x 2) readings were considered instead of 8064, and the preprocessing batch size was also This is doubled because the MAHNOB dataset provides raw data collected at 256Hz, while DEAP provides a 128Hz downsampled version of the data. This way, the time window covered by the batch is the same for both datasets .

3.3.3 Summary of preprocessed data sets

        After the preprocessing steps explained in the previous sections, both datasets contain data with the same shape, as shown in Table 1.

 Table 1: Dataset size and data shape after preprocessing steps. The data contains 32 channels with 99 records each, while the tags contain 2 values ​​(valence and arousal)

        Scripts that perform these processing steps are provided in the project's repository under the names prepare deap.py and prepare mahnob.py respectively.

        Both datasets are divided into training and test sets, and the segmentation ratios of DEAP and MAHNOB are (1180,100) and (460,86) respectively. Unfortunately, the original MAHNOB dataset contained 1183 sessions, but only 546 of them were annotated with valence and arousal, resulting in a rather small dataset for the current use case.


4 models

        This study adopted two different neural network architectures: Deep Neural Network (DNN) with fully connected layers and Convolutional Neural Network (CNN), which were taken from [2] with only minor modifications. Both models were developed using Python and PyTorch [10], and the source code can be found in scripts/nn/models.py.

        The following subsections explain each of these models and training techniques in detail.


4.1 Deep Neural Network (DNN)

        The DNN model is a deep neural network with 3 hidden layers. An approximate graphical scheme of the architecture is shown in Figure 2, while the exact details of each layer are shown in Table 2.

Figure 2: DNN architecture. The number of neurons depicted is for representation only, the true number of neurons is reported below each layer.

Table 2: Deep neural network (DNN) architecture

        The ReLU activation function is used after each dense layer (except the last layer) to introduce nonlinearity into the model, while the sigmoid function is applied after the last layer to compress the output to the interval [0, 1]. Since in this paper, valence/arousal classification is treated as a binary classification problem (low or high), a single output neuron with a value in [0, 1] represents the input signal inferred by the network to refer to a high value/arousal emotional state. The probability.

        To avoid overfitting, dropout techniques are heavily used since the amount of data available for training is small.

        All weights of the network are initialized using the Xavier normal method [11], while all biases are initialized with the value 0.

        The hyperparameters, optimizers and loss functions used for training are reported in Table 3. There are slight differences between the two datasets.

Table 3: Hyperparameters, loss functions, and optimizers for DNN training procedures. BCE = Binary Cross Entropy; RMSProp = Root Mean Squared Propagation (root mean square).


4.2 Convolutional Neural Network (CNN)

        The CNN model utilizes convolutional layers and treats the data as a 2D input of shape 32 x 99. Figure 3 depicts the architecture, and Table 4 details each layer.

        In short, the model consists of two convolutional layers, then a max-pooling layer, and finally two fully connected layers. The convolutional layer treats the input as a 2D image and applies a 3x3 filter through a convolution operation. This type of layer is mainly used in tasks involving images. The max pooling layer is used to reduce the spatial dimension of the data, sliding a 2x2 window over the image, which is reduced to a single value: the value of the neuron with the highest activation. Max pooling reduces the spatial dimension of the image, thereby reducing the number of parameters required in the final fully connected layer and helping the network avoid overfitting.

        Like the DNN model, the CNN weights are initialized using Xavier's normal technique with the bias set to 0.

        The hyperparameters, optimizers and loss functions used for training are reported in Table 5. There are slight differences between the two datasets.

Table 5: Hyperparameters, loss functions, and optimizers in the CNN training process. BCE = Binary Cross Entropy; SGD = Stochastic Gradient Descent.


5 Result Analysis

        This section is divided into subsections.

        Section 5.1 focuses on the comparison of the obtained results with the expected results from the reproduction study [2] and the differences between DEAP and MAHNOB model performance.

        On the other hand, Section 5.2 describes the statistical tests performed to compare DNN and CNN models with each other, with the aim of discovering whether there are significant differences between the two models.

        Finally, Section 5.3 is dedicated to the performance of the arousal classification model.


5.1 Analysis of results between data sets

        The first way to evaluate a model is the simplest one. As mentioned in Section 3.3, each dataset is divided into two subsets: training part and testing part. For this experiment, the model has been trained on the training part of the dataset and tested on the test set of the corresponding dataset.

        The results of binary valence classification can be found in Table 6. These specific results refer to the best model obtained during training.

Table 6: Results of value classification by DNN and CNN models on DEAP and MAHNOB datasets. The confidence interval refers to the 95% significance level and is calculated by approximating the binomial distribution of the test set evaluation to a Gaussian distribution. The script confidence-intervals.py contains the code for the calculations.

        From these results, in general, the model performs better on DEAP than MAHNOB. MAHNOB is undoubtedly a factor in this

        The dataset has less than half the number of examples of DEAP, making the model harder to train and more prone to overfitting. As can be seen from Table 6, the DNN model seems to outperform the CNN model on both datasets, but especially on DEAP; however, this informal observation is questioned in Section 5.2, which discusses The two models were statistically compared.

        The model was also evaluated using K-fold cross-validation. For this technique, the dataset is divided into K folds of equal size (if possible), and then, each fold in turn is used as a test set, while the rest of the dataset is used as a training set. Therefore, K models are trained and their accuracy is evaluated, so the final reported accuracy of K-fold cross-validation is the average of these accuracies.

        The 32-fold cross-validation results of DEAP and the 6-fold cross-validation of MAHNOB are shown in Table 7.

Table 7: K-fold cross-validation results of DNN and CNN on DEAP and MAHNOB. The DEAP run used 32 folds, while the MAHNOB run used 6 folds. The script to replicate this experiment can be found under the name kfold cross-validation.py.

        The accuracy found using K-fold cross-validation is much lower than that found using a fixed train/test split. Therefore, it can be said that models suffer from high variance errors, i.e. their performance is highly correlated with the specific training and test sets provided to them. For the results in Table 6, it is likely that the train/test split operated on the dataset was a "lucky" split that produced high accuracy by chance.

        The high variance conjecture is also confirmed by the fold-specific accuracy obtained during K-fold cross-validation. For example, in a K-fold run of a DNN model on DEAP, fold accuracy ranged from 43% to 78%, demonstrating how different dataset splits can fundamentally change accuracy results. The same behavior was observed with MAHNOB, albeit to a less extreme extent.

        The K-fold results of DEAP can be compared with the results reported in [2] since that study also used 32-fold cross-validation as an evaluation technique. The accuracy of DNN and CNN is 75% and 81% respectively, while our accuracy is 58% and 59%. The gap in accuracy is huge, and although the dataset and model used in this study are different from those in [2], using the same exact data preprocessing steps and model architecture in [2], we have obtained the same results as in Table 7 The results in are very similar to those reported in Section 2.1

        The K-fold results also confirm the previous results, that is, both models perform better than MAHNOB on DEAP. Another interesting observation is that the CNN model slightly outperforms the DNN model on both datasets, while the DNN model is able to achieve higher maximum accuracy when evaluated on a single train/test split.


5.2 Statistical testing of comparative models

        For simplicity, all statistical tests are performed on the valence prediction model. However, based on the results in Section 5.3, we believe that the statistical test results for the arousal model are similar


5.2.1 McNemar's test

        The McNemar test is used to test whether there is a statistically significant difference between the performance of the DNN and CNN models. To conduct this test, the predictors with the results reported in Table 6 were used, i.e., predictors trained on the default train/test split of DEAP and MAHNOB. The script to reproduce McNemar's test presented in this section is McNemar-test.py McNemar's test works as follows [5]: The predictors to be compared, in this case f DNN and f CNN, are evaluated against the test set, Also build the following link table:

        where n 00 is the number of samples in the test set that are misclassified by both predictors, n 01 is the number of samples that are misclassified by f DNN but not by f CNN, and n 10 is the number of samples that are misclassified by f CNN but not by fDNN. The number of samples and n 11 is the number of samples correctly classified by both predictors. Therefore, n00+n01+n10+n11 equals the number of examples in the test set.

        The null hypothesis of the McNemar test is that the two predictors have the same error rate, that is, n 01 = n 10. This test compares the expected counts for n01 and n10 with the actually obtained counts using a well-fitted chi-square test.

        In practice, the following McNemar test statistic is greater than X_{1,0.95}^{2}=3.841, with probability less than 5%:

        Therefore, in this case, the null hypothesis can be confidently rejected that the two predictors have significantly different performances on the selected training and test sets.

        The contact tables obtained using DNN and CNN models trained on DEAP are as follows:

        And the resulting statistic is 0.487, which is not enough to confidently reject the null hypothesis. Therefore, although the DNN and CNN predictors have different performances as mentioned in Section 5, the McNemar test shows that we should accept the null hypothesis that these two predictors do not have significantly different performances.

        For the predictors of MAHNOB, the following link table has been obtained:

        Even without doing any calculations, we can see that n01 and n10 are almost the same, so in this case it can also be said that these two predictors have basically the same performance according to McNemar's test.


5.2.2 5x2cv paired t test

        While McNemar's test is about the comparison of two predictors (where the predictor is considered the result of running a learning algorithm, i.e. the resulting model), the 5x2cv test compares two learning algorithms. Therefore, in order to perform this test, it is not necessary to use the pretrained model given in Section 5, as in McNemar's test.

        The 5x2cv paired t-test is a statistical test based on 5-repeated 2-fold cross-validation, designed to discover whether there is a significant performance difference between two learning algorithms [5]. This test showed low Type I error, although not as low as McNemar's test. On the other hand, the 5x2cv test has higher power than McNemar's, i.e. the test is better at detecting differences when they really exist.

        One disadvantage of the 5x2cv test is that it is computationally expensive, ten times more expensive than the McNemar test. Dieterich recommended in [5] to use 5x2cv instead of McNemar where computationally feasible, and fortunately the data and models for this study did so.

        The test works as follows. Five iterations of 2-fold cross-validation were performed. In each iteration, the data is divided into two sets, S1 and S2, and then both learning algorithms A and B are first trained on S1 and then tested on S2, and vice versa. As a result, four error estimates are obtained: p_{A}^{(1)}, p_{B}^{(1)}, p_{B}^{(2)}and p_{A}^{(2)}. For each fold, the estimated difference can be calculated as follows: p^{(1)}= p_{A}^{(1)} - p_{B}^{(1)} and p^{2} = p_{A}^{(2)} - p_{B}^{(2)}. Then, the estimated variance is: s^{2} = ( p^{(1)} - p\bar{}) ^{2}. Since this calculation is repeated for each iteration, for i = 1,...,5, we get s_{i}^{2}. The test statistic can then be calculated as follows:

        Under the null hypothesis, t\tilde{}a t-distribution with 5 degrees of freedom is followed. Therefore, by setting alpha to 0.05, the null hypothesis can be rejected if t>2.571 or t<−2.571.

        The 5x2cv test has been used to compare DNN and CNN models on DEAP and MAHNOB. These tests used the same architecture and hyperparameters reported in Sections 4.1 and 4.2, except that the number of epochs was reduced to 150 to meet hardware constraints. A script to reproduce these results can be found under the name 5x2cv-test.py.

        On DEAP, the resulting statistic is -2.502, which is very close to -2.571, the threshold for rejecting the null hypothesis with 95% confidence. For slightly higher alpha values, such as 0.06, the null hypothesis can be rejected, which means that there may be a statistically significant difference between the two compared learning algorithms.

        Surprisingly, although in the results of Table 6 in Section 5, the DNN network is able to achieve higher accuracy than CNN, in this case, the average accuracy of the two models of DNN and CNN is 54.3% and 57.2 respectively. %, so the CNN model is better than DNN. Note that the accuracy of 2-fold cross-validation is worse than that reported in Section 5 because the training set is much smaller in this case, which may lead to overfitting.

        On MAHNOB, on the other hand, the t-statistic calculated by the test is 0.306, indicating that both models perform similarly on this dataset.


5.3 Arousa classification results

        The current research focuses mainly on valence classification, but some experiments on arousal classification have also been performed. Specifically, K-fold cross-validation was also performed on it, and the results in Table 8 were obtained.

Table 8: Results of K-fold cross-validation of DNN and CNN on DEAP and MAHNOB for arousal binary classification. The DEAP run used 32 folds, while the MAHNOB run used 6 folds.

        These results are consistent with the results for valence classification, highlighting that the CNN model appears to be slightly better than the DNN model. They are also consistent with the results reported in [2], as they also show a slight decrease in accuracy in valence classification.


6 Conclusion

        In this work, we first tried to replicate the results of another paper [2], but we were unable to do so because the accuracy of the model was much lower than the accuracy reported in the replication paper.

        However, the study found that both test models were able to perform similarly on DEAP and MAHNOB, meaning that they have proven to be very robust classifiers of valence arousal from EEG and may be able to perform better on other EEG-based data. Use when the set occurs with small to no changes. Since these results were obtained with basic and general neural network models described in this study, it is reasonable to believe that more specialized and complex neural structures may perform better in emotion classification of EEG.

        Furthermore, from a statistical perspective, the CNN architecture is much better than the DNN model, at least on DEAP. This result is important because future research may benefit more from experiments with different CNN architectures than DNN architectures.

The code analysis part is on the personal homepage:

https://blog.csdn.net/qq_45874683/article/details/130007976?csdn_share_tail=%7B%22type%22%3A%22blog%22%2C%22rType%22%3A%22article%22%2C%22rId%22%3A%22130007976%22%2C%22source%22%3A%22qq_45874683%22%7D

(Thesis plus source code) Code analysis of two-class EEG emotion recognition (pytorch deep neural network (DNN) and convolutional neural network (CNN)) based on DEAP and MABHOB data sets

Please see your personal homepage for papers and source code:

https://download.csdn.net/download/qq_45874683/87667147

Paper plus source code) Two-class EEG emotion recognition (pytorch deep neural network (DNN) and convolutional neural network (CNN) based on DEAP and MABHOB data sets)

Guess you like

Origin blog.csdn.net/qq_45874683/article/details/130000469