【Paper Notes】DepressionNet: A Novel Summarization Boosted Deep Framework for Depression Detection

DepressionNet: A Novel Summarization Boosted Deep Framework for Depression Detection on Social Media


image-20230317211345987

Conference : SIGIR 2021

Task : Depressed User Detection

Original : link

Abstract

This paper proposes a novel summarization-augmented deep framework for social media depression detection. The framework first selects relevant content on all user tweets through a hybrid extraction and abstraction-extractive summarization strategy, resulting in more fine-grained and relevant content. The framework consists of CNN and attention-enhanced GRU, which outperforms existing baselines.

The main contributions of this paper include:

  • A deep learning framework for depression detection that connects user behavior and post history is proposed: Depression-Net;
  • Based on the connection BERT-BART application abstractive-extractive automatic text summarization model, in order to deal with two major requirements:
    • Broad coverage of depression-related tweets by condensing large numbers of tweets into short, conclusive descriptions;
    • Content that may be relevant to depression is reserved.
  • Using information about user behavior, a cascaded deep network is developed to concatenate behavioral features from different layers.

Motivation

Previous studies on the detection of depressed users on social media mainly relied on user behavior and language patterns. The disadvantage of these studies is that the model will be trained on some content that is not relevant to the detection of depressed users . Such content introduces inefficiencies and performance degradation, e.g., the curse of dimensionality problem, and these depression-independent content are likely to have a dominant impact on model classification than depression-sensitive content.

We therefore need to introduce methods that can weaken these negatively impacting depression classifications; moreover, we need an efficient feature selection strategy to be able to detect implicit/latent patterns from users, i.e. some difficult Patterns that are well modeled by simple word frequency statistical mining or modeling surface features.

Model

We use an abstraction-summarization mechanism in our deep framework based on CNN and BiGRU+Attention to compress the content into a summary, which helps us retain the most salient content for each user. CNN has a more faithful feature modeling ability, and BiGRU is used to alleviate the shortcomings of CNN that cannot capture long sequence relationships. To further capture user patterns, we introduce various user behavior features such as: social network connection, emotion, depression domain-specific and user-specific latent topic information, and apply stacked BiGUR. Finally, the model concatenates user behavior and post summaries, we call this model DepressionNet, which consists of two shared hierarchically fused user behavior networks and a post history aware network .

insert image description here

Extractive-Abstractive Summarization

  • Motivation: Summarization can help our model focus on those condensed and salient content. This paper is the first to apply text summarization to social media depression detection. As shown in the word cloud in the figure below, after summarizing the comments of depressed users, the original redundant and unrelated words such as "still" and "eleven" were deleted, and words like "sic" and "mental" were deleted. Depression-related words become prominent. We observe the same pattern in the utterances of non-depressed users. After the abstract, the focus only stays on the most non-redundant patterns.

insert image description here

  • This paper proposes a framework that fuses the interplay between abstractive and extractive summarization. Considering that a user's post is huge in number, noisy, and has a lot of redundant information, the application of extractive summarization can help us automatically select user-generated content by removing redundant information, while generative (abstract) summarization retains semantic information while further compressing the content. This article uses the BERT-BART model for summarization.

  • BERT encoded sentences and K-means clustering : for user post history T n T_nTn, apply extractive summarization to select the most important mmm postsT m T_mTm. Using BERT for sentence embedding, BERT outputs a vector representation for each tweet that is based on words rather than sentences. Then, K-means clustering is used on the high-dimensional vector representing the semantic center of the text , and the cluster center is selected to extract the most important tweets.

image-20230318160818779

  • BART Generative Summarization : Subsequently, we further compress the remaining redundant information that may not have been discovered during the extractive summarization stage. We use BART-large fine-tuned on the CNN/DM dataset for abstract summarization. Output summary: S = w 1 , w 2 , . . , w N , S ∈ RV × NS= {w_1, w_2, .., w_N },S ∈ R^{V×N}S=w1,w2,..,wN,SRV × NVVV represents the vocabulary size,NNN represents the digest length.

  • CNN+BiGRU+Attention captures sequence information : The word embedding generated by BART represents the word-level summary of user posts, and this word embedding is used as the input of the stacked CNN and BiGRU-Attention modules to capture sequence information, such as the context of the sentence. The attention mechanism has an advantage in this scenario because it helps the model focus on depression-related words .

    • Word Embedding : By using the Skipgram model, SSEach wi w_iin Swiis subsequently converted to the word xi x_ixiA vector of , we used a pre-trained 300-dimensional embedding, and the embedded summary sentence can be expressed as: X = x 1 , x 2 , . . , x NX={x_1, x_2, .., x_N}X=x1,x2,..,xN

    • CNN : The weighted matrix of word vectors will be used as the output of the word embedding layer, and then input to CNN+Max-pooling+ReLu. The goal of CNN is to extract the most relevant embedding summary sentence features . The word vector representation of tweets is usually complex, so the word vector dimensions are regularly extracted by CNN, and important features are extracted by learning the spatial structure in the summary text through pooling . Finally, a fully connected layer is added to provide BiGRU with a unified global synthesis feature .

    • BiGRU-Attention : The feature representation output by CNN is used as the input of BiGRU, and the word representation output by BiGRU at each moment is combined by the forward and backward representations through the XOR operation. Attention weight calculation method: ut = tanh ( ht ) ) u_t = tanh(h_t))ut=t you ( ht)) , and then the result of the final weighted sum output through softmax is used as the summary feature learned by the model.

      image-20230318183900969

User Behaviour Modelling

  • User Behavior Feature Extraction

    Divide user behavior characteristics into four types, as shown in the table below.

    • User social network .
    • emotions . VAD dictionary: includes a series of English words and their scores in the three dimensions of V, A, and D. Each tweet of our users calculates its VAD score, which in turn calculates the VAD score of each user. In addition, the expressions in tweets are divided into positive, neutral, and negative for statistics.
    • Domain-specific features of depression . Consider two features: depressive symptoms and antidepressant associations.
      • Depressive Symptoms: Counts the number of occurrences of any of the nine depressive symptoms in the DSM-IV in user tweets. Symptoms are specified in nine lists, each containing various synonyms for a specific symptom.
      • Antidepressant relevance: We created a separate list of antidepressant drug names from Wikipedia and counted their occurrences in user tweets.
    • Topic features . Topic modeling works by users to discover salient patterns (represented as distributions over words in tweets) under the assumption of mixed membership, i.e. each tweet may exhibit multiple patterns. Frequently occurring topics play an important role in depression detection. This paper first considers a corpus of complete tweets from all depressed users and splits each tweet into a list of words, which are then assembled in order of decreasing frequency. We have removed stop words from the data. Apply the unsupervised LDA topic model to extract the latent topic distribution. For each user, we count the number of times each word appears in the user's tweets.

    image-20230318164057602

  • Stacked BiGRU feature modeling :

    To obtain fine-grained information, we apply stacked BiGRU to each multimodal feature. Specifically, two BiGRUs are used, which capture the behavioral semantics of each user in the backward and forward directions, respectively, followed by a fully connected layer. The result of behavioral modeling is a high-level representation that captures behavioral semantic information, which is useful in the diagnosis of depression. Play a key role in (ablation experiments have been verified).

image-20230318183935583

Fusion of User Behaviour and Post History

The overall structure of the model consists of two different parallel networks, namely the user post history network (post history perception network and fully connected layer) and user behavior network (two shared layered post-fusion networks, which I understand here should be shared) For the four features, the weight of GRU is shared).

A hierarchical time-series-aware network integrates multiple fully-connected layers to integrate user behavior representation and user publishing.

For a user, we extract a compact feature representing behavior and user post history, and then perform post-fusion. The framework proposed in this paper models a high-level representation that captures behavioral semantics. Similarly, user post histories consist of representations extracted from user histories representing gradual increases in depressive symptoms. We concatenate these two representations to generate a feature map that considers both user behavior and historical tweet reflections. The output of the DepressionNet network is a response map representing the similarity scores between depressed and non-depressed users.

Since the network incorporates multiple layers of fully connected convolutional layers, the network may have different spatial resolutions. To overcome this challenge, we utilize max pooling to downsample shallow convolutional layers to the same resolution as deep convolutional layers.

Hierarchical integration of user behavior networks brings significant performance improvements (verified by ablation experiments).

Experiments and Results

Through the analysis of ablation experiments, it is found that the effect of only using user post summaries is better than only using user behavior features. The results of comparative experiments show that DepressionNet achieves the best performance by combining the model of user post summary and user behavior features.

In addition, comparative experiments show that the introduction of images in user posts can play a very important role, and the performance of Depression is slightly better than post+Image, and the joint learning of the interaction between text and images helps to provide additional multimodality for the model. state knowledge.

Not only does Depression show good performance, but its various parts also show good performance. For example, stacked BiGRU is used to capture user behavior characteristics, CNN+BiGRU (Att) user summary modeling, and they all perform better than other models.

The author has also done a lot of other experiments to verify the role and influence of various behavioral features on the model, the length of the text, and the influence of different user tweet selection strategies. The summary shows significant superiority and stability. Evaluation of generative summaries via t-SNE visualization reveals a clear separation between depressed and non-depressed summary documents for both BART and DistilBART .

image-20230319152705659

Conclusion

This paper proposes a novel layered deep learning network that fuses multiple fully-connected layers to integrate user behavior representations and user posts. This paper introduces the method of summary enhancement to filter irrelevant content, enhance the model's attention to depression-related information, reduce the data dimension, and improve the efficiency of the model. Automatic summarization also frees our feature selection from arbitrary design choices, such as discarding sentences with certain predefined words or sentences within a certain length. Ultimately, the joint effect of user behavior and summarization-enhanced post history semantics enables the model to significantly outperform both existing SOTA baselines.

The idea of ​​this article is to combine user behavior and text content published by users to jointly model and improve the performance of depression detection. And in view of the user's characteristics of too much text content, too much noise, too much redundancy, and depressive symptoms (the more significant it becomes in the later stage), in order to filter irrelevant content and enhance the classification dominance of salient content, text summarization is introduced. The author's motivation for the design of each part of the model is well explained and it is worth learning. It's just that the place where the fusion is done is not very clear. Isn't it just a few fully connected layers strung together? Why is it so complicated, exaggerated just for the sake of telling a story?

Guess you like

Origin blog.csdn.net/m0_47779101/article/details/129652618