Solving Aspect Category Sentiment Analysis as a Text Generation Task paper reading (EMNLP2021)

Table of contents

Title Translation: Addressing Aspect Category Sentiment Analysis as a Text Generation Task

Paper link: https://aclanthology.org/2021.emnlp-main.361.pdf

Summary

1 Introduction

2 related work

3 methods

3.1 Pre-trained language model

3.2 Classification method

3.3 Masked Language Model Approach

3.4 Generation method

3.4.1 Template creation

3.4.2 Reasoning

3.4.3 Training

4 experiments

4.1 Baseline method

4.2 Development experiment

4.3 ACSA experiments

4.4 ACD experiment

4.5 Joint Model

4.6 Few-shot and zero-shot learning

5 Analysis 

5.1 Effect of category frequency

5.2 Case studies

6 Summary


Title Translation: Addressing Aspect Category Sentiment Analysis as a Text Generation Task

Paper link: https://aclanthology.org/2021.emnlp-main.361.pdf

Summary

Aspect category sentiment analysis has received increasing research attention. Mainstream methods leverage pre-trained language models by learning efficient aspect category-specific representations and adding specific output layers to their pre-trained representations. We consider a more straightforward approach using a pre-trained language model to convert the ACSA task to a natural language generation task, using natural language sentences to represent the output. Our approach allows for more direct use of pre-trained knowledge in seq2seq language models by directly following the task setting during pre-training. Experiments on several benchmarks show that our method gives the best reported results, with great advantages in the few-shot and zero-shot settings.

1 Introduction

Aspect-based sentiment analysis (ABSA) is a fine-grained sentiment analysis task that includes many subtasks, two of which are aspect category sentiment analysis (ACSA) and aspect category detection (ACD). Figure 1 shows an example where the input is "The restaurant was expensive, but the menu was great". ACD detects aspect categories, such as price and food, and ACSA predicts the sentiment polarity of each aspect category. In this work, we focus on these two tasks as well as a joint task that combines the two.

    Previous studies investigated various approaches to treat ACSA and ACD as classification tasks, learning aspect-specific sentence representations (Wang et al., 2016; Ruder et al., 2016). Recently, pretrained language models (PLMs) have shown their effectiveness in this regard (Jiang et al., 2019). The main idea is to leverage a pre-trained model such as BERT (Devlin et al., 2019a) to represent specific aspect forms of the input (e.g., by concatenating aspect categories to the end of the input sentence (Fig. and ACD classifiers provide useful semantic features. These methods give very competitive results (Sun et al., 2019; Li et al., 2020b).

    The above classification models benefit from contextualized representations that incorporate knowledge learned through pre-training on big data (Lin et al., 2019). However, their use of pre-trained knowledge can be considered indirect for at least two reasons. First, the classification task is performed by using a neural network with individual network parameters on top of the pretrained representation. Second, the integration of aspect categories makes the aspect-specific input representation not exactly a natural language sentence, which is different from the pre-training setting. Intuitively, by connecting pre-training and ACSA at the task level, more pre-training knowledge can be exploited, not just at the representation level.

    We investigate the above potential by translating the sentiment classification task into a language modeling task. In particular, as shown in Figure 2, both ACSA and ACD are converted to sequence-to-sequence (seq2seq) tasks, where the encoder takes input sentences and the decoder generates natural language sentences. For ACD, the output follows a template stating whether a particular aspect was discussed (e.g., "<category_type> category was discussed"); for ACSA, stating the sentiment polarity of a particular aspect (e.g., "<given_category> is highly >"). This setup is very consistent with BART's denoising autoencoder training scheme (Lewis et al., 2020), which we use as a pre-trained model. Compared with classification-based methods, our method does not include more network parameters and thus generalizes better to new domains (Brown et al., 2020; Gao et al., 2020). Given a new domain with completely unseen aspect categories and sentiment labels, our method can be applied without changing the output layer structure.

    In addition to classification-based methods, we use masked language models (MLMs) as a baseline, and a natural counterpart of our method is the masking-filling task. As shown in Fig. 3(b), unlike our method, the output template is concatenated to the input and keywords are masked for prediction. This MLM task is very consistent with pre-training on BERT (Devlin et al., 2019a). Compared to such MLM methods, generative methods can better learn the correlation between the input and output templates as two related sequences, as has been demonstrated by the strong performance of BART on abstract text summarization (Lewis et al. 2020).

    Experimental results on three standard benchmark datasets show that both generative and MLM methods outperform classification methods using the same pretrained language model. Finally, the generative approach has stronger performance than the MLM approach and substantially outperforms previous state-of-the-art methods. Furthermore, using a generative approach, we show that jointly performing ACSA and ACD leads to better results than traditional pipelines. To the best of our knowledge, we are the first to use generative pretrained language models to solve ACSA/ACD problems. We are releasing the code at https://github.com/lgw863/ACSA-generation.

2 related work

Aspect Category Sentiment Analysis Wang et al. (2016) propose an attention-based LSTM network that can focus on different parts of a sentence when different aspect categories are given as input. Ruder et al. (2016) model the interdependence of sentences in text with a hierarchical bidirectional LSTM. Yin et al. (2017) model the task as a machine comprehension problem by constructing pseudo-question-answer pairs. Xue and Li (2018) used CNN to extract sentiment features and used a gating mechanism to selectively output features related to aspect categories. Xing et al. (2019), Liang et al. (2019). Sun et al. (2019) build auxiliary sentences from aspect categories and convert ACSA to a sentence pair classification task. Li et al. (2020b) indicate aspect categories in sentences by aggregation Sentiment of words to predict aspect categories mentioned in sentences.

    To avoid error propagation, several joint models are proposed that jointly perform ACD and ACSA. Schmitt et al. (2018) propose two joint models: an end-to-end LSTM and an end-to-end CNN, which simultaneously generate all aspect categories and their corresponding sentiment polarities. Hu et al. (2019) proposed Constrained Attention Network (CAN) to constrain attention weight assignment. Wang et al. (2019) proposed an aspect-level emotion capsule model (AS-Capsules), which exploits the correlation between aspect categories and emotions through shared components. Li et al. (2020a) propose a new joint model that includes a shared sentiment prediction layer.

    All of the above models are classification methods, using a separate output network to give the output labels. Instead, we investigate natural language generation methods by directly following the pre-training process of language models. 

Masked Language Model Methods  There is a series of works using masked language model (MLM) to complete natural language understanding tasks. The basic idea is to leverage information in pre-trained models by defining specific sentence cues in language modeling tasks. Brown et al. (2020) use hints for few-shot learning in a text classification task. Schick and Schütze (2020) reformulate the input as cloze questions for text classification. Schick et al. (2020) and Gao et al. (2020) extend Schick and Schütze (2020) by automatically generating tag words and templates, respectively. Petroni et al. (2019) extract relations between entities from BERT by constructing cloze-style templates. We are the first to apply such an approach to ACSA and use it as a baseline. Unlike these template-based models, our final model uses BART for text generation, which better models the correlation between input and output sentences than BERT.

    Generation Methods have worked on NLP problems as sequence generation tasks (Vinyals et al., 2015; Ma et al., 2017; Stanovsky and Dagan, 2018; Raffel et al., 2020), where the output is a sequence of tokens rather than a natural language sentence. Daza and Frank (2018) view semantic role labeling as a sequence-to-sequence process. Li et al. (2019) tackle the entity-relationship extraction task as a multi-turn question answer generation method. Our work is similar to converting NLP tasks to generative tasks. Different from the above methods, our goal is to make full use of ACSA's pre-trained knowledge in BART.

3 methods

Formally, for ACD, the input is a sentence X = {x1, ... , xn}, where xi represents the i-th word. For ACSA, a set of predetermined aspect categories is also given. We introduce the associated pretrained language model in Section 3.1, our classification approach in Section 3.2, our MLM approach in Section 3.3, and our generative approach in Section 3.4. 

3.1 Pre-trained language model

We use BERT (Devlin et al., 2019a) and BART (Lewis et al., 2020) as pretrained language models. Both are built on the Transformer (Vaswani et al., 2017) architecture. BERT (Devlin et al., 2019a) is an encoder stack of Transformers for masked text padding, where the model uses context words to predict masked words. BART (Lewis et al., 2020) is a denoising autoencoder seq2seq model pre-training for natural language generation. Its training applies document destruction, such as randomly removing tokens from the input, and corrupting the text with an arbitrary noise function. BART is trained to reconstruct the original text.

3.2 Classification method

We use a multilayer perceptron network as a classifier model that takes a representation vector as input. Both BERT and BART are considered encoders.  

BERT Classification  BERT takes “[CLS] input sentence [SEP] given_category [SEP]” as input. The final hidden state corresponding to "[CLS]" is used as the representation for classification.

3.3 Masked Language Model Approach

Masked language models (MLM) (Devlin et al., 2019a) complete a given cue by filling in missing tokens. We refer to templates containing given categories and MASK tokens as hints. For the sentiment analysis task, BERT MLM takes an input sentence and a cue as model input and predicts sentiment polarity tag words for a given category. For BART MLM, the same input is fed to the encoder and decoder, and the highest decoder prediction for the label word of the MASK token is the predicted polarity label (see Figure 3(b)). We use the same template in both the MLM method and the generative method, following the template creation method in Section 3.4.1.

3.4 Generation method

We formulate ACSA and ACD as language model ranking problems under the seq2seq framework (see Figure 3(c)). The target sequence Tai, pk(Tai) = {t1, ... , tm} is a template populated by a given category ai and polarity type pk. We first introduce how to create templates in Section 3.4.1, and then present the details of inference and training in Sections 3.4.2 and 3.4.3, respectively.

3.4.1 Template creation

3.4.2 Reasoning

3.4.3 Training

4 experiments

We choose SemEval-2014 restaurant reviews (Rest14) (Pontiki et al, 2014a), a variant of Rest14 (Rest14 hard) (Xue and Li, 2018) and multi-aspect multi-sentiment (MAMS) (Jiang et al, 2019) datasets As sentence-level sentiment, TripAdvisor (Wang et al, 2010) and BeerAdvocate (McAuley et al, 2012; Lei et al, 2016) datasets are used for document-level sentiment. Following the previous work of Tay et al. (2018), a standard split of train/dev/test sets is adopted, the details of which are shown in Appendix A. 

    We use pre-trained BERT Baseline 1 and BART Baseline 2 models for task fine-tuning. For different models, we choose fine-tuning learning rates from {4e-5, 2e-5 and 1e5} and batch sizes from {8, 16, 24}. The dropout probability is 0.1. The best model configuration is chosen based on the highest performance on the dev set. The setup details are shown in Appendix A.

4.1 Baseline method

We compare our generative approach to classification and MLM baselines (Fig. 3) using the same encoder. In particular, BART generation (i.e., Fig. 3(c)) is compared with BART classification (Fig. 3(a)) and BART MLM (Fig. 3), as well as BERT classification and BERT MLM. Furthermore, our method is compared with other models in the literature as follows.

    For sentence-level ACSA, we also compare our method with the following state-of-the-art methods in the literature. (1) Non-BERT models: GCAE (Xue and Li, 2018), As capsules (Wang et al., 2019) and CapsNet (Jiang et al., 2019); (2) BERT-based (Devlin et al., 2019b) models : BERT-pair-QA-B (Sun et al., 2019), CapsNet BERT (Jiang et al., 2019) and AC-MIMLLN-BERT (Li et al., 2020b).

    For document-level ACSA, we compare our method with the following methods. (1) Non-BERT models: LSTM (Tang et al., 2015), HAN (Yang et al., 2016) and MR (machine understanding model) (Yin et al., 2017); (2) BERT-based (Devlin et al., 2019b ) model: BERT classification.

    For ACD, we compare our method with the following methods. (1) Non-BERT models: XRCE (Brun et al., 2014), NRC Canada (Kiritchenko et al., 2014); (2) BERT-based (Devlin et al., 2019b) models: BERT classification, BERT-pair-NLI -B (Sun et al., 2019), CNE-net (Dai et al., 2020).

4.2 Development experiment

4.3 ACSA experiments

The results of sentence-level ACSA are shown in Table 3. We can see that, first, BERT MLM and BART MLM outperform BERT Classification and BART Classification respectively. In particular, BERT MLM provides a strong baseline that outperforms all non-BERT and BERT classification baselines. This shows that using pre-training at the task level can achieve better results than at the representation level. Furthermore, the BART MLM and classification models performed better than the corresponding BERT models. Second, BART generation outperforms all baselines on all three datasets, suggesting that our model can better detect multiple sentiment polarities in a sentence for different aspect categories. Third, the performance of BART generation is significantly better than that of BART MLM, with a 3.89% increase in accuracy on MAMS, proving the effectiveness of the generation method. This shows the strength of BART pre-training for generating semantically relevant content, which is also reflected in BART's strong performance on abstract summarization (Lewis et al., 2020). In contrast, MLM methods concatenate the input and output into a sequence, thus their correlation cannot be modeled in encoder-decoder pre-training.

    The performance of our model on document-level ACSA is shown in Table 4. Compared with LSTM, HAN and MR, BERT classification and BART classification outperform all baselines, which shows the effectiveness of pre-training. BERT MLM and BART MLM surpass BERT classification and BART classification respectively. Our generative model for BART outperforms BART MLM by 1.15% and 0.70% on TripAdvisor and BeerAdvocate, respectively, suggesting that the generative approach can more effectively use BART for ACSA.

4.4 ACD experiment

The results of the Rest14 ACD subtask are shown in Table 5. Following Pontiki et al. (2014b), we use Micro-F1 for evaluation. Again, BART Generation achieves better results than BART Classification and BART MLM. Our model outperforms all baselines in terms of accuracy and F-1 score. In particular, an accuracy score of over 95% is obtained, which demonstrates that our model can effectively exclude aspect categories not mentioned in the input.

    We also investigate the performance on the MAMS dataset, which consists of at least two unique aspect categories with different sentiment polarities in each input sentence. Table 7 shows that BART generation outperforms all baselines, indicating that our model is better able to detect multiple facet categories in a sentence. 

4.5 Joint Model

The generative approach allows us to build a simple joint model by extending the first template in Table 1, using "<given_category> has sentiment polarity of none" as the template for the non-existing aspect category. The results of Rest-14 and MAMS are shown in Table 6. We find that joint BART generation achieves better results on this task than pipelined BART generation. Joint BART generation outperforms all baselines in terms of precision, recall, and F-1 score, demonstrating the benefits of joint learning. 

4.6 Few-shot and zero-shot learning

We evaluate model performance on ACSA, where only a small amount of labeled data is available for training, simulating low-resource data scenarios by randomly sampling training instances from a large training set. In particular, we train with different numbers of instances, randomly sampling a fixed number of instances per class type (10, 20, 50, 100, 200, 500 instances per class type for Rest14 and MAMS). The results are shown in Figure 4, where the BERT classification, BART classification and BART MLM methods are also compared.

    It can be seen that our model outperforms BERT classification, BART classification and BART MLM on all datasets, especially when the number of training instances is small. For example, when there are only 10 training instances, our model achieves an accuracy score of 82.01% on Rest14, compared to 38.57% for BERT classification and 50.16% for BART classification. When the number of instances grows to 500, our model achieves 2.24% and 2.65% higher accuracy than BART MLM on Rest14 and MAMS, respectively. One possible reason is that our method makes more use of the direct emotional knowledge language model in pre-training, directly adopting the original structure of BART mentioned earlier. In contrast, classification methods cannot achieve this due to the indirect transfer of emotional bias.

    The results of our zero-shot learning experiments are shown in Table 8. In all cases, our method outperforms all baselines. In particular, the model trained on MAMS outperforms the reverse zero-shot setting on Rest14, which proves that the MAMS dataset is more challenging.

5 Analysis 

5.1 Effect of category frequency

Aspect categories can be implicit and do not necessarily appear as terms in a given sentence. To explore the correlation between ACSA accuracy and the frequency of occurrence of a given category, we divided the eight categories in the MAMS test set into four subsets according to their frequency of occurrence. Categories that never appear in a given sentence (i.e. Miscellaneous) are put into the zero-frequency subset, 15% of the least frequent categories (e.g. ambiance, staff) are put into the low-frequency subset, and 30% of the most frequent categories (e.g. menu, service) are put into the high-frequency subset, and the remaining categories (such as price, food, location) are put into the medium-frequency subset.

    Figure 5 shows the accuracy of the BART classification and our model with respect to frequency. As the frequency of classes decreases, the relative gap in accuracy between the two models increases. At zero frequency, our method outperforms the BART classification by 8.03% accuracy. This suggests that our method is more robust in summarizing sentiment polarity for abstract or rare categories. Even if there are no explicit category terms in the sentence, generative methods can give implicit category opinions for the whole sentence according to the context.

5.2 Case studies

Figure 6 shows typical examples in the test set where the BART classification model fails to infer. In sentence (a), the given category miscellaneous does not appear as a term in the given sentence. This method can synthesize different aspects of emotional polarity to get the correct polarity. In sentence (b), "the value on the kids menu is good", good modifies this value, not the given category menu. Our approach gives the correct polarity without being influenced by other surrounding emotions. The last instance (c) has conditional inference, which is difficult for BART classification. In contrast, the MRT generation gave the correct label by correctly identifying the negativity in "if there was...". will be more attractive. "This may be because our method uses pre-trained knowledge to infer inter-sentence correlations between input and output sequences, which the BART classification model fails to achieve due to the indirect use of BART in an additional classification network.

6 Summary

We investigate a generative method for aspect category detection (ACD) and aspect category sentiment analysis (ACSA) that better leverages BART in terms of semantic-level summarization of inputs without introducing additional model parameters. The advantages. Experiments show that the proposed method achieves better performance than the baseline models in both sentence-level and document-level aspect sentiment analysis. Our method is also more robust on zero-shot and few-shot tasks compared to traditional sentiment classification methods.

Guess you like

Origin blog.csdn.net/Starinfo/article/details/130692741