Discrete Opinion Tree Induction for Aspect-based Sentiment Analysis paper reading notes (ACL2022)

Table of contents

Title Translation: Discrete Opinion Tree Induction Based on Aspect Sentiment Analysis

Paper link: https://aclanthology.org/2022.acl-long.145.pdf

Summary

1 Introduction

2 models

2.1 Classifier based on opinion tree

2.2 Training of Sentiment Classifier

2.3 Training Tree Inducer

3 Variational reasoning perspective

3.1 Variational latent tree model

3.2 Correlation with our model

4 experiments

4.1 Baseline

4.2 Development Results

4.3 Main results

4.4 Case studies

4.5 Analysis 

5 related work

6 Summary

7 thanks


Title Translation: Discrete Opinion Tree Induction Based on Aspect Sentiment Analysis

Paper link: https://aclanthology.org/2022.acl-long.145.pdf

Summary

Dependency trees and graph neural networks are widely used in aspect-based sentiment classification. Although these methods are effective, they all rely on external parsers, which are not available for low-resource languages, or in low-resource The field performed worse. Also, dependency trees are not optimized for aspect-based sentiment classification. In this paper, we propose an aspect-specific and language-independent model of discrete latent opinion trees as an alternative structure to explicit dependency trees. To simplify the learning of complex structural latent variables, we establish a link between aspect-to-context attention scores and syntactic distances, and induce trees from attention scores. Results on six English benchmark datasets, one Chinese dataset and one Korean dataset show that our model can achieve competitive performance and interpretability.

1 Introduction

Aspect-based sentiment classification (ABSA) is the task of identifying the sentiment polarity of a specific aspect category or aspect term in a given sentence (Jiang et al., 2011; Dong et al., 2014; Wang et al., 2016; Tang et al. al., 2016; Li et al., 2018; Du et al., 2019; Sun et al., 2019a; Seoh et al., 2021; Xiao et al., 2021). Different from document-level sentiment analysis, different aspect terms in the same document can carry different sentiment polarities. For example, given a restaurant review “decoration is nice, but service may be spotty”, the corresponding sentiment labels for “decoration” and “service” are positive and negative, respectively.

    How to find the corresponding opinion context for each aspect term is the key challenge faced by ABSA. To this end, recent efforts leverage dependency trees (Zhang et al., 2019; Sun et al., 2019a; Wang et al., 2020). Syntactic dependencies have been shown to better capture the interaction between aspects and opinion contexts (Huang et al., 2020; Tang et al., 2020). For example, in Figure 1(a), using the syntactic relationship, we can find that the opinion words corresponding to "dector" and "service" are "nice" and "spotty", respectively.

    Although the dependency syntax is efficient, it has the following limitations. First, dependency parsers may not be available for low-resource languages, or perform worse in low-resource domains (Duong et al., 2015; Rotman and Reichart, 2019; Vania et al., 2019; Kurniawan et al., 2021) . Second, dependency trees are also not optimized for aspect-based sentiment classification. Previous studies transformed dependency trees into aspect-specific forms with handcrafted rules (Dong et al., 2014; Nguyen and Shirai, 2015; Wang et al., 2020) to improve aspect sentiment classification performance. However, the tree structure is mainly adjusted through the node hierarchy, without optimizing the dependencies of ABSA.

    In this paper, we explore a simple way to automatically induce a discrete opinion tree structure for each aspect. Two examples are shown in Figure 1. In particular, given an aspect and a sentence, our algorithm recursively induces a tree structure based on a set of attention scores, computed using neural layers on top of the sentence's BERT representation (Devlin et al., 2019). The algorithm starts from the root node, builds a tree by selecting a child node on each side of the current node, and recursively continues the partitioning process to obtain a binarized and lexicalized tree structure. The resulting tree serves as the input structure and is fed into a graph convolutional network (Kipf and Welling, 2017) for learning a sentiment classifier. We study policy-based reinforcement learning (Williams, 1992) to train tree inducers. One challenge is that the generated policies are easily memorized by the BERT encoder, which leads to under-exploration (Shi et al., 2019). To alleviate this problem, we propose a set of regularizers to help BERT-based policy generation.

    While our approach is conceptually simple and straightforward in the inference stage, we show that it has deep theoretical underpinnings. In particular, attention-based tree-induced parsers trained with policy networks can be viewed as simplified versions of standard latent tree-structured VAE models (Kingma and Welling, 2014; Yin et al., 2018), where prior and posterior The KL divergence between tree probabilities is approximated by an attention-based syntactic distance metric (Shen et al., 2018a).

    Experiments on six English benchmarks, a Chinese hotel review dataset and a Korean car review dataset demonstrate the effectiveness of our proposed model. The discrete structure also facilitates the interpretation of classification results. Furthermore, our algorithm is faster, smaller, and more accurate than fully variational latent tree variant models. To our knowledge, we are the first to use BERT to learn an aspect-specific discrete opinion tree structure. We made the code public at https://github.com/CCSoleil/dotGCN.

2 models

Figure 2 shows the architecture of our proposed model. Given an input sentence x and a specific aspect term a, we Q_{\phi }(t\mid x,a))derive an opinion tree t from a recognition network, where \phiis the set of network parameters. We apply a multi-layer graph convolutional network (GCNS) on BERT output vectors to model structural relationships in opinion trees and extract aspect-specific features. Finally, we use an attention-based classifier to learn a sentiment classifier P_{\theta (y\mid x,a,t) }, where \thetais the parameter set.

    To train the model, RL is used Q_{\phi }(t\mid x,a))(Section 2.3) and standard backpropagation is used for training P_{\theta (y\mid x,a,t) }(Section 2.2).

2.1 Classifier based on opinion tree

Opinion tree: Represent the input sentence as x = w1w2 . . . wn and aspect a = wbwb+1 . . . we. [b, e] is a continuous span of [1,n]. Wi is the i-th word. As shown in Figure 1, the opinion tree of a is a binary tree. Each node contains a word span and at most two child nodes. a is placed on the root node. Except for the root node, each node contains exactly one word. An in-order traversal of t restores the original sentence. Ideally, nodes near the root node should contain corresponding opinion words, such as "nice" for "decoration" and "spotty" for "service".

    Algorithm 1 shows the process of using the node score function v to construct an opinion tree t that meets the above conditions for a, where vi represents the information score of the ith word that contributes to the sentiment polarity y of a. V_{i}^{j}is the score of the corresponding word in the span [i,j]. We first take the aspect span [b,e] as the root node, and then build its left and right children from the spans [1,b-1] and [e+1,n] respectively. To build the left and right subtrees, first select the element with the highest score in the span as the root node of the subtree, then recursively use build_tree calls on the corresponding span partitions.

Calculating v: Following Song et al. (2019), we put the input “[CLS] w1 w2…wn [SEP] wb wb+1…we” into BERT to get aspect-specific sentence representation H, and then calculate A set of attention scores for facet-oriented words.

Among them u_{p},W_{p},W_{a,p}is the model parameter, \sigmais the ReLU activation function, and h_{a}is H_{b}H_{b+1}...H_{e}the aspect representation of the sum pooling. Q_{\phi }(t\mid x,a))It \ocontains u_{p},W_{p},W_{a,p}the model parameters of BERT and .

Graph representation: Given t and H, we use GCNs to learn a representation vector for each word. We transform t into an undirected graph G. Specifically, we regard each word as a node in G, consider four types of edges, and design the adjacency matrix A∈ of R^{n\times n}G. First, we include the self loop for each word. Second, we fully concatenate each word in the aspect term. Third, for a child node wj of the root node, we connect wj to each word in a. Finally, we consider edges between single word nodes in t other than the root node. Formally, A is given by: Equation 2 guarantees that A is symmetric.

     Then we use GCNs to capture the structured relationship between word pairs. Given the connection matrix A between nodes and the (l-1) layer representation matrix ∈ , a GCN gives the l-th layer H^{l-1} representation  R^{d}  as H ^{l}:

Where f is the activation function (ie ReLU),  W^{l}R^{d\times d}and b^{l}R^{d}are the model parameters of the l-th layer. The input to the first GCN layer H^{0}is H given by the sentence encoder.

Target aspect representation: We consider the representation vector of the "[CLS]" token ( ) and the aspect vector ( , ... ) H_{cls}^{0}given by the last GCN layer as the aspect-oriented representation vector for querying the input sentence representation. The last aspect-specific feature representation c is given by the attention layer on the input sentence representation.H_{b}^{N}H_{b+1}^{N}H_{e}^{N}H^{0}

where \alpha _{t}is the attention score of a pair \omega_{t}, \alphais the normalized score, and c is the final feature.

The output layer uses c to compute the sentiment polarity score, and the final sentiment distribution is given by a softmax classifier.

W_{c}and b_{c}are model parameters, and p is the predictive distribution. 

2.2 Training of Sentiment Classifier

The cross-entropy loss classifier is trained by maximizing the log-likelihood of the training samples. Formally, the goal is to minimize (6)

where |D| is the size of the training data, the emotional label of a in  already}the i-th example x_{i}and ya is the classification probability of a, which is given by Equation 5. The set of model parameters in includes the GCN block and the classifier parameters in Equation 5.p_{i,ya}P_{\theta (y\mid x,a,t) }\theta

Tree distance regularization loss Following (Pouran Ben Veyseh et al. (2020), we introduce a grammatical constraint to regularize the attention weights. Ideally, words closer to the root node should receive higher attention weights. Given Given an opinion tree t, we use the length of the shortest path to the root to compute the tree distance di for each word i. Given the distance and attention score α, we use the KL divergence to encourage aspect items to participate in the distance shorter context.

tdi is the normalized tree distance, and KL is the Kullback-Leibler divergence. 

Backpropagation During training, we replace the argmax operator in Algorithm 1 with random sampling to explore more discrete structures. Since the tree sampling process is a discrete decision process, it is not differentiable. Gradients can be propagated from Lsup in Equation 6 to t and \theta, but not further from t \phi. Therefore, we use the policy gradient given by REINFORCE (Williams, 1992) to optimize the policy network \phi(Section 2.3).

2.3 Training Tree Inducer

Assuming that the reward function of the latent tree t is Rt, the goal of reinforcement learning is to minimize the negative expected reward function

For each t, ​​we use the log sentiment likelihood P_{\theta (y\mid x,a,t) }as Rt. Using REINFORCE, the gradient L_{rl}relative \phito is:

log Q_{\phi }(t|x, a) is the log-likelihood of the generated sample t, which can be decomposed into the sum of the log-likelihoods of each tree-building step. According to Algorithm 1, each call to build_tree( ,i,j) involves choosing an action k from span [i,j] v_{i}^{j}given a score . v_{m}^{n}The operation space contains j-i+1 operations. The log-likelihood of this action is given by (10):

In particular, we use Equation 1 v^{p}as the scoring function v. Enumerating all possible trees to compute the expected term in Equation 9 is intractable, we use a Monte Carlo method (Rubinstein and Kroese, 2016) to approximate the training objective by taking M samples: 

Attention Consistency Loss Instead of relying solely on boosted gradients to train the policy network, we also introduce an attention consistency loss to directly supervise the policy network. Note that there are two attention scores in our model, the first one is the attention score defined in Equation 1 s^{p}, trained by the reinforcement learning algorithm. The second is the attention score α defined in Equation 4 to extract useful contextual features for aspect-specific classifiers. α is trained by end-to-end direction propagation. Intuitively, the word that gets the largest attention score should be an effective opinion word in terms of the target. Therefore, the policy network should make it closer to the root node. To this end, we implement a consistent regularization between the two attention scores such that the polarity-directed attention α can be directly used in supervised scoring strategies s^{p}. Formally Latt is given by (12):

where detach is a stop gradient operator.

Overall loss The final overall loss is given by (13):

where Lsup is a supervision loss, Lrl is a reinforcement learning loss, Latt is a new attention consistency loss, and Ltd is a loss that guides the distribution of attention scores through tree constraints. λrl, λatt and λtd are hyperparameters.

3 Variational reasoning perspective

Interestingly, Lsup, Lrl, and Latt can be unified into a theoretical framework through variational inference (Kingma and Welling, 2014). We show in this section that our method can be viewed as a stronger extension of latent tree VAE models.

3.1 Variational latent tree model

To model P_{\theta }(y|x, a), we introduce a latent discrete structured variable t. Formally, the training objective is to minimize the negative log-likelihood,

Equation 14 computes the logarithmic sum of all possible trees t, which is exponential. Equation 14 can be approximated by using an evidence lower bound (ELBO) on the variational parameter φ (Kingma and Welling, 2014; Yin et al., 2018),

where P_{\theta }(t|x, a) is the prior distribution for generating potential trees, q_{\phi }(t|x,y,a) is the corresponding posterior distribution, and log P_{\theta }(y|x, a, t) is the logarithmic likelihood function ( Assuming that the potential tree t is already known),  E_{q_{\phi }}(t|x,y,a)[log P_{\theta }(y|x, a, t)] is q_{\phi }the expected pair of all potential trees on (t|x,y,a) by considering Number Likelihood Function. The KL term acts as a regularizer, forcing the matching of the prior and posterior distributions. During training, q_{\phi }the tree is induced using (t|x, y, a). P_{\theta }For inference, use (t|x, a)  since y is still unknown .

In practice, the behavior of the KL term can be controlled using the scaling hyperparameter β (Bowman et al, 2016b), 

The first item is the expected item, and the second item is the KL item. Equation 16 is the standard VAE model for the ABSA task, but has not been discussed in the research literature. It can be trained using tree entropy (Kim et al, 2019b) and neural mutual information estimation (Fang et al, 2019). However, both methods are slow because they both need to consider a large number of tree samples. To model q_{\phi }(t|x, y, a), we instead compute the score function of the posterior through an MLP layer similar to Equation 1 s^{q},

where u_{q}, W_{q}and W_{a,q}are parameters, and H' and h_{a}^{'}are the posterior sentence and aspect representations for a given y, respectively. To ensure that y guides the encoder, we feed the input sequence along with y into BERT, using "[CLS] w1 w2...wn [SEP] wf wf+1...wey" to get H'.

3.2 Correlation with our model

Our method can be seen as a novel simplification of the above model, which can be demonstrated by associating the expected and KL terms defined in Eq. 16 with the attention scores in Eq. 1 and Eq. 4, respectively . In particular, we consider converting t to a special type of tree distance, the aspect-to-context attention score. We then delegate the probability distribution of structured tree samples to a set of attention scores. Intuitively, if the attention scores are similar, the resulting trees should be highly similar.

Approximate expected term: Considering that the gradient of the first expected term with respect to φ is,

Assuming that the posterior  q_{\phi }(t|x,y,a) is approximate to Qφ(t|x, a) given by the recognition network, Equation 18 is equivalent to Lrl in Eq. 11. 

Approximate KL term: When β = λatt, the KL term is similar to Latt in Equation 12, i.e.,

First, we delegate the probability distribution of tree samples to a set of attention scores. In particular, we use s^{p}and s^{q}as proxies for pθ(t|x, a) and qφ(t|x, y, a), respectively. s^{q}This is equivalent to feeding the posterior and prior scores s^{p}to Algorithm 1 to derive the corresponding trees during training . Second, since both α and attention score α in Equation 4 s^{q}are directly supervised by the output label y, we can safely assume s^{q}≈α. s^{q}Then the KL term KL( , ) in Equation 16 s^{p}becomes KL(α, s^{p}), the attention-based regularization loss defined in Equation 12.

4 experiments

We conduct experiments on eight aspect-based sentiment analysis benchmarks, including six English datasets, one Chinese dataset and one Korean dataset. See Appendix A.3 for statistics. We use Stanza (Qi et al, 2020) as an external parser to generate dependency parses for comparison with dependency tree-based models, reporting accuracy (Acc.) and macro F1 (F1) scores for each model. See Appendix A.1 for more details.

MAMS Jiang et al. (2019) provide a recent challenge dataset with 4297 sentences and 11,186 aspects. We use it as the main dataset because it is a large-scale multifaceted dataset with more aspects per sentence than other datasets. MAMS-small is a small version of MAMS.

Chinese hotel reviews dataset   Liu et al. (2020) provides manually annotated 6,339 objects and 2,071 items for multi-objective sentiment analysis.

Korean automotive comments dataset Hyun et al. (2020) provided a dataset with 30032 Korean comment facets.

SemEval datasets  We used five SemEval datasets, including Dong's twitter posts (twitter). Laptop reviews (laptop) provided by Pontiki et al. (2014), restaurant reviews from SemEval 2014 task4 (Rest14; Pontiki et al. 2014), SemEval 2015 task 12 (Rest15; Pontiki et al. 2015) and SemEval 2016 task5 (Rest16; Pontiki et al. 2015). et al 2016). These datasets are preprocessed following Tang et al (2016) and Zhang et al (2019).

4.1 Baseline

We denote our model as dotGCN (Discrete Opinion Tree GCN), and compare it with BERT-based models, including those that do not use trees and those based on dependency trees. Furthermore, the variational inference baseline (Section 3.1) is denoted as viGCN. The baselines are (1) BERT-SPC is a simple baseline by fine-tuning the "[CLS]" vector of BERT from Jiang et al. (2019); (2) AEN. Song et al. (2019) Attention encoder using BERT; (3) CapsNet. Jiang et al. (2019) combined capsule networks with BERT; (4) Hard-Span. Hu et al. (2019) use RL to determine aspect-specific opinion spans; (5) depGCN. Zhang et al. (2019) applied aspect-specific GCNs on dependency trees; (6) RGAT. Wang et al. (2020) use graph attention networks on aspect-centric dependency trees to incorporate dependency edge type information; (7) SAGAT. Huang et al. (2020) use graph attention network and BERT to explore syntactic and semantic information in sequences; (8) DGEDT. Tang et al. (2020) jointly consider BERT output and dependency tree based representation via bidirectional GCN. (9) kumaGCN. Chen et al. (2020) combined dependency trees and latent graphs induced by self-attentional neural networks;

4.2 Development Results

We develop experiments using MAMS because this is the largest dataset with more challenging examples compared to other datasets. We implement three baselines, including BERT-SPC, depGCN and kumaGCN. For a fair comparison, we also combine depGCN and kumaGCN with a grammar regularization loss in Equation 7 by computing the grammar distance with respect to aspect items on the input dependency tree.

    Table 1 shows the results on the MAMS validation set. The accuracy rate of BERT-SPC is 84.08, and that of F1 is 83.52. Surprisingly, dependency tree-based models cannot outperform BERT-SPC, which demonstrates the limitations of using cross-domain dependency resolvers for this task. kumaGCN outperforms depGCN because it is able to incorporate implicit latent graphs. Adding a grammar regularization loss often improves the performance of grammar-based models. In particular, kumaGCN+Ltd is on par with BERT-SPC.

    viGCN outperforms kumaGCN+Ltd and depGCN+Ltd, showing the potential of structured latent tree models. Our dotGCN model achieves an accuracy of 84.53 and an F1 of 83.97, greatly outperforming all baselines, and empirically shows that induced discrete opinion trees are promising for this task. Compared with viGCN, our model gives better scores. Furthermore, our model converges nearly 1.8 times faster (0.66h/epoch vs 1.25h/epoch). dotGCN does not need to compute the true posterior distribution over structured tree samples, which greatly reduces computational overhead.

Ablation Experiments Table 1 shows the ablation studies on the MAMS validation set, removing three proposed loss items, namely Ltd, Lrl and Latt, during training. We can observe that after removing any of them, the model performance drops. Removing the grammar regularization loss slightly affects performance. Without using the attention consistency loss Latt, the model lags behind BERT-SPC, which shows the importance of our proposed attention consistency regularization. Removing the RL loss leads to the largest performance drop among the three settings (Acc: 84.53 → 83.48). This shows that the reinforcement learning component plays a central role in the whole model.

4.3 Main results

MAMS Table 2 shows the results of dotGCN and the baseline of Jiang et al. (2019) on the MAMS test set. We implement BERT-SPC, denoted as BERT-SPC∗, which outperforms the BERT-SPC model of Jiang et al. (2019). Compared to the baselines (BERT-SPC, CapsNet, CapsNet-dr, and BERT-SPC∗) that do not use dependency trees, dotGCN achieves significantly better results (p < 0.01). For comparison with dependency tree based models, we also implement depGCN+ L_{td}^{*}and kumaGCN+ L_{td}^{*}. depGCN+ L_{td}^{*}achieves an accuracy of 84.36 and an F1 of 83.88 on the MAMS test set. kumaGCN+ L_{td}^{*}gives similar results with an accuracy of 84.37 and an F1 of 83.83. Our dotGCN outperforms all baselines, giving an accuracy of 84.95 and an F1 of 84.44. In terms of the average accuracy of F1 score on MAMS and MAMS-small, dotGCN significantly outperforms depGCN and kumaGCN (p < 0.05). The results show that induced aspect-specific discrete opinion trees hold promise for multifaceted affective tasks.

The results of the Multilingual  Chinese hotel review dataset are shown in Table 2. dotGCN outperforms the baseline BERT-SPC* by 0.72 precision points and 0.61 F1, respectively. The results show that our model generalizes across languages ​​without relying on language-specific parsers. On the Korean dataset, we obtain 5.20 accuracy and 11.61 F1 improvement compared to LCF-BERT (Zeng et al, 2019), which is the best BERT-based model. These results demonstrate that our model generalizes well to multiple languages ​​and may potentially benefit low-resource languages ​​for this task.

SemEval  Table 3 shows the results of our model on the SemEval dataset. First, tree-based graph neural network models generally outperform BERT-SPC. On five relatively small datasets, our model is still competitive in terms of average F1 and accuracy scores, as shown in Table 3. In particular, our model outperforms depGCN and depGCN+ on four of the five datasets L_{td}^{*}, which validates that augmented discrete opinion trees can be promising structured representations compared to automatically parsed dependency trees.

    We also compare our model with span-based reinforcement learning models (Hard-Span; Hu et al. (2019)) on the laptop and restaurant datasets preprocessed by Tay et al. (2018). As shown in Table 4, on the laptop, our model outperforms Hard-Span by 2.55 accuracy points. On restaurants, our model achieves comparable results to Hard-Span. This shows that opinion trees are better representations than opinion spans.

4.4 Case studies

Figure 3a and Figure 3b show the induction tree and dependency resolution for the aspect term "scallop", respectively. Opinion words "unique" and "tasty" are farther apart in terms of distance (more than 10 words) in the dependency tree. In the induced tree of dotGCN, the opinion words “tasty” and “unique” are respectively 2 and 3 depths away from the aspect word “scallop”, which shows that dotGCN can potentially handle complex interactions between aspect and opinion context. Furthermore, the dotGCN-induced tree is binarized and the root node can contain multiple words, as shown in Figure 4a.

Figures 4a and 4b show the induced trees for two aspect terms with different emotional polarities. For "creme brulee", the policy network gives high weight to both "delicious" and "savory". Interestingly, it gives higher weight to "delicious" than to "savory", even though "savory" is closer to its aspect term than "delicious". For "appetizer", the word "interesting" gets a higher attention score than the other two sentiment words. These results demonstrate that dotGCN is able to distinguish different sentiment contexts of different aspect terms in the same sentence.

4.5 Analysis 

Distance between aspect terms and opinion words   Figure 5 shows the distance between aspect terms and opinion words. We used the Rest16 annotation opinion word provided by Fan et al. (2019) to compare our induction and dependency trees. Distances computed on the original sequence are also included. We can observe that the distribution of distances over sequences is relatively flat compared to tree structures. In both tree structures, nearly 90% of opinion words are within 3 depths of aspect words. The distance distribution of our induced trees is similar to that of the dependency trees, which empirically demonstrates the ability of induced discrete trees to capture the interactions between aspect terms and opinions. By using dependency trees as the gold standard, our tree inducer achieves an unlabeled attachment score (UAS) of 35.4%, showing that induced trees are significantly different from dependency trees, although both can connect opinion words and aspect terms.

Low frequency aspects  Table 5 shows the classification accuracy on the MAMS test set with respect to vertical and horizontal frequencies. For the aspect items that occur in the training corpus, both methods give similar results. However, for unseen aspects, dotGCN achieves better results than depGCN. This may be due to severe parsing errors in low frequencies. dotGCN does not depend on external parsers, so this problem can be avoided. Experiments show that the induced tree structure is more robust in capturing aspect-opinion interactions compared to depGCN.

5 related work

Tree induction of ABSA  There has been much work on unsupervised discrete induction (Bowman et al., 2016a; Shen et al., 2018b; Kim et al., 2019b,a; Jin et al., 2019; Cao et al., 2020; Yang et al., 2021; Dai et al. , 2021), which aim to obtain general composition trees without explicit syntactic annotations and task-dependent supervisory signals. We focus on learning task-specific tree structures for ABSA, where the trees are fully binarized and lexicalized. Choi et al. (2018) proposed Gumbel Tree LSTM for learning task-specific trees of semantic composition. Similarly, Maillard et al. (2019) proposed an unsupervised graph parser for jointly learning sentence embeddings and grammars. However, they mainly focus on sentence-level tasks without considering aspect information.

Much recent work on aspect-level emotion classification  has explored the neural attention mechanism of this task (Tang et al., 2016; Ma et al., 2017; Li et al., 2018; Liang et al., 2019). Among the tree-based methods, Zhang et al. (2019) and Sun et al. (2019b) used GCN to perform aspect-level sentiment analysis on dependency trees; Zhao et al. (2019) used GCN to model fully connected graphs between aspect items; Wang (2020) use a relational graph attention network to incorporate dependency edge type information and construct aspect-specific graph structures; Barnes et al. (2021) attempt to directly predict dependency-based sentiment graphs. Tang et al. (2020) use a dual-transformer structure to enhance the dependency graph for this task. Our work is similar to where we also consider structural dependencies, but the difference is that we rely on automatically induced tree structures rather than external parsing. Chen et al. (2020) propose to induce aspect-specific latent maps by sampling from a self-attention based Hard Kumaraswamy distribution (Bastings et al.). However, to achieve competitive performance, their method still requires external dependencies on a combination of parse trees and induced latent graphs.

    Sun et al. (2019a) and Xu et al. (2019) construct aspect-related auxiliary sentences as input to BERT (Devlin et al., 2019) for enhancing the context encoder. Xu et al. (2019) propose Bert-based job training for enhanced domain-specific contextual representation for aspect-oriented sentiment analysis. Our work shares a similar feature extraction approach, but differently, we focus on inducing latent trees for ABSA.

6 Summary

We propose a method for aspect-based sentiment analysis by viewing aspect-to-context attention scores as syntactic distances to obtain trees. Attention scores are trained using RL and a novel attention-based regularization. Compared with dependency tree based models, our model achieves competitive performance while being independent of the parser. We also provide a theoretical perspective on our approach using variational inference.

7 thanks

Guess you like

Origin blog.csdn.net/Starinfo/article/details/129691224