The notes are study notes compiled by myself. If there are any mistakes, please point them out~
Attribute-level sentiment analysis
- Introduction
- Dataset introduction
- Data loading and preprocessing (data_utils.py)
- Pre-trained model (skep)
- Model definition module (model.py)
- Training configuration (config.py)
- Model training (train.py)
- Model testing (test.py)
- Model prediction (predict.py)
- Running environment and dependency installation
- data set
- reference
Introduction
Attribute-level sentiment analysis refers to further sentiment analysis of specific attributes or aspects involved in the text based on text sentiment analysis. Traditional text sentiment analysis usually only focuses on the sentiment polarity of the overall text (such as positive, negative, neutral), while attribute-level sentiment analysis strives to identify the sentiment tendencies for specific attributes or aspects in the text to achieve a more fine-grained understanding. User attitudes and emotions towards various aspects of a product, service or event.
Attribute-level sentiment analysis usually involves the following main steps:
-
Attribute extraction: First, words or phrases related to specific attributes need to be extracted from the text. These attributes can be specific characteristics of the product (such as appearance, performance, price), An aspect of the service (such as customer service, logistics and distribution), specific aspects of the event, etc.
-
Emotional classification: For each extracted attribute, classify the emotions expressed in the text, usually including positive, negative, and neutral emotional polarities. This step typically requires the use of text classification or sentiment classification models to identify the sentiment tendencies in the text for specific attributes.
-
Result summary: Summarize the sentiment classification results for different attributes to form attribute-level sentiment analysis results for the entire text. This can provide a clearer understanding of users' attitudes and emotional tendencies toward each attribute, providing more detailed reference information for product improvement, service optimization, or public opinion monitoring.
Attribute-level sentiment analysis has important applications in product evaluation, social media public opinion analysis, consumer opinion mining and other fields. It can help companies understand user needs and feedback more comprehensively, thereby improving products and services in a targeted manner.
Dataset introduction
Training set len(train_ds)=800
Validation set len(dev_ds)=100
Test set len(test_ds)=100
The first column is the label
The second column is the attribute view
The third column is the original text
Data loading and preprocessing (data_utils.py)
Data loading and preprocessing part:
- Load datasets: Load training, validation, and test datasets from files
- Data mapping: Use the word segmenter of the pre-trained model skep_ernie_1.0_large_ch to segment and encode the text, and convert the text of each sample into a feature form.
- Construct DataLoader: Create batch samplers for training, validation and test data, and construct a data loader.
len(train_loader) 200= len(train_ds) 800 / batch_size 4 #batch_size The number of samples contained in each batch (batch)
Pre-trained model (skep)
Skep model as the basic model
SKEP (Sentiment Knowledge Enhanced Pre-training) is a pre-training model based on emotional knowledge enhancement. Baidu Natural Language Processing Laboratory
The pre-trained language model used by Skep is the Ernie (Enhanced Representation through kNowledge IntEgration) model
The input has two parts:
The attribute (Aspect) of the evaluation and the corresponding comment text
After splicing the two, they can be passed into the SKEP model, and the SKEP model semantically encodes the text string
Perform emotion classification based on the semantic encoding vector
Model definition module (model.py)
A sequence classifier (Sequence Classification) based on the Skep model is defined. It uses the pretrained Skep model as a base and adds a layer of linear classifier on top of it.
- The initialization method (
__init__
) accepts as input the Skep model, the number of categories, and optional dropout parameters. - In the initialization method, first call the parent class's
__init__
method, and then set the number of categories and Skep model. - The initialization method also determines whether to use the default hidden layer dropout probability by checking whether the dropout parameter is provided.
forward
The method implements forward propagation logic. Accepts an input text sequence (input_ids) and optional token_type_ids, position_ids and attention_mask parameters.- During the forward propagation process, it obtains the hidden state and pooled output (pooled_output) of the last layer by calling the Skep model.
- The pooled output is then applied to a dropout operation and the result is mapped by a linear classifier into an output space with a number of classes num_classes.
- Finally, the output of the classifier (logits) is returned.
Training configuration (config.py)
# 训练配置:设置训练超参数,如学习率、权重衰减、最大梯度范数等,并创建优化器、学习率调度器和评估指标
num_epoch = 20 # 训练的轮数,即遍历整个训练数据集的次数
learning_rate = 4e-5 # 初始学习率,表示每次参数更新时的步长大小
weight_decay = 0.01 # 正则化项的权重衰减系数,用于防止过拟合
warmup_proportion = 0.1 # 学习率预热的比例,用于在训练初期逐渐增加学习率,以提高训练的稳定性
max_grad_norm = 1.0 # 梯度裁剪的最大范数,用于控制梯度的大小,防止梯度爆炸问题
log_step = 20 # 每隔多少步打印一次训练日志信息
eval_step = 100 # 每隔多少步进行一次模型评估
seed = 1000 # 随机种子,用于控制随机过程的可重现性
checkpoint = "./checkpoint/" # 保存模型训练参数的路径
num_training_steps = len(train_loader) * num_epoch
# 学习率调度器 lr_scheduler,使用的是线性衰减加预热的策略。
lr_scheduler = LinearDecayWithWarmup(learning_rate=learning_rate, total_steps=num_training_steps, warmup=warmup_proportion)
# 获取模型中不属于偏置(bias)或归一化(norm)参数的所有其他参数的名称
decay_params = [p.name for n, p in model.named_parameters() if not any(nd in n for nd in ["bias", "norm"])]
# 梯度裁剪器,通过全局梯度范数对梯度进行裁剪,用于控制梯度的最大范数,防止梯度爆炸问题
grad_clip = paddle.nn.ClipGradByGlobalNorm(max_grad_norm)
# 优化器,使用的是 AdamW 优化算法
optimizer = paddle.optimizer.AdamW(learning_rate=lr_scheduler, parameters=model.parameters(), weight_decay=weight_decay, apply_decay_param_fun=lambda x: x in decay_params, grad_clip=grad_clip)
# 同时计算准确率(Accuracy)和 F1 值
metric = AccuracyAndF1()
Model training (train.py)
Train a text classification model and evaluate its performance
- Define the function train for training the model, which uses the training set to train the model, evaluates the model at certain steps and saves the best model.
- In the train function, switch the model to training mode, and then iterate through the data of each epoch and each batch.
- For each batch of data, calculate the prediction results of the model and calculate the loss value.
- Backpropagation is performed based on the loss value to calculate the gradient, and the optimizer is used to update the model parameters.
- Define the function evaluate to evaluate the model, which will calculate the accuracy, precision, recall and F1 value on the validation set.
- Every certain step (log_step), print training status information, including the current epoch, batch number, global_step and loss value.
- Evaluate model performance on the validation set every certain step (eval_step) or when the total number of training steps (num_training_steps) is reached.
Call the evaluate function to calculate the accuracy, precision, recall and F1 value.
If the F1 value exceeds the previously recorded best F1 value, update the best F1 value and save the model parameters.
Print the evaluation results, including accuracy, precision, recall and F1 value. - Finally, save the final model parameters.
Model testing (test.py)
Load the trained text classification model and evaluate it on the test set
The specific steps are as follows:
- Import the required libraries and modules.
- Use
data_cls.load_dict
function to load the label dictionary used during training, which is used to map labels to category ids and category ids to labels. - Use the
paddle.load
function to load the previously saved best model weights. - Use the
SkepModel.from_pretrained
function to load the pre-trained weights of the Skep model. - Create a new
SkepForSequenceClassification
model that contains features extracted from the Skep model and a fully connected layer for classification, with the output dimensions set according to the number of categories. - Use the
best_model.load_dict
method to load the previously saved best model weights. - Call
evaluate
function to evaluate the loaded model on the test set. - Print the evaluation results, including accuracy, precision, recall and F1 value.
Accuracy, precision, recall and F1 value
Model prediction (predict.py)
Use the loaded text classification model to predict
Use the loaded model to predict the sentiment classification of the given text and output the prediction results.
The specific steps are as follows:
- Import the required libraries and modules, including
paddle
,SkepModel
,SkepTokenizer
, etc. - Use
data_cls.load_dict
function to load the previously saved tag dictionary, which is used to map tags to category ids and category ids to tags. - Use the
paddle.load
function to load the previously saved best model weights. - Use the
SkepModel.from_pretrained
function to load the pretrained Skep model. - Use the
SkepTokenizer.from_pretrained
function to load the word segmenter corresponding to the Skep model. - Create a new
SkepForSequenceClassification
model that contains features extracted from the Skep model and a fully connected layer for classification, with the output dimensions set according to the number of categories. - Use the
best_model.load_dict
method to load the previously saved best model weights. - defines a
predict
function for predicting input text. The function first switches the model to evaluation mode, then uses Tokenizer to encode the input, converts the encoded input into Tensor, and performs inference through the model to obtain the prediction result. Finally print out the prediction results. - Three text examples and their corresponding text pairs are defined.
- Call the
predict
function respectively to predict these three examples and print the prediction results.
Predictive text:
forecast result:
Running environment and dependency installation
(1) 环境依赖
python >= 3.6
paddlenlp >= 2.2.2
paddlepaddle-gpu >= 2.2.1
(2) Preparing the operating environment. Before running, please create new directories data and checkpoints in this directory to store data and save the model respectively.
data set
https://bj.bcebos.com/v1/paddlenlp/data/cls_data.tar.gz
reference
[1] Attribute-level sentiment analysis: https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/sentiment_analysis/ASO_analysis
[2] H. Tian, C. Gao, X. Xiao, H. Liu, B. He, H. Wu, H. Wang, and F. Wu, ''SKEP: Sentiment knowledge enhanced pre-training for sentiment analysis,'' 2020, arXiv:2005.05635 .