Talk about Sentiment Analysis

I have nothing to do recently, and I signed up for the US Spring Competition with my friends. I used a sentiment analysis model in it, and I will introduce it to you below.

What is a sentiment analysis model?

Introduction

Sentiment analysis apparently refers to the use of computer technology to mine and analyze text, image, audio, video, and even cross-modal data. But in a broad sense, sentiment analysis also includes the analysis of opinions, attitudes, tendencies, etc. Sentiment analysis mainly involves two objects, that is, the object of evaluation (including goods, services, organizations, individuals, topics, problems, events, etc.) and the attitude and emotion of the object. Sentiment analysis has a wide range of applications in social public opinion management, business decision-making, precision marketing and other fields. In scenarios such as stock market predictions and election predictions, sentiment analysis plays a pivotal role. The birth and development of sentiment analysis mainly comes from social media and networks, such as forums, blogs, Weibo, etc. Sentiment analysis has been one of the active fields in natural language processing since 2000. However, in real life, there are still great difficulties in sentiment analysis of social networks (one of the main reasons is that there are a lot of useless "junk" information in social network data, which is also most of the work in natural language processing (such as machine translation, etc.) The reason why it is difficult to achieve better results when applied to real life scenarios).

Research Methods of Sentiment Analysis

The research methods of sentiment analysis mainly include supervised and unsupervised methods. Early supervised learning refers to shallow models such as SVM, maximum entropy, and naive Bayesian, while unsupervised learning is based on methods such as dictionaries and semantic analysis. The emergence of deep learning has achieved the best results in many classification and regression tasks. In recent years, the application of deep learning for sentiment analysis has also become a research hotspot.

Three Levels of Sentiment Analysis

Sentiment analysis is mainly divided into three levels, namely: Document level, Sentence level and aspect level. Among them, the Document level takes the entire document as the analysis unit and assumes that the object discussed in the document is a single entity and the emotions and opinions are clear and clear, that is, neural, positive or negative. Sentence level takes each sentence as a separate analysis object. Since there may be some connections between sentences, we cannot regard it as a clear point of view. For the aspect level, the classification granularity is finer, that is, we need to extract the independent evaluations of different levels of targets, and summarize and synthesize to obtain the final emotion. It will involve ascept extraction, entity extraction and aspect sentiment classification. For example, although the result is poorly interpretable for Deep Learning, it is very effective for image recognition tasks. Among them, Deep Learning is the entity, and the extraction identity is the "result" and "image recognition". For the "result", its emotion is negative, and "image recognition" is positive.

For document-level sentiment analysis, it is mainly a binary classification problem (positive or negative). Of course, we can also convert it into a regression problem, that is, score the sentiment of the document. At the beginning of the period, the traditional solution to this problem was based on the document bag-of-words model, such as calculating word frequency or TF-IDF score. The most direct problem brought by this method is that the matrix is ​​sparse, and this method ignores the order of words in the sentence. Therefore, the n-gram model was introduced later (the n-gram model obtains the probability of sequence occurrence through simple statistics on the vocabulary in the corpus. In the past few decades, the n-gram model has been the core module in NLP, and the longest used Including 2-grams and 3-grams, etc.), the modified model can consider multiple words at the same time, which alleviates the sequence problem between words in short texts to a certain extent. However, we still need to smooth the unregistered words, and this method It does not take any semantic information into account. After that, Benjio proposed word vectors in 2003, which are widely used in neural networks by representing articles as dense vectors (but word embedding cannot solve the problem of polysemy, and it was not until the emergence of ELMO that an elegant solution).

For sentence-level sentiment analysis, similar to documents, we also need to change them into sentence vectors and then classify them. The difference is that due to the short length of the sentence, we can combine syntactic dependency trees for processing. In addition, for social media text data, such as Tweets, we can also consider information such as social relationships. In early research, models combined with syntactic tree analysis dominated, and later neural networks became the mainstream.

For aspect-level sentiment analysis, which is different from documents and sentences, we need to consider the sentiment of the target and analyze the sentiment of the aspect at the same time, and the sentiment contained in different contexts will affect the final result, so it must be modeled Considering both the target object and the emotional relationship between contexts also increases the difficulty of the task. In the neural network, we generally decompose aspect level sentiment classification into three sub-tasks: first, the representation of context words; second, the representation of target, which can be solved by embedding; third, the identification of specific target emotional context words identify.

 

Let me introduce my model to you.

Data Display

 Here's some of the data.

my code:

import pandas as pd
import numpy as np
import re
from textblob import TextBlob
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer

# nltk.download('stopwords')
# 读取数据
data = pd.read_excel('C:/Users/HP/Desktop/PRO/User evaluation.xlsx')

# 预处理文本数据
def preprocess_text(text):
    text = re.sub('[^a-zA-Z]', ' ', text)  # 去除特殊字符和数字
    text = text.lower()  # 将文本转换为小写
    text = text.split()  # 将文本拆分为单词
    text = [word for word in text if not word in set(stopwords.words('english'))]  # 移除停用词
    stemmer = PorterStemmer()
    text = [stemmer.stem(word) for word in text]  # 执行词干提取
    text = ' '.join(text)  # 将单词重新组合为文本
    return text

data['Processed_Text'] = data['Text'].apply(preprocess_text)

# 执行情感分析
def get_sentiment(text):
    analysis = TextBlob(text)
    return analysis.sentiment.polarity

data['Sentiment_Score'] = data['Processed_Text'].apply(get_sentiment)

# 计算整个数据集的平均情感得分
average_sentiment_score = np.mean(data['Sentiment_Score'])
print(f'Average Sentiment Score: {average_sentiment_score}')

# 检查不同来源的情感得分分布
sentiment_by_source = data.groupby('Source')['Sentiment_Score'].mean()
print(sentiment_by_source)

Guess you like

Origin blog.csdn.net/Allen1862105/article/details/129948649