Sentiment Analysis of Weibo Corpus

Table of contents

statement of originality

Chapter One Introduction

1.1 Research Background

1.2 Subject research

Chapter 1 Sentiment Analysis Preparation

2.1 Knowledge base of sentiment analysis

2.2 SnowNLP library

2.3 Word2vec method

2.3 Working principle of word vector

The third chapter didn't think well...

Chapter 4 I didn't think about it...

Chapter 5 I didn't think about it...

Write at the end:

As soon as the Chat comes out, the coder is dead, the code is here, take it away!

Word2vec way:

Call SnowNLP directly

Literature reference:


statement of originality

        I completed this experiment with great interest, and added some of my simple understanding and preliminary views on the current related technologies of NLP. Some local language descriptions are not very serious. In order to save time, I introduced some research results and theories of academic papers in the discussion experiment, so some parts are not original to me, and I did not specifically mark the source of these results, but with a learning attitude, I insisted on writing Such a report, so this cannot be regarded as a formal dissertation report. By the way, some of the words may not be very accurate, but I think they are compelling, so I use them, so you don’t need to delve into them, just take a look at the things that can’t stand scrutiny, so if there are any deficiencies, please let me know. understanding.

Chapter One Introduction

1.1 Research Background

        With the rapid development of computer technology, people have entered the era of artificial intelligence. In particular, ChatGPT , a Natural Language Processing ( Natural Language Processing ) tool driven by artificial intelligence technology newly launched by the American artificial intelligence research laboratory OpenAI , is extremely popular this year. The emergence of ChatGPT technology makes the application of artificial intelligence technology in language interaction become more widespread and widespread. On twitter , feedback from users who have used ChatGPT indicates that the intelligence of Chat version 3.0 based on the 21-year corpus training is already at the level of a puppy, while GPT-3 is just ChatGPT (here we skip versions 1 and 2 directly. ) based model, subsequent versions of Chat are improved on this model, and the latest version of GPT-4 is more accurate and more diverse . At the same time, GPT-4 has been connected to the software of various platforms, which is equivalent to connecting hands and feet to this already powerful brain. While adding more practicality and fun to users , it can Let it fully play its due role in various fields of human beings, and drive the development of human scientific and technological civilization to a higher level.The only fly in the ointment is that domestic users have certain difficulties in using this technology, and the price of ChatGPT is relatively high. If you want to use the latest version of ChaGPT (it is said that version 5.0 will also be launched), users need to pay $20 per month. According to the latest exchange rate calculation (6.8680,as of2023-03-25 05:00) , domestic users will spend at least 137 yuan per month, which is simply unacceptable for poor people like me! So let's use version 3.5, if you can prostitute 4.0 for nothing... hhh

Shortly after the launch of ChatGPT, a group of companies emerged in China, immediately launched a shell product about ChatGP, quickly occupied the domestic market, harvested leeks, and sought benefits. up. At the same time, Baidu launched a similar product in March this year - Wenxin Yiyan. What is more interesting is that on the day of the press conference, after Robin Li introduced this product, Baidu's stock began to fall, and most netizens thought that Wenxin A word is nothing compared to ChatGPT. Recently, the media is also hyping about the Pangu model that Tencent will launch. Most netizens who understand the national conditions are not very optimistic about this. I only applied for newbing's API interface, and openAI's ChatGPT4 interface, and only Microsoft's has passed (in fact, it is GPT version 3.0), and openAI has not responded to my message so far. I only heard of Wen Xin's words, and I applied for it yesterday, but I haven't paid attention to the others. I think domestic artificial intelligence technology still needs to be improved, and domestic artificial intelligence technology has a long way to go.

1.2 Subject research

        There are various communication platforms on the Internet. Since the development of the Internet in China, major Internet companies have launched QQ, WeChat, Tieba, Weibo, Zhihu and other social software for people to use. These apps have greatly changed the way of communication between people, allowing people to express their thoughts more efficiently and accurately, and people have different views on every issue, more or less with personal emotional tendencies, As an NLP scholar, we naturally have to analyze these emotional viewpoints and judge the emotions of these people. As for the analysis method I will use, I will talk about it later. Among various social software, Weibo covers a wide range of groups, with hundreds of millions of daily active users. With the development of the past few years, the increase in the number of Weibo users has made the interactive content of Weibo more diversified, and the rich and diverse content ecology can enable users to get a better experience. Massive emotional texts are generated every moment on Weibo, involving politics, economy, entertainment, sports and other fields, and generally full of emotional tendencies and personal positions, with very rich research value. Therefore, in this topic, we take Weibo as an example, use natural language processing technology (NLP), build a model for the obtained Weibo corpus, and then use the trained model to perform word segmentation and sentiment analysis on the training set.


Chapter 1 Sentiment Analysis Preparation

2.1 Knowledge base of sentiment analysis

        Sentiment analysis is also called comment mining or opinion mining. In terms of emotional information classification, it can generally be divided into coarse-grained sentiment classification and fine-grained sentiment classification according to the target task. Coarse-grained sentiment classification refers to judging the overall text. Emotion, which expresses a person's overall evaluation of something or an object. There are two methods of coarse-grained emotion classification. One is the classification of emotional tendency, which is divided into three categories: praise, derogation, and neutral; the second is emotional classification, which means that personal subjective emotions are divided into joy, anger, sadness, fear, Surprise and other categories. So far, text sentiment analysis methods can be mainly divided into three categories, namely: based on sentiment lexicon, based on machine learning, and based on deep learning models. Harm, forget about these theories, there is ChatGPT, we just ask for it, besides, now many libraries can be directly called, directly on the project (Note: It is actually a simple language processing job, not worthy of being called project, but let it be)!

2.2 SnowNLP library

        In this data mining experiment, I used the SnowNLP library. This library is a python library developed by the Chinese. It is specially designed for mining Chinese text. There are already algorithms in it. You need to call the function yourself to build a corpus based on different texts. That's it, it's really convenient. There are many algorithms in it, and I am also trying to learn the algorithm principles these days. However, I encountered many difficulties, because when I was learning this library, I checked a lot of information and found that there were few or no specific detailed explanations of this library. Many of them were reprinted from the official website's introduction to this library, and some explanations were also vague. clear. Some of the cases I found about SnowNLP are written in English (including the introduction of SonwNLP official website), so English comprehension is also a major difficulty in the learning process, but in the process, I did gain something. I will simply record the calls of some modules, the specific principle is...

        First prepare the corpus, neg.csv and pos.csv, and use them to train your own model. The specific steps are as follows: call the snetiment module in the SnowNLP library, then save your own model, put it in your own folder, and use it later arrive. There is not a lot of training corpus here, and each file only has more than 10,000 Weibo corpus data. The training is completed in about five minutes , and the test masters can take advantage of this time to memorize a few English words .

        Then load your own model to perform sentiment analysis on the test corpus. Before that, remember to load your own model and change the default path of the model file to your own. The running results of the following code show that the two paths are different.

        After loading the model, directly process the data, and then analyze the sentiment. At a glance, you can understand the code, so don’t be wordy. In short, each line in the valid_data data here has a blog statement, so there must be a carriage return (that is, a newline character) after each line of statement. , so we use the x.strip() method to get the sentence that needs to be analyzed, write it into the list, and then directly call the SnowNLP.sentiments module to calculate its analysis result for each sentence and return a sentiment with an interval of [0,1] value, then write the list.

        I will then describe the rest of my experimental procedure in short terms:

        And then so...

        And then that's it...

        Then the experiment ends

        ......

        Document backup:

        Ok, if you want to go a little deeper, just go ahead and Copy. Let me briefly copy the principle of the Word2vec model (Note: there is more than one word vector model).

2.3 Word2vec method

        The steps of using Word2vec word vectors can be divided into: Chinese word segmentation, stop words removal, and training word vectors. Since the experimental corpus comes from the Chinese Weibo network platform, the text language is mainly Chinese, and a few English words are mixed in a very small number of sentences. Since there are spaces between English words as natural word segmentation boundaries, word segmentation operations are not required, while Chinese words need to use word segmentation algorithms to separate the words and symbols in the sentence in turn.

        In the experiment, the Python-based open source word segmentation tool "jieba" is used to segment the microblog comment text. The tool has three word segmentation modes, which are full mode, precise mode and search engine mode, which can be selected according to the task objectives. After the computer reads the corpus file, it calls the jieba.cut() method to start word segmentation for each sentence. Among them, you need to set three parameters in the jieba.cut() function: the first parameter is to enter the string that needs to be word-segmented within the quotation marks; the second parameter cut_all needs to be filled with True or False to decide whether to use the full mode or the precise mode; The third is the HMM parameter used to control whether to use the HMM model. In the experiment, the precise mode was chosen to ensure the accuracy of word segmentation. The next stage of word segmentation processing is to filter stop words. Due to the strong originality and randomness of the text of online comments, many words without practical meaning will be mixed in the sentences, such as "de", "yes", "ha" and so on. In addition, there are many punctuation marks that are useless for sentiment classification in the text, such as ",", "*", ":", "#", "?", "@", "[", "]" wait. Too much redundant information will bring difficulties to the sentiment analysis task and affect the accuracy of the analysis, so it should be removed before further analysis and processing. Then extract the keywords.

        Let me talk about the parameters of model.Word2Vec here:

        sentences: The corpus we want to analyze can be a list, or it can be read from a file.

        size: The dimension of the word vector, the default value is 100. The value of this dimension is generally related to the size of our corpus. If it is a small corpus, such as a text corpus less than 100M, the default value is generally sufficient. If it is a very large corpus, it is recommended to increase the dimension.

        window: It is the maximum distance between word vector contexts. This parameter is marked as cc in our Algorithm Principles. The larger the window, the more contextual relationship will be generated with words that are far away from a certain word. The default value is 5. In actual use, the size of this window can be dynamically adjusted according to actual needs. If it is a small corpus, this value can be set smaller. For general corpus this value is recommended between [5,10].

        sg: That is, the choice of our word2vec two models. If it is 0, it is the CBOW model, if it is 1, it is the Skip-Gram model, and the default is 0, which is the CBOW model.

        hs: That is, the choice of two solutions for our word2vec. If it is 0, it is Negative Sampling. If it is 1 and the number of negative samples negative is greater than 0, it is Hierarchical Softmax. The default is 0, which is NegativeSampling.

        negative: The number of negative samples when using Negative Sampling, the default is 5. It is recommended to be between [3,10]. This parameter is marked as neg in our algorithm principle.

        cbow_mean: It is only used when CBOW is doing projection. If it is 0, it is the sum of the word vectors of the context, and if it is 1, it is the average value of the word vectors of the context. It is described according to the average value of the word vector.

        min_count: Need to calculate the minimum word frequency of the word vector. This value can remove some very rare low-frequency words, the default is 5. If it is a small corpus, you can lower this value.

        iter: The maximum number of iterations in the stochastic gradient descent method, the default is 5. For large corpora, this value can be increased.

        alpha: Initial step size for iterations in stochastic gradient descent. The default is 0.025.

        min_alpha: Since the algorithm supports gradually reducing the step size during the iteration process, min_alpha gives the minimum iteration step size value. The iteration step size of each round in stochastic gradient descent can be obtained by iter, alpha, and min_alpha.

        For large corpus, it is necessary to adjust the parameters of alpha, min_alpha, and iter together to select the appropriate three values.

2.3 Working principle of word vector

        The basic idea is: through training, map each word in a certain language into a fixed-length short vector (the "short" here is relative to the "long" of one-hot representation), and put all these vectors in Together they form a word vector space, and each vector can be regarded as a point in the space. By introducing "distance" into this space, the distance between words can be judged (lexically, semantically) of) similarities. For example: Assume that each word vector is N different points distributed on a two-dimensional plane. Given a certain point, you want to find the closest point to this point on the plane. To solve this problem, We follow the following steps: first, establish a rectangular coordinate system, based on this coordinate system, each point on it corresponds to a unique coordinate; then introduce the Euclidean distance; finally calculate the distance between this word and other N-1 words The distance between them, the word corresponding to the minimum distance value is the word we are looking for.

        Of course, it can be directly handed over to GPT here (I use version 3.5 here, 4.0 can’t be used, please forgive me, magicians) , you can see that the answer it gives is more systematic and more accurate than what I understand. It is more professional, and it is undeniable that its knowledge framework is really easy to use, and it is very convenient in many places. It will be more perfect if you can get to the 4.0 version of the API

        The code is shown as follows: There is a stop word processing code in the middle that is not written. If it is written, the picture will be taken twice, so... 


The third chapter didn't think well...

Chapter 4 I didn't think about it...

Chapter 5 I didn't think about it...


Write at the end:

        If you are not satisfied with the results, you can try more models or refine the data and model. The approach is as follows: load the model, read in the Weibo corpus (.txt), connect Word2vec to LSTM, Bi-LSTM, Self-Attention, and CNN network models for training, calculate the accuracy rate and F1 score of each model, and write In the excel table...

        Also, my cover is generated by GPT3.5, which looks quite abstract, but it is also compelling enough. One thing to say is that ChatGPT is really easy to use. If you are willing, magicians can exchange and learn together.


As soon as the Chat comes out, the coder is dead, the code is here, take it away!

Word2vec way:

from snownlp import sentiment
import gensim
from gensim.models import Word2Vec
import jieba
import os

sentiment.train("train_neg_data.txt",
                "train_pos_data.txt")
sentiment.save('./saved_model/weibo.marshal')

# 读取积极语料
pos_file = "train_pos_data.txt"
with open(pos_file, 'r', encoding='utf-8') as f:
    pos_text = f.read()

# 读取消极语料
neg_file = "train_neg_data.txt"
with open(neg_file, 'r', encoding='utf-8') as f:
    neg_text = f.read()

# 对积极语料和消极语料进行分词
pos_words = jieba.lcut(pos_text)
neg_words = jieba.lcut(neg_text)

#停用词处理

# 建立Word2Vec模型
sentences = [pos_words, neg_words]
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1)

# 保存模型
model_file = '../saved_model/sentiment_analysis.model'
model.save(model_file)

# 加载模型
model = Word2Vec.load(model_file)

# 查看词向量
# print(model.wv['好'])

Call SnowNLP directly

import pandas as pd
import snownlp
from snownlp import sentiment
from snownlp import SnowNLP

#数据格式转换
neg = pd.read_excel("neg.xls")
pos = pd.read_excel("pos.xls")

neg.to_csv("neg.csv")
pos.to_csv("pos.csv")

#训练语料,构建模型
sentiment.train("neg.csv",
                "pos.csv")
sentiment.save('../saved_model/weibo_pro.marshal_2.0')#这里差不多要五分钟


# Use the saved model to analyze the sentiment of test data
# sentiment = sentiment.classifier.load('../saved_model/weibo_pro.marshal_2.0')
# print("默认的模型文件路径:", sentiment.data_path)
sentiment.data_path ="../saved_model/weibo_pro.marshal_2.0.3"#加载文件路径
# print("自己训练的模型路径:", sentiment.data_path)

#对测试数据进行情感分析
data  = []
data_file = "valid_data.txt"
with open(data_file, "r", encoding="utf-8") as f:
    data_line = f.readlines()
    f.close()
for line in data_line:
    data_sentence = line.strip("\n")
    data.append(data_sentence)
data_text = pd.DataFrame(data, columns=["sentence"])

data_text_file = "valid_data.xls"
test_data = pd.read_excel(data_text_file)

#把分析结果写入excel文件
test_data["sentiment"] = test_data["sentence"].apply(lambda x: SnowNLP(x).sentiments)
test_data["sentiment"].to_excel("../results/marshal_2.0_test_result.xls")#写入文件
import snownlp
from snownlp import SnowNLP
import pandas as pd

#加载训练好的模型
# model_path = "./saved_model/weibo.marshal.3"
# s = SnowNLP(model_path)
snownlp.sentiment.DATA_PATH ="./saved_model/weibo_pro.marshal_2.0.3"

# df = pd.DataFrame(columns=['text', 'sentiment'])

text = []
sent = []
with open("valid_data.txt", 'r', encoding='utf-8') as f:
    data = f.readlines()
for line in data:
    sentence = line.strip("\n")
    sentiment = SnowNLP(sentence).sentiments
    text.append(sentence)
    sent.append(sentiment)
        # df = pd.concat({'text': line, 'sentiment': sentiment}, ignore_index=True)
        # df = pd.concat({'text': line, 'sentiment': sentiment}, ignore_index=True)
df_text = pd.DataFrame(text, columns=['text'])
df_sent = pd.DataFrame(sent, columns=['sentiment'])

pf = pd.concat([df_text, df_sent], ignore_index=True)
pf.to_excel('./results/SnowNLP_result.xlsx', index=False)

Literature reference:

[1]  Lang Cong. (2021). Sentiment Analysis of Microblog Text Based on Deep Learning (Master's Thesis, Shenyang University of Technology). Sentiment Analysis of Microblog Text Based on Deep Learning- CNKI

[2] Build a deep learning model for text sentiment classification based on LSTM: the accuracy rate is often above 95%_Sentiment classification accuracy_YiqiangXu's blog-CSDN blog

[3]https://www.cnblogs.com/always-fight/p/10310418.html

[4]https://louyu.cc/articles/machine-learning/2019/09/?p=1933/

[5] Python realizes the sentiment analysis operation of shopping review text [based on the Chinese text mining library snownlp] / Zhang Shengrong 

Guess you like

Origin blog.csdn.net/weixin_51206451/article/details/129917512