"The Missing She" Douban Short Review Data Analysis

"The Missing She" Douban Short Review Data Analysis

I. Introduction

Have you watched the recently popular movie "She Who Disappeared"? People on the Internet have different opinions on this movie. I believe this blog will give you the answer.
insert image description here

In this blog, we will analyze the Douban short review data of the movie "The Missing Her". Our goal is to understand the audience's perception of this movie through exploratory data analysis (EDA), sentiment analysis and film review analysis of the review data. The movie's rating, and whether the movie is worth watching.

The data we will use includes:

  • "The Vanishing She" Douban Short Comment Data.csv: This is our main data, including Douban users' short comments on the movie "The Vanishing She".
  • Stop thesaurus.txt: This is the stop thesaurus we use for text preprocessing, which contains some common words that need to be ignored in the analysis.

let's start!

2. Data loading and preprocessing

import pandas as pd
import numpy as np

# 读取数据
df = pd.read_csv('《消失的她》豆瓣短评数据.csv')

# 查看数据的基本信息
df.info()
df.head()

insert image description here

From the above output, we can see that the dataset contains 232 records, and each record contains 6 fields:

  • Commenter Screen Name: Username of the commenter
  • Rating: reviewer's rating of the movie, such as 'recommended', 'okay', etc.
  • Review: Specific reviews of the movie by the reviewer
  • Comment Time: The time the comment was posted
  • Review Location: The geographic location of the reviewer
  • Comment likes: the number of likes this comment received

We can also see that there are missing values ​​in some fields, such as 'commenter screen name', 'review', 'comment', 'comment time', 'comment location' and 'comment likes'. We need to deal with these missing values ​​before proceeding with further analysis.

# 处理缺失值
df = df.dropna()

# 再次查看数据的基本信息
df.info()
df.head()

insert image description here

By removing rows containing missing values, we now have 217 complete records. In the next step we will perform exploratory data analysis on our dataset.

3. Exploratory Data Analysis

In this part, we will conduct a preliminary exploration of the data, including:

  • View rating distribution of reviews
  • View the distribution of comment likes
  • View geographic distribution of reviews

This will help us understand the overall rating of the movie by the audience, as well as some basic characteristics of the review.

1. View the evaluation distribution of comments

df['评价'].value_counts()
还行    63
推荐    54
较差    47
很差    38
力荐    15
Name: 评价, dtype: int64
import matplotlib.pyplot as plt
import seaborn as sns
from pyecharts.charts import Pie
from pyecharts import options as opts

# 设置风格
sns.set_style('whitegrid')

# 示例数据
cate = [str(i) for i in df['评价'].value_counts().index]
data = [int(i) for i in df['评价'].value_counts().values]

pie = (Pie()
       .add('', [list(z) for z in zip(cate, data)],
            radius=["30%", "75%"],
            rosetype="radius"
            )
       .set_global_opts(title_opts=opts.TitleOpts(title="《消失的她》评价", subtitle="总体分布"))
       .set_series_opts(label_opts=opts.LabelOpts(formatter="{b}: {d}%"))
      )

pie.render_notebook()

insert image description here

From the picture above, we can see that most of the comments are 'okay', followed by 'recommended', which shows that most of the audience's evaluation of this movie is relatively positive.

2. View the distribution of comments and likes

Next, let's take a look at the distribution of comments and likes.

#隐藏警告
import warnings
warnings.filterwarnings("ignore")               #忽略警告信息
plt.rcParams['font.sans-serif']  = ['SimHei'] # 用来正常显示中文标签
plt.rcParams['axes.unicode_minus'] = False      # 用来正常显示负号
plt.rcParams['figure.dpi']  = 100        #分辨率
# 查看评论点赞数的分布
df['评论点赞数'].describe()

# 绘制评论点赞数的分布图
sns.distplot(df['评论点赞数'], bins=20, kde=False)

insert image description here

From the figure above, we can see that the distribution of comments likes is skewed to the right. Most of the comments have less than 10,000 likes, and only a few comments have more than 10,000 likes. This means that while some comments get a lot of likes, most of the comments get fewer likes.

3. View the geographical distribution of comments

Next, we look at the geographic distribution of reviews.

# 查看评论的地理分布
plt.figure(figsize=(10, 8))
sns.countplot(y='评论地点', data=df, order=df['评论地点'].value_counts().index)
plt.title('评论的地理分布')
plt.xlabel('数量')
plt.ylabel('地点')
plt.show()

insert image description here

From the picture above, we can see that the comments mainly come from Beijing, Shanghai, Guangdong, Jiangsu and other places, and the audience activity in these places is relatively high.

Through the above exploratory data analysis, we have a certain understanding of the data. Next we will conduct sentiment analysis to understand the audience's emotional inclination towards the movie.

4. Sentiment Analysis

In this part, we will conduct sentiment analysis on the review text to understand the audience's emotional tendency towards the movie. We will use the jieba library for Chinese word segmentation, and then use the SnowNLP library for sentiment analysis.

First, we need to load the disabled thesaurus and define a function for text preprocessing.

import jieba
from snownlp import SnowNLP

# 加载停用词库
with open('停用词库.txt', 'r', encoding='utf-8') as f:
    stop_words = [line.strip() for line in f.readlines()]

# 定义文本预处理函数
def preprocess_text(text):
    # 使用jieba进行分词
    words = jieba.cut(text)
    # 去除停用词
    words = [word for word in words if word not in stop_words]
    return ' '.join(words)

# 对评论文本进行预处理
df['评论'] = df['评论'].apply(preprocess_text)
# 查看处理后的评论
df['评论'].head()
0          一个   谋杀   老婆   男人   无意   谋杀   孩子   流泪   讽刺 
1                           倪妮   角色   T   铁   T   复仇记 
2    男主   b   超   照片   崩溃   孩子   杀   老婆   眼都   眨   ...
3    建议   情人节   档   安排   适合   情侣   宝宝   好   电影   ❤ ...
4    故事   20   分钟   猜   表演   倪妮   好似   没什么   信念   感...
Name: 评论, dtype: object

We have successfully preprocessed the comments, next we will perform sentiment analysis. We will use the SnowNLP library for sentiment analysis. The sentiment analysis of SnowNLP is based on the classification of sentiment tendency. It will return a floating-point number between 0 and 1. The closer the value is to 1, the more positive the sentiment is, and the closer it is to 0, the more negative the sentiment.

from snownlp import SnowNLP

# 定义情感分析函数
def sentiment_analysis(text):
    return SnowNLP(text).sentiments

# 对评论进行情感分析
df['情感分析'] = df['评论'].apply(sentiment_analysis)
# 查看情感分析结果
df['情感分析'].head()
0    0.999920
1    0.998887
2    0.054732
3    0.905509
4    0.923089
Name: 情感分析, dtype: float64
# 绘制情感分析结果的直方图
plt.hist(df['情感分析'], bins=20, alpha=0.5, color='steelblue', edgecolor='black')
plt.title('情感分析结果')
plt.xlabel('情感倾向')
plt.ylabel('评论数量')
plt.show()

insert image description here

From the histogram we can see that most of the reviews are leaning toward positive sentiment, which means that audiences generally rated the movie well.

Next, we will conduct a word cloud analysis of the reviews to better understand the themes of the audience's reviews of the movies.

from wordcloud import WordCloud

# 合并所有评论
text = ' '.join(df['评论'])

# 生成词云
wordcloud = WordCloud(font_path='simhei.ttf', background_color='white').generate(text)

# 显示词云
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

insert image description here

From the word cloud, we can see the words with high frequency in the reviews, which can help us understand the audience's main review topics for the movie.

Next, we will analyze the ratings of the movies, we will calculate the average rating of the movies and look at the distribution of the ratings.

# 将'评价'这一列的数据转换为数值
df['评价'] = df['评价'].map({
    
    '很差': 1, '较差': 2, '还行': 3, '推荐': 4, '力荐': 5})

# 计算电影的平均评价
average_rating = df['评价'].mean()
print(f'电影的平均评价是:{
      
      average_rating:.2f}')

# 绘制评价的直方图
plt.hist(df['评价'], bins=5, alpha=0.5, color='steelblue', edgecolor='black')
plt.title('评价分布')
plt.xlabel('评价')
plt.ylabel('评论数量')
plt.show()
电影的平均评价是:2.82

insert image description here

The average rating of the movie is 2.82, which shows that the audience's evaluation of the movie is generally biased towards "ok" and "recommended". From the distribution chart of reviews, we can see that most of the reviews are concentrated in the two levels of "okay" and "recommended", which further confirms that audiences generally have good reviews of movies.

To sum up, judging from the results of sentiment analysis, the word cloud of reviews, and the evaluation of the movie, the audience generally has a good evaluation of this movie, so this movie is worth watching.

Guess you like

Origin blog.csdn.net/qq_52417436/article/details/131565367