乘风破浪的姐姐？NO！是兴风作浪的姑奶奶Python分析芒果TV9万条弹幕，评论，谁才是真正的C位？

数据导入

In [33]:

import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns import re from pyecharts.charts import Pie,Funnel,Map,Page,Bar,Sankeyfrom pyecharts import options as opts from pyecharts.globals import SymbolTypeplt.style.use('seaborn')

In [34]:

 # 支持中文plt.rcParams['font.sans-serif'] = ['SimHei']plt.rcParams['axes.unicode_minus'] = False

In [35]:

df_1 = pd.read_excel(r'E:\py练习\数据分析\乘风破浪\数据\百度百科选手工作信息.xlsx')df_1.head()

Out[35]:

	0
0	阿朵中国内地女歌手、演员代表作：水果姑娘、熊猫大侠、专列一号等
1	黄圣依中国内地女演员、歌手代表作：功夫、天仙配、碧血剑、白蛇传说等
2	金莎中国内地女歌手、演员代表作：十八岁的天空、星月神话、被风吹过的夏天等
3	孟佳中国内地女歌手、演员代表作：BAD BUT GOOD、A CLASS等
4	王丽坤中国内地女演员代表作：五号特工组、美人心计、武动乾坤等

In [36]:

df_2 = pd.read_excel(r'E:\py练习\数据分析\乘风破浪\数据\维基百科数据.xlsx')df_2.head()

Out[36]:

	本名	出生地	出生日期	出道年数	初舞台分数	最终排名	个人特质	成团潜力	舞台表现	声乐表现
0	蓝盈莹	上海	（30岁）	9	91	1	23	24	22	22
1	黄龄	上海	（33岁）	13	89	2	23	22	21	23
2	李斯丹妮	四川	（30岁）	9	87	3	23	23	23	18
3	孟佳	湖南	（30岁）	10	87	4	23	22	22	20
4	沈梦辰	湖南	（31岁）	10	86	5	23	21	23	19

In [37]:

df_3 = pd.read_excel(r'E:\py练习\数据分析\乘风破浪\数据\选手豆瓣热度排名.xlsx')df_3.head()

Out[37]:

	本名	normalized_num
0	袁咏琳	0.0
1	朱婧汐	1.5
2	钟丽缇	3.0
3	王智	3.9
4	沈梦辰	6.4

In [38]:

df_4 = pd.read_excel('E:\py练习\数据分析\乘风破浪\数据\选手弹幕热度排名.xlsx')df_4.head()

Out[38]:

	本名	normalized_num
0	袁咏琳	0.0
1	朱婧汐	0.4
2	王智	0.7
3	陈松伶	4.4
4	刘芸	6.6

数据预处理

df_1表：提取职位

扫描二维码关注公众号，回复： 11372208 查看本文章

In [39]:

# 重命名列df_1.columns = ['col1'] # 提取姓名df_1['姓名'] = df_1.col1.str.extract(r'(.*?)\s')# 提取职业df_1['职业'] = df_1.col1.str.extract('\s(.*?)代表作') df_1['职业'] = df_1['职业'].str.replace(re.compile('中国内地女|中国香港女|中国台湾女|美籍华语女|中国内地'), '')# 第一职业df_1['第一职业'] = df_1['职业'].str.split('、').str[0]# 第二职业df_1['第二职业'] = df_1['职业'].str.split('、').str[1]# 填补缺失df_1['第二职业'] = df_1['第二职业'].fillna('无') # 选取列df_1 = df_1[['姓名', '第一职业', '第二职业']]df_1.head()

Out[39]:

	姓名	第一职业	第二职业
0	阿朵	歌手	演员
1	黄圣依	演员	歌手
2	金莎	歌手	演员
3	孟佳	歌手	演员
4	王丽坤	演员	无

df_2表:

提取出身日期

In [40]:

# 出生日期df_2['出生日期'] = df_2['出生日期'].str.extract(r'(\d+)') df_2.head()

Out[40]:

	本名	出生地	出生日期	出道年数	初舞台分数	最终排名	个人特质	成团潜力	舞台表现	声乐表现
0	蓝盈莹	上海	30	9	91	1	23	24	22	22
1	黄龄	上海	33	13	89	2	23	22	21	23
2	李斯丹妮	四川	30	9	87	3	23	23	23	18
3	孟佳	湖南	30	10	87	4	23	22	22	20
4	沈梦辰	湖南	31	10	86	5	23	21	23	19

In [41]:

# 合并数据df_all = pd.merge(df_1,df_2,how='inner',left_on='姓名',right_on='本名')# 删除列df_all = df_all.drop('本名', axis=1) # 修改列名df_all = df_all.rename({'出生日期':'年龄'},axis = 1)# 转换类型df_all['年龄'] = df_all['年龄'].astype('int')# stripdf_all['第一职业'] = df_all['第一职业'].str.strip()df_all['第二职业'] = df_all['第二职业'].str.strip()# 异常值df_all['第二职业'] = df_all['第二职业'].str.replace('戏剧', '')df_all.head()

Out[41]:

	姓名	第一职业	第二职业	出生地	年龄	出道年数	初舞台分数	最终排名	个人特质	成团潜力	舞台表现	声乐表现
0	阿朵	歌手	演员	湖南	42	18	79	13	20	18	18	23
1	黄圣依	演员	歌手	上海	37	20	80	11	23	24	15	18
2	金莎	歌手	演员	上海	37	18	68	29	18	18	16	16
3	孟佳	歌手	演员	湖南	30	10	87	4	23	22	22	20
4	王丽坤	演员	无	内蒙古	35	16	72	26	19	18	15	20

In [42]:

# 合并豆瓣热度df_all = pd.merge(df_all,df_3,left_on='姓名',right_on='本名')df_all.drop('本名',axis=1,inplace=True)df_all.rename({'normalized_num': '豆瓣热度'}, axis=1, inplace=True) df_all.head()

Out[42]:

	姓名	第一职业	第二职业	出生地	年龄	出道年数	初舞台分数	最终排名	个人特质	成团潜力	舞台表现	声乐表现	豆瓣热度
0	阿朵	歌手	演员	湖南	42	18	79	13	20	18	18	23	9.1
1	黄圣依	演员	歌手	上海	37	20	80	11	23	24	15	18	52.9
2	金莎	歌手	演员	上海	37	18	68	29	18	18	16	16	13.8
3	孟佳	歌手	演员	湖南	30	10	87	4	23	22	22	20	11.4
4	王丽坤	演员	无	内蒙古	35	16	72	26	19	18	15	20	28.0

In [43]:

# 合并弹幕热度df_all = pd.merge(df_all, df_4, left_on='姓名', right_on='本名', how='inner')df_all.drop('本名', axis=1, inplace=True)df_all.rename({'normalized_num': '弹幕热度'}, axis=1, inplace=True) df_all.head()

Out[43]:

	姓名	第一职业	第二职业	出生地	年龄	出道年数	初舞台分数	最终排名	个人特质	成团潜力	舞台表现	声乐表现	豆瓣热度	弹幕热度
0	阿朵	歌手	演员	湖南	42	18	79	13	20	18	18	23	9.1	12.4
1	黄圣依	演员	歌手	上海	37	20	80	11	23	24	15	18	52.9	16.3
2	金莎	歌手	演员	上海	37	18	68	29	18	18	16	16	13.8	19.9
3	孟佳	歌手	演员	湖南	30	10	87	4	23	22	22	20	11.4	10.7
4	王丽坤	演员	无	内蒙古	35	16	72	26	19	18	15	20	28.0	22.9

数据分析

选手年龄的分布

In [45]:

bins = [29, 33, 37, 41, 45, 49, 53] bins_label = ['29-33', '33-37', '37-41', '41-45', '45-49', '49-53']df_all['年龄段'] = pd.cut(df_all['年龄'],bins=bins,labels=bins_label,include_lowest=True)age_num = df_all['年龄段'].value_counts().sort_index()age_num

Out[45]:

29-33    11
33-37    10
37-41     4
41-45     1
45-49     3
49-53     1
Name: 年龄段, dtype: int64

In [47]:

data_pair =  [list(z) for z in zip(age_num.index.tolist(), age_num.values.tolist())]# 绘制饼图pie1 = Pie()pie1.add('', data_pair, radius=['35%', '60%'], rosetype='radius')pie1.set_global_opts(title_opts=opts.TitleOpts(title='年龄分布'), 
                     legend_opts=opts.LegendOpts(orient='vertical', pos_top='15%', pos_left='2%'))pie1.set_series_opts(label_opts=opts.LabelOpts(formatter="{b}:{d}%"))pie1.set_colors(['#EF9050', '#3B7BA9', '#6FB27C', '#FFAF34', '#D8BFD8', '#00BFFF'])pie1.render_notebook()

Out[47]:

选手职业分布

In [48]:

df_all['value1'] = [1 if i!='无' else 0 for i in df_all['第一职业']] df_all['value2'] = [1 if i!='无' else 0 for i in df_all['第二职业']]df_all.head()

Out[48]:

	姓名	第一职业	第二职业	出生地	年龄	出道年数	初舞台分数	最终排名	个人特质	成团潜力	舞台表现	声乐表现	豆瓣热度	弹幕热度	年龄段	value1	value2
0	阿朵	歌手	演员	湖南	42	18	79	13	20	18	18	23	9.1	12.4	41-45	1	1
1	黄圣依	演员	歌手	上海	37	20	80	11	23	24	15	18	52.9	16.3	33-37	1	1
2	金莎	歌手	演员	上海	37	18	68	29	18	18	16	16	13.8	19.9	33-37	1	1
3	孟佳	歌手	演员	湖南	30	10	87	4	23	22	22	20	11.4	10.7	29-33	1	1
4	王丽坤	演员	无	内蒙古	35	16	72	26	19	18	15	20	28.0	22.9	33-37	1	0

In [52]:

set1 = set(pd.concat([df_all.第一职业, df_all.第二职业])) set1.remove('无')set1

Out[52]:

{'主持人', '模特', '歌手', '演员', '音乐人'}

In [54]:

# 产生nodesnodes = []for i in df_all['姓名']:
    dict1 = {}
    dict1['name'] = i
    nodes.append(dict1)for j in set1:
    dict2 = {}
    dict2['name'] = j
    nodes.append(dict2) 
    nodes

Out[54]:

[{'name': '阿朵'},
 {'name': '黄圣依'},
 {'name': '金莎'},
 {'name': '孟佳'},
 {'name': '王丽坤'},
 {'name': '许飞'},
 {'name': '张雨绮'},
 {'name': '郑希怡'},
 {'name': '白冰'},
 {'name': '黄龄'},
 {'name': '蓝盈莹'},
 {'name': '宁静'},
 {'name': '王霏霏'},
 {'name': '郁可唯'},
 {'name': '张含韵'},
 {'name': '朱婧汐'},
 {'name': '陈松伶'},
 {'name': '海陆'},
 {'name': '李斯丹妮'},
 {'name': '吴昕'},
 {'name': '王智'},
 {'name': '伊能静'},
 {'name': '张萌'},
 {'name': '丁当'},
 {'name': '金晨'},
 {'name': '刘芸'},
 {'name': '沈梦辰'},
 {'name': '万茜'},
 {'name': '袁咏琳'},
 {'name': '钟丽缇'},
 {'name': '歌手'},
 {'name': '音乐人'},
 {'name': '演员'},
 {'name': '模特'},
 {'name': '主持人'}]

In [57]:

# 筛选df_sel = df_all[df_all['第二职业']!='无'] # 产生linkslinks = []for i,j,z in zip(df_all['姓名'], df_all['第一职业'], df_all['value1']):
    dict1 = {}
    dict1['source'] = i
    dict1['target'] = j 
    dict1['value'] = z 
    links.append(dict1)for i, j, z in zip(df_sel['姓名'], df_sel['第二职业'], df_sel['value2']):
    dict1 = {} 
    dict1['source'] = i
    dict1['target'] = j 
    dict1['value'] = z 
    links.append(dict1) 
    links

Out[57]:

[{'source': '阿朵', 'target': '歌手', 'value': 1},
 {'source': '黄圣依', 'target': '演员', 'value': 1},
 {'source': '金莎', 'target': '歌手', 'value': 1},
 {'source': '孟佳', 'target': '歌手', 'value': 1},
 {'source': '王丽坤', 'target': '演员', 'value': 1},
 {'source': '许飞', 'target': '歌手', 'value': 1},
 {'source': '张雨绮', 'target': '演员', 'value': 1},
 {'source': '郑希怡', 'target': '歌手', 'value': 1},
 {'source': '白冰', 'target': '演员', 'value': 1},
 {'source': '黄龄', 'target': '歌手', 'value': 1},
 {'source': '蓝盈莹', 'target': '演员', 'value': 1},
 {'source': '宁静', 'target': '演员', 'value': 1},
 {'source': '王霏霏', 'target': '歌手', 'value': 1},
 {'source': '郁可唯', 'target': '歌手', 'value': 1},
 {'source': '张含韵', 'target': '歌手', 'value': 1},
 {'source': '朱婧汐', 'target': '歌手', 'value': 1},
 {'source': '陈松伶', 'target': '演员', 'value': 1},
 {'source': '海陆', 'target': '演员', 'value': 1},
 {'source': '李斯丹妮', 'target': '歌手', 'value': 1},
 {'source': '吴昕', 'target': '主持人', 'value': 1},
 {'source': '王智', 'target': '演员', 'value': 1},
 {'source': '伊能静', 'target': '歌手', 'value': 1},
 {'source': '张萌', 'target': '演员', 'value': 1},
 {'source': '丁当', 'target': '歌手', 'value': 1},
 {'source': '金晨', 'target': '演员', 'value': 1},
 {'source': '刘芸', 'target': '演员', 'value': 1},
 {'source': '沈梦辰', 'target': '主持人', 'value': 1},
 {'source': '万茜', 'target': '演员', 'value': 1},
 {'source': '袁咏琳', 'target': '歌手', 'value': 1},
 {'source': '钟丽缇', 'target': '演员', 'value': 1},
 {'source': '阿朵', 'target': '演员', 'value': 1},
 {'source': '黄圣依', 'target': '歌手', 'value': 1},
 {'source': '金莎', 'target': '演员', 'value': 1},
 {'source': '孟佳', 'target': '演员', 'value': 1},
 {'source': '郑希怡', 'target': '演员', 'value': 1},
 {'source': '黄龄', 'target': '演员', 'value': 1},
 {'source': '王霏霏', 'target': '演员', 'value': 1},
 {'source': '张含韵', 'target': '演员', 'value': 1},
 {'source': '朱婧汐', 'target': '音乐人', 'value': 1},
 {'source': '陈松伶', 'target': '歌手', 'value': 1},
 {'source': '李斯丹妮', 'target': '演员', 'value': 1},
 {'source': '吴昕', 'target': '演员', 'value': 1},
 {'source': '伊能静', 'target': '演员', 'value': 1},
 {'source': '金晨', 'target': '模特', 'value': 1},
 {'source': '沈梦辰', 'target': '演员', 'value': 1},
 {'source': '万茜', 'target': '歌手', 'value': 1},
 {'source': '袁咏琳', 'target': '演员', 'value': 1}]

In [58]:

colors = ['#54B4F9', '#F29150', '#FF7BAE', '#D69AC0', '#485CE0', '#28BE7A']s = Sankey(init_opts=opts.InitOpts(width='1350px', height='1350px'))s.set_colors(colors)s.add('', 
      nodes, 
      links, 
      pos_left='10%',
      pos_right='55%',
      linestyle_opt=opts.LineStyleOpts(opacity=0.2, curve=0.5, color='source'), 
      itemstyle_opts=opts.ItemStyleOpts(border_width=1, border_color="#aaa"),
      tooltip_opts=opts.TooltipOpts(trigger_on="mousemove"),
      is_draggable=True,
      label_opts=opts.LabelOpts(position="left", 
                                #horizontal_align='left',
                                font_family='Arial', 
                                margin=10,
                                font_size=13, 
                                font_style='italic')
     ) s.set_global_opts(title_opts=opts.TitleOpts(title='乘风破浪的姐姐演员第一/二职业分布'))  s.render_notebook()

Out[58]:

下载.png

In [61]:

df_all[df_all.第一职业.str.contains('演员|歌手') & df_all.第二职业.str.contains('演员|歌手')].shape

Out[61]:

(13, 17)

解读：

在这些艺人中，我们可以看到，她们大多数演员和歌手出身，艺人中身兼数职的情况比较普遍，30人中至少有17人身兼多职，其中13人既是演员、也是歌手。

选手出生地分布

In [62]:

province_num = df_all['出生地'].value_counts()province_num[:5]

Out[62]:

上海    5
湖南    5
四川    3
山东    2
辽宁    2
Name: 出生地, dtype: int64

In [63]:

# 地图map1 = Map()map1.add("", [list(z) for z in zip(province_num.index.tolist(), province_num.values.tolist())],
         maptype='china'
        ) map1.set_global_opts(title_opts=opts.TitleOpts(title='籍贯分布'),
                     visualmap_opts=opts.VisualMapOpts(max_=4),
                    )map1.render_notebook()

Out[63]:

影响姐姐们得分的关键因素

In [64]:

# 相关系数data_corr = df_all[['年龄', '出道年数', '个人特质', '成团潜力', '舞台表现', '声乐表现', '初舞台分数']].corr()data_corr

Out[64]:

	年龄	出道年数	个人特质	成团潜力	舞台表现	声乐表现	初舞台分数
年龄	1.000000	0.898560	-0.424288	-0.431672	-0.372400	0.275562	-0.310940
出道年数	0.898560	1.000000	-0.330552	-0.339932	-0.380076	0.265712	-0.253856
个人特质	-0.424288	-0.330552	1.000000	0.861874	0.485876	0.079038	0.854642
成团潜力	-0.431672	-0.339932	0.861874	1.000000	0.340230	-0.127461	0.718482
舞台表现	-0.372400	-0.380076	0.485876	0.340230	1.000000	0.101970	0.693657
声乐表现	0.275562	0.265712	0.079038	-0.127461	0.101970	1.000000	0.459155
初舞台分数	-0.310940	-0.253856	0.854642	0.718482	0.693657	0.459155	1.000000

In [66]:

plt.figure(figsize=(15,10))sns.heatmap(data_corr,linewidths=0,cmap='tab20c_r', annot=True)plt.title('影响姐姐们得分的相关因素分析', fontdict={'fontsize': 'xx-large', 'fontweight':'heavy'}) plt.xticks(fontsize=12)plt.yticks(fontsize=12)plt.show()

解读：

通过Python计算数值型变量之间的pearson相关系数。对于系数r的取值，根据经验可将相关程度分为以下几种情况，|r|>=0.8时，可视为高相关，0.5<=|r|<0.8，可视为中度相关，0.3<=|r|<0.5时，可视为低度相关，|r|<0.3，可视为不相关。根据相关系数数值，在95%的置信程度水平情况下：

1.初评舞台分数和年龄、出道年数没有显著相关。
2.年龄和个人特质、成团潜力的分数间存在低度负相关关系，年龄越大，个人特质和成团潜力的得分也就越低；
3.个人特质和成团潜力的打分之间存在高度正相关，即两者得分存在高则同高，低则同低的情况。

乘风破浪豆瓣数据分析

In [67]:

df = pd.read_excel(r'E:\py练习\数据分析\乘风破浪\数据\乘风破浪豆瓣短评.xlsx')df.shape

Out[67]:

(500, 6)

In [74]:

df.info()df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   user_name     499 non-null    object
 1   rating_num    500 non-null    object
 2   comment_time  500 non-null    object
 3   content       500 non-null    object
 4   votes         500 non-null    int64 
dtypes: int64(1), object(4)
memory usage: 19.7+ KB

Out[74]:

	user_name	rating_num	comment_time	content	votes
0	阿丘	5星	2020-06-12 15:13:29	芒果能出个无杜华版吗？她影响了我看喜剧的好心情。	5752
1	蓝色的椰子	4星	2020-06-12 12:46:37	宁静:还要我介绍我是谁？我这几十年都白干了。	4023
2	丁丁丁丁丁丶	4星	2020-06-12 12:15:10	开播前半小时才通知，0宣传，微博热搜榜也在冻结，太野太任性了吧	3343
3	承宸	3星	2020-06-12 12:16:22	三十而立，三十而励，三十而骊。很开心中国也开始宣传“老女孩”的美了。看美剧欲望都市的时候，好...	1906
4	你知不知道	5星	2020-06-12 12:45:31	姐姐比妹妹的搞头可多太多了！	3483

In [69]:

df.duplicated().sum()

Out[69]:

In [70]:

df.isnull().sum()

Out[70]:

user_name       1
user_url        0
rating_num      0
comment_time    0
content         0
votes           0
dtype: int64

In [71]:

# 删除列df = df.drop('user_url', axis=1)

In [73]:

# 定义函数def tranform_rating(x):
    if x == '很差':
        return '1星'
    elif x == '较差':
        return '2星'
    elif x == '还行':
        return '3星'
    elif x == '推荐':
        return '4星'
    elif x == '力荐':
        return '5星'
    else:
        return '4星'  # 众数
    df['rating_num'] = df['rating_num'].apply(tranform_rating) df.rating_num.value_counts()

Out[73]:

4星    191
5星    125
3星     77
2星     61
1星     46
Name: rating_num, dtype: int64

总体评分分布

In [77]:

rating_num = df.rating_num.value_counts().sort_index() rating_num

Out[77]:

1星     46
2星     61
3星     77
4星    191
5星    125
Name: rating_num, dtype: int64

In [79]:

# 数据对data_pair = [list(z) for z in zip(rating_num.index.tolist(), rating_num.values.tolist())]# 绘制饼图pie1 = Pie()pie1.add('', data_pair=data_pair, radius=['35%', '60%'])pie1.set_global_opts(title_opts=opts.TitleOpts(title='总体评分分布'), #                      toolbox_opts=opts.ToolboxOpts(),
                     legend_opts=opts.LegendOpts(orient='vertical', pos_top='15%', pos_left='2%'))pie1.set_series_opts(label_opts=opts.LabelOpts(formatter="{b}:{d}%"))pie1.set_colors(['#AC613C', '#6BA3D6', '#F4737A', '#32A251', '#B5DFFD']) pie1.render_notebook()

Out[79]:

短评时间走势图

In [89]:

comment_date = pd.to_datetime(df['comment_time']).dt.date.value_counts().sort_index()comment_date

Out[89]:

2020-06-12    167
2020-06-13    157
2020-06-14     65
2020-06-15     31
2020-06-16     22
2020-06-17     11
2020-06-18     12
2020-06-19     16
2020-06-20     19
Name: comment_time, dtype: int64

In [91]:

# 折线图from pyecharts.charts import Lineline1 = Line()line1.add_xaxis(comment_date.index.tolist())line1.add_yaxis('', comment_date.values.tolist(),
                label_opts=opts.LabelOpts(is_show=False))line1.set_global_opts(title_opts=opts.TitleOpts(title='评论数量走势图'), 
                      visualmap_opts=opts.VisualMapOpts(max_=140))line1.set_series_opts(linestyle_opts=opts.LineStyleOpts(width=4)) line1.render_notebook()

Out[91]:

In [94]:

import jiebadef get_cut_words(content_series):
    # 读入停用词表
    stop_words = [] 
    
    with open(r"E:\py练习\数据分析\stop_words.txt", 'r', encoding='utf-8') as f:
        lines = f.readlines()
        for line in lines:
            stop_words.append(line.strip())

    # 添加关键词
    my_words = ['杜华', '辣鸡', '导演组', '节目组', '不公平', '黄圣依', '无杜华版']      
    for i in my_words:
        jieba.add_word(i) 

    # 自定义停用词
    my_stop_words = ['第一期', '一堆', '三个', '真的']   
    stop_words.extend(my_stop_words)               

    # 分词
    word_num = jieba.lcut(content_series.str.cat(sep='。'), cut_all=False)

    # 条件筛选
    word_num_selected = [i for i in word_num if i not in stop_words and len(i)>=2]
    
    return word_num_selected

In [95]:

# 选取子集df1 = df[(df['rating_num']=='1星') | (df['rating_num']=='2星')]df2 = df[(df['rating_num']=='4星') | (df['rating_num']=='5星')] # 调用函数text1 = get_cut_words(content_series=df1.content) text1[:5]

Building prefix dict from the default dictionary ...
Dumping model to file cache C:\Users\ASUS\AppData\Local\Temp\jieba.cache
Loading model cost 0.880 seconds.
Prefix dict has been built successfully.

Out[95]:

['不公平', '选秀', '节目', '杜华', '辣鸡']

In [96]:

import stylecloud from IPython.display import Image

In [98]:

# 绘制词云图stylecloud.gen_stylecloud(text=' '.join(text1), 
                          collocations=False,
                          font_path=r'‪C:\Windows\Fonts\msyh.ttc',
                          icon_name='fas fa-thumbs-down',
                          size=653,
                          output_name='豆瓣负向评分词云.png')Image(filename='豆瓣负向评分词云.png')

Out[98]:

吐槽点：

评委杜华：辣鸡；给丁当打分真是要气炸；30+的女性岁月积淀了魅力，评审却按照20岁女团的标准来
黄晓明：尬上加尬的绿大暗主持，根本调节不好气氛；油腻中年男
节目组：场景布置令人寒酸；摄影差，灯光差，置景差
黄圣依：等黄圣依淘汰了我再改成五星，谢谢。

In [ ]:

豆瓣评分4/5星用户词云图

In [99]:

# 调用函数text2 = get_cut_words(content_series=df2.content) text2[:5]

Out[99]:

['芒果', '出个', '无杜华版', '影响', '喜剧']

In [100]:

# 绘制词云图stylecloud.gen_stylecloud(text=' '.join(text1), 
                          collocations=False,
                          font_path=r'‪C:\Windows\Fonts\msyh.ttc',
                          icon_name='fas fa-thumbs-up',
                          size=653,
                          output_name='豆瓣正向评分词云.png')Image(filename='豆瓣正向评分词云.png')

Out[100]:

乘风破浪芒果TV弹幕数据分析

In [101]:

# 读入数据df = pd.read_excel('E:\py练习\数据分析\乘风破浪\数据\芒果tv弹幕6.19.xlsx')df.head()

Out[101]:

	episodes	danmu_id	uname	content	danmu_time	up_count	danmu_minites
0	第一集上	6837353528928491520	最强舞担	昕昕子你是最美的！！！	60923	16.0	1
1	第一集上	6837381424741214208	NaN	姐姐们加油啊	60164	4.0	1
2	第一集上	6839011343358140416	NaN	爱了爱了，姐姐们都好强呀！	60000	3.0	1
3	第一集上	6837810908587182080	NaN	旁白小哥哥是如果国宝会说话的小哥哥吗	60981	2.0	1
4	第一集上	6837348194579084288	气场强大	孟佳冲鸭！！！	60341	4.0	1

In [102]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94575 entries, 0 to 94574
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   episodes       94575 non-null  object 
 1   danmu_id       94575 non-null  int64  
 2   uname          4613 non-null   object 
 3   content        94575 non-null  object 
 4   danmu_time     94575 non-null  int64  
 5   up_count       86667 non-null  float64
 6   danmu_minites  94575 non-null  int64  
dtypes: float64(1), int64(3), object(3)
memory usage: 5.1+ MB

In [103]:

# 转换类型df['content'] = df.content.astype('str')

In [104]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 94575 entries, 0 to 94574
Data columns (total 7 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   episodes       94575 non-null  object 
 1   danmu_id       94575 non-null  int64  
 2   uname          4613 non-null   object 
 3   content        94575 non-null  object 
 4   danmu_time     94575 non-null  int64  
 5   up_count       86667 non-null  float64
 6   danmu_minites  94575 non-null  int64  
dtypes: float64(1), int64(3), object(3)
memory usage: 5.1+ MB

In [105]:

df_player = pd.read_excel('E:\py练习\数据分析\乘风破浪\数据\维基百科数据.xlsx', usecols=['本名'])df_player.head()

Out[105]:

	本名
0	蓝盈莹
1	黄龄
2	李斯丹妮
3	孟佳
4	沈梦辰

In [106]:

df_player['昵称'] = ['蓝盈莹|盈莹', '黄龄', '丹妮', '孟佳', '梦辰', 
                     '可唯', '宁静|静静子|静姐', '霏霏', '希怡', '袁咏琳',
                     '圣依|依依子', '金晨', '阿朵', '含韵', '白冰',
                     '钟丽缇', '茜', '张萌|萌萌子', '婧汐', '丁当',
                     '许飞', '刘芸|芸芸子', '吴昕|昕昕子|昕姐|昕昕', '伊能静', '松伶',
                     '丽坤', '张雨绮|雨绮|绮绮子', '海陆', '金莎', '王智']df_player.head()

Out[106]:

	本名	昵称
0	蓝盈莹	蓝盈莹\|盈莹
1	黄龄	黄龄
2	李斯丹妮	丹妮
3	孟佳	孟佳
4	沈梦辰	梦辰

In [107]:

nick_names = df_player.昵称.values.tolist()nick_names[:5]

Out[107]:

['蓝盈莹|盈莹', '黄龄', '丹妮', '孟佳', '梦辰']

In [108]:

names_countall = []for name in nick_names:
    # 筛选
    df_sel = df[df.content.str.contains(name)]
    # 弹幕/点赞数
    names_count = df_sel.shape[0]
    # 追加
    names_countall.append(names_count) 
    names_countall

Out[108]:

In [109]:

df_player['names_count'] = names_countalldf_player.head()

Out[109]:

	本名	昵称	names_count
0	蓝盈莹	蓝盈莹\|盈莹	673
1	黄龄	黄龄	653
2	李斯丹妮	丹妮	470
3	孟佳	孟佳	523
4	沈梦辰	梦辰	1107

In [110]:

# 标准化min_max_range = df_player.names_count.max() - df_player.names_count.min() normalized_num = (df_player.names_count - df_player.names_count.min()) / min_max_range df_player['normalized_num'] = round(normalized_num*100, 1)df_player.head()

Out[110]:

	本名	昵称	names_count	normalized_num
0	蓝盈莹	蓝盈莹\|盈莹	673	15.3
1	黄龄	黄龄	653	14.7
2	李斯丹妮	丹妮	470	9.1
3	孟佳	孟佳	523	10.7
4	沈梦辰	梦辰	1107	28.4

In [111]:

mangguo_rank = df_player.sort_values('normalized_num', ascending=True)[['本名', 'normalized_num']]mangguo_rank.head()

Out[111]:

	本名	normalized_num
9	袁咏琳	0.0
18	朱婧汐	0.4
29	王智	0.7
24	陈松伶	4.4
21	刘芸	6.6

In [112]:

mangguo_rank.to_excel('E:\py练习\数据分析\乘风破浪\数据\选手弹幕热度排名.xlsx', index=False)

In [115]:

# 条形图bar1 = Bar(init_opts=opts.InitOpts(width='700px', height='1000px')) bar1.add_xaxis(mangguo_rank['本名'].values.tolist())bar1.add_yaxis('', mangguo_rank['normalized_num'].values.tolist()) bar1.set_global_opts(title_opts=opts.TitleOpts(title='选手芒果tv第一期弹幕热度排名(标准化)'),
                     visualmap_opts=opts.VisualMapOpts(max_=80) 
                    )bar1.set_series_opts(label_opts=opts.LabelOpts(position='right'))bar1.reversal_axis()bar1.render_notebook()

Out[115]:

下载 (1).png

弹幕词云

In [117]:

my_words_list = df_player.昵称.str.cat(sep='。').replace('|', '。').split('。')print(my_words_list)

['蓝盈莹', '盈莹', '黄龄', '丹妮', '孟佳', '梦辰', '可唯', '宁静', '静静子', '静姐', '霏霏', '希怡', '袁咏琳', '圣依', '依依子', '金晨', '阿朵', '含韵', '白冰', '钟丽缇', '茜', '张萌', '萌萌子', '婧汐', '丁当', '许飞', '刘芸', '芸芸子', '吴昕', '昕昕子', '昕姐', '昕昕', '伊能静', '松伶', '丽坤', '张雨绮', '雨绮', '绮绮子', '海陆', '金莎', '王智']

In [118]:

def get_cut_words(content_series):
    # 读入停用词表
    import jieba 
    stop_words = [] 
    
    with open(r"E:\py练习\数据分析\stop_words.txt", 'r', encoding='utf-8') as f:
        lines = f.readlines()
        for line in lines:
            stop_words.append(line.strip())

    # 添加关键词
    my_words = ['杜华', '辣鸡', '导演组', '节目组', '不公平', '黄圣依', '无杜华版']      
    for i in my_words:
        jieba.add_word(i) 
        
    my_words2 = my_words_list
    for j in my_words2:
        jieba.add_word(j) 

    # 自定义停用词
    my_stop_words = ['第一期', '一堆', '三个', '真的', '哈哈哈', '哈哈哈哈', '啊啊啊']    
    stop_words.extend(my_stop_words)               

    # 分词
    word_num = jieba.lcut(content_series.str.cat(sep='。'), cut_all=False)

    # 条件筛选
    word_num_selected = [i for i in word_num if i not in stop_words and len(i)>=2]
    
    return word_num_selected

In [119]:

text1 = get_cut_words(content_series=df.content)text1[:5]

Out[119]:

['昕昕子', '最美', '姐姐', '加油', '姐姐']

In [121]:

# 绘制词云图stylecloud.gen_stylecloud(text=' '.join(text1), 
                          collocations=False,
                          font_path=r'‪C:\Windows\Fonts\msyh.ttc',
                          icon_name='fas fa-heart',
                          size=653,
                          output_name='芒果TV第一期整体弹幕词云.png')Image(filename='芒果TV第一期整体弹幕词云.png')

Out[121]:

In [122]:

text2 = get_cut_words(df.content[df.content.str.contains('宁静|静静子|静姐')])text2[:5]

Out[122]:

['宁静', '宁静', '我来', '静静子', '昕姐']

In [123]:

stylecloud.gen_stylecloud(text=' '.join(text2), 
                          collocations=False,
                          font_path=r'‪C:\Windows\Fonts\msyh.ttc',
                          icon_name='fas fa-guitar',
                          size=653,
                          output_name='芒果TV弹幕词云-宁静.png')Image(filename='芒果TV弹幕词云-宁静.png')

Out[123]:

In [124]:

text4 = get_cut_words(df.content[df.content.str.contains('蓝盈莹|盈莹')])text4[:5]

Out[124]:

['蓝盈莹', '姐姐', '蓝盈莹', '蓝盈莹', '冲冲']

In [125]:

stylecloud.gen_stylecloud(text=' '.join(text4), 
                          collocations=False,
                          font_path=r'‪C:\Windows\Fonts\msyh.ttc',
                          icon_name='fas fa-star',
                          size=653,
                          output_name='芒果TV弹幕词云-蓝盈莹.png')Image(filename='芒果TV弹幕词云-蓝盈莹.png')

Out[125]:

In [ ]: