EDG won the championship, using crawlers + data analysis + natural language processing (sentiment analysis) + data visualization to analyze 30,000 pieces of data: fans are crazy (the only original)

Originality is not easy, plagiarism and reprinting are prohibited in this article, and violations of rights will be investigated!

1. EDG Championship Information

On November 6th, in the League of Legends Finals, the EDG team defeated the South Korean team 3:2 and won the 2021 League of Legends Global Finals championship. This game has also attracted much attention on major platforms across the Internet:

1. The number one hot search on Weibo . As of 2021-11-10, it has 100 million views and 6.384 million Weibo fans. 2. Bilibili has hundreds of millions of popularity, with a total of 223,000 bullet screens. Ranked No. 2 on the station ranking list , 2.199 million fans at station B 3, 8 million viewers on video platforms such as Tencent , iQiyi , and Youku
insert image description here

insert image description hereinsert image description here

4. Huya and other live streaming platforms are also very popular

5. CCTV News also posted Weibo to celebrate EDG's victory
insert image description here
insert image description here
. Since the competition is so popular, we will use bilibili as a benchmark this time to collect 30,000 barrage data from Bilibili of EDG's championship video, and then use Python to Analyze and feel the enthusiasm of fans


2. Practical goals

2.1 Web crawler

Using crawler technology to capture 30,000 barrage data of the EDG team's championship game video at Station B
insert image description here

2.2 Data visualization (word cloud map)

Analyze and visualize the captured barrage data through Python libraries such as jieba and numpy
insert image description here
insert image description here

2.3 Natural Language Processing (Sentiment Analysis)

Use pandas+natural language processing (NLP) to conduct sentiment analysis on the 30,000 barrage data of the EDG championship game video, and draw some conclusions based on the analysis results
insert image description here

insert image description here
insert image description here



3. Analysis of bilibili interface

First enter the video URL of the EDG championship game:
https://www.bilibili.com/video/BV1EP4y1j7kV?p=1

Bilibili has compiled the EDG game videos for everyone. From the opening ceremony to the moment of winning the championship, there are 7 video
insert image description here
Bilibili barrage data interfaces:

http://api.bilibili.com/x/v1/dm/list.so?oid=XXX

This interface is the dedicated interface for bullet chat data at station B, and we can use it directly. The oid in this interface can be understood as the unique identifier in each video. It is composed of numbers, and each video has a unique oid. Then as long as we find the oid, we can request the API interface of the corresponding game video barrage, so as to capture the barrage data

Get the oid
, open the developer tool, switch to the Network option, and then find the request interface starting with pagelist
insert image description here
, then find the Request URL request interface, open a new window and directly use this API interface to request, as shown in the figure below:
insert image description here
When we directly request this API interface You can see the data in JSON format, and the cid in it is the oid we need, as follows:

{
    
    "code":0,"message":"0","ttl":1,"data":[{
    
    "cid":437586584,"page":1,"from":"vupload","part":"第一局 4K","duration":2952,"vid":"","weblink":"","dimension":{
    
    "width":1920,"height":1080,"rotate":0}},{
    
    "cid":437626309,"page":2,"from":"vupload","part":"第二局 4K","duration":3031,"vid":"","weblink":"","dimension":{
    
    "width":1920,"height":1080,"rotate":0}},{
    
    "cid":437659159,"page":3,"from":"vupload","part":"第三局 4K","duration":3406,"vid":"","weblink":"","dimension":{
    
    "width":1920,"height":1080,"rotate":0}},{
    
    "cid":437727348,"page":4,"from":"vupload","part":"第四局 4K","duration":3212,"vid":"","weblink":"","dimension":{
    
    "width":1920,"height":1080,"rotate":0}},{
    
    "cid":437729555,"page":5,"from":"vupload","part":"第五局 4K","duration":3478,"vid":"","weblink":"","dimension":{
    
    "width":1920,"height":1080,"rotate":0}},{
    
    "cid":437550300,"page":6,"from":"vupload","part":"开幕式","duration":984,"vid":"","weblink":"","dimension":{
    
    "width":1920,"height":1080,"rotate":0}},{
    
    "cid":437717574,"page":7,"from":"vupload","part":"夺冠时刻","duration":2017,"vid":"","weblink":"","dimension":{
    
    "width":1920,"height":1080,"rotate":0}}]

Of course, we can also click the **Preview** option, click data, open the data, and the JSON data inside is **folded**, including cid, as shown in the following figure:

insert image description here
It can be seen that each cid corresponds to each game video. We can also click on the Response option, the data in it is real data, which means that the data has not been folded, and it is the same as the JSON data returned by directly requesting the Request URL


4. Coding

4.1 Crawling data

Define a method to get cid

import requests
import json


def get_cid():
  url = 'https://api.bilibili.com/x/player/pagelist?bvid=BV1EP4y1j7kV&jsonp=jsonp'
  try:
    response = requests.get(url,timeout=None)
    if response is not None:
      return response.text
    else:
      return Nnone
  except Exception as e:
    print(e.args)


if __name__ == '__main__':
  data = get_cid()
  json_data = json.loads(data)
  for cid_datas in json_data['data']:
    cid = cid_datas.get('cid')
    print(cid)

The console output is as follows:

insert image description here
Splicing URL barrage data API interface

if __name__ == '__main__':
  data = get_cid()
  json_data = json.loads(data)
  base_api = 'http://api.bilibili.com/x/v1/dm/list.so?oid='
  for cid_datas in json_data['data']:
    cid = cid_datas.get('cid')
    detail_api = base_api + str(cid)
    print(detail_api)

The console output is as follows:
insert image description here

There are 7 URLs in total, corresponding to the bullet chat data of 7 EDG game videos. We click on the first URL to view the
insert image description here

captured bullet chat
data. In each tag, facing this format, let's think about which parsing tool is more appropriate? The answer is of course regular expressions. Next, we need to obtain 223,000 pieces of data from 7 competition videos. The code is as follows:

base_api = 'http://api.bilibili.com/x/v1/dm/list.so?oid='
  all_api = []
  for cid_datas in json_data['data']:
    cid = cid_datas.get('cid')
    detail_api = base_api + str(cid)
    all_api.append(detail_api)
  for api in all_api:
    edg_datas = get_api_data(detail_api)
    edg_datas = re.findall('<d.*?>(.*?)</d>',edg_datas,re.S)
    with open('EDG.txt','a',encoding='utf-8') as f:
      for edg_data in edg_datas:
        print(edg_data)
        f.write(edg_data + '\n')

To avoid garbled characters, add the following code:

 response.encoding = chardet.detect(response.content)['encoding']

The console output is as follows:
insert image description here
Since there are 30,000 barrage data, only part of the barrage data in EDG.txt is shown here, as shown in the figure below:
insert image description here

4.2 Data visualization (word cloud map)

We have captured the barrage data, and then use the EDG background image to make a word cloud image.
insert image description here
The code is as follows:

import jieba
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from PIL import Image
import numpy as np

def do_wordcloud():
  text = open('EDG.txt','r',encoding='utf-8').read()
  text = text.replace('\n','').replace('\u3000','')
  text_cut = jieba.lcut(text)
  text_cut = ' '.join(text_cut)

  #过滤一些没有关系的词
  stop_words = ['“',',',' ','我','的','是','了',':','?','!','啊','你','吗','。','我们']

  background = Image.open("EDG.jpg")
  graph = np.array(background)

  word_cloud = WordCloud(font_path='simsun.ttc',
                         background_color='white',
                         mask=graph, # 指定词云的形状
                         stopwords=stop_words)

  word_cloud.generate(text_cut)
  plt.subplots(figsize=(12,8))
  plt.imshow(word_cloud)
  plt.axis('off')
  plt.show()
  word_cloud.to_file('edg.png')

The console output is as follows:
insert image description here
Make a wave of the background picture of Ultraman Tiga , ahahaha!
insert image description here
Make it into the shape of Tiga Altman word cloud map, as shown below:
insert image description here

5. Natural Language Processing (NLP)

5.1 Data import

data = pd.read_csv('EDG.csv')
data = data.head()
print(data)

The console output is as follows:
insert image description here

5.2 Data preprocessing

data = pd.read_csv('EDG.csv')
data = data[['id','content']]
data = data.head(10)
print(data)

The console output is as follows:
insert image description here


5.3 Sentiment Analysis

First install the Python library for sentiment analysis:

pip install snownlp -i https://pypi.doubanio.com/simple

The effect is as follows:
insert image description here
Sentiment analysis
Because the data is too large, only part of it is shown here

from snownlp import SnowNLP
data1['emotion'] = data1['content'].apply(lambda x:SnowNLP(x).sentiments)
data1 = data1.head()
print(data1)

Console output:
insert image description here
sentiment data description

data1 = data1.describe()

Console output:
insert image description here
Data description : The average value of emotion is 0.63, the median is 0.67, and the 25% quantile is 0.49. It can be seen that less than 25% of the data caused a large downward shift in the overall mean. In addition, it can be seen at the bottom of the above figure that the execution time of sentiment analysis is 48.8s, and the amount of data is still quite large.

5.4 Sentiment Analysis Histogram

plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

bins = np.arange(0,1.1,0.1)	#设置区间
plt.hist(data1['emotion'],bins,color='#4F94CD',alpha=0.9)
plt.xlim(0,1)
plt.xlabel('情感分析')
plt.ylabel('数量')
plt.title('情感分析直方图')
plt.show()

Console output:
insert image description here
data description:

  • It can be seen from the histogram that the barrage emotion is gradually increasing, which shows that fans are gradually excited and excited about EDG winning the championship;
  • Among the nearly 30,000 barrage data, about 4,500 barrage sentiment scores are in the interval [0.5, 0.6]; at the same time, about 4,800 barrage sentiment scores are in the [0.8, 0.9] interval, and fans are most excited in this interval , It is estimated that it is the moment to win the championship, hahaha!
  • From the interval [0.5, 0.6] to [0.6, 0.7] and from the interval [0.8, 0.9] to [0.9, 1.0], the barrage mood has declined, which may be due to some problems in the game or the end of the game

5.5 Keyword Extraction

from jieba import analyse

key_words = analyse.extract_tags(sentence=text_cut,topK=10,withWeight=True,allowPOS=())
print(key_words)

Console output:
insert image description here
data description:

  • The above keywords show that "champion" is the most in the barrage posted by fans, followed by "translation", "us", "fuck", "little sister", "EDG", "tearful eyes", "holy gun" "Brother", "Congratulations", "edg", from this point of view, EDG is really popular, and the translator is also very popular. This can also be seen in the word cloud diagram above

Parameter Description:

  • sentence is the string to be extracted, it must be str type, not list
  • topK indicates how many keywords are extracted before
  • withWeight indicates whether to return the weight of each keyword
  • allowPOS indicates the part of speech that is allowed to be extracted. By default, place names (ns), nouns (n), gerunds (vn), and verbs (v) are extracted

5.6 Active barrage and negative barrage

Calculate the number of positive bullet chatter and negative bullet chatter:

pos,neg = 0,0
for  i in data1['emotion']:
	if i >= 0.5:
		pos += 1
	else:
		neg += 1
print(f'积极弹幕数据为:{
      
      pos}' + '\n' + f'消极弹幕数据为:{
      
      neg}')

Console output:
insert image description here
Active barrage data: 17941
Negative barrage data: 6054

5.7 Pie Chart Analysis

import matplotlib.pyplot as plt

plt.rcParams['font.sans-serif'] = ['SimHei']
plt.rcParams['axes.unicode_minus'] = False

pie_labels = 'positive','negative'
plt.pie([pos,neg],labels=pie_labels,autopct='%1.2f%%',shadow=True)

plt.show()

Console output:
insert image description here
As can be seen from the above figure, 74.77% of the bullet chat data is positive, and 25.23% of the bullet chat data is negative. Overall, there are still more positive bullet chat data

5.8 Analysis of negative barrage

Take out some negative barrage data

data2 = data1[data1['emotion'] < 0.5]
data2 = data2.head()
print(data2)

Console output:
insert image description here
data description:

  • The negative bullet screens such as "blood recovery" and "desire to survive" in the above picture may be caused by the poor game of the EDG team or the Korean team

6. Summary

PIL library
jieba library
numpy library
pandas library
requests library
wordcloud library
matplotlib library
json, re, chardet library
snownlp sentiment analysis library

7. Complete project download

My blog garden original text link: Read the original text
Complete project download link: Download
My original official account original text link: Read the original text

Originality is not easy, if you find it interesting and fun, I hope you can give it a thumbs up, thank you guys!

Recently, I found that many people plagiarized my blog on CSDN, and they are more popular than me, hey! After all, I just started blogging in October. Although it is an account created in 2018, its popularity and fans are not as high as others!

8. Author Info

Author: Xiaohong's Fishing Daily, Goal: Make programming more interesting!

Original WeChat public account: " Xiaohong Xingkong Technology ", focusing on algorithms, crawlers, websites, game development, data analysis, natural language processing, AI, etc., looking forward to your attention, let us grow and code together!

Reprint instructions: Be sure to indicate the source (note: from the official account: Xiaohong Xingkong Technology , author: Xiaohong's Fishing Daily)

Guess you like

Origin blog.csdn.net/qq_44000141/article/details/121265521