Python crawling Netease cloud music 10000 comments, feel the truth in the epidemic

I. Introduction

Yesterday Open NetEase cloud music is this video refresh the ⬇️

Pay tribute to all staff epidemic front! music.163.comicon

16 million + players amount, 13,000 + comments

Well, today we look at these comments with a python which left a touching story,

Second, the data fetch

First of all, we open this link NetEase cloud music video with a computer. Find the latest comments, the goal is to take out all these comments. Then find every click on "Next", url site and no change, indicating that the contents of the entire region are comments by request Ajax asynchronous technology get. About this concept we can Baidu, it is simply enables the exchange of data with the server behind the scenes, updated web page without reloading the page. Open your browser F12, enter the Developer Tools, select Network, we chose XHR (XmlHttpRequest) can elect Ajax request packet:

Then a point to go see the response, you can find the data packet ⬇️ with comments

We can see the point into the header information

It may be found a post request and receives two parameters params and encSecKey

Let's give it a try

import requests
import json

url = 'https://music.163.com/weapi/v1/resource/comments/R_VI_62_3F79C7B87510106B8118EE3F811C1BC5?csrf_token='

headers = {
   'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36',
   'Referer':'https://music.163.com/video?id=3F79C7B87510106B8118EE3F811C1BC5&userid=265996751',
   'Origin':'http://music.163.com',
   'Host':'music.163.com'
}
user_data = {
   'params': 'fphfDEFeIs3I+ybqkBQhWxvB8GFOB0RMrmOS1VfB9ljX0CWccYd5WPdfRk6iaPuhllQcpKweUTwKc7GyZZENbB99O3C/vdhEeChuxLK8Rl40hb/ipmhXIxbJ1KRMemNFF+jTQqdFUnw3HNdrUqSzjmfh/HP630vmp4HVL6i+oSDygse0C1JUgS5d5Six93R7r8b3tKUCnPw/JJbH3AXTlA==',
   'encSecKey': 'a658168c2225f0dfe46e9b260abb348691c42946ec46e6f4a5c434e86d6d546da0fcb7de0dba750422c40064b026169a453f5e42c59f63c38c7749c0e81023dd27978f1e5d97b6c97fa70df347737b51a69fc15b49b2e3e209c53eefcf7d795b6344404811e84761c700422ef57a427e84bc77adece15146ca62033b3f2aacfd'
}

response = requests.post(url,headers=headers,data=user_data)

Can be found on this page can be taken to Comment

But this can only get to comment on the current page. So how do you get all the information it reviews, we said before, click on the next page refresh time only comment without reloading the page. By testing we found Click Next time only params and encSecKey change, then the next question is, is thoroughly understand how these two parameters change. Fortunately, there are already know almost great God [1] gives the analysis and reduction code encryption process, so we can directly take over with. The whole process is complex, the double-encrypted for each parameter related to four different parameters. The complete code can get up early in the python lack of space. So finally all comments all crawling down.

Third, data analysis

First look at what to say in Lively

那有什么白衣天使,只不过是一群孩子穿上白衣服,学着前辈的样子治病救人罢了!

听到那句“妈妈在打怪兽呢”眼泪就掉下来了

武汉只是暂时被病毒藏起来!武汉加油再来看下出现最多的词汇

Read What appears most frequently keywords

There is no doubt refueling, refueling Wuhan, the largest number of Chinese refueling arise. Finally making word cloud look

from wordcloud import WordCloud
import matplotlib.pyplot as plt #绘制图像的模块
import  jieba #jieba分词

path_txt='music.txt'
f = open(path_txt,'r',encoding='UTF-8').read()

# 结巴分词,生成字符串,wordcloud无法直接生成正确的中文词云
cut_text = " ".join(jieba.cut(f))

wordcloud = WordCloud(
   #设置字体,不然会出现口字乱码,文字的路径是电脑的字体一般路径,可以换成别的
   font_path="msyh.ttc",
   #设置了背景,宽高
   background_color="black",width=2000,height=880).generate(cut_text)

plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.show()

The final word cloud generated ⬇️ can see: Wuhan Come on! Go China!

 

- Do not light collection point Like it!

My public numbers: get up early python

Reference material

[1] almost known:  https://www.zhihu.com/question/36081767

 

Published 33 original articles · won praise 82 · views 20000 +

Guess you like

Origin blog.csdn.net/weixin_41846769/article/details/104693221