Crawl all comments on a song in NetEase Cloud and output word cloud image

Disclaimer: This article is for study and research only, and it is forbidden to be used for illegal purposes. Otherwise, you will be at your own risk. If there is any infringement, please notify and delete it, thank you!

Project scene:

My last article wrote about the decryption of NetEase cloud parameters. This time we will crawl all the comments of the painted baby .

solution:


1. Let’s take a look at the commented interface first. It can be seen that it is https://music.163.com/weapi/comment/resource/comments/get?csrf_token=ff57cff46ebe79b9a51dd10f8c9181bbthis interface, and it also carries two encryption parameters params and encSecKey. The previous article talked about the decryption of the two parameters of the song interface. In fact, the comment is here. The two parameter decryption methods are the same, so I won't go into details.

Insert picture description here

2. The encryption parameters params and encSecKey generation method of the comment interface are posted directly here. There are several parameters that need to be paid attention to. One is cursor and the other is pageSize. After many tests, I found that the comments on each page are based on The cursor is generated at the time of the last comment in the industry, and the pagesize of each page is 20
{
    
    "rid":"R_SO_4_1474342935","threadId":"R_SO_4_1474342935","pageNo":"3","pageSize":"20","cursor":"1600190813154","offset":"0","orderType":"1","csrf_token":"ff57cff46ebe79b9a51dd10f8c9181bb"}

Insert picture description here

3. Then the problem is coming. I don’t know the time of the last comment on each page. What should I do if the corresponding cursor cannot be generated? My solution is: from the date of the first comment (for example, 2020) -08-27), then I will define the cursor as the last second of the time of the last day, that is, the timestamp of '2020-08-27 23:59:59' 1598543999000 (13 digits with 3 zeros), until the current day (For example, 2020-09-16) The last second time stamp is '1600271999000', then set pagesize=1000 (1000 is the maximum value, orderType=1 (sorted by time)), and use the corresponding method to avoid every Daily will get repeated comments. In fact, the number of comments obtained in this way is incomplete. You can subdivide the comments of the day according to page and offset to obtain complete comment information (the host is lazy, this method is not used), if you have other You can comment on good ideas!
param = {
    
    "rid": "R_SO_4_" + song_id, "threadId": "R_SO_4_" + song_id, "pageNo": "1", "pageSize": "1000",
		"cursor": cursor, "offset": "0", "orderType": "1", "csrf_token": "ff57cff46ebe79b9a51dd10f8c9181bb"}

4. After analysis, we can get most of the comments of the song, the code is as follows:
    now_day = datetime.date.today() #当天日期
    flag_info = None #重复评论标志
    num = 0
    for i in range(20, -1, -1):  # 获取 2020-08-27---2020-09-16 的日期
        pre_day = str(now_day - datetime.timedelta(days=i)) + ' 23:59:59'  # 获取T+1日期
        # 先转换为时间数组
        timeArray = time.strptime(pre_day, "%Y-%m-%d %H:%M:%S")
        # 转换为时间戳
        cursor = str(int(time.mktime(timeArray))) + '000'  # 拼接成13位时间戳
        print(pre_day, cursor)
        # 评论接口参数
        param = {
    
    "rid": "R_SO_4_" + song_id, "threadId": "R_SO_4_" + song_id, "pageNo": "1", "pageSize": "1000",
                 "cursor": cursor, "offset": "0", "orderType": "1", "csrf_token": "ff57cff46ebe79b9a51dd10f8c9181bb"}
        pdata = js_tool.call('d', str(param))
        response = requests.post('https://music.163.com/weapi/comment/resource/comments/get', headers=headers,data=pdata)
        # 获取评论信息
        data = json.loads(response.text)['data']
        comments = data.get('comments')
        # 存储评论信息
        with open('comments.txt', 'a', encoding='utf8') as f:
            for comment in comments:
                info = comment.get('content')
                if flag_info == info:  # 取到重复的评论则跳出循环,防止重复获取
                    break
                print(info)
                f.write(info + '\n')
                # folow_comments = comment.get('beReplied') # 附加的评论,暂不获取
                # if folow_comments:
                #     for folow_comment in folow_comments:
                #         print(folow_comment.get('content'))
                num += 1  # 获取评论数+1
        flag_info = comments[0]['content']  # 取每次请求的第一条
        print('每次请求的第一条', flag_info, '\n')
    print('获取评论数:', num)

5. Then we got the comment data, used jieba to perform word segmentation statistics on the data, and output the word cloud image:
# 分词
def fc_CN(text):
    # 接收分词的字符串
    word_list = jieba.cut(text)
    # 分词后在单独个体之间加上空格
    result = " ".join(word_list)
    return result

# 输出云词
def word_cloud():
    with open("./comments.txt", encoding='utf8') as fp:
        text = fp.read()
        # 将读取的中文文档进行分词
        text = fc_CN(text).replace('\n', '').split(' ')
        # 过滤部分分词
        filter_str = ['的', ',', '了', '我', '[', '你', '是', '就', ']', '!', '。', '?', '这', '不', '也', '都', '吧', '啊', '在',
                      '吗', '和', '吗', '听', '有', '说', '去', '好', '人', '给', '他', '…', '小', '来', '还', '没', '一', '']
        new_text = []
        for data in text:
            if data not in filter_str:
                new_text.append(data)
        print(new_text)
        # 词频统计
        word_counts = collections.Counter(new_text)  # 对分词做词频统计
        word_counts_top10 = word_counts.most_common(10)  # 获取前10最高频的词
        print(word_counts_top10)  # 输出检查

        # 词频展示
        mask = np.array(image.open('./love.jpg'))  # 定义词频背景--需要自行导入
        wc = wordcloud.WordCloud(
            # background_color='white',  # 设置背景颜色
            font_path='C:\Windows\Fonts\simhei.TTF',  # 设置字体格式
            mask=mask,  # 设置背景图
            max_words=200,  # 最多显示词数
            max_font_size=300,  # 字体最大值
            # scale=32  # 调整图片清晰度,值越大越清楚
        )

        wc.generate_from_frequencies(word_counts)  # 从字典生成词云
        image_colors = wordcloud.ImageColorGenerator(mask)  # 从背景图建立颜色方案
        wc.recolor(color_func=image_colors)  # 将词云颜色设置为背景图方案
        wc.to_file("./tmp.jpg")  # 将图片输出为文件
        plt.imshow(wc)  # 显示词云
        plt.axis('off')  # 关闭坐标轴
        plt.show()  # 显示图像


6. Finally, Run!, the total number of comments is 8544, which climbed to 8230, but there are still hundreds of them not.

Insert picture description here


Insert picture description here

7. Then take a look at the word cloud image we output, hahaha, the capital giao is greeted by my giao brother! , The complete code can be obtained from my git: https://github.com/934050259/wyy_comments

Insert picture description here

Note: I found that there is another commented API interface http://music.163.com/api/v1/resource/comments/R_SO_4_1474342935?limit=20&offset=0 The parameter is R_SO_4_+song_id, hey, uncomfortable~

Guess you like

Origin blog.csdn.net/qq_26079939/article/details/108621764