Python crawled Weibo comment data, and was actually blocked by anti-crawling!

It’s a professional era now, so you should become an expert earlier

Empty and full of dangdang <---------------------------------> groggy but plain and plain

Insert picture description here
Hello, everyone, everyone, I have been busy with other things these days, and I realized that I haven’t updated the crawler album for several days. Today I will give you a whole Python crawling Weibo comment. Haha, but in the end a little clown (weibo) I have been banned for a while), let’s take a look at the operation!

One: core process steps

  1. Find api interface, get json text data

  2. Analyze the url link parameters of the main comment and the sub-comment for splicing

  3. Crawl the main comment and sub comment data separately and import them into the execl table

Two: Search comment interface

First, search for a blogger on Weibo, and click to enter the blogger’s post of a Weibo information interface!
Insert picture description here
Then you have to switch from the PC viewing mode to the mobile viewing mode. The PC parameters are too complicated (Xiao Ye Dou struggled on the PC for three days, and finally the url spliced ​​out was nothing, but the mobile terminal was done in one day)!
Insert picture description here
Then, find the api interface for storing comment information from a lot of files (it is recommended to search in xhr format for Xiaoyedou, which is usually the information loaded in ajax format).
Insert picture description here
Copy this url link to the URL link, and then display the URL Copy and paste the content of json.cn to load the website in json format, as shown in the figure:
Insert picture description here
Let Xiaoyedou crawl the data of this json file first, and try to crawl a url link first. The data to be crawled is as follows Shown:
Insert picture description here

Core data capture code, access the interface, and get the json data format!

import requests
import json
from tqdm import tqdm
import datetime
import time
import random
import csv
# 一条主微博链接部分评论, 需要构造参数max_id获取全部ajax
up_main_url = 'https://m.weibo.cn/comments/hotflow?id=4596226979532970&mid=4596226979532970&max_id_type=0'
headers = {
    
    
        # ua代理
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/87.0.664.75',
        # 登录信息
        'cookie': 'SINAGLOBAL=5702871423690.898.1595515471453; SCF=Ah2tNvhR8eWX01S-DmF8uwYWORUbgfA0U3GnciJplYvqE1sn2zJtPdkJ9ork9dAVV8G7m-9kbF-PwIHsf3jHsUw.; SUB=_2A25NDifYDeRhGeBK7lYS9ifFwjSIHXVu8UmQrDV8PUJbkNANLRmlkW1NR7rne18NXZNqVxsfD3DngazoVlT-Fvpf; SUBP=0033WrSXqPxfM725Ws9jqgMF55529P9D9WhhI1TcfcjnxZJInnV-kd405NHD95QcSh-Xe0q41K.RWs4DqcjQi--ciK.RiKLsi--Ni-24i-iWi--Xi-z4iKyFi--fi-2XiKLhSKeEeBtt; wvr=6; _s_tentry=www.sogou.com; UOR=,,www.sogou.com; Apache=9073188868783.379.1611369496580; ULV=1611369496594:3:3:3:9073188868783.379.1611369496580:1611281802597; webim_unReadCount=%7B%22time%22%3A1611369649613%2C%22dm_pub_total%22%3A0%2C%22chat_group_client%22%3A0%2C%22chat_group_notice%22%3A0%2C%22allcountNum%22%3A63%2C%22msgbox%22%3A0%7D'
    }
response = requests.get(url=up_main_url, headers=headers)
if response.status_code == 200:
        text = response.text.encode("gbk", "ignore").decode("gbk", "ignore")  # 解决报错双重严格限制
        content = json.loads(text)  # 将文本转为json格式
        try:
            data = content['data']['data']  # 获取评论列表
            for comment in tqdm(data, desc='花花评论爬取加载进度--->!'):
                time.sleep(random.random())
                text = str(comment['text'])  # 获取文本信息
                # 卧槽,我房子又塌了<span class="url-icon">
                # 处理文本信息,find函数找到<span开始的索引
                if text.find('<span') != -1:
                    text = text[:text.find('<span')]
                create_time = comment['created_at']  # 发布时间
                # 格林威治时间格式字符串 Wed Jul 10 20:00:09 +0800 2019 转换为好理解的标准时间格式 2019-07-10 20:00:09
                # Fri Jan 22 17:56:48 +0800 2021 转换为标准时间格式 2021/1/22 17:56:48
                std_transfer = '%a %b %d %H:%M:%S %z %Y'  # 转换的一个格式
                std_create_time = datetime.datetime.strptime(create_time, std_transfer)
                user_name = comment['user']["screen_name"]  # 用户姓名
                user_id = comment['user']['id']  # 用户id
                user_followers_count = comment['user']['followers_count']  # 该用户粉丝数
                user_follow_count = comment['user']['follow_count']  # 该用户关注数
                user_gender = comment['user']['gender']  # 用户性别
                total_number = comment["total_number"]  # 总回复数
                like_count = comment["like_count"]  # 点赞数
                flag_id = comment["id"]  # 二级评论url构造所需id
                print('')
                # print(f'内容: {text}')
                # print(f'用户名: {user_name}')
                # print(f'评论时间: {std_create_time}')
                # print(f'id:{user_id}')
                # print(f'关注人数: {user_follow_count}')
                # print(f'粉丝: {user_followers_count}')
                # print(f'性别: {user_gender}')
                # print(f'回复数量: {total_number}')
                # print(f'点赞数: {like_count}')
                # print(f'cid: {flag_id}')
                #print('成功保存信息!')
        except:
            print("啊这,今晚是上分局!被反爬了")
            pass

The above code simply accesses the api interface to obtain the necessary information such as comment content, comment user name, number of likes, number of replies, and time of likes, and save it as a csv file and import it into the execl table. Part of the information in the table is shown in the figure:
Insert picture description here
The above code is a data crawl of part of the main comment below a Weibo, and some of the main comments are loaded by ajax, which requires parameters to construct a url link! The crawling of sub-comment reply information under the main comment is similar to the above code, due to space problems Don't show more, just pick it up at the end of the article !

URL rule of parameter construction

Crawling the main comment api interface parameters need: max_id (can be obtained directly from the json data in the first main comment api interface)

The api interface parameters for crawling sub-comments require: cid (can be obtained directly from the json data in the first main comment api interface), max_id (need to be obtained from the json data in the sub-comment interface)

ps: The sources of the two max_id parameters are different: one is obtained from the main comment api interface, and the other is obtained from the sub-comment api interface

# 构造主评论url链接
# 获取构造ajax主评论url全部max_id参数
def main_max_id():
    # 以没有内容报错作为终止条件break跳出
    while len(max_id_url_list) < 200:
        print("正在休眠中")
        time.sleep(random.randint(1, 3))  # 休眠
        print("休眠完毕啦")
        if (len(max_id_list) == 0):
            main_url = 'https://m.weibo.cn/comments/hotflow?id=4596226979532970&mid=4596226979532970&max_id_type=0'
            max_id_url_list.append(main_url)
        else:
            main_url = f'https://m.weibo.cn/comments/hotflow?id=4596226979532970&mid=4596226979532970&max_id={max_id_list[-1]}&max_id_type=0'
            max_id_url_list.append(main_url)
        try:
            content = requests_json(main_url)
            max_id = content['data']["max_id"]  # 获得主评论需要得参数max_id来构造url链接
            max_id_list.append(max_id)
            # TODO: 写一个终止条件, 什么时候不在获取max_id
            data = content['data']['data']  # 获取评论列表
            for comment in data:  # 循环遍历
                cid = comment["id"]  # 二级评论url构造所需id
                cid_list.append(cid)  # 添加
        except:
            print("最后一条max_id打底啦!该跳出走人咯!")
            break

The code runs as shown in the figure, and the two-threaded method is used to crawl, and the Weibo data is successfully obtained

Insert picture description here
The above is the crawling of the Weibo comments shared in this issue. The length is too long. If you want to get all the source code, please continue to pay attention to Xiaoyedou’s WeChat public account: Yedou小神社

Reply " 005 Weibo Data " in the background to get all the source code!

Insert picture description here
Please continue to pay attention to the Yedou Shrine, and don’t get lost on the crawling road!

Dianping bans ip, but also font encryption? I call it straight, that's it!

"The actual case of Xiaoye Doudi crawler entry"

  • You are very important on this planet, please cherish your preciousness! ~~~Yatou Shrine
    Insert picture description here

Guess you like

Origin blog.csdn.net/xtreallydance/article/details/113419022