新浪微博爬虫-抓取用户发布的微博

1.寻找接口

在浏览器中访问微博寻找接口
请求 https://weibo.com/yangmiblog?profile_ftype=1&is_all=1#_0 杨幂发布所有的微博列表页，每个微博用户唯一不同的是 yangmiblog 这一部分，其他的微博列表替换掉这一部分就行
详情页的接口，有很多每个接口，在列表页中寻找详情页所需要的参数，拼接就行
- 接口1，老微博的接口 https://weibo.cn/comment/IpAmFboF7?uid=2803301701 IpAmFboF7: 微博详情页的标识，uid:这个微博账号的标识
- 客户端详情页的接口2， https://m.weibo.cn/status/4460578661751867 4460578661751867: 微博文章的标识

2.获取cookie

请求列表页https://weibo.com/yangmiblog?profile_ftype=1&is_all=1#_0经过尝需要携带的参数是 cookie中的 SUB，需要我们获取cookie中的SUB参数
使用 selenium访问列表页获取页面的cookie，获取SUB的值，怎么获取cookie参考我的另一篇文章selenium获取cookie

3.解析微博内容我使用的是客户端详情页的接口

获取列表页，解析出详情页的标识 4460578661751867

import requests
import json


headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3980.0 Safari/537.36 Edg/80.0.355.1',
            'Cookie':
                'SUB=_2AkMpQl9Zf8NxqwJRmP4Uz2vmaox_yAvEieKfHq6CJRMxHRl-yj9jqhwttRB6AsJxtmeKpiXNyz7GDDQw5YkpmIZ6O0s2'
        }
def get_history():
    weibo_url = "https://weibo.com/yangmiblog?profile_ftype=1&is_all=1"
    response = requests.get(url=weibo_url, headers=headers)
    try:
        html_doc = response.content.decode('utf-8')
        except Exception as e:
            print('获取历史页错误，cookie过期')
            return None
        # 解析历史页
        article_id_list = re.findall(r'mid=\\"(\d+)\\"', html_doc, re.S)
        return article_id_list

拼接url对详情页发送请求

response = requests.get(url, timeout=20, headers={
                    "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3980.0 Safari/537.36 Edg/80.0.355.1',
                    "Sec - Fetch - User":"?1",
                    "Upgrade - Insecure - Requests": "1"
                    })
html = response.text
data = json.loads(re.findall(r'render_data = \[(.*?)\]\[0\]', html, re.S)[0])
# 内容
content = data['status']['text']
# 转发量
reposts_count = data['status']['reposts_count']
# 评论量
comments_count = data['status']['comments_count']
# 点赞量
attitudes_count = data['status']['attitudes_count']
# 标题
title = data['status']['status_title']
pub_time_str = data['status']['created_at'].split(' ')
month = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'].index(
            pub_time_str[1]) + 1
# %Y-%m-%d %H:%M:%S
# 发布时间
pub_time = pub_time_str[-1] + '-' + str(month) + '-' + pub_time_str[2] + ' ' + pub_time_str[3]
# 这样就获取到了一条完整的微博数据

这里就是我的抓取思路，代码仅供提供思路

Fred3D

发布了28 篇原创文章 · 获赞 35 · 访问量 2万+

私信关注

新浪微博爬虫-抓取用户发布的微博

1.寻找接口

2.获取cookie

3.解析微博内容我使用的是客户端详情页的接口

猜你喜欢