新浪微博爬虫-抓取用户发布的微博

1.寻找接口
2.获取cookie
3.解析微博内容我使用的是客户端详情页的接口
  • 获取列表页,解析出详情页的标识 4460578661751867

    import requests
    import json
    
    
    headers = {
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3980.0 Safari/537.36 Edg/80.0.355.1',
                'Cookie':
                    'SUB=_2AkMpQl9Zf8NxqwJRmP4Uz2vmaox_yAvEieKfHq6CJRMxHRl-yj9jqhwttRB6AsJxtmeKpiXNyz7GDDQw5YkpmIZ6O0s2'
            }
    def get_history():
        weibo_url = "https://weibo.com/yangmiblog?profile_ftype=1&is_all=1"
        response = requests.get(url=weibo_url, headers=headers)
        try:
            html_doc = response.content.decode('utf-8')
            except Exception as e:
                print('获取历史页错误,cookie过期')
                return None
            # 解析历史页
            article_id_list = re.findall(r'mid=\\"(\d+)\\"', html_doc, re.S)
            return article_id_list
    
  • 拼接url对详情页发送请求

    response = requests.get(url, timeout=20, headers={
                        "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3980.0 Safari/537.36 Edg/80.0.355.1',
                        "Sec - Fetch - User":"?1",
                        "Upgrade - Insecure - Requests": "1"
                        })
    html = response.text
    data = json.loads(re.findall(r'render_data = \[(.*?)\]\[0\]', html, re.S)[0])
    # 内容
    content = data['status']['text']
    # 转发量
    reposts_count = data['status']['reposts_count']
    # 评论量
    comments_count = data['status']['comments_count']
    # 点赞量
    attitudes_count = data['status']['attitudes_count']
    # 标题
    title = data['status']['status_title']
    pub_time_str = data['status']['created_at'].split(' ')
    month = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'].index(
                pub_time_str[1]) + 1
    # %Y-%m-%d %H:%M:%S
    # 发布时间
    pub_time = pub_time_str[-1] + '-' + str(month) + '-' + pub_time_str[2] + ' ' + pub_time_str[3]
    # 这样就获取到了一条完整的微博数据
    

这里就是我的抓取思路,代码仅供提供思路

发布了28 篇原创文章 · 获赞 35 · 访问量 2万+

猜你喜欢

转载自blog.csdn.net/qq_40125653/article/details/104015559