python爬虫新浪微博评论、评论人信息

笔者此次由于需要做数据分析,所以写了一份儿爬虫,爬取新浪微博的微博评论和评论人信息以及转发情况和转发后的点赞情况。

爬取新浪微博的原则是能爬移动端,打死不爬pc端,因为移动端的数据获取的url分析起来简单容易理解,没有pc端那么多需要考虑的参数,移动端的回复url格式

'https://m.weibo.cn/api/comments/show?id={}&page={}'.format(weibo_id,i) #weibo_id:爬取的微博的id

而微博的id大家是很容易通过pc端的网址找到的

在这里我们以爬取王思聪IG夺冠发布抽奖微博为例 https://weibo.com/1826792401/H1rMeFWa2?from=page_1003061826792401_profile&wvr=6&mod=weibotime&type=comment#_rnd1543159215181

这里的weibo_id = 'H1rMeFWa2',大家将此id带入上面的url中,用浏览器打开便可看见很明显的效果样式。

这里我就不具体截图了,具体的分析过程我也不说太多,都是对字典、列表等基本数据结构的应用。

最终的爬取结果可以存储到excel文件汇中。但是提醒大家注意的问题是字符编码的问题,注意utf-8和其他字符集的区别,以及写入文件时的字符集调整。

代码就贴下来,供大家学习。

# -*- coding:utf-8 -*-
__author__ = 'TengYu'
import requests
import xlwt
import  re
import json
import time

headers = {
    'User-agent' : 'Your-agent',
    'Cookie':'Your-cookie'
}

#工具类,用来去除爬取的正文中一些不需要的链接、标签等
class Tool:
    deleteImg = re.compile('<img.*?>')
    newLine =re.compile('<tr>|<div>|</tr>|</div>')
    deleteAite = re.compile('//.*?:')
    deleteAddr = re.compile('<a.*?>.*?</a>|<a href='+'\'https:')
    deleteTag = re.compile('<.*?>')
    deleteWord = re.compile('回复@|回覆@|回覆|回复')

    @classmethod
    def replace(cls,x):
        x = re.sub(cls.deleteWord,'',x)
        x = re.sub(cls.deleteImg,'',x)
        x = re.sub(cls.deleteAite,'',x)
        x = re.sub(cls.deleteAddr, '', x)
        x = re.sub(cls.newLine,'',x)
        x = re.sub(cls.deleteTag,'',x)
        return x.strip()

# comment类,用来获取评论
class comment(object):
    def get_comment(self):
        count = 0
        i = 0
        File = open('filename','w')
        excel = xlwt.Workbook(encoding='utf-8')
        sheet = excel.add_sheet('sheet1')
        sheet.write(0,0,'id')
        sheet.write(0,1,'sex')
        sheet.write(0,2,'name')
        sheet.write(0,3,'time')
        sheet.write(0,4,'loc')
        sheet.write(0,5,'text')
        sheet.write(0,6,'likes')
        while count < 400 and i < 101:
            i += 1
            url = 'https://m.weibo.cn/api/comments/show?id=H1rMeFWa2&page='+str(i) # 更新url
            print (url)
            try:
                response = requests.get(url,headers=headers)
                resjson = json.loads(response.text)
                data = resjson.get('data')
                datanext = data.get('data')
                for j in range(0,len(datanext)):
                    count += 1
                    temp = datanext[j]
                    text = temp.get('text')
                    text = Tool.replace(text)
                    File.write(str(text) + "\n")
                    like_counts = temp.get('like_counts')
                    created_at = temp.get('created_at')
                    user = temp.get('user')
                    screen_name = user.get('screen_name')
                    userid = user.get('id')
                    info_url = "https://m.weibo.cn/api/container/getIndex?containerid=230283"+str(userid)+"_-_INFO" # 转发人信息的url
                    r = requests.get(info_url)
                    infojson = json.loads(r.text)
                    infodata = infojson.get('data')
                    cards = infodata.get('cards')
                    sex = ''
                    loc = ''
                    for l in range(0,len(cards)):
                        temp = cards[l]
                        card_group = temp.get('card_group')
                        for m in range(0,len(card_group)):
                            s = card_group[m]
                            if s.get('item_name') == '性别':
                                sex = s.get('item_content')
                            if s.get('item_name') == '所在地':
                                loc = s.get('item_content')
                    if sex is None:
                        sex = '未知'
                    if loc is None:
                        loc = '未知'
                    sheet.write(count,0,userid)
                    sheet.write(count,1,str(sex))
                    sheet.write(count,2,str(screen_name))
                    sheet.write(count,3,created_at)
                    sheet.write(count,4,text)
                    sheet.write(count,5,str(loc))
                    sheet.write(count,6,like_counts)
                print ("已经抓取"+str(count)+"条数据")
                time.sleep(20)
            except Exception as e:
                print (e)
        excel.save('filename.xls')



if __name__ == '__main__':
    Comment = comment()
    Comment.get_comment()

尊重原作,转载请注明,转载自:https://blog.csdn.net/kr2563

猜你喜欢

转载自blog.csdn.net/kr2563/article/details/84579504
今日推荐