背景

最近部门里同事在研究的过程发现需要用到舆论监控，研究组反馈过来的意思是，新浪微博上面积累了大量的舆情内容，都是用户发表的，其实反观我等网民，有事没事在微博或者朋友圈发个动态，抱怨这个赞赏那个的，也许一个人的意见还不足以构成洪流，那么一大群人的意见就会发生强烈的聚集效应。想要研究舆情，首先得有舆情内容，然后再利用自然语言处理技术做进一步舆情挖掘，而怎么获取舆情内容，针对新浪微博给出了两个方案

用微博API，优点是调用方便，返回数据结构一目了然，缺点是限制太严重。
自己先人为登录一次，将浏览器里的新浪微博cookie一起放到http请求中发过去，就会得到你想要的页面。缺点是cookie有过期时间，一天可能要换几次。

实践表明第一条路行不通，申请了好几次开发者认证，死活不给我发邮件认证，且开发者文档是2012年写的，一直没更新过，于是转第二个方案——爬取。

思路设计

本次研究主要想搜集集中式公寓top10的品牌的舆情数据，想通过键入品牌关键词，然后抓取该关键词下的发博者，发博内容以及发博时间3个字段。比如我在网址

https://s.weibo.com/

的搜索栏输入“泊寓”关键词
在这里插入图片描述
会出现若干条关于泊寓公寓的微博，内容就在眼前，接下来就是把这些微博抓取下来存在数据库里面的工作了。

完整代码

# -*- coding: utf-8 -*-
"""
project_name:sina
@author: 帅帅de三叔
Created on Thu Aug  8 15:38:23 2019
"""
import requests #导入网页请求模块
from bs4 import BeautifulSoup #导入网页解析模块
import urllib.parse  #url中文编码
import time #导入时间模块
import pymysql #导入数据库模块
header={"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
        "Accept-Encoding": "gzip, deflate, br",
        "Accept-Language": "zh-CN,zh;q=0.9",
        "Connection": "keep-alive",
        "Cookie": "SINAGLOBAL=189893257055.06488.1565250256566; SUBP=0033WrSXqPxfM725Ws9jqgMF55529P9D9WhIslrza.kd.kNEJcd42Uiz5JpX5KMhUgL.Fo2NSK2Xeh-RS022dJLoIpXLxKqLBo-L1h2LxKqLBo-LB-8.9CH8SC-RSEHWeBtt; UOR=,,login.sina.com.cn; webim_unReadCount=%7B%22time%22%3A1565594400683%2C%22dm_pub_total%22%3A0%2C%22chat_group_client%22%3A166%2C%22allcountNum%22%3A182%2C%22msgbox%22%3A0%7D; ALF=1597195209; SSOLoginState=1565659210; SCF=AvgIv74Ms33G2ftEbwjdl6H0E-HCYkHVFewxNE8Kd7QU4g3AjjXO0yp9JQGZkYk9Zi-NqteyTja21CeazpUGQcc.; SUB=_2A25wVmAaDeRhGedJ7lMV8CvEzD2IHXVTItbSrDV8PUNbmtBeLUL6kW9NUeEu3p4s_PkH8hhC31wYMQ5PSb-t_qZq; SUHB=0BLpscTreRdnfr; _s_tentry=login.sina.com.cn; Apache=4077173518697.288.1565659211864; ULV=1565659211929:4:4:2:4077173518697.288.1565659211864:1565572651396; WBStorage=edfd723f2928ec64|undefined",
        "Host": "s.weibo.com",
        "Referer": "https://s.weibo.com/weibo?q=%E6%B3%8A%E5%AF%93&wvr=6&Refer=SWeibo_box",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "same-origin",
        "Sec-Fetch-User": "?1",
        "Upgrade-Insecure-Requests": "1",
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36"}
db=pymysql.connect("localhost","root","123456","weibo",charset="utf8mb4") #链接数据库
cursor=db.cursor() #获取游标
cursor.execute("DROP TABLE IF EXISTS weibo_anxingongyu") #重头再来
c_sql="""create table weibo_anxingongyu(
         author varchar(12),
         content varchar(300),
         time varchar(30)
            )Engine=InnoDB AUTO_INCREMENT=1 Default charset=utf8mb4""" #创建表采用utf8mb4可以把emoji插入数据库
cursor.execute(c_sql) #执行创建表

def get_page(keywords,num):#构造请求网址
    original_url="https://s.weibo.com/weibo?q={}&Refer=SWeibo_box&page={}"
    for page in range(1,num+1):
        url=original_url.format(keywords,page)
        yield url
def get_content(get_page): #定义获取微博内容函数
    time.sleep(1) #挂起进程一秒
    response=requests.get(get_page,headers=header,timeout=10) #带cookies请求
    soup=BeautifulSoup(response.text,'lxml') #解析网页
    #print(soup)
    contents=soup.find("div",class_="m-con-l",id="pl_feedlist_index").findAll("div",class_="card-feed") #微博内容
    #print(contents[-1])
    for content in contents:
        author=content.find("a",class_="name").get_text() #发文者
        comment=content.find("p",class_="txt").get_text().strip().replace("展开全文c","") #微博内容片段
        #full_comment=content.find_element_by_xpath("//a[@node-type='feed_list_content_full") #完整内容
        date_time=content.find("p",class_="from").find("a").get_text().strip() #发博日期
        print(author,comment,date_time)
        insert_data=("insert into weibo_anxingongyu(author,content,time)""values(%s,%s,%s)") #控制插入格式
        weibo_data=([author,comment,date_time]) #待插入数据
        cursor.execute(insert_data,weibo_data) #执行数据
        db.commit()#主动提交数据库
if __name__=="__main__":
    keyword=urllib.parse.quote(input("please input keywords:")) #输入关键词并编码
    num=int(input("please input total pages:")) #请输入爬取总页数并转为整数型
    for link in get_page(keyword,num):
        get_content(link)

截图

在这里插入图片描述

代码分析

微博的爬取关键还是要先登陆一次记住cookies，然后用其构造请求头，请及时更新请求头，首先通过**get_page( )**函数接受两个参数关键词keyword和页数num来构造所有请求网页，然后进入第二步利用get_content()函数来获取所需字段，最后存入数据库mysql, 存如数据库要注意表情emoji占位大小，采用utf8mb4可以把emoji插入数据库。