百度指数爬取+pyppeteer登录(解决旋转验证码)

百度指数中这些折线上的点是是通过两个字符串加密过的

其中,数据接口会返回一个data值作为e值,和一个uniqid用作去请求t值

当得到这两个之后会进行一个处理函数decrypt

通过带入t和e到decrypt测试,就是我们想要的,python版如下

def decrypt_py(t,e):
    """
    :param t:
    :param e:
    :return: 解析出来的数据
    """
    a=dict()
    length=int(len(t)/2)
    for o in range(length):
        a[t[o]] = t[length + o]
    r="".join([a[each]for each in e ]).split(",")

    return r

对于省份和城市的名字是存在一个字典中来调用的

#baidu_id.py
city={1:"济南",2:"贵阳",3:"黔南",4:"六盘水",5:"南昌",6:"九江",7:"鹰潭",8:"抚州",9:"上饶",10:"赣州",11:"重庆",13:"包头",14:"鄂尔多斯",15:"巴彦淖尔",16:"乌海",17:"阿拉善盟",19:"锡林郭勒盟",20:"呼和浩特",21:"赤峰",22:"通辽",25:"呼伦贝尔",28:"武汉",29:"大连",30:"黄石",31:"荆州",32:"襄阳",33:"黄冈",34:"荆门",35:"宜昌",36:"十堰",37:"随州",38:"恩施",39:"鄂州",40:"咸宁",41:"孝感",42:"仙桃",43:"长沙",44:"岳阳",45:"衡阳",46:"株洲",47:"湘潭",48:"益阳",49:"郴州",50:"福州",51:"莆田",52:"三明",53:"龙岩",54:"厦门",55:"泉州",56:"漳州",57:"上海",59:"遵义",61:"黔东南",65:"湘西",66:"娄底",67:"怀化",68:"常德",73:"天门",74:"潜江",76:"滨州",77:"青岛",78:"烟台",79:"临沂",80:"潍坊",81:"淄博",82:"东营",83:"聊城",84:"菏泽",85:"枣庄",86:"德州",87:"宁德",88:"威海",89:"柳州",90:"南宁",91:"桂林",92:"贺州",93:"贵港",94:"深圳",95:"广州",96:"宜宾",97:"成都",98:"绵阳",99:"广元",100:"遂宁",101:"巴中",102:"内江",103:"泸州",104:"南充",106:"德阳",107:"乐山",108:"广安",109:"资阳",111:"自贡",112:"攀枝花",113:"达州",114:"雅安",115:"吉安",117:"昆明",118:"玉林",119:"河池",123:"玉溪",124:"楚雄",125:"南京",126:"苏州",127:"无锡",128:"北海",129:"钦州",130:"防城港",131:"百色",132:"梧州",133:"东莞",134:"丽水",135:"金华",136:"萍乡",137:"景德镇",138:"杭州",139:"西宁",140:"银川",141:"石家庄",143:"衡水",144:"张家口",145:"承德",146:"秦皇岛",147:"廊坊",148:"沧州",149:"温州",150:"沈阳",151:"盘锦",152:"哈尔滨",153:"大庆",154:"长春",155:"四平",156:"连云港",157:"淮安",158:"扬州",159:"泰州",160:"盐城",161:"徐州",162:"常州",163:"南通",164:"天津",165:"西安",166:"兰州",168:"郑州",169:"镇江",172:"宿迁",173:"铜陵",174:"黄山",175:"池州",176:"宣城",177:"巢湖",178:"淮南",179:"宿州",181:"六安",182:"滁州",183:"淮北",184:"阜阳",185:"马鞍山",186:"安庆",187:"蚌埠",188:"芜湖",189:"合肥",191:"辽源",194:"松原",195:"云浮",196:"佛山",197:"湛江",198:"江门",199:"惠州",200:"珠海",201:"韶关",202:"阳江",203:"茂名",204:"潮州",205:"揭阳",207:"中山",208:"清远",209:"肇庆",210:"河源",211:"梅州",212:"汕头",213:"汕尾",215:"鞍山",216:"朝阳",217:"锦州",218:"铁岭",219:"丹东",220:"本溪",221:"营口",222:"抚顺",223:"阜新",224:"辽阳",225:"葫芦岛",226:"张家界",227:"大同",228:"长治",229:"忻州",230:"晋中",231:"太原",232:"临汾",233:"运城",234:"晋城",235:"朔州",236:"阳泉",237:"吕梁",239:"海口",241:"万宁",242:"琼海",243:"三亚",244:"儋州",246:"新余",253:"南平",256:"宜春",259:"保定",261:"唐山",262:"南阳",263:"新乡",264:"开封",265:"焦作",266:"平顶山",268:"许昌",269:"永州",270:"吉林",271:"铜川",272:"安康",273:"宝鸡",274:"商洛",275:"渭南",276:"汉中",277:"咸阳",278:"榆林",280:"石河子",281:"庆阳",282:"定西",283:"武威",284:"酒泉",285:"张掖",286:"嘉峪关",287:"台州",288:"衢州",289:"宁波",291:"眉山",292:"邯郸",293:"邢台",295:"伊春",297:"大兴安岭",300:"黑河",301:"鹤岗",302:"七台河",303:"绍兴",304:"嘉兴",305:"湖州",306:"舟山",307:"平凉",308:"天水",309:"白银",310:"吐鲁番",311:"昌吉",312:"哈密",315:"阿克苏",317:"克拉玛依",318:"博尔塔拉",319:"齐齐哈尔",320:"佳木斯",322:"牡丹江",323:"鸡西",324:"绥化",331:"乌兰察布",333:"兴安盟",334:"大理",335:"昭通",337:"红河",339:"曲靖",342:"丽江",343:"金昌",344:"陇南",346:"临夏",350:"临沧",352:"济宁",353:"泰安",356:"莱芜",359:"双鸭山",366:"日照",370:"安阳",371:"驻马店",373:"信阳",374:"鹤壁",375:"周口",376:"商丘",378:"洛阳",379:"漯河",380:"濮阳",381:"三门峡",383:"阿勒泰",384:"喀什",386:"和田",391:"亳州",395:"吴忠",396:"固原",401:"延安",405:"邵阳",407:"通化",408:"白山",410:"白城",417:"甘孜",422:"铜仁",424:"安顺",426:"毕节",437:"文山",438:"保山",456:"东方",457:"阿坝",466:"拉萨",467:"乌鲁木齐",472:"石嘴山",479:"凉山",480:"中卫",499:"巴音郭楞",506:"来宾",514:"北京",516:"日喀则",520:"伊犁",525:"延边",563:"塔城",582:"五指山",588:"黔西南",608:"海西",652:"海东",653:"克孜勒苏柯尔克孜",654:"天门仙桃",655:"那曲",656:"林芝",657:"None",658:"防城",659:"玉树",660:"伊犁哈萨克",661:"五家渠",662:"思茅",663:"香港",664:"澳门",665:"崇左",666:"普洱",667:"济源",668:"西双版纳",669:"德宏",670:"文昌",671:"怒江",672:"迪庆",673:"甘南",674:"陵水黎族自治县",675:"澄迈县",676:"海南",677:"山南",678:"昌都",679:"乐东黎族自治县",680:"临高县",681:"定安县",682:"海北",683:"昌江黎族自治县",684:"屯昌县",685:"黄南",686:"保亭黎族苗族自治县",687:"神农架",688:"果洛",689:"白沙黎族自治县",690:"琼中黎族苗族自治县",691:"阿里",692:"阿拉尔",693:"图木舒克"}
province={901:"山东",902:"贵州",903:"江西",904:"重庆",905:"内蒙古",906:"湖北",907:"辽宁",908:"湖南",909:"福建",910:"上海",911:"北京",912:"广西",913:"广东",914:"四川",915:"云南",916:"江苏",917:"浙江",918:"青海",919:"宁夏",920:"河北",921:"黑龙江",922:"吉林",923:"天津",924:"陕西",925:"甘肃",926:"新疆",927:"河南",928:"安徽",929:"山西",930:"海南",931:"台湾",932:"西藏",933:"香港",934:"澳门"}

这个省份和城市可以通过js文件获取,点击人群画像时候在network中搜索一个地名,会查到一个js文件,点进去之后再次进行查询,就有好多好多城市了。

然后就可以动手了,需要登录一下取到游览器中的cookie, 

已经根据接口更改修改了,5.20

然后看见有的小伙伴看见接口变了就不知道怎么做,推荐一个编码转换的网站,可以把它先解码,就会容易得多

http://tool.chinaz.com/tools/urlencode.aspx

import requests
import datetime
from utils.baidu_id import province, city


def getIndex(word="我和我的祖国"):
    """
        搜索指数
        :param word:
        :return:
        """
    insert_word = """[[{"name":"%s","wordType":1}]]""" % word
    url = f"http://index.baidu.com/api/SearchApi/index?word={insert_word}&area=0&days=30"
    rep_json = get_rep_json(url)
    generalRatio = rep_json['data']['generalRatio']
    uniqid = rep_json['data']['uniqid']
    all_index_e = rep_json['data']['userIndexes'][0]['all']['data']
    pc_index_e = rep_json['data']['userIndexes'][0]['pc']['data']
    wise_index_e = rep_json['data']['userIndexes'][0]['wise']['data']
    t = getPtbk(uniqid)
    startDate = rep_json['data']['userIndexes'][0]['wise']['startDate']
    all_news = getTopNews(decrypt_py(t, all_index_e), startDate, word)
    pc_news = getTopNews(decrypt_py(t, pc_index_e), startDate, word)
    wise_news = getTopNews(decrypt_py(t, wise_index_e), startDate, word)
    for each in (all_news, pc_news, wise_news):
        print(each)
    return None


def getFeedIndex(word="我和我的祖国"):
    """
    :param word: 关键词
    :return: 资讯指数
    """
    insert_word="""[[{"name":"%s","wordType":1}]]"""%word
    url = "http://index.baidu.com/api/FeedSearchApi/getFeedIndex?word=%s&area=0&days=30" % insert_word
    feed_index_data = get_rep_json(url)
    uniqid = feed_index_data['data']['uniqid']
    data = feed_index_data["data"]['index'][0]
    generalRatio = data['generalRatio']  # 资讯指数概览
    e = data['data']
    t = getPtbk(uniqid)

    return decrypt_py(t, e)


def getNewsDate(word="我和我的祖国"):
    """
    :param word:
    :return: 媒体指数的峰顶新闻
    """
    insert_word = """[[{"name":"%s","wordType":1}]]""" % word
    url = f"http://index.baidu.com/api/NewsApi/getNewsIndex?area=0&word={insert_word}&days=30"
    res_json = get_rep_json(url)['data']

    generalRatio = res_json["index"][0]['generalRatio']
    e = res_json['index'][0]['data']
    start_date = res_json['index'][0]['startDate']
    t = getPtbk(res_json['uniqid'])

    news = getTopNews(decrypt_py(t, e), start_date, word)

    return news


def getTopNews(numList: list, start_date, word):
    """
    找到当前指数列表中的峰值
    转换成日期字符串
    将合成的日期字符串带入到请求数据接口中
    返回新闻数据
    :param numList: 指数列表
    :param start_date: 起始日期
    :param word:
    :return: 峰值新闻
    """
    start_date = string_toDatetime(start_date)
    hill_tops = getHilltop(numList)
    hill_tops_date = [datetime_toString(start_date + datetime.timedelta(days=index)) for index in hill_tops]
    news = getNews(",".join(hill_tops_date), word)["data"][word]

    return news


def getNews(dts, word):
    """
    获取媒体指数接口数据
    :param dts:用,连接的时间字符串,例:dts=2019-10-06,2019-10-10,2019-10-12,2019-10-16,2019-10-21,2019-10-24
    :param word:
    :return:接口传回的数据
    """
    url = f"http://index.baidu.com/api/NewsApi/checkNewsIndex?dates[]={dts}&type=day&words={word}"
    return get_rep_json(url)


def getHilltop(numList: list):
    """
    :param numList:一组数值数组
    :return:峰值的序号列表
    """
    numList = list(map(lambda x: float(x) if x else 0, numList))
    hillTops = [index for index, each in enumerate(numList) if
                index and index < len(numList) - 1 and each > numList[index - 1] and each > numList[index + 1]]

    return hillTops


def getMulti(word="我和我的祖国"):
    """需求图谱
    pv搜索热度;ratio搜索变化率;sim相关性
    """
    url = f"http://index.baidu.com/api/WordGraph/multi?wordlist%5B%5D={word}"
    word_data = get_rep_json(url)['data']['wordlist'][0]
    if word_data['keyword']:
        print(word_data['wordGraph'])


def getRegion(word="我和我的祖国", startDate='2019-09-17', endDate='2019-10-17'):
    """地域分布"""
    url = f"http://index.baidu.com/api/SearchApi/region?region=0&word={word}&startDate={startDate}&endDate={endDate}"
    region = get_rep_json(url)['data']['region'][0]
    region_city = [{'city': city[int(city_n)], 'number': region['city'][city_n]} for city_n in region['city']]
    region_prov = [{'prov': province[int(prov_n)], 'number': region['prov'][prov_n]} for prov_n in region['prov']]
    print(region_city, region_prov)


def getBaseAttributes(word="我和我的祖国"):
    """人群属性"""
    url = f"http://index.baidu.com/api/SocialApi/baseAttributes?wordlist[]={word}"
    rep_data = get_rep_json(url)['data']['result']
    return rep_data


def getInterest(word="我和我的祖国"):
    """兴趣分布"""
    url = f"http://index.baidu.com/api/SocialApi/interest?wordlist[]={word}"
    rep_data = rep_data = get_rep_json(url)['data']['result']
    return rep_data


def string_toDatetime(string):
    # 把字符串转成datetime
    return datetime.datetime.strptime(string, "%Y-%m-%d")


def datetime_toString(dt):
    # 把datetime转成字符串
    return dt.strftime("%Y-%m-%d")


def getPtbk(uniqid):
    url = f"http://index.baidu.com/Interface/ptbk?uniqid={uniqid}"
    return get_rep_json(url)['data']


def decrypt_py(t, e):
    """
    :param t:
    :param e:
    :return: 解析出来的数据
    """
    a = dict()
    length = int(len(t) / 2)
    for o in range(length):
        a[t[o]] = t[length + o]
    r = "".join([a[each] for each in e]).split(",")
    print(r)

    return r


def get_rep_json(url):
    """
    获取json
    :param url: 请求接口
    :return:
    """
    hearder = {
        "Cookie": '',  # 请填写游览器中的cookie
        "User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36"
    }
    response = requests.get(url, headers=hearder)
    response_data = response.json()
    # print(response_data)
    return response_data


def main():
    getFeedIndex()
    getNewsDate()
    getIndex()
    getRegion()
    getBaseAttributes()
    getInterest()


if __name__ == "__main__":
    main()

还有主题的,接口找了一下,东西都一样,有兴趣可以自己搞一下 

主题:

主题搜索指数:http://insight.baidu.com/base/search/trend/general?id=23734&dateType=30&filterType=1&source=0

&filterType=1&source=1#pc

&source=2#移动

主题资讯和主题视频

"http://index.baidu.com/Interface/Newwordgraph/getTopicFeed?nodeid=23935";"http://index.baidu.com/Interface/api/ptbkTopic?uniqid=5dad242a566a46.43359139";;;;"/api/videoIndex/getVideoIndex?nodeid=23935";"http://index.baidu.com/Interface/api/ptbkTopic?uniqid=5dad242a71d612.53363283"

品牌关注

http://insight.baidu.com/base/search/topic/attentionBrand?id=23734

搜索地域分布:http://insight.baidu.com/base/search/region/general?id=23734&dateType=30&filterType=1&pageSize=40

人群属性:

http://insight.baidu.com/base/search/Topic/baseAttributes?nodeid=23734

兴趣分布:

http://insight.baidu.com/base/search/Topic/interest?nodeid=23734&typeid=

模拟登录完成旋转验证码

现在的我已经不是从前的我了,现在的我已经可以完成它了。

世界上没有爬不过去的山,如果有,那么可以站在巨人的肩膀上,再爬一次。

我来了我来了,我带着模型走来了,同学们你们是否还在为旋转验证码而苦恼,从现在开始你可以换个苦恼的问题了!!!

来来来,看成果

怎么样,是不是很快乐,因为这篇篇幅已经挺长不够我输出彩虹屁了,所以我写到另外一篇博客了

旋转拖动验证码解决方案

有什么不对的地方还是希望同学们能指出来!好嘞,快乐就完事了!

猜你喜欢

转载自blog.csdn.net/Laozizuiku/article/details/102602401