抓取猫眼电影实时数据

抓取猫眼电影实时数据

我又回来了,guys!最近也是看到流浪地球,飞驰人生,疯狂的外星人的票房大卖,于是就想着利用python对猫眼做一下票房的数据统计。下面就开始我的表演:

  • 环境: python3.6
  • 集成工具:pycharm专业版
  • 用到的第三方包:requests,base64,lxml,fontTools,re,csv

首先我们可以看到:
在这里插入图片描述
在这里插入图片描述
查看源码时,我们可以看到一些字体加密正是我们想要爬取的数据,此时我们来找一下这个字体文件 ,在html页面中搜索关键字:font-face,找到如下内容。一大串字符串,从base64后面开始一直到后面format前面的括号中的内容,应该是字体文件的内容。是经过了base64编码后的形式。把这一段字符串考出来,用base64解码后再保存成本地ttf文件(ttf是字体的一种类型);
在这里插入图片描述
我们要先解码,再保存成本地文件 zt02.ttf。
创建一个py文件01.py(随便命名);

import base64
font_face='d09GRgABAAAAAAggAAsAAAAAC7gAAQAAAAAAAAAAAAAAAAAAAAAAAAAAAABHU1VCAAABCAAAADMAAABCsP6z7U9TLzIAAAE8AAAARAAAAFZW7lbiY21hcAAAAYAAAAC8AAACTDw2sk9nbHlmAAACPAAAA5IAAAQ0l9+jTWhlYWQAAAXQAAAALwAAADYUQEAcaGhlYQAABgAAAAAcAAAAJAeKAzlobXR4AAAGHAAAABIAAAAwGhwAAGxvY2EAAAYwAAAAGgAAABoHMgY2bWF4cAAABkwAAAAfAAAAIAEZADxuYW1lAAAGbAAAAVcAAAKFkAhoC3Bvc3QAAAfEAAAAXAAAAI/OSrqdeJxjYGRgYOBikGPQYWB0cfMJYeBgYGGAAJAMY05meiJQDMoDyrGAaQ4gZoOIAgCKIwNPAHicY2Bk0mWcwMDKwMHUyXSGgYGhH0IzvmYwYuRgYGBiYGVmwAoC0lxTGBwYKr5nM+v812GIYdZhuAIUZgTJAQDfRQt0eJzFkT0OgkAQhd/Kn4qFobKlorGg5jqUXIATeAIbTkFi5XlYQiioCIES3zI0JtDqbL5N5u1mZjIPgAPAIndiA+oNBRMvqmrRLZwX3caD+Q1XKkfkVa0DnTRRW3RlHw7pGE/ZPPPH/stWKFbcOubF9HHY98QJXBzgw6Ps7lT6Qaj/tf6Oy3I/18wn+QpHrGrB+KoDgZuETgTjeRMJxvO2ELhndKVgPO9DgbvHkAp0AWMsmL5TJsD7AHWQP0B4nEWTz3MSZxzG33dxdiNBQmSXFaLAsmR3gSRs9hcBNgtCQPOTkgAhRDFkFDGtmmaMjZppFdvOaKd/gL10podenB6825lOPVUdm0P/gM702lud8ZKBvktourfv7uF5ns/zLIAAdP8GEiABBkBMpkgvKQD0oKv7HgPY74AFUfTFC2XJgDEDTkOFxwmcDXCqosmSF1KkHbIBnuOhwrEBnCJpl6R9M6iL4SRvxwnojo7F1u5/sTmzqyfvFsqKZoWt5alkJRS+V/hJV0cN1aONDJzAwx7Pw62bX81/2376Q3kiWobJhbX6Uj4UWT32AzvIjx+MAQA5HokROGGHlIGsuY5cIE8xZIp20ZA0HWuqwgVw+J2NCiphf5i2nfKvy6v7iauZW08Xsp+WNdXWecbnOK1YuFvCXAo9Svvi51a0yYl2M3tn+vuXB/VlcaLUeTNWjtQWZ1crwHLMhQQ+MA6A0+Rg6hMmChyBQHeMdMlSTDPFLaSLRod2dL34ZPvlzlYm1/7jfDovZhSRZbLN82cDo4GQX6ZCpc+L8Eth66MbtxdagutK5vK+oTfy9R+VlN9Xz6Y7T/gc6aRI/uFyETGBJhj4AXkJAjBKMUrMsKCGTCVUBMEb0PRlMiLsFgJ+6PCD1hEhziUKVGheTy3A2sm9d3tMhMyKgkSfHiiVfF53NKr6xblzU9dm5/LW5o2d8viiRKcEZvwMfepIs/sP7CLNSH8VmtKrIkb3aPeEtV4JXoi0e1BQMXx76IJmlPmQ7gla7fG1lCbPWKuOeKKUkCZVaTJ14Unryv7J3+YzlX1esC7C5LSYMjJDteik50x1Y941dCl/+fF27f8tHPY7AE6UG/ZK76c114gakDQHz6GNhN2e1tJO8qzDYbOPXCtc1/O14v2VsPAgOA4b7bml0no4rd9MNfmllbnqmxd3duFGMiFnejoo8CH2FlgBys6ojArlYZliKX7YArOd1zB/sdGo/vm8CA86YvH5IXr383EvXewVcCJ3KkMh+jjBms2Y9UThAZudkZ3ugXU47PAlvWkGu1XOBRv3HqRrH4eb+t7t+CWuz/o9dgL71UzZZ30E2MlQDNHfvfkvor/va+uslq5WspEsuZKDVzt/8f4Ztv4onvtsc9oYeJXLbD6rcD4r3C794qIfXd+4uKpN1f7zuotY2lBCdhhtRe25lOFu1d8SZqdGhME4Jnp1RzkguUUagH8BuIfgwgAAeJxjYGRgYADiXJP8rfH8Nl8ZuFkYQODGhDojBP3/DQsD03kgl4OBCSQKACGOCnAAeJxjYGRgYNb5r8MQw8IAAkCSkQEV8AAAM2IBzXicY2EAghQGBiYd4jAAN4wCNQAAAAAAAAAMAFQAmADeARgBWgGOAaoBzgIAAhoAAHicY2BkYGDgYTBgYGYAASYg5gJCBob/YD4DAA6DAVYAeJxlkbtuwkAURMc88gApQomUJoq0TdIQzEOpUDokKCNR0BuzBiO/tF6QSJcPyHflE9Klyyekz2CuG8cr7547M3d9JQO4xjccnJ57vid2cMHqxDWc40G4Tv1JuEF+Fm6ijRfhM+oz4Ra6eBVu4wZvvMFpXLIa40PYQQefwjVc4Uu4Tv1HuEH+FW7i1mkKn6Hj3Am3sHC6wm08Ou8tpSZGe1av1PKggjSxPd8zJtSGTuinyVGa6/Uu8kxZludCmzxMEzV0B6U004k25W35fj2yNlCBSWM1paujKFWZSbfat+7G2mzc7weiu34aczzFNYGBhgfLfcV6iQP3ACkSaj349AxXSN9IT0j16JepOb01doiKbNWt1ovippz6sVYYwsXgX2rGVFIkq7Pl2PNrI6qW6eOshj0xaSq9mpNEZIWs8LZUfOouNkVXxp/d5woqebeYIf4D2J1ywQB4nG3JOw6AIBCE4R18oKh3kYWAlGrwLjZ2Jh7fuLRO8yXzk6IyQ/8boVChRoMWGh16GAwYMREefV/n4YP7zJy9GDiKzq3Sma0Yw1Z+y+KSkrivh/TgZqIXEYgXYQ=='
b=base64.b64decode(font_face)
with open('zt02.ttf','wb')as f:
    f.write(b)

在这里我们要用到一个工具FontCreator:
链接地址: 点我下载FontCreator
提取码: v3j1
安装好后 。点击左上角打开文件(File),然后打开我们上面保存的zt02.ttf文件,可以看到每个编码所对应的字符;
在这里插入图片描述
在这里插入图片描述
两张图片对比一下我们就可以看到每一个编码所对应的数字
先下载一个字体文件保存到本地(比如叫01.ttf),人工的找出每一个数字对应的编码。当我们重新访问网页时,同样也可以把新的字体文件下载下来保存到本地ttf(比如叫02.ttf)。网页中的一个数字的编码比如为AAAA,如何确定AAAA对应的数字。我们先通过编码AAAA找到这个字符在02.ttf中的对象,并且把它和01.ttf中的对象逐个对比,直到找到相同的对象,然后获取这个对象在01.ttf中的编码,再通过编码确认是哪个数字
再创建一个py文件02.py

from fontTools.ttLib import TTFont
font=TTFont('01.ttf')    #打开本地字体文件01.ttf
font.saveXML('01.xml') 

点开标签内部,<GlyphOrder…>内包含着所有编码信息,注意前两个是不是0-9的编码,需要去除;

font1=TTFont('01.ttf')    #打开本地字体文件01.ttf
obj_list1=font1.getGlyphNames()[1:-1]   #获取所有字符的对象,去除第一个和最后一个
uni_list1=font1.getGlyphOrder()[2:]    #获取所有编码,去除前2个
      #手动确认编码和数字之间的对应关系,保存到字典中
dict={'uniE877': '8', 'uniE53A': '2', 'uniF65B': '1', 'uniF691': '9', 'uniE17E': '3', 'uniE7C7': '4', 'uniE10C': '0', 'uniF717': '7', 'uniEB68': '5', 'uniF197': '6'}

font2=TTFont('02.ttf')       #打开访问网页新获得的字体文件02.ttf
obj_list2=font2.getGlyphNames()[1:-1]
uni_list2=font2.getGlyphOrder()[2:]
for uni2 in uni_list2:
    obj2=font2['glyf'][uni2]  #获取编码uni2在02.ttf中对应的对象
    for uni1 in uni_list1:
        obj1=font1['glyf'][uni1]

        if obj1==obj2:
                print(uni2,dict[uni1])  #打印结果,编.码uni2和对应的数字

注意 :每次的字符编码都不一样
然后字体加密是解决了 开始咱们的数据抓取
创建一个maoyan.py文件:

import requests
import time
import base64
from lxml import etree
from fontTools.ttLib import TTFont
import re
import csv

def get_mapping_dict(base64str=""):
    mapping_dict = {}
    font1 = TTFont('zt02.ttf') #打开本地字体文件
    obj_list1 = font1.getGlyphNames()[1:-1]
    uni_list1 = font1.getGlyphOrder()[2:]

    dictx = {'uniE117': '4',
             'uniF3D2': '0',
             'uniF02A': '5',
             'uniEA82': '7',
             'uniF709': '2',
             'uniEDFD': '1',
             'uniF26C': '8',
             'uniE173': '3',
             'uniEE27': '9',
             'uniEAAA': '6'}
    b = base64.b64decode(base64str)

    with open('zt03.ttf','wb')as f:
        f.write(b)

    font2 = TTFont('zt03.ttf')
    obj_list2 = font2.getGlyphNames()[1:-1]
    unit_list2 = font2.getGlyphOrder()[2:]

    for uni2 in unit_list2:
        obj2 = font2['glyf'][uni2]
        for uni1 in uni_list1:
            obj1 = font1['glyf'][uni1]
            if obj1 == obj2:
                mapping_dict[r'\u' + uni2[-4:].lower()] = dictx[uni1]
    return mapping_dict

if __name__ == '__main__':
    url = 'https://piaofang.maoyan.com/?ver=normal'
    headers = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
    }
    response = requests.get(url=url,headers=headers)
    html = etree.HTML(response.text)
    base64str = re.findall('base64,(.*?)\) format',response.text)[0]
    mapping_dict = get_mapping_dict(base64str=base64str)
    lis = html.xpath("//div[@id='ticket_tbody']/ul[@class='canTouch']")
    write_lines = []
    for li in lis:
        name = li.xpath("./li[@class='c1']/b/text()")[0]
        shangying = li.xpath("./li[@class='c1']/em[1]/text()")[0]
        zong_piaofang = str(li.xpath("./li[@class='c1']/em/i[@class='cs']/text()"))
        piaofang = str(li.xpath("./li[@class='c2 ']/b/i[@class='cs']/text()"))
        piaofang_zhanbi = str(li.xpath("./li[@class='c3 ']/i[@class='cs']/text()"))
        paipian_zhanbi = str(li.xpath("./li[@class='c4 ']/i[@class='cs']/text()"))
        shangzuo_lv = str(li.xpath("./li[@class='c5 ']/span/i[@class='cs']/text()"))
        # print(name,shangying,zong_piaofang,piaofang,piaofang_zhanbi,paipian_zhanbi,shangzuo_lv)
        for key,val in mapping_dict.items():
            zong_piaofang = zong_piaofang.replace(key, val)
            piaofang = piaofang.replace(key, val)
            piaofang_zhanbi = piaofang_zhanbi.replace(key, val)
            paipian_zhanbi = paipian_zhanbi.replace(key, val)
            shangzuo_lv = shangzuo_lv.replace(key, val)
        print(name,shangying,zong_piaofang[2:-2],piaofang[2:-2],piaofang_zhanbi[2:-2],paipian_zhanbi[2:-2],shangzuo_lv[2:-2])
        write_line = [name,shangying,zong_piaofang[2:-2],piaofang[2:-2],piaofang_zhanbi[2:-2],paipian_zhanbi[2:-2],shangzuo_lv[2:-2]]
        write_lines.append(write_line)

    with open('maoyan.csv','a+',newline='')as f:
        writer = csv.writer(f)
        title = ['片名','上映天数','总票房','实时票房','票房占比','排片占比','上座率']
        f.seek(0)
        if len(f.readlines()) == 0:
            writer.writerow(title)
        f.seek(2)
        for write_line in write_lines:
            writer.writerow(write_line)

然后运行py文件,可以看见咱们抓取的数据
在这里插入图片描述
同时存成了csv模式:
在这里插入图片描述
You can have whatever you want!

猜你喜欢

转载自blog.csdn.net/WJL0104/article/details/87639075