猫眼电影票房爬取

前段时间看到了一篇文章：

《反击爬虫，前端工程师的脑洞可以有多大？》

当中介绍了几个前端反扒的思路。感觉挺有意思的。

这篇文章是记录自己动手爬取天猫票房时遇到的问题

网上已经有朋友介绍了，票房爬取的思路

反击“猫眼电影”网站的反爬虫策略

以及原理

利用自定义web-font实现数据防采集

在测试文章中代码时，发现python3 中并没有 fontforge 包。在网上查找之后，在python3中有个类似的包fontTools。

首先获取网页上的字符集，保存在本地‘1.ttf’。

p = re.compile(r"url\('(.*?)'\) format\('woff'\);")
uni_font_url = re.findall(p, sourcehtml)

url = 'http:%s' % uni_font_url[0]
resp = requests.get(url)
with open('1.ttf','wb') as fontfile:
    fontfile.write(resp.content)

使用fontTools读取‘1.ttf’中的字符集。

from fontTools import ttLib, unicode

tt = ttLib.TTFont("1.TTF")

print(tt.getGlyphNames())
print(tt.getGlyphNames2())
print(tt.getGlyphOrder())

#上面3个方法都可以获得字符集，通过对比网页上的字符返回，可以发现getGlyphOrder()是按数字顺序返回。

#使用 getGlyphOrder() 获取各数字的字符，并生成字典 tmp_dic

glyphs = tt.getGlyphOrder()[2:]tmp_dic = {}for num,un_size in enumerate(glyphs): print(un_size,num) font_uni = un_size.replace('uni','0x').lower() tmp_dic[font_uni] = numprint(tmp_dic)

根据字典，替换网页中的数字字符集。

sourcehtml = sourcehtml.replace('&#','0')
for key in tmp_dic.keys():
    initstr = key + ';'
    sourcehtml = sourcehtml.replace(initstr,str(tmp_dic[key]))

注意：如果使用BeautifulSoup一定要先使用字典替换字符集，再解析。直接解析BeautifulSoup会将无法识别的字符置为空。

猫眼电影 票房爬取

反击“猫眼电影”网站的反爬虫策略

猜你喜欢

猫眼电影票房爬取