Python crawler: font encryption and font anti-crawling

Foreword: Font anti-climbing is also a common anti-climbing technology, such as 58.com, Maoyan movie box office, Autohome, Tianyancha, and other websites. These websites use custom font files, which are displayed normally on the browser, but the data crawled by the crawler is either garbled or turned into other characters, because they use custom font files and refer to the styles through online loading. , This is a new feature of CSS3, through CSS3, web designers can use any font they like, and then because crawlers will not actively load online fonts,

Font encryption generally means that the web page modifies the default character encoding set, and loads their own defined font file on the web page as the font style, which can display the numbers correctly, but the same binary number on the source code is not loaded because the custom font file is not loaded. The computer's default encoding became garbled.

the goal

Goal: Let’s learn to crawl the rental information of 58 same city today to obtain housing information.

Data scraping

Let’s first follow the basic crawler knowledge learned earlier, pick up the keyboard and go directly to it (inexperienced, don’t know what font crawling is), and have been parsing with xpath, and we have forgotten the extraction of BeautifulSoup. Here we will use this extraction, review review.

import requests
from bs4 import BeautifulSoup


url = 'https://cs.58.com/chuzu/?PGTID=0d100000-0019-e310-48ff-c90994a335ae&ClickID=4'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'
}

response = requests.get(url,headers=headers)

html_text = response.text
bs = BeautifulSoup(html_text, 'lxml')

# 获取房源列表信息,通过css选择器来
lis = bs.select('li.house-cell')

# 获取每个li下的信息
for li in lis:
    title = li.select('h2 a')[0].stripped_strings  # stripped_strings获取某个标签下的子孙非标签字符串,会去掉空白字符。返回来的是个生成器
    room = li.select('div.des p')[0].stripped_strings
    money = li.select('.money b')[0].string  # 获取某个标签下的非标签字符串。返回来的是个字符串。
    print(list(title)[0], list(room)[0], money)

The output result:

Garbled characters are displayed, which is also garbled on the page

We right click to select查看网页源代码

It seems to be caused by unicode encoding. This type of font is encrypted. The general solution is to find the font file and analyze the mapping relationship in the file. Generally speaking, the font file is added as a style to the encrypted font. So we looked for related styles in the html header, and found the font style of the header information font-face, in CSS @font-face, which allows web developers to specify online fonts for their web pages.

We ctrl+f search@font-face

About fonts

FontTools operation related

Here we use a module fontTools, which is a library for manipulating fonts, used to convert font files such as woff or ttf into XML files.

1. We can directly use pip to install:

pip install fontTools

2. Load the font file:

font = TTFont('58.woff')

3. Convert to xml file:

font.saveXML('58.xml')

4. Name of each node:

font.keys()

5. Obtain the GlyphOrder node name value in order:

font.getGlyphOrder() 或  font['cmap'].tables[0].ttFont.getGlyphOrder()

6. Obtain the mapping between cmap node code and name value:

font.getBestCmap()

7. Get font coordinate information:

font['glyf'][i].coordinates

8. Get the 0 or 1 of the coordinate:

font['glyf'][i].flags  **注:** 0表示弧形区域 1表示矩形

Font Basics and XML

A font is composed of several tables, and the information of the font is stored in the tables. 1. A basic font file must contain the following tables:

  • cmap: Character to glyph mapping Unicode and Name mapping relationship
  • head: Font header font global information
  • hhea: Horizontal header defines the horizontal header
  • hmtx: Horizontal metrics defines the horizontal metric
  • maxp: Maximum profile is used to allocate memory for fonts
  • name: Naming table defines the font name, style name and copyright statement, etc.
  • glyf: glyph data is contour definition and adjustment instructions
  • OS2: OS2 and Win­dows spe­cific met­rics
  • post: Post­Script in­for­ma­tion

Let's decrypt the font and save it locally to see:

import requests

from fontTools.ttLib import TTFont

import re
import base64


url = 'https://cs.58.com/chuzu/?PGTID=0d100000-0019-e310-48ff-c90994a335ae&ClickID=4'

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'
}

response = requests.get(url,headers=headers)

html_text = response.text
# print(html_text)

pattern = r"base64,(.*?)'"   # 提取加密信息
result = re.findall(pattern, html_text)  # 返回列表

if result:  # 避免有的页面没有使用加密
    print(type(result), len(result))
    base64str = result[0]
    fontfile_content = base64.b64decode(base64str)  # 通过base64编码的数据进行解码,输出二进制
    with open('58.ttf', 'wb') as f:   # 生成字体文件
        f.write(fontfile_content)

    font = TTFont('58.ttf')   # 加载字体文件
    font.saveXML('58.xml')  # 转换成xml文件

else:
    print('没有内容')
    base64str = ""

The generated font library file 58.ttf.

The generated xml file:

Analyze the xml file

Let's analyze the mapping relationship in the xml file.
Click the GlyphOrder label, you can see the Id and name. Here id only represents the serial number, rather than corresponding to a specific number:

Click on the glyf label and see the name and some coordinate points. These coordinate points are used to describe the shape of the font. There is no need to pay attention to these coordinate points.

Click on the cmap label, which is the correspondence between code and name:

Here, import the font file to  http://fontstore.baidu.com/static/editor/index.html  and open it, as shown below:

Is the display in the source code of the webpage  a bit similar to the one shown  here? In fact, this is the case. After removing the opening &#x and the ending;, the remaining 4 hexadecimal display numbers plus uni are the encodings in the font file. So the corresponding is the number "6". According to glyph00007this, we can find from these two pictures that glyph00001 corresponds to the number 0, glyph00002 corresponds to the number 1, and so on... glyph00010 corresponds to the number 9.

Use code to get the correspondence between code and name:

from fontTools.ttLib import TTFont

font = TTFont('58.ttf')  # 打开本地的ttf文件
font.saveXML('58.xml') # 转换为xml文件
bestcmap = font['cmap'].getBestCmap()  # 获取cmap节点code与name值映射
print(bestcmap)

Output:

{38006: 'glyph00010', 38287: 'glyph00006', 39228: 'glyph00007', 39499: 'glyph00005', 40506: 'glyph00009', 40611: 'glyph00002', 40804: 'glyph00008', 40850: 'glyph00003', 40868: 'glyph00001', 40869: 'glyph00004'}

The output is a dictionary, the key is an encoded int type, we need to convert it to the hexadecimal we see in xml and the mapping relationship with specific numbers:

        for key,value in bestcmap.items():
            key = hex(key)  # 10进制转16进制

            value = int(re.search(r'(\d+)', value).group()) -1  # 通过上面分析得出glyph00001对应的是数字0依次类推。
            print(key,value)

Output result:

0x9476 6
0x958f 5
0x993c 4
0x9a4b 3
0x9e3a 7
0x9ea3 2
0x9f64 9
0x9f92 1
0x9fa4 0
0x9fa5 8

Now you can replace the custom font on the page with a normal font, and then parse it. The entire code is as follows:

import requests
from bs4 import BeautifulSoup
from fontTools.ttLib import TTFont

import re
import base64
import io


def base46_str(html_text):
    pattern = r"base64,(.*?)'"  # 提取加密部分
    result = re.findall(pattern, html_text)  # 返回列表

    if result:  # 避免有的页面没有使用加密
        # print(type(result), len(result))
        base64str = result[0]
        bin_data = base64.b64decode(base64str)  # 通过base64编码的数据进行解码,输出二进制
        # # print(fontfile_content)
        # with open('58.ttf', 'wb') as f:
        #     f.write(bin_data)
        # font = TTFont('58.ttf')  # 打开本地的ttf文件
        # font.saveXML('58.xml')
        # bestcmap = font['cmap'].getBestCmap()
        # print(bestcmap)
        fonts = TTFont(io.BytesIO(bin_data))  # BytesIO实现了在内存中读写bytes,提高性能
        bestcmap = fonts['cmap'].getBestCmap()
        # print(bestcmap) # 字典

        # for key,value in bestcmap.items():
        #     key = hex(key)  # 10进制转16进制

        #     value = int(re.search(r'(\d+)', value).group()) -1
        #     print(key,value)

        # 使用字典推导式
        cmap = {hex(key).replace('0x', '&#x') + ';' : int(re.search(r'(\d+)', value).group(1)) - 1 for key, value in bestcmap.items()}
        # print(cmap)

        for k,v in cmap.items():
            html_text = html_text.replace(k, str(v))

        return html_text

    else:
        print('没有内容')
        base64str = ""
        return html_text


def parse_html(html_text):
    bs = BeautifulSoup(html_text, 'lxml')
    # 获取房源列表信息,通过css选择器来
    lis = bs.select('li.house-cell')

    # 获取每个li下的信息
    for li in lis:
        href = li.select('h2 a')[0]['href']
        title = li.select('h2 a')[0].stripped_strings  # stripped_strings获取某个标签下的子孙非标签字符串,会去掉空白字符。返回来的是个生成器
        room = li.select('div.des p')[0].stripped_strings
        money = li.select('.money')[0].get_text().replace('\n','')  # 获取某个标签下的非标签字符串。返回来的是个字符串。
        print(href, list(title)[0], list(room)[0], money)



if __name__ == '__main__':

    url = 'https://cs.58.com/chuzu/?PGTID=0d100000-0019-e310-48ff-c90994a335ae&ClickID=4'

    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'
    }

    response = requests.get(url, headers=headers)

    html_text = response.text
    html_text = base46_str(html_text)
    parse_html(html_text)

Output result:

At this point, 58 Tongcheng fonts are almost relevant.

expand

The above is just a simple font anti-climbing , like car home, cat's eye movie, we can go to challenge.

 

Guess you like

Origin blog.csdn.net/weixin_45293202/article/details/113729806