Python crawler font decryption | Take skiing as an example to demonstrate the information collection of Dianping stores

  • 1. Brief description

  • 2. Font anti-climbing processing

    • 2.1. Get the font file link

    • 2.2. Create the mapping relationship between three types of fonts and actual characters

  • 3. Analysis of single-page shop information

  • 4. All page data acquisition

    • 4.1. Get the number of data pages

    • 4.2. Collect all data

  • 5. Summary

  • Many people learn python and don't know where to start.
    Many people learn python and after mastering the basic grammar, they don't know where to find cases to get started.
    Many people who have done case studies do not know how to learn more advanced knowledge.
    For these three types of people, I will provide you with a good learning platform, free to receive video tutorials, e-books, and course source code! ??¤
    QQ group: 232030553

Winter is a season suitable for skiing, but skiing needs to be cautious. For example, beginners should not go to advanced trails. You have to be a little beeping whether you can ski.

So today, let’s use skiing as a keyword to demonstrate how to use a Python crawler to collect Dianping’s store information.

In the search results page  form by request.get ()  to get the page data, and then parse the web page data related shops you can get the information we need.

However, during the crawling process, we will find that information such as the number of store reviews, per capita consumption, and store address is displayed as on the web page,  which is similar to in the get data.  I don't know what it is. This is actually a kind of font anti-climbing, then we will break them all.

The following are the data fields we need to collect:

Field Description method of obtaining Font
shop_id Store ID Get directly
shop_name Shop name Get directly
shop_star Store star rating Get directly
shop_address Shop address Get directly
shop_review Number of store evaluations Font crawl shopNum
shop_price Per capita consumption of shops Font crawl shopNum
shop_tag_site Shop area Font crawl tagName
shop_tag_type Shop classification Font crawl tagName

2. Font anti-climbing processing

Open Dianping and search for skiing. We press F12 on the search results page to enter the developer mode. After selecting the number of reviews, you can see that the class is shopNum and the content is □  . In the styles on the right, you can see that the font-family is PingFangSC- Regular-shopNum. In fact, you can find the font file link by clicking on the .css link on the right. Considering that the font file links corresponding to other fields related to font anti-climbing may be different, we collect another method for one-time acquisition (see the next paragraph for details).

Anti-climbing font (number of ratings)

2.1. Get the font file link

We head webpage  section, you can find photo-text css  , css their corresponding addresses contains the follow-up will be used all the font file link  directly with requess.get () request to change the address to return all the font names and Its font file download link.

Font anti-climbing (font link)

-Define the function method get_html() for obtaining web page data

# 获取网页数据
def get_html(url, headers):   
    try:
        rep = requests.get(url ,headers=headers)
    except Exception as e :
        print(e)
    text = rep.text
    html = re.sub('\s', '', text) #去掉非字符数据
    
    return html

-Get web page data

import re
import requests
# Cookie部分,直接复制浏览器里的即可
headers = {
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36",
        "Cookie":"你的浏览器Cookie",
        }
# 搜索关键字
key = '滑雪'
# 基础url
url = f'https://www.dianping.com/search/keyword/2/0_{key}'
# 获取网页数据
html = get_html(url, headers)

-Get font file link

# 正则表达式获取head里图文混排css中的字体文件链接
text_css = re.findall('<!--图文混排css--><linkrel="stylesheet"type="text\/css"href="(.*?)">', html)[0]  
# 'http://s3plus.meituan.net/v1/mss_0a06a471f9514fc79c981b5466f56b91/svgtextcss/29de4c2bc5d95d1e147c3c25a5f4aad8.css'
# 组合成css链接
css_url = 'http:' + text_css
# 获取字体文件链接的网页数据
font_html = get_html(css_url, headers)
# 正则表达式获取 字体信息列表
font_list = re.findall(r'@font-face{(.*?)}', font_html)

# 获取使用到的字体及其链接
font_dics = {}
for font in font_list:
    # 正则表达式获取字体文件名称
    font_name = re.findall(r'font-family:"PingFangSC-Regular-(.*?)"', font)[0]
    # 正则表达式获取字体文件对应链接
    font_dics[font_name] = 'http:' + re.findall(r',url\("(.*?)"\);', font)[0]

-Download font files to local

# 由于我们用到的只有shopNum、tagName和address,这里只下载这三类字体
font_use_list = ['shopNum','tagName','address']
for key in font_use_list:
    woff = requests.get(font_dics[key], headers=headers).content
    with open(f'{key}.woff', 'wb')as f:
        f.write(woff)
  • Font file (stored locally, install FontCreator  to open the font file, view the font content, and reply FontCreator to get the download address of the installation package)

Font file

2.2. Create the mapping relationship between three types of fonts and actual characters

Let's first look at the html content of the evaluation number in the requested web page data as follows:

<b>
    <svgmtsi class="shopNum"></svgmtsi>
    <svgmtsi class="shopNum"></svgmtsi>
    <svgmtsi class="shopNum"></svgmtsi>
    <svgmtsi class="shopNum"></svgmtsi>
</b>条评价

The corresponding page shows the number of evaluation for the 4576  review, we know the correspondence relationship 4=&#xf8a1 , 5=&#xee4c , 7=&#xe103 , 6=&#xe62a .

We use FontCreator to  open the shopNum font file as follows:

shopNum

We can find by comparison that 4 corresponds to uniF8A1 in shopNum, and 5 corresponds to uniEE4C  ... etc. So, find the law, we know the corresponding data in the data request information such as & # xf8a1  actually uniF8A1  real number that corresponds to the text or font file corresponding to the needs of a character (4) can be.

Here you need to introduce python's font processing third-party library fontTools  to modify the mapping relationship of the three types of fonts:

from fontTools.ttLib import TTFont

# 修改三类字体映射关系
real_list = {}
for key in font_use_list:
    # 打开本地字体文件
    font_data = TTFont(f'{key}.woff')
    # font_data.saveXML('shopNum.xml')
    # 获取全部编码,前2个非有用字符去掉 
    uni_list = font_data.getGlyphOrder()[2:]
    # 请求数据中是 "" 对应 编码中为"uniF8A1",我们进行替换,以请求数据为准
    real_list[key] = ['&#x' + uni[3:] for uni in uni_list]

real_list

By opening these three types of font files, we found that the corresponding character sequence is the same (order and character content), copy as follows:

# 字符串
words = '1234567890店中美家馆小车大市公酒行国品发电金心业商司超生装园场食有新限天面工服海华水房饰城乐汽香部利子老艺花专东肉菜学福饭人百餐茶务通味所山区门药银农龙停尚安广鑫一容动南具源兴鲜记时机烤文康信果阳理锅宝达地儿衣特产西批坊州牛佳化五米修爱北养卖建材三会鸡室红站德王光名丽油院堂烧江社合星货型村自科快便日民营和活童明器烟育宾精屋经居庄石顺林尔县手厅销用好客火雅盛体旅之鞋辣作粉包楼校鱼平彩上吧保永万物教吃设医正造丰健点汤网庆技斯洗料配汇木缘加麻联卫川泰色世方寓风幼羊烫来高厂兰阿贝皮全女拉成云维贸道术运都口博河瑞宏京际路祥青镇厨培力惠连马鸿钢训影甲助窗布富牌头四多妆吉苑沙恒隆春干饼氏里二管诚制售嘉长轩杂副清计黄讯太鸭号街交与叉附近层旁对巷栋环省桥湖段乡厦府铺内侧元购前幢滨处向座下澩凤港开关景泉塘放昌线湾政步宁解白田町溪十八古双胜本单同九迎第台玉锦底后七斜期武岭松角纪朝峰六振珠局岗洲横边济井办汉代临弄团外塔杨铁浦字年岛陵原梅进荣友虹央桂沿事津凯莲丁秀柳集紫旗张谷的是不了很还个也这我就在以可到错没去过感次要比觉看得说常真们但最喜哈么别位能较境非为欢然他挺着价那意种想出员两推做排实分间甜度起满给热完格荐喝等其再几只现朋候样直而买于般豆量选奶打每评少算又因情找些份置适什蛋师气你姐棒试总定啊足级整带虾如态且尝主话强当更板知己无酸让入啦式笑赞片酱差像提队走嫩才刚午接重串回晚微周值费性桌拍跟块调糕'

For numbers (in fact, there are only 10, located in the top 10 of the font map and words string), when we get the anti-climbing character  is actually uniF8A1  , we first find its corresponding position in shopNum, and then replace the equivalent The characters in the words string at the position are sufficient.

for i in range(10):
 s.replace(real_list['shopNum'][i], words[i])

For the Chinese character class (at most len(real_list['tagName'])), the replacement logic is similar to the number class, just replace the same position.

for i in range(len(real_list['tagName'])):
    s.replace(real_list['tagName'][i], words[i])

3. Analysis of single-page shop information

Through the processing of font anti-climbing in Part 2, combined with the store information fields that can be directly obtained, we can complete the analysis and collection of all store information. Here we use re regular expressions for parsing  , and interested students can also use xpath, bs4 and other tool libraries for processing.

We create a function get_items(html, real_list, words) to get all store information data on a single page  :

# 获取单页全部信息
def get_items(html, real_list, words):    
    # 获取单页全部商铺html整体信息
    shop_list = re.findall(r'<divclass="shop-listJ_shop-listshop-all-list"id="shop-all-list">(.*)<\/div>',html)[0]
    # 获取单页全部商铺html信息组成的列表
    shops = re.findall(r'<liclass="">(.*?)<\/li>', shop_list)
    
    items = []
    for shop in shops:
        # 解析单个商铺信息
        # shop = shops[0]
        item = {}
        # 商铺id(唯一性,用于数据清洗阶段去重)
        item['shop_id'] = re.findall(r'<divclass="txt"><divclass="tit">.*data-shopid="(.*?)"', shop)[0]
        # 商铺名称
        item['shop_name'] = re.findall(r'<divclass="txt"><divclass="tit">.*<h4>(.*)<\/h4>', shop)[0]
        # 商铺星级,由于是二位数,需要除以10.0转化为浮点数
        item['shop_star'] = re.findall(r'<divclass="nebula_star"><divclass="star_icon"><spanclass="starstar_(\d+)star_sml"><\/span>', shop)[0]
        item['shop_star'] = int(item['shop_star'])/10.0
        
        # 其实关于商铺地址信息,在class="operate J_operate Hide"中的data-address是有的
        # 因此,我们不需要用到 字体反爬,直接正则获取吧
        # 商铺地址
        item['shop_address'] = re.findall('<divclass="operateJ_operateHide">.*?data-address="(.*?)"', shop)[0]
        
        shop_name = item['shop_name']
        # 评价数和人均价格,用的是shopNum
        try:
            shop_review = re.findall(r'<b>(.*?)<\/b>条评价', shop)[0]
        except:
            print(f'{shop_name} 无评价数据')
            shop_review = ''
            
        try:
            shop_price = re.findall(r'人均<b>¥(.*?)<\/b>', shop)[0]
        except:
            print(f'{shop_name} 无人均消费数据')
            shop_price = ''
            
        for i in range(10):
            shop_review = shop_review.replace(real_list['shopNum'][i], words[i])
            shop_price = shop_price.replace(real_list['shopNum'][i], words[i])
        # 评价数和人均价格,只取数字,然后组合起来
        item['shop_review'] = ''.join(re.findall(r'\d',shop_review))
        item['shop_price'] = ''.join(re.findall(r'\d',shop_price))
        
        # 商铺所在区域和商铺分类用的是tagName
        shop_tag_site = re.findall(r'<spanclass="tag">.*data-click-name="shop_tag_region_click"(.*?)<\/span>', shop)[0]
        # 商铺分类
        shop_tag_type = re.findall('<divclass="tag-addr">.*?<spanclass="tag">(.*?)</span></a>', shop)[0]
        for i in range(len(real_list['tagName'])):
            shop_tag_site = shop_tag_site.replace(real_list['tagName'][i], words[i])
            shop_tag_type = shop_tag_type.replace(real_list['tagName'][i], words[i])
        # 匹配中文字符的正则表达式: [\u4e00-\u9fa5]
        item['shop_tag_site'] = ''.join(re.findall(r'[\u4e00-\u9fa5]',shop_tag_site))
        item['shop_tag_type'] = ''.join(re.findall(r'[\u4e00-\u9fa5]',shop_tag_type))
        items.append(item)
    
    return items

The following is the information data of all the stores on the homepage we obtained by taking skiing as an example:

All store information on a page

4. All page data acquisition

In most cases, our search results are composed of multiple pages of data. In addition to collecting single page data, we need to obtain all pages of data. In this case, generally first obtain the number of pages, and then perform a page turning cycle to crawl all pages of data.

4.1. Get the number of data pages

For single page data, there is no total number of pages; for multi-page data, we drag to the bottom of the page, select the control on the last page to find the html node where the value is located, and then use regular expressions to get the Value.

Number of pages

# 获取页数
def get_pages(html):
    try:
        page_html = re.findall(r'<divclass="page">(.*?)</div>', html)[0]
        pages = re.findall(r'<ahref=.*>(\d+)<\/a>', page_html)[0]
    except :
        pages = 1
    
    return pages

4.2. Collect all data

When we parse the first page of webpage data, we can get the number of data pages, download the anti-climbing font, get the actual font real_list mapping relationship and the string words composed of real characters. At the same time, we also get the data composition of all shops on the first page. List. Then, we can traverse from the second page to the last page, and add the obtained single page data to the first list.

# 第一页商铺数据构成的列表
shop_data = get_items(html, real_list, words)
# 从第二页开始遍历到最后一页
for page in range(2,pages+1):
    aim_url = f'{url}/p{page}'
    html = get_html(aim_url, headers)
    items = get_items(html, real_list, words)
    shop_data.extend(items)
    print(f'已爬取{page}页数据')
# 转化为dataframe类型
df = pd.DataFrame(shop_data)

The dataframe type composed of all the acquired data is as follows:

All results

Under the anti-climbing mechanism for public comments and types of fonts, we first obtain the font file to analyze the real character mapping relationship corresponding to its character code, and then replace the code with the real character.

But in fact, in the actual operation of Python crawling the information of Dianping’s shops, we may encounter more complicated situations, such as prompting the verification center to verify or prompting the account ip limit, etc. In this case, by setting Cookies, Operations such as adding ip proxy can be processed.

Guess you like

Origin blog.csdn.net/Python_sn/article/details/112391844
Recommended