Python crawler | Take ski as an example to demonstrate the information collection of Dianping stores!

  • 1. Brief description

  • 2. Font anti-climbing processing

  • 2.1. Get the font file link

  • 2.2. Create the mapping relationship between three types of fonts and actual characters

  • 3. Analysis of single-page shop information

  • 4. All page data acquisition

  • 4.1. Get the number of data pages

  • 4.2. Collect all data

  • 5. Summary

Winter is a season suitable for skiing, but skiing needs to be cautious. For example, beginners should not go to advanced trails. You have to be a little beeping whether you can ski.

So today, let’s use skiing as a keyword to demonstrate how to use a Python crawler to collect Dianping’s store information.

In the search results, the page data can be obtained through request.get() in the form of page turning, and then the relevant analysis of the page data can obtain the store information we need.

However, during the crawling process, we will find that information such as the number of store reviews, per capita consumption, and store address is displayed as □ on the web page, which is similar to  in the get data. I don't know what it is. This is actually a kind of font anti-climbing, then we will break them all.

The following are the data fields we need to collect:

Field

Description

method of obtaining

Font

 

2. Font anti-climbing processing

Open Dianping and search for skiing. We press F12 on the search results page to enter the developer mode. After selecting the number of reviews, you can see that the class is shopNum and the content is □. In the styles on the right, you can see that the font-family is PingFangSC- Regular-shopNum. In fact, you can find the font file link by clicking on the .css link on the right. Considering that the font file links corresponding to other fields related to font anti-climbing may be different, we collect another way for one-time acquisition (see the next paragraph for details).

 

2.1. Get the font file link

In the head section of the web page, we can find mixed graphics and text css. The corresponding css address contains all the font file links that will be used in the future. You can directly use requess.get() to request the address to change the address to return all font names and Its font file download link.

 

-Define the function method get_html() for obtaining web page data

 
 

# Get webpage data def get_html(url, headers): try: rep = requests.get(url ,headers=headers) except Exception as e: print(e) text = rep.text html = re.sub('\s' ,'', text) #Remove non-character data return html

 

-Get web page data

 
 

import re import requests # Cookie part, just copy the content in the browser headers = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0 .4147.89 Safari/537.36", "Cookie":"Your browser Cookie",} # search keyword key ='ski' # basic url url = f'https://www.dianping.com/search/keyword/ 2/0_{key}' # Get web page data html = get_html(url, headers)

 

-Get font file link

 
 

# Regular expression to get the font file link in the text-text css in the head text_css = re.findall('<!--- css--><linkrel="stylesheet"type="text\/css" href="(.*?)">', html)[0] #'http://s3plus.meituan.net/v1/mss_0a06a471f9514fc79c981b5466f56b91/svgtextcss/29de4c2bc5d95d1e147c3c25a5f4aad8.css' #Combine into css link + text_css # Get the webpage data of the font file link font_html = get_html(css_url, headers) # Get the font information list by regular expression font_list = re.findall(r'@font-face{(.*?))', font_html) # Get the used fonts and their links font_dics = {} for font in font_list: # Regular expression to get the font file name font_name = re.findall(r'font-family:"PingFangSC-Regular-(.*?)"', font)[0] # Regular expression to obtain the corresponding link of the font file font_dics[font_name] ='http:' + re.findall(r',url\("(.*?)"\);', font)[0 ]

 

-Download font files to local

 
 

# Since we only use shopNum, tagName and address, we only download these three types of fonts here. font_use_list = ['shopNum','tagName','address'] for key in font_use_list: woff = requests.get(font_dics[key] , headers=headers).content with open(f'{key}.woff','wb')as f: f.write(woff)

  • Font file (stored locally, install FontCreator to open the font file, view the font content, and reply to FontCreator to get the installation package download address)

 

2.2. Create the mapping relationship between three types of fonts and actual characters

Let's first look at the html content of the evaluation number in the requested web page data as follows:

 
 

<b> <svgmtsi class="shopNum"></svgmtsi> <svgmtsi class="shopNum"></svgmtsi> <svgmtsi class="shopNum"></svgmtsi> <svgmtsi class="shopNum"></svgmtsi> </b>条评价

The corresponding webpage shows 4576 reviews, and we know that the corresponding relationship is 4=, 5=, 7=, 6=.

We use FontCreator to open the shopNum font file as follows:

We can find by comparison that 4 corresponds to uniF8A1 in shopNum, and 5 corresponds to uniEE4C... etc. Therefore, after finding the rule, we know that the corresponding data information in the requested data, such as , is actually uniF8A1. The actual corresponding number or text needs to correspond to a certain character (4) in the font file.

Here you need to introduce python's font processing third-party library fontTools to modify the mapping relationship of the three types of fonts:

 
 

from fontTools.ttLib import TTFont # Modify the three types of font mapping relationship real_list = {} for key in font_use_list: # Open the local font file font_data = TTFont(f'{key}.woff') # font_data.saveXML('shopNum.xml' ) # Get all codes, remove the first 2 non-useful characters uni_list = font_data.getGlyphOrder()[2:] # "" in the request data and "uniF8A1" in the corresponding code, we will replace it, and the requested data shall prevail real_list [key] = ['&#x' + uni[3:] for uni in uni_list]

By opening these three types of font files, we found that the corresponding character sequence is the same (order and character content), copy as follows:

 
 

# String words = '1234567890 shop Zhongmeijiaguan trolley Dashigongjiu line national product power generation Jinxinye Shangsi Chaosheng clothing garden food has new limited sky noodles work clothes Haihua water room decoration city Leqixiang Department Lizi old art flowers Special east meat dishes, learning blessing rice, people hundred meals, tea Wutongweisuoshan mountain medicine Yinnonglong stop Shangan Guangxin Yirongdong Nanguyuanxing fresh timing roasted Wenkang Xinguoyangli pot Baoda dieryi specialty West Niu Jiahua, Pifang Prefecture, 5 meters, Xiu Aibei, selling building materials, Sanhui, Jishihong Station, Dewang Guangming, Liyou Yuantang, Shaojiang Society, Hexing Cargo-type Village Jingwu Jingju Juzhuang Shishun Liner County Hand Office sells hospitable fire Yasheng sports travel shoes spicy for powder package building school fish flat color on the bar Baoyong all things education design medicine Zhengzhou Fengjian point soup net Qing Jisi washing material distribution exchange Wood margin plus linen Weichuantai color Shifang Yufeng young sheep blanched Lai Gaochang Lan Abepi All girls La Chengyun Weimao Daoshu Yundukou Bohe Ruihong Jingji Road Xiangqing Town Kitchen Peili Huilian Ma Honggang Xunying A, window assistant, Bufu brand, four more makeup, Jiyuan, Sha Henglong, spring dry cake, second tube, Chengxian production and sales, Jiachangxuan, miscellaneous Qingji, Huangxun, near the intersection of Taiyahao Street, and the opposite lane. The inner side of the Xiafu shop in the Xiafu section of the provincial bridge and lake section. Yuan purchased the front building and the front building. The front building is facing the seat. Jingquantang Fangchang line Wanzhengbu Ningjie Baitian Dingxi Eighteen ancient double wins this single and nine welcome the first platform Yujin At the end of the seven oblique period, Wuling Songjiao Ji Chaofeng Six Zhenzhu Bureau Gangzhou Hengbian Jijing Office Han Dynasty Linong Mission Outer Pagoda Yang Tiepu Zi Nian Ling Yuan Mei Jin Rong You Hong Yang Gui Yan Shi Jin Kailian Ding Xiuliu Ji Purple Flag Zhang Gu’s is not good, it’s also a good one, I’m just thinking that I’ve never been to the wrong place, I’m feeling secondary, I’ve seen it, but it’s true, but I like it the most. Two kinds of thinkers recommend to do the sorting, the sweetness is full, and the hot is finished, the recommendation is to drink, wait for a few more friends, wait for it, and buy it in the same amount I’m so angry that your sister will always try to try it. It’s a full-fledged prawn, and it’s better to be a better confidant, but it’s not sour. Go back to the late Wei Zhou value table to pat with a piece of cake'

For numbers (in fact, there are only 10, located in the first 10 of the font map and words string), when we get the anti-climbing character as  which is actually uniF8A1, we first find its corresponding position in shopNum, and then replace the equivalent The characters in the words string at the position are sufficient.

 
 

for i in range(10): s.replace(real_list['shopNum'][i], words[i])

For the Chinese character class (at most len(real_list['tagName'])), the replacement logic is similar to the number class, just replace the same position.

 
 

for i in range(len(real_list['tagName'])): s.replace(real_list['tagName'][i], words[i])

 

3. Analysis of single-page shop information

Through the processing of font anti-climbing in Part 2, combined with the store information fields that can be directly obtained, we can complete the analysis and collection of all store information. Here we use re regular expressions for parsing, and interested students can also use xpath, bs4 and other tool libraries for processing.

We create a function get_items(html, real_list, words) to get all store information data on a single page:

 
 

# Get all the information on a single page def get_items(html, real_list, words): # Get the overall html information of all shops on a single page shop_list = re.findall(r'<divclass="shop-listJ_shop-listshop-all-list"id=" shop-all-list">(.*)<\/div>',html)[0] # Get a list of all shop html information on a single page shops = re.findall(r'<liclass="">(. *?)<\/li>', shop_list) items = [] for shop in shops: # Analyze the information of a single shop # shop = shops[0] item = {} # Shop id (unique, used in the data cleaning phase Re) item['shop_id'] = re.findall(r'<divclass="txt"><divclass="tit">.*data-shopid="(.*?)"', shop)[0] # Shop name item['shop_name'] = re.findall(r'<divclass="txt"><divclass="tit">.*<h4>(.*)<\/h4>', shop)[0] # Shop star, because it is a two-digit number, it needs to be divided by 10.0 and converted into a floating point number item['shop_star'] = re.findall(r'<divclass="nebula_star"><divclass="star_icon"><spanclass="starstar_(\d+)star_sml"><\/span>', shop)[0] item['shop_star'] = int(item['shop_star'])/10.0 # Actually Regarding the shop address information, the data-address in class="operate J_operate Hide" is some# Therefore, we don’t need to use the font to reverse crawl, just get it directly # shop address item['shop_address'] = re.findall ('<divclass="operateJ_operateHide">.*?data-address="(.*?)"', shop)[0] shop_name = item['shop_name'] # The number of reviews and per capita price, using shopNum try : shop_review = re.findall(r'<b>(.*?)<\/b>review', shop)[0] except: print(f'{shop_name} no review data') shop_review ='' try : shop_price = re.findall(r'per capita<b>¥(.*?)<\/b>', shop)[0] except: print(f'{shop_name} per capita consumption data') shop_price ='' for i in range(10): shop_review = shop_review.replace(real_list['shopNum'][i], words[i]) shop_price = shop_price.replace(real_list['shopNum'][i], words[i]) # Evaluation number and per capita price, just take the number, and then combine item[' shop_review'] =''.join(re.findall(r'\d',shop_review)) item['shop_price'] =''.join(re.findall(r'\d',shop_price)) # shop location The area and shop classification use tagName shop_tag_site = re.findall(r'<spanclass="tag">.*data-click-name="shop_tag_region_click"(.*?)<\/span>', shop)[0 ] # Shop classification shop_tag_type = re.findall('<divclass="tag-addr">.*?<spanclass="tag">(.*?)</span></a>', shop)[0] for i in range(len(real_list['tagName'])): shop_tag_site = shop_tag_site.replace(real_list['tagName'][i], words[i]) shop_tag_type = shop_tag_type.replace(real_list['tagName'][ i],words[i]) # Regular expression matching Chinese characters: [\u4e00-\u9fa5] item['shop_tag_site'] =''.join(re.findall(r'[\u4e00-\u9fa5]',shop_tag_site) ) item['shop_tag_type'] =''.join(re.findall(r'[\u4e00-\u9fa5]',shop_tag_type)) items.append(item) return items

The following is the information data of all the stores on the homepage we obtained by taking skiing as an example:

 

4. All page data acquisition

In most cases, our search results are composed of multiple pages of data. In addition to collecting single page data, we need to obtain all pages of data. In this case, generally first obtain the number of pages, and then perform a page turning cycle to crawl all pages of data.

 

4.1. Get the number of data pages

For single page data, there is no total number of pages; for multi-page data, we drag to the bottom of the page, select the control on the last page to find the html node where the value is located, and then use regular expressions to get the Value.

 
 

# 获取页数 def get_pages(html): try: page_html = re.findall(r'<divclass="page">(.*?)</div>', html)[0] pages = re.findall(r'<ahref=.*>(\d+)<\/a>', page_html)[0] except : pages = 1 return pages

 

4.2. Collect all data

When we parse the first page of webpage data, we can get the number of data pages, download the anti-climbing font, get the actual font real_list mapping relationship and the string words composed of real characters. At the same time, we also get the data composition of all shops on the first page. List. Then, we can traverse from the second page to the last page, and add the obtained single page data to the first list.

 
 

# The list of shop data on the first page shop_data = get_items(html, real_list, words) # From the second page to the last page for page in range(2,pages+1): aim_url = f'{url}/ p{page}' html = get_html(aim_url, headers) items = get_items(html, real_list, words) shop_data.extend(items) print(f'already crawled {page}page data') # converted to dataframe type df = pd.DataFrame(shop_data)

The dataframe type composed of all the acquired data is as follows:

Under the anti-climbing mechanism for public comments and types of fonts, we first obtain the font file to analyze the real character mapping relationship corresponding to its character code, and then replace the code with the real character.

But in fact, in the actual operation of Python crawling the information of Dianping’s shops, we may encounter more complicated situations, such as prompting the verification center to verify or prompting the account ip limit, etc. In this case, by setting Cookies, Operations such as adding ip proxy can be processed.

Recently, many friends consulted about Python learning issues through private messages. To facilitate communication, click on the blue to join the discussion and answer resource base by yourself

Guess you like

Origin blog.csdn.net/weixin_43881394/article/details/112366054