One, crawl the target
http://zb.yfb.qianlima.com/yfbsemsite/mesinfo/zbpglist
2. Demand analysis
The requirement is simple, that is to crawl all data.
Of course, you can also provide a function to crawl all data after searching for a certain keyword.
The difficulty is that no matter where you click, the URL always looks the same, which increases the difficulty of crawling .
What the crawler needs to do:
1. The crawl target is an ajax page, which requires selenium + headless browser
2. The search box in the web page (optional feature)
3. Option button
In the default page, the information type and selection time are not all selected. If you want to crawl all data, you need to change these two options.
4. Select page
Three, the key technology
1. Dynamic webpage: selenium + headless browser
phantomjs.exe is no longer available, so chromedriver.exe is used here
Install selenium, then download chromedriver.exe and put it in the root directory of the c drive, you can use it to open the webpage with the following code
base_url = "http://zb.yfb.qianlima.com/yfbsemsite/mesinfo/zbpglist"
driver = webdriver.Chrome(executable_path=r'C:\chromedriver.exe', options=chrome_options,service_args=['--load-images=no'])
driver.get(base_url)
2. Option button: find_element_by_xpath
View webpage source code, search for information type
So we have the code to click the "All" button:
select_type_box = driver.find_element_by_xpath("//ul[@class='select_type_box clearfix']/li[1]")
select_type_box.click()
Similarly, choose time
Select time code:
select_time_box = driver.find_element_by_xpath("//ul[@class='fl select_time_box']/li[1]")
select_time_box.click()
3. Next page: onclick, JS analysis, browser capture
At first I thought it was simple:
next_page = driver.find_element_by_xpath("//div[@class='pagination']/ul[1]/li[11]")
next_page.click()
However, the execution error: Element is not clickable at point
In other words, a headless browser can easily click the button, but it is not so easy for onclick.
So, I started to pull JS, F12 into the debugging mode, I can see some things, this website can only see 3 js
I searched a little bit, and did not find the page function, but the page function was found in the source code of the webpage
Found a link here! ! !
I ventured to guess that searchword is a search box, so I tried it
http://zb.yfb.qianlima.com/yfbsemsite/mesinfo/zbpglist?searchword=5G
Sure enough!
The same idea failed again, and no similar links were found anymore.
So I searched for information on the Internet and found an article about browser capture: https://blog.csdn.net/weixin_39610722/article/details/110960576
First turn on the tracking, and then click on page 4:
Find the URL, right click copy as cURL, the content is as follows:
curl 'http://zb.yfb.qianlima.com/yfbsemsite/mesinfo/zbpglist' \
-H 'Connection: keep-alive' \
-H 'Cache-Control: max-age=0' \
-H 'Upgrade-Insecure-Requests: 1' \
-H 'Origin: http://zb.yfb.qianlima.com' \
-H 'Content-Type: application/x-www-form-urlencoded' \
-H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36' \
-H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' \
-H 'Referer: http://zb.yfb.qianlima.com/yfbsemsite/mesinfo/zbpglist' \
-H 'Accept-Language: zh-CN,zh;q=0.9' \
-H 'Cookie: JSESSIONID=C20E9758652DE1F141E51B349D4AE839; __jsluid_h=2b424c9697bd9692d2ad1451300a9c75; Hm_lvt_a31e80f5423f0ff316a81cf4521eaf0d=1610549640; pageSize=15; keywords=%E5%8D%83%E9%87%8C%E9%A9%AC%E6%8B%9B%E6%A0%87; keywordvaluess=""; laiyuan=3; Hm_lvt_0a38bdb0467f2ce847386f381ff6c0e8=1610550267; Hm_lpvt_0a38bdb0467f2ce847386f381ff6c0e8=1610550267; Hm_lvt_5dc1b78c0ab996bd6536c3a37f9ceda7=1610550268; Hm_lpvt_5dc1b78c0ab996bd6536c3a37f9ceda7=1610550268; UM_distinctid=176fc46d0a65b3-0c99d08cd43d87-31346d-e1000-176fc46d0a7d20; gr_user_id=97220f81-4919-4a9d-a198-24b3caf49796; pageNo=4; Hm_lpvt_a31e80f5423f0ff316a81cf4521eaf0d=1610560460' \
--data-raw 'pageNo=7&kwname=&pageSize=15&ipAddress=122.96.44.71&searchword=&searchword2=&hotword=&provinceId=&provinceName=&areaId=&areaName=&infoType=0&infoTypeName=¬iceTypes=¬iceTypesName=&secondInfoType=&secondInfoTypeName=&timeType=5&timeTypeName=%E8%BF%91%E4%B8%80%E5%B9%B4&searchType=2&clearAll=false&e_keywordid=&e_creative=&flag=0&source=baidu&firstTime=1' \
--compressed \
--insecure
Before I can take a closer look, I can tell at a glance that there is such a formula: pageNo=4
I ventured to guess that this is what we were looking for. I tried http://zb.yfb.qianlima.com/yfbsemsite/mesinfo/zbpglist?pageNo=4 , and it really works!
In the same way, we also get infoType and timeType
Therefore, we found that after a thorough analysis of JS, there is no need for crawlers, just read html in order.
Then I found that no matter what you search, no matter what you filter, only 30 pages will be displayed at most.
A new round of contest is brewing. . .
First, I got this link
Then I thought, I have 1000 entries per page, why do I want pageNo!
If this data is large enough, all the results of the keyword search can be displayed, and there is no limit on the number of queries.
Four, the code
Reptile draft:
# coding=utf-8
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
base_url = "http://zb.yfb.qianlima.com/yfbsemsite/mesinfo/zbpglist?infoType=0&timeType=0&searchword=5G&pageNo=1"
driver = webdriver.Chrome(executable_path=r'C:\chromedriver.exe', options=chrome_options,service_args=['--load-images=no'])
driver.get(base_url)
select_type_box = driver.find_element_by_xpath("//ul[@class='select_type_box clearfix']/li[1]")
select_type_box.click()
select_time_box = driver.find_element_by_xpath("//ul[@class='fl select_time_box']/li[1]")
select_time_box.click()
print(driver.page_source)
while True:
next_page = driver.find_element_by_xpath("//div[@class='pagination']/ul[1]/li[11]")
next_page.click()
print(driver.page_source)
driver.close()
Because there is no need to crawl like this for the time being, the code will not be changed.