Python crawler four: mouse click simulation, JS analysis

One, crawl the target

http://zb.yfb.qianlima.com/yfbsemsite/mesinfo/zbpglist

 

2. Demand analysis

The requirement is simple, that is to crawl all data.

Of course, you can also provide a function to crawl all data after searching for a certain keyword.

The difficulty is that no matter where you click, the URL always looks the same, which increases the difficulty of crawling .

What the crawler needs to do:

1. The crawl target is an ajax page, which requires selenium + headless browser

2. The search box in the web page (optional feature)

3. Option button

In the default page, the information type and selection time are not all selected. If you want to crawl all data, you need to change these two options.

4. Select page

 

Three, the key technology

1. Dynamic webpage: selenium + headless browser

phantomjs.exe is no longer available, so chromedriver.exe is used here

Install selenium, then download chromedriver.exe and put it in the root directory of the c drive, you can use it to open the webpage with the following code

base_url = "http://zb.yfb.qianlima.com/yfbsemsite/mesinfo/zbpglist"
driver = webdriver.Chrome(executable_path=r'C:\chromedriver.exe', options=chrome_options,service_args=['--load-images=no'])
driver.get(base_url)

2. Option button: find_element_by_xpath

View webpage source code, search for information type

So we have the code to click the "All" button:

select_type_box = driver.find_element_by_xpath("//ul[@class='select_type_box clearfix']/li[1]")
select_type_box.click()

Similarly, choose time

Select time code:

select_time_box = driver.find_element_by_xpath("//ul[@class='fl select_time_box']/li[1]")
select_time_box.click()

3. Next page: onclick, JS analysis, browser capture

At first I thought it was simple:

next_page = driver.find_element_by_xpath("//div[@class='pagination']/ul[1]/li[11]")
next_page.click()

However, the execution error: Element is not clickable at point

In other words, a headless browser can easily click the button, but it is not so easy for onclick.

So, I started to pull JS, F12 into the debugging mode, I can see some things, this website can only see 3 js

I searched a little bit, and did not find the page function, but the page function was found in the source code of the webpage

Found a link here! ! !

I ventured to guess that searchword is a search box, so I tried it

http://zb.yfb.qianlima.com/yfbsemsite/mesinfo/zbpglist?searchword=5G

Sure enough!

The same idea failed again, and no similar links were found anymore.

So I searched for information on the Internet and found an article about browser capture: https://blog.csdn.net/weixin_39610722/article/details/110960576

First turn on the tracking, and then click on page 4:

Find the URL, right click copy as cURL, the content is as follows:

curl 'http://zb.yfb.qianlima.com/yfbsemsite/mesinfo/zbpglist' \
  -H 'Connection: keep-alive' \
  -H 'Cache-Control: max-age=0' \
  -H 'Upgrade-Insecure-Requests: 1' \
  -H 'Origin: http://zb.yfb.qianlima.com' \
  -H 'Content-Type: application/x-www-form-urlencoded' \
  -H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36' \
  -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' \
  -H 'Referer: http://zb.yfb.qianlima.com/yfbsemsite/mesinfo/zbpglist' \
  -H 'Accept-Language: zh-CN,zh;q=0.9' \
  -H 'Cookie: JSESSIONID=C20E9758652DE1F141E51B349D4AE839; __jsluid_h=2b424c9697bd9692d2ad1451300a9c75; Hm_lvt_a31e80f5423f0ff316a81cf4521eaf0d=1610549640; pageSize=15; keywords=%E5%8D%83%E9%87%8C%E9%A9%AC%E6%8B%9B%E6%A0%87; keywordvaluess=""; laiyuan=3; Hm_lvt_0a38bdb0467f2ce847386f381ff6c0e8=1610550267; Hm_lpvt_0a38bdb0467f2ce847386f381ff6c0e8=1610550267; Hm_lvt_5dc1b78c0ab996bd6536c3a37f9ceda7=1610550268; Hm_lpvt_5dc1b78c0ab996bd6536c3a37f9ceda7=1610550268; UM_distinctid=176fc46d0a65b3-0c99d08cd43d87-31346d-e1000-176fc46d0a7d20; gr_user_id=97220f81-4919-4a9d-a198-24b3caf49796; pageNo=4; Hm_lpvt_a31e80f5423f0ff316a81cf4521eaf0d=1610560460' \
  --data-raw 'pageNo=7&kwname=&pageSize=15&ipAddress=122.96.44.71&searchword=&searchword2=&hotword=&provinceId=&provinceName=&areaId=&areaName=&infoType=0&infoTypeName=&noticeTypes=&noticeTypesName=&secondInfoType=&secondInfoTypeName=&timeType=5&timeTypeName=%E8%BF%91%E4%B8%80%E5%B9%B4&searchType=2&clearAll=false&e_keywordid=&e_creative=&flag=0&source=baidu&firstTime=1' \
  --compressed \
  --insecure

Before I can take a closer look, I can tell at a glance that there is such a formula: pageNo=4

I ventured to guess that this is what we were looking for. I tried http://zb.yfb.qianlima.com/yfbsemsite/mesinfo/zbpglist?pageNo=4 , and it really works!

In the same way, we also get infoType and timeType

Therefore, we found that after a thorough analysis of JS, there is no need for crawlers, just read html in order.

Then I found that no matter what you search, no matter what you filter, only 30 pages will be displayed at most.

A new round of contest is brewing. . .

First, I got this link

http://zb.yfb.qianlima.com/yfbsemsite/mesinfo/zbpglist?infoType=0&timeType=0&searchword=5G&pageNo=1&pageSize=1000

Then I thought, I have 1000 entries per page, why do I want pageNo!

http://zb.yfb.qianlima.com/yfbsemsite/mesinfo/zbpglist?infoType=0&timeType=0&searchword=5G&pageSize=1000

If this data is large enough, all the results of the keyword search can be displayed, and there is no limit on the number of queries.

 

Four, the code

Reptile draft:

# coding=utf-8
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument("--headless")
base_url = "http://zb.yfb.qianlima.com/yfbsemsite/mesinfo/zbpglist?infoType=0&timeType=0&searchword=5G&pageNo=1"
driver = webdriver.Chrome(executable_path=r'C:\chromedriver.exe', options=chrome_options,service_args=['--load-images=no'])
driver.get(base_url)
select_type_box = driver.find_element_by_xpath("//ul[@class='select_type_box clearfix']/li[1]")
select_type_box.click()
select_time_box = driver.find_element_by_xpath("//ul[@class='fl select_time_box']/li[1]")
select_time_box.click()
print(driver.page_source)
while True:
    next_page = driver.find_element_by_xpath("//div[@class='pagination']/ul[1]/li[11]")
    next_page.click()
    print(driver.page_source)
driver.close()

Because there is no need to crawl like this for the time being, the code will not be changed.

Guess you like

Origin blog.csdn.net/nameofcsdn/article/details/112598545