[Crawler Series] Use Selenium module to collect job information on recruitment websites (1)

In the previous demonstration, the Request module was used to collect the recruitment information of the PC-side webpage of the Zhaopin recruitment website. As a result, it encountered a relatively hidden anti-crawling restriction (it was not clear what caused it at the time). It seems that the way to use this module will not work for the time being Now, I plan to try using the Selenium module, and try to find the specific reason for the website's anti-crawling restriction .

1. Environmental preparation

  • Google Chrome and chromedriver.exe driver plug-in;
  • An editor that supports Python programming (such as PyCharm);
  • pip download selenium, BeautifulSoup, csv and other modules;

Note : The installation and verification of the driver plug-in of Google Chrome was introduced in Getting Started with Crawlers , so I won’t repeat it here.

2. Use the Selenium module

After selecting the target webpage, the most important thing is to analyze the structure of the webpage so as to locate the target elements. The previous article has already done a specific analysis, so I won’t repeat the description here, but directly use the selenium module to achieve data collection, and try to find the reason for the anti-climbing that existed before.

The idea of ​​the main method:

  • Use the webdriver driver library of the selenium module to simulate opening a browser;
  • Send a get request to access the address of the first page of Zhaopin recruiting Python positions in Hangzhou (do not use multiple windows for the time being);
  • After intelligently waiting for 10s, obtain the web page source code of the first page, and write it into a txt file for saving;
  • The next step is to parse and locate webpage elements. The steps and methods of this piece are the same as before. After extracting the target data, write it into a csv file and save it. Finally, exit and close the browser.

Test crawling the first page, the demonstration is as follows:

def selenium_zl(url, savePath, results, fileType):
    browser = webdriver.Chrome()
    browser.get(url) # https://sou.zhaopin.com/?jl=653&kw=Python
    browser.implicitly_wait(10)
    html = browser.page_source
    html2txt(html, "1", savePath)
    parser_html_by_bs(html, "1", results, fileType, savePath)
    print('爬取页面并解析数据完毕,棒棒哒.....................................')
    browser.quit()

txt file content analysis:

When you get the source code of the web page, you can start to verify whether the saved html web page has collected the first page of data. I casually searched for the keyword "Hema Holdings", but it does not exist in the txt file, as follows:

Are you about to doubt your life? ! It seems that it has little to do with the selected module.

The mountains are heavy and the rivers are back, and the willows are dark and the flowers are bright:

After reorganizing my thinking, I temporarily commented out the code for parsing the webpage and exiting the browser, and then added the click code for the browser. At this time, I seem to see something different. The code is as follows:

At this point, the simulated browser does not exit, and you can see the content of the first page, as follows:

Here, it seems to know why? It turns out that it is restricted by Session access . Turn the page of the simulated browser to the front and you will understand, as shown below:

This is really frustrating. Let me reflect on it now. During the previous analysis, I have already done manual login in the browser. Why do I need to log in when using the crawler code? ! It seems that you can't take it for granted. Now that you understand what's going on, you can bypass it in a targeted manner.

Mastering the stick, but aiming for the yard:

If it is restricted by Session access, just create a Session session and log in before initiating a request. However, things are not that simple. Zhaopin’s PC-side webpage only allows login with mobile phone number + SMS verification code ! ! ! I felt very speechless at once, which greatly increased the difficulty of crawling, hey, I am so tired!

After some research and testing, there are roughly two possible solutions:

  • One is to find a way to forward the verification code on the mobile phone to a third-party platform, and then automate the crawler code to go to the server of the platform to request it.
  • The second is to manually log in to the website first, and then use the cookie in the automated crawler code to simulate login by grabbing the logged-in cookie.

The first option is more difficult and requires a lot of knowledge reserves. For example, you need to understand the sending and receiving mechanism of monitoring mobile phone text messages and the use of the Flask framework, or simply use a coding platform, or implement different methods such as deep learning algorithms by yourself. Considering the safety and the complexity of implementation, hahaha, I was immediately persuaded to quit. At present, my own knowledge system and reserves are not enough, so I will conduct in-depth research later.

The second option is slightly less difficult, but requires some familiarity with HTTP communication. This method is simple and feasible, and it is recommended to use it. Therefore, I intend to focus on how to bypass the SMS verification code to log in in the next article.

Three. Summary

So far, the cause of the problem that has been bothering me for a long time has finally been found. It turns out that it is the website's anti-pickup restriction! Hereby record it.

Next, we will continue to study the solution to this problem and achieve the previous goal. The next part will summarize and sort out, that's it.

Guess you like

Origin blog.csdn.net/qq_29119581/article/details/127939407