[Crawler Series] Python crawler combat -- job information crawling on recruitment websites

1. Analysis

1. Demand analysis

When looking for a job from the Internet, people generally search for relevant information through various recruitment websites. Today, crawlers are used to collect job information on recruitment websites, such as job title, job requirements, salary, company name, company size, company location, benefits, etc. content of concern. After the acquisition and analysis are completed, use Excel or csv file to save.

2. Analysis of the landing page structure

Take the "Zhaolian Recruitment" PC-side webpage as an example, search and open the website, and log in with account password (mainly to avoid session access restrictions). Next, select the target city and search for the job information related to Python. The website will return the paged results of the relevant job information, as shown in the figure:

Through simple verification, it can be found that there is no dynamic rendering of the current web page, and there is no strict anti-crawler mechanism.

So, start analyzing from the first page. Open the console of the Google browser through [F12], and quickly locate the position of the job list (that is, the div tag whose class is positionlist) through the [Element] column, and then take one of the "class=joblist-box__item" under the list Clearfix” div tag analysis, the required recruitment information is basically in the 3 div tags below it, as follows:

To target job listings using CSS selectors, you can do this:

soup.find_all('div', class_='joblist-box__item clearfix')

For the target data mentioned in the requirements analysis, you need to further expand the element label, and you can find the specific location, as follows:

Here comes the most critical point, which is to parse out the target data. Taking the current recruitment information as an example, you can do this:

# 岗位名称
soup.select("div[class='iteminfo__line1__jobname'] > span[class='iteminfo__line1__jobname__name']")[0].get_text().strip()
# 公司名称
soup.select("div[class='iteminfo__line1__compname'] > span")[0].get_text().strip()
# 薪资
soup.select("div[class='iteminfo__line2__jobdesc'] > p")[0].get_text().strip()
# 公司位置
soup.select("div[class='iteminfo__line2__jobdesc'] > ul[class='iteminfo__line2__jobdesc__demand']")[0].get_text().strip().split(" ")[0]

If you traverse all the recruitment information on the first page, you can replace select(xxx)[0] with select(xxx)[i] in the loop.

3. Pagination Analysis

In fact, we will not only crawl the first page of data, but also need to specify the number of crawled pages or crawl all pages according to the situation. At this point, you have to look for the rules of the request URL. The URL of the first page is:

https://sou.zhaopin.com/?jl=653&kw=Python

Click on the second page, the third page... until the last page, the URL changes to:

https://sou.zhaopin.com/?jl=653&kw=Python&p=1
https://sou.zhaopin.com/?jl=653&kw=Python&p=2
......
https://sou.zhaopin.com/?jl=653&kw=Python&p=6

In this way, obvious rules can be found. We splice the URL of the first page with a &p=1 in this way, and the result obtained is the same as that of the first page. Therefore, the number of crawled pages can be controlled through this parameter p.

What about crawling all pages, how to judge that the current page has reached the last page ? After analysis, it is found that the pagination area is in the div tag of "class=pagination clearfix", to be precise, it is in the div tag of "class=pagination__pages", as shown below:

Click to the third page, but still can’t distinguish the logo of the last page. When you try to click to the last page (page 6), you can see the difference. The div on the next page will have an extra class attribute, as shown below Show:

At this point, using this obvious feature, you can write an analytic expression like this to easily locate the last page:

soup.select("div[class='pagination__pages'] > button[class='btn soupager__btn soupager__btn--disable']")

Just judge whether the last_page is a null value, and if it is null, it means that it is not the last page. After the above-mentioned analysis, it seems that the crawling and analysis of the recruitment website page is ready.

Next, choose the technical solution of request + BeautifulSoup + CSS selector to achieve our crawler goal.

2. Code implementation

1. Implementation of the main method

The idea is as follows:

Splicing the url of the request, initiating a get request (supports pagination operation), and if the response code is 200, the source code of the html webpage is obtained in a loop;
Save the source code of the html webpage as a txt file, which is convenient for analyzing and viewing problems (such as garbled characters, etc.);
Then, parse the html page, extract the target data, and save it as a file in an optional format, such as a csv file.

code show as below:

def process_zhilianzhaopin(baseUrl, pages, fileType, savePath):
    results = [['岗位名称', '公司名称', '岗位薪资', '岗位要求', '公司位置', '福利待遇']]
    headers = UserAgent(path='D:\\XXX\\reptile\\fake_useragent.json').google
    # 根据入参pages,拼接请求url,控制爬取的页数
    for page in range(1, int(pages) + 1):
        url = baseUrl + str(page)
        response = requests.get(url, headers)
        print(f"current url：{url}，status_code={response.status_code}")
        if response.status_code == 200:
            html = response.text
            html2txt(html, page, savePath)
            parser_html_by_bs(html, page, results, fileType, savePath)
        else:
            print(f"error,response code is {response.status_code} !")
    print('爬取页面并解析数据完毕，棒棒哒.....................................')

2. Save the source code of the web page

Note: The premise is to ensure that the path to save the source code of the web page exists.

def html2txt(html, page, savePath):
    with open(f'{savePath}\\html2txt\\zhilianzhaopin_python_html_{page}.txt', 'w', encoding='utf-8') as wf:
        wf.write(html)
        print(f'write boss_python_html_{page}.txt is success！！！')

3. Parse the source code of the web page

Analysis process:

The overall idea: judge the last page first; if it is not the last page, then locate the webpage element of the current page and extract the target data; write the extracted data into a file in the specified format.

def parser_html_by_bs(html, current_page, results, fileType, savePath):
    soup = BeautifulSoup(html, "html.parser")
    # 判断当前页是否为最后一页
    if not judge_last_page(current_page, soup):
        # 定位网页元素，获取目标数据
        get_target_info(soup, results)
        # 并将解析的数据写入指定文件类型
        write2file(current_page,results, fileType, savePath)

Determine whether the current page is the last page, if yes, stop parsing, otherwise, continue parsing. The judgment code is as follows:

def judge_last_page(current_page, soup):
    last_page = soup \
        .select("div[class='pagination__pages'] > button[class='btn soupager__btn soupager__btn--disable']")
    if len(last_page) != 0:
        print("current_page is last_page,page num is " + str(last_page))
        return True
    print(f"current_page is {current_page},last_page is {last_page}")
    return False

Parse the 'job name', 'company name', 'job salary', 'job requirements', 'company location', 'welfare benefits' and other content in the source code of the webpage, assemble it into a data list and write it into the result list, as follows:

def get_target_info(soup, results):
    jobList = soup.find_all('div', class_='joblist-box__item clearfix')
    # print(f"jobList: {jobList}，size is {len(jobList)}")
    for i in range(0, len(jobList)):
        job_name = soup.select("div[class='iteminfo__line1__jobname'] > span[class='iteminfo__line1__jobname__name']")[i].get_text().strip()
        company_name = soup.select("div[class='iteminfo__line1__compname'] > span")[i].get_text().strip()
        salary = soup.select("div[class='iteminfo__line2__jobdesc'] > p")[i].get_text().strip()
        desc_list = soup.select("div[class='iteminfo__line2__jobdesc'] > ul[class='iteminfo__line2__jobdesc__demand']")[i].get_text().strip()
        # print(f"job_name={job_name} , company_name={company_name}, salary={salary}, tag_list=null, job_area={desc_list.split(' ')[0]}, info_desc=null")
        results.append([job_name,company_name,salary,desc_list.split(" ")[1] + "," + desc_list.split(" ")[2], desc_list.split(" ")[0],"暂时无法获取"])

4. Implementation of the method of writing to the specified file

Note: The premise is to ensure that the path for saving and writing files exists. Here we take writing csv files as an example. If you want to write other files, you can add a conditional branch to judge and realize it yourself.

def write2file(current_page, results, fileType, savePath):
    if fileType.endswith(".csv"):
        with open(f'{savePath}\\to_csv\\zhilianzhaopin_python.csv', 'a+', encoding='utf-8-sig', newline='') as af:
            writer = csv.writer(af)
            writer.writerows(results)
            print(f'第{current_page}页爬取数据保存csv成功！')

5. Start the test

if __name__ == '__main__':
    base_url = "https://sou.zhaopin.com/?jl=653&kw=Python&p="
    save_path = "D:\\XXX\\zhilianzhaopin_python"
    page_total = "2"
    process_zhilianzhaopin(base_url, page_total, ".csv", save_path)

Take crawling the first two pages of data as an example, just assign page_total a value of 2, and the logs output by the console are as follows:

Save the text file of the source code of the webpage, and the test results are as follows:

The parsed data is written into a csv file, and the test results are as follows:

Open the csv file, and you can see the crawled 2 pages with a total of 40 job information, the content is as follows:

If the verification test crawls the results of all pages (there are 6 pages in total), we can assign the value of page_total in the test code to 6.

7. Record of problems encountered

During the crawling process, various problems cannot be avoided, some of which cannot be solved for the time being, and are very difficult, so record them here.

<1>. Garbled characters

When writing a CSV file, I encountered garbled Chinese characters, and the crawled Chinese characters were unsightly. Finally, change encoding='utf-8' to encoding='utf-8-sig' and it will be solved!

<2>. Duplicate data

Take the previous two pages as an example, when the data in the CSV file is written, duplication occurs. After analysis, there should be no problem with the process of crawling code, but the problem is: no matter which page of data is requested, the data returned by the server is the default data? ! The most intuitive verification is that the contents of the zhilianzhaopin_python_html_1.txt and zhilianzhaopin_python_html_2.txt web page source code files are exactly the same, but the job or company name cannot be found when accessing through the link printed in the log!

At this point, it suddenly dawned on me! The Zhaopin recruitment website still has an anti-crawler mechanism, but it is relatively hidden, and it is not easy to find and verify, because crawlers can crawl to web page data. If you do not perform detailed verification or analysis on the crawled data, it will be difficult to find this situation, and you will mistakenly think that the crawling is successful!

<3>. The last page judgment does not take effect

The problem that the judgment on the last page does not take effect bothered me at first. From the content analysis of the crawled data, the website has enabled the anti-crawler mechanism, so no matter which page is requested, it will be directed to request the default web page source code, so the label on the last page will never appear. Hey, what a big pit! ! !

at last

This is the case with crawlers. Sometimes I feel that I have bypassed the crawler mechanism of the website, but I didn't find the problem until the data verification and cleaning process. This is also a wake-up call for learning crawlers.

However, the analysis method and code implementation introduced in this article are applicable to web pages that do not have strict anti-crawlers. The focus is on the analysis of web page structure and the positioning of web page elements. There are still ways to solve this relatively hidden anti-pickup mechanism. The next article will demonstrate how to use the selenium module to solve this kind of problem.