python crawler selenium-51job

1. Background introduction

Selenium simulates the operation of the browser by driving the browser, and then crawls the data. In addition, you need to install the browser driver, and the related steps can be solved by yourself.

mind Mapping:

2. Import library

import csv
import random
import time
from time import sleep
from selenium import webdriver
from selenium.webdriver import ActionChains
from selenium.webdriver import ChromeOptions
from selenium.webdriver.common.by import By

3. Remove browser identification

option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])
option.add_experimental_option('detach', True)

Remove the words "Chrome is being controlled by automated testing software" above the browser.

4. Instantiate a browser object (the driver passed into the browser)

driver = webdriver.Chrome(options=option)

5. Initiate the request

driver.get("https://www.51job.com/")
time.sleep(2) #防止加载缓慢,休眠2秒

6. Solve feature recognition

script = 'Object.defineProperty(navigator, "webdriver", {get: () => false,});'
driver.execute_script(script)

No verification box or verification slider appears, indicating that selenium recognition has been successfully blocked.

7. Locate the input box and find related jobs

driver.find_element(By.XPATH, '//*[@id="kwdselectid"]').click()
driver.find_element(By.XPATH, '//*[@id="kwdselectid"]').clear()
driver.find_element(By.XPATH, '//*[@id="kwdselectid"]').send_keys('老师')
driver.find_element(By.XPATH, '/html/body/div[3]/div/div[1]/div/button').click()
# driver.implicitly_wait(10)
time.sleep(5)
print(driver.current_url)

Enter the keyword "teacher" and make changes according to your needs.

8. Extract data using xpath and css selectors

jobData = driver.find_elements(By.XPATH, '//*[@id="app"]/div/div[2]/div/div/div[2]/div/div[2]/div/div[2]/div[1]/div')
    for job in jobData:
        jobName = job.find_element(By.CLASS_NAME, 'jname.at').text
        time.sleep(random.randint(5, 15) * 0.1)
        jobSalary = job.find_element(By.CLASS_NAME, 'sal').text
        time.sleep(random.randint(5, 15) * 0.1)
        jobCompany = job.find_element(By.CLASS_NAME, 'cname.at').text
        time.sleep(random.randint(5, 15) * 0.1)
        company_type_size = job.find_element(By.CLASS_NAME, 'dc.at').text
        time.sleep(random.randint(5, 15) * 0.1)
        company_status = job.find_element(By.CLASS_NAME, 'int.at').text
        time.sleep(random.randint(5, 15) * 0.1)
        address_experience_education = job.find_element(By.CLASS_NAME, 'd.at').text
        time.sleep(random.randint(5, 15) * 0.1)

        try:
            job_welf = job.find_element(By.CLASS_NAME, 'tags').get_attribute('title')
        except:
            job_welf = '无数据'
        time.sleep(random.randint(5, 15) * 0.1)

        update_date = job.find_element(By.CLASS_NAME, 'time').text
        time.sleep(random.randint(5, 15) * 0.1)

        print(jobName, jobSalary, jobCompany, company_type_size, company_status, address_experience_education, job_welf,
              update_date)

Because it prevents the website from anti-crawling, let the program sleep for a random length of time while obtaining data. (Set the appropriate length of time according to your needs)

In the process of extracting work benefits, some positions do not have this option, so try...except...

9. Locate the page input box and realize the jump

Use xpath to locate the page number input box, enter the page number, and complete the jump. In order to prevent anti-climbing, the program sleeps for a random length of time after each operation.

driver.find_element(By.XPATH, '//*[@id="jump_page"]').click()
    time.sleep(random.randint(10, 30) * 0.1)
    driver.find_element(By.XPATH, '//*[@id="jump_page"]').clear()
    time.sleep(random.randint(10, 40) * 0.1)
    driver.find_element(By.XPATH, '//*[@id="jump_page"]').send_keys(page)
    time.sleep(random.randint(10, 30) * 0.1)
    driver.find_element(By.XPATH,
                        '//*[@id="app"]/div/div[2]/div/div/div[2]/div/div[2]/div/div[3]/div/div/span[3]').click()

10. Data storage

Save the extracted data into csv in the way of append writing.

 with open('wuyou_teacher.csv', 'a', newline='') as csvfile:
            writer = csv.writer(csvfile)
            writer.writerow(
                [jobName, jobSalary, jobCompany, company_type_size, company_status, address_experience_education,
                 job_welf,
                 update_date])

Guess you like

Origin blog.csdn.net/m0_62428181/article/details/129597479