1. Background introduction
Selenium simulates the operation of the browser by driving the browser, and then crawls the data. In addition, you need to install the browser driver, and the related steps can be solved by yourself.
mind Mapping:
2. Import library
import csv
import random
import time
from time import sleep
from selenium import webdriver
from selenium.webdriver import ActionChains
from selenium.webdriver import ChromeOptions
from selenium.webdriver.common.by import By
3. Remove browser identification
option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])
option.add_experimental_option('detach', True)
Remove the words "Chrome is being controlled by automated testing software" above the browser.
4. Instantiate a browser object (the driver passed into the browser)
driver = webdriver.Chrome(options=option)
5. Initiate the request
driver.get("https://www.51job.com/")
time.sleep(2) #防止加载缓慢,休眠2秒
6. Solve feature recognition
script = 'Object.defineProperty(navigator, "webdriver", {get: () => false,});'
driver.execute_script(script)
No verification box or verification slider appears, indicating that selenium recognition has been successfully blocked.
7. Locate the input box and find related jobs
driver.find_element(By.XPATH, '//*[@id="kwdselectid"]').click()
driver.find_element(By.XPATH, '//*[@id="kwdselectid"]').clear()
driver.find_element(By.XPATH, '//*[@id="kwdselectid"]').send_keys('老师')
driver.find_element(By.XPATH, '/html/body/div[3]/div/div[1]/div/button').click()
# driver.implicitly_wait(10)
time.sleep(5)
print(driver.current_url)
Enter the keyword "teacher" and make changes according to your needs.
8. Extract data using xpath and css selectors
jobData = driver.find_elements(By.XPATH, '//*[@id="app"]/div/div[2]/div/div/div[2]/div/div[2]/div/div[2]/div[1]/div')
for job in jobData:
jobName = job.find_element(By.CLASS_NAME, 'jname.at').text
time.sleep(random.randint(5, 15) * 0.1)
jobSalary = job.find_element(By.CLASS_NAME, 'sal').text
time.sleep(random.randint(5, 15) * 0.1)
jobCompany = job.find_element(By.CLASS_NAME, 'cname.at').text
time.sleep(random.randint(5, 15) * 0.1)
company_type_size = job.find_element(By.CLASS_NAME, 'dc.at').text
time.sleep(random.randint(5, 15) * 0.1)
company_status = job.find_element(By.CLASS_NAME, 'int.at').text
time.sleep(random.randint(5, 15) * 0.1)
address_experience_education = job.find_element(By.CLASS_NAME, 'd.at').text
time.sleep(random.randint(5, 15) * 0.1)
try:
job_welf = job.find_element(By.CLASS_NAME, 'tags').get_attribute('title')
except:
job_welf = '无数据'
time.sleep(random.randint(5, 15) * 0.1)
update_date = job.find_element(By.CLASS_NAME, 'time').text
time.sleep(random.randint(5, 15) * 0.1)
print(jobName, jobSalary, jobCompany, company_type_size, company_status, address_experience_education, job_welf,
update_date)
Because it prevents the website from anti-crawling, let the program sleep for a random length of time while obtaining data. (Set the appropriate length of time according to your needs)
In the process of extracting work benefits, some positions do not have this option, so try...except...
9. Locate the page input box and realize the jump
Use xpath to locate the page number input box, enter the page number, and complete the jump. In order to prevent anti-climbing, the program sleeps for a random length of time after each operation.
driver.find_element(By.XPATH, '//*[@id="jump_page"]').click()
time.sleep(random.randint(10, 30) * 0.1)
driver.find_element(By.XPATH, '//*[@id="jump_page"]').clear()
time.sleep(random.randint(10, 40) * 0.1)
driver.find_element(By.XPATH, '//*[@id="jump_page"]').send_keys(page)
time.sleep(random.randint(10, 30) * 0.1)
driver.find_element(By.XPATH,
'//*[@id="app"]/div/div[2]/div/div/div[2]/div/div[2]/div/div[3]/div/div/span[3]').click()
10. Data storage
Save the extracted data into csv in the way of append writing.
with open('wuyou_teacher.csv', 'a', newline='') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(
[jobName, jobSalary, jobCompany, company_type_size, company_status, address_experience_education,
job_welf,
update_date])