Preface
The object of a few days is a website related to China's college entrance examination. It provides college entrance examination information, batch line inquiry, volunteer application guidance, college information and other services. It is very useful to Chinese high school students and parents.
The specific steps are as follows:
Introduction of libraries
First, we need to import some required libraries:
# 时间模块
import time
# 自动化测试模块
from selenium import webdriver
# 保存数据
import csv
time
Module: used to control the speed of the program to prevent the IP from being blocked by the website.selenium
Module: used to simulate browser operations and can solve some anti-crawling mechanisms, such as JavaScript rendering, etc.csv
Module: Used to store data into CSV files.
Open the browser and visit the web page
Then, we need to open the Chrome browser and visit the computer major page on the college entrance examination website:
# 打开浏览器
driver = webdriver.Chrome()
# 访问网站
driver.get('https://www.gaokao.cn/special?fromcoop=pddh&subjectCategory=%E5%B7%A5%E5%AD%A6&subjectName=%E8%AE%A1%E7%AE%97%E6%9C%BA%E7%B1%BB')
# 延时等待 网页元素加载
driver.implicitly_wait(10)
selenium
The class in the module is used here webdriver
to open the Chrome browser, and through get
the method, access the computer major page on the college entrance examination website. We also used implicitly_wait
the method, setting a wait time of 10 seconds to ensure that the required web elements have been fully loaded.
Get the elements of all universities and open the information page of each university one by one.
lis = driver.find_elements_by_css_selector('.major-list_setSchool__3Nr1N')
for li in lis:
li.click()
handles = driver.window_handles
driver.switch_to.window(handles[-1])
Use a for loop to iterate through each obtained school name. Use click
the function to simulate clicking the current school name with the mouse to enter another sub-page containing data.
After entering the university's information page, use an endless loop to continuously turn the pages and extract the basic information of each university on each page.
while True:
driver.implicitly_wait(10)
time.sleep(1)
divs = driver.find_elements_by_css_selector('.school-tab_schoolInfo__1mNye')
for div in divs:
# 提取大学的基本信息并保存到 CSV 文件中
# ...
frame = driver.find_element_by_css_selector('.ant-pagination-next')
next_page = frame.get_attribute('aria-disabled')
if next_page == 'true':
break
elif next_page == 'false':
frame.click()
Use while True
statements to enter an infinite loop. In each loop, use statements find_elements_by_css_selector
to obtain the label content corresponding to each school, use try except
statements to handle errors, and write the obtained data into a CSV file.
Before the end of the loop, frame.click()
get the next page of the web page by traversing all web page elements. If there is no next page for the retrieved element, exit the loop.
Close the current information page, return to the initial page, and continue traversing the next university's information page.
driver.close()
driver.switch_to.window(handles[0])
Create CSV file
Next, we need to create a CSV file and write the header:
# 创建文件
f = open('data1.csv', mode='w', encoding='utf-8', newline='')
# 设置表头
csv_writer = csv.DictWriter(f, fieldnames=[
'学校',
'tag1',
'tag2',
'tag3',
'tags',
])
# 写入表头
csv_writer.writeheader()
The function that comes with Python is used here open
to create data1.csv
a file named. We then use csv
the module's DictWriter
class to create a write object for writing to the CSV file. The parameters of this class fieldnames
are used to set the header, which is the first line of the CSV file. Next, we call writeheader
the function to write the header to the CSV file.
In this way, we can crawl the computer major university information from the college entrance examination website locally and save it as a CSV file to facilitate future analysis and processing.