Python uses Selenium to crawl Ctrip ticket information

1. Description of the problem

1. The selenium library is a relatively flattering third-party library in the crawling process. It can skip interactions such as js and ajax, and it is easier to get started.

2. The basic code is based on the reference of other bloggers, but the Ctrip website is constantly changing. Except for the unchanged information such as ID, the rest has changed. Therefore, after careful comparison, the following code was improved and written, and released on October 19, 2021.

3. If an error occurs, please try to modify time.sleep()the function parameters in the following code.

4. To crawl the data you want, you only need to modify the departure and arrival places as well as the departure time. In addition, pay attention to modifying the browser driver. I use Microsoft Edge and download the driver from the corresponding website. After downloading, you need to rename the driver and modify the parameters driver_path.

4. It is only a basic version at present, and an updated version may be released in the future, such as adding personalized data crawling such as direct flights, transfers, and stops.

5. The code is only for learning and reference, please do not use it for commercial use!

Two, the code

# -*- coding:utf-8 -*-
# 利用selenium爬取携程
# Author: KingStar
import time
from selenium import webdriver
from bs4 import BeautifulSoup

def page_select_function(driver_path):
    driver = webdriver.Edge(executable_path=driver_path)
    driver.get('https://www.ctrip.com/')
    time.sleep(1)
    # 窗口最大化
    driver.maximize_window()
    # 从首页选择进入机票页面
    input_tag_slect = driver.find_element_by_class_name('s_tab_nocurrent')
    input_tag_slect.click()
    time.sleep(1)
    # 选择日期
    input_tag_time = driver.find_element_by_id('FD_StartDate')
    input_tag_time.send_keys('2021-10-20')
    time.sleep(1)
    # 选择出发城市
    input_tag_depart_city = driver.find_element_by_id('FD_StartCity')
    input_tag_depart_city.send_keys('厦门')
    time.sleep(1)
    # 选择到达城市
    input_tag_arrive_city = driver.find_element_by_id('FD_DestCity')
    input_tag_arrive_city.send_keys('兰州')
    time.sleep(1)
    # 开始搜索
    input_tag_search = driver.find_element_by_id('FD_StartSearch')
    # driver.find_element(By.CSS_SELECTOR, "#submit").send_keys(Keys.ENTER)
    driver.execute_script("arguments[0].click();", input_tag_search)
    # input_tag_search.click()
    time.sleep(3)
    # 关闭紧急公告提示
    tag_close = driver.find_element_by_class_name('close-icon')
    tag_close.click()
    time.sleep(1)
    # 只选择经停
    tag_select = driver.find_element_by_class_name('auto_cursor')
    tag_select.click()
    time.sleep(2)
    driver.find_element_by_xpath('//*[@id="domestic_filter_group_trans_and_train__trans_count"]/li[2]/span').click()
    driver.find_element_by_class_name('flight-part').click()
    time.sleep(1)
    '''
    # 只选择直飞
    direct_flight_tag_click = driver.find_element_by_class_name('form-label')
    direct_flight_tag_click.click()
    time.sleep(2)
    '''
    for i in range(1):
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(1)
    return driver

def data_acquistion(driver):
    source = driver.page_source
    bs = BeautifulSoup(source, 'html.parser')
    divs = bs.find_all('div', class_='flight-item domestic')
    return divs

def data_treating(divs):
    for div in divs:
        try:
            airlineName = div.find('div', class_='airline-name').get_text()
            flightNumber = div.find_all('span', class_='plane-No')[0].get_text()
            # craftTypeName = div.find('span', class_='direction_black_border low_text').string
            departureTime = div.find('div', class_='depart-box').find('div', class_='time').string
            arrivalTime = div.find('div', class_='arrive-box').find('div', class_='time').get_text()
            lowestPrice = div.find('span', class_='price').get_text()
            print(airlineName, '\t', flightNumber, '\t', departureTime, '\t', arrivalTime, '\t', lowestPrice)
            time.sleep(1)
        except:
            print(无该信息)

def main():
    driver_path = r'C:\MicrosoftWebDriver.exe'
    driver = page_select_function(driver_path)
    divs = data_acquistion(driver)
    data_treating(divs)
    driver.close()

if __name__ == '__main__':
    main()



Guess you like

Origin blog.csdn.net/JaysonWong/article/details/120830168