Beginner crawler (3): Use selenium to simulate the browser to grab dynamic web pages (2) Selenium project actual combat-Shenzhen short-term rental data

The purpose of this project is to obtain the first five pages of short-term rental housing data on Airbnb in Shenzhen.
Destination URL: Airbnb Shenzhen short-term rental data
Insert picture description here

1. Analyze the HTML code tags of the website

1. Right-click anywhere on the webpage> Check

Insert picture description here

2. Find the HTML code corresponding to "All House Data"

Insert picture description here
The address to find all the data of the house is: div.gigle7

3. Find the HTML code corresponding to "House Price"

Insert picture description here
The address to find the house price is: span.krjbj

4. Find the HTML code corresponding to "House score, house reviews"

Insert picture description here
Find the house score, the address of the house review number is: span._1clmxfj

5. Find the HTML code corresponding to "House Name"

Insert picture description here
The address to find the name of the house is: div._qrfr9x5

6. Find the HTML code corresponding to "house type, number of rooms"

Insert picture description here
The address to find the house type and the number of rooms is: span._faldii7

Two, write crawling code

1. Use selenium to get the data of the first page of Airbnb

from selenium import webdriver
import time

driver = webdriver.Firefox()
#我已将其配置到环境变量中,若读者没有,请在上面括号中加入你电脑中geckodriver.exe程序的地址
#在虚拟浏览器中打开 Airbnb 页面
driver.get("https://zh.airbnb.com/s/Shenzhen--China/homes")

#找到页面中所有的出租房
rent_list = driver.find_elements_by_css_selector('div.gigle7')

#对于每一个出租房
for eachhouse in rent_list: #使用for循环提取每个出租房的信息
    #找到评论数量
    try:
        comment = eachhouse.find_element_by_css_selector('span._1clmxfj')
        comment = comment.text
    except:
        comment = 0
    
    #找到价格
    price = eachhouse.find_element_by_css_selector('span.krjbj')
    price = price.text.replace("每晚", "").replace("价格", "").replace("\n", "")
    
    #找到名称
    name = eachhouse.find_element_by_css_selector('div._qrfr9x5')
    name = name.text
    
    #找到房屋类型,房间数量
    details = eachhouse.find_element_by_css_selector('span._faldii7')
    details = details.text
    house_type = details.split(" · ")[0]
    bed_number = details.split(" · ")[1]
    print (comment, price, name, house_type, bed_number)

The result is similar:
Insert picture description here

2. Get the data of the first 5 pages

Click on the second page, the URL becomes:

Add link description

Click on the third page, the web page becomes:

Add link description

Click on the fourth page, the web page becomes:

Add link description

It is found that only items_offset has changed from 20 to 40 and then to 60. You can turn the page only by changing the value of items_offset regularly through a for loop.

from selenium import webdriver
import time

driver = webdriver.Firefox()
#我已将其配置到环境变量中,若读者没有,请在上面括号中加入你电脑中geckodriver.exe程序的地址
for i in range(0,5):
    link = "https://zh.airbnb.com/s/Shenzhen--China/homes?items_offset=" + str(i *18)
    driver.get(link)
    rent_list = driver.find_elements_by_css_selector('div._gig1e7')

#对于每一个出租房
for eachhouse in rent_list: #使用for循环提取每个出租房的信息
    #找到评论数量
    try:
        comment = eachhouse.find_element_by_css_selector('span._1clmxfj')
        comment = comment.text
    except:
        comment = 0
    
    #找到价格
    price = eachhouse.find_element_by_css_selector('span.krjbj')
    price = price.text.replace("每晚", "").replace("价格", "").replace("\n", "")
    
    #找到名称
    name = eachhouse.find_element_by_css_selector('div._qrfr9x5')
    name = name.text
    
    #找到房屋类型,房间数量
    details = eachhouse.find_element_by_css_selector('span._faldii7')
    details = details.text
    house_type = details.split(" · ")[0]
    bed_number = details.split(" · ")[1]
    print (comment, price, name, house_type, bed_number)

The result is similar:
Insert picture description here

Guess you like

Origin blog.csdn.net/qq_45154565/article/details/109763657