The purpose of this project is to obtain the first five pages of short-term rental housing data on Airbnb in Shenzhen.
Destination URL: Airbnb Shenzhen short-term rental data
1. Analyze the HTML code tags of the website
1. Right-click anywhere on the webpage> Check
2. Find the HTML code corresponding to "All House Data"
The address to find all the data of the house is: div.gigle7
3. Find the HTML code corresponding to "House Price"
The address to find the house price is: span.krjbj
4. Find the HTML code corresponding to "House score, house reviews"
Find the house score, the address of the house review number is: span._1clmxfj
5. Find the HTML code corresponding to "House Name"
The address to find the name of the house is: div._qrfr9x5
6. Find the HTML code corresponding to "house type, number of rooms"
The address to find the house type and the number of rooms is: span._faldii7
Two, write crawling code
1. Use selenium to get the data of the first page of Airbnb
from selenium import webdriver
import time
driver = webdriver.Firefox()
#我已将其配置到环境变量中,若读者没有,请在上面括号中加入你电脑中geckodriver.exe程序的地址
#在虚拟浏览器中打开 Airbnb 页面
driver.get("https://zh.airbnb.com/s/Shenzhen--China/homes")
#找到页面中所有的出租房
rent_list = driver.find_elements_by_css_selector('div.gigle7')
#对于每一个出租房
for eachhouse in rent_list: #使用for循环提取每个出租房的信息
#找到评论数量
try:
comment = eachhouse.find_element_by_css_selector('span._1clmxfj')
comment = comment.text
except:
comment = 0
#找到价格
price = eachhouse.find_element_by_css_selector('span.krjbj')
price = price.text.replace("每晚", "").replace("价格", "").replace("\n", "")
#找到名称
name = eachhouse.find_element_by_css_selector('div._qrfr9x5')
name = name.text
#找到房屋类型,房间数量
details = eachhouse.find_element_by_css_selector('span._faldii7')
details = details.text
house_type = details.split(" · ")[0]
bed_number = details.split(" · ")[1]
print (comment, price, name, house_type, bed_number)
The result is similar:
2. Get the data of the first 5 pages
Click on the second page, the URL becomes:
Click on the third page, the web page becomes:
Click on the fourth page, the web page becomes:
It is found that only items_offset has changed from 20 to 40 and then to 60. You can turn the page only by changing the value of items_offset regularly through a for loop.
from selenium import webdriver
import time
driver = webdriver.Firefox()
#我已将其配置到环境变量中,若读者没有,请在上面括号中加入你电脑中geckodriver.exe程序的地址
for i in range(0,5):
link = "https://zh.airbnb.com/s/Shenzhen--China/homes?items_offset=" + str(i *18)
driver.get(link)
rent_list = driver.find_elements_by_css_selector('div._gig1e7')
#对于每一个出租房
for eachhouse in rent_list: #使用for循环提取每个出租房的信息
#找到评论数量
try:
comment = eachhouse.find_element_by_css_selector('span._1clmxfj')
comment = comment.text
except:
comment = 0
#找到价格
price = eachhouse.find_element_by_css_selector('span.krjbj')
price = price.text.replace("每晚", "").replace("价格", "").replace("\n", "")
#找到名称
name = eachhouse.find_element_by_css_selector('div._qrfr9x5')
name = name.text
#找到房屋类型,房间数量
details = eachhouse.find_element_by_css_selector('span._faldii7')
details = details.text
house_type = details.split(" · ")[0]
bed_number = details.split(" · ")[1]
print (comment, price, name, house_type, bed_number)
The result is similar: