Python case implementation|Crawl rental website information

01. Case realization


The following is an example of a comprehensive actual combat project, crawling from "Beijing Lianjia.com", including the city name (district), street name (street), community name (community), floor information (floor), whether there is an elevator (lift), area ( area), house orientation (toward), house type (model) and rent (rent) and other rental information.

The program flow chart of crawling data is shown in Figure 1.

image.png

■ Figure 1 Flowchart of "Beijing Lianjia.com" renting data crawling

(1) Import library. code show as below. First, import the library required in the crawling process. Here is only a brief description of the library, which will be explained in detail later.

import csv
import random
import time
import requests
import pandas as pd
from Ixml import etree

Among them, the requests library is used to request the specified page and get the response, the etree library performs XPath analysis on the returned page to obtain the specified data, the random library and the time library are used for some settings in the crawling process, and the Pandas library and the csv library are used for processing and saving files.

(2) Enter the city abbreviation of "Beijing Lianjia.com" and the page number range of the listing pages to be crawled. Opening the rental page, the first thing displayed is the listing page. By observing the URL links of Lianjia.com rental pages in different cities, it can be seen that the URL link of a listing page is mainly divided into three parts: the pinyin abbreviation of the city, the page number and others Part, for example, the URLs of the three property listing pages are given below.

#北京
https://bj.lianjia.com/zufang/pg4/#contentList 
#重庆
https://cq.lianjia.com/zufang/pg5/#contentList 
#上海
https://sh.lianjia.com/zufang/pg6/#contentList

Among them, "bj" in the first URL indicates the pinyin abbreviation of the city Beijing, and "pg4" indicates the fourth page. By setting the city pinyin abbreviation of the URL of the listing page, in addition to crawling rental data in Beijing, you can also crawl rental data in other cities.

In this case, the pinyin abbreviation of the city (such as bj) and the page number (pg4) is used to construct a URL link to the property list page, where "#contentList" is used for positioning on the page and can be ignored.

(3) Crawl and parse the property list page. Since the house information to be crawled in this case is distributed on the house listing page and each house detail page, it is first necessary to obtain the URL link of each house detail page. In addition, the property list page also includes the geographic location (city, street, and neighborhood) of each property. The following describes how to analyze the page information of the house listing page to obtain the URL link of the house listing details page and the geographical location of each house listing.

Use the Firefox browser to open the rental page of "Beijing Lianjia.com "  . Click the button in the shape of a small arrow on the far left of the page, and then point to one of the listing information, right-click the MTML code in blue text in the "Viewer" sub-window, select "Copy" → "Overall HTML", and get the following HTML content of (in order to simplify the difficulty of analysis, only part of the code is kept).

<div class="content list">
<div class="content list--item">
<div class="content list--item--main"><p class="content list--item--title"><a class="twoline" target=" blank"href="/zufang/BJ2840486736310837248.html">整租·长阳国际城二区 3 室 1厅南/北< /a>
</p><p class="content list--item--des"><a target="blank"href="/zufang/fangshan/">房山</a>-<ahref="/zufang/changyang1/"target=" blank">长阳</a>-<atitle="长阳国际城二区"href="/zufang/c1111053458322/"target=" blank">长阳国际城二区< /a>
<i>/</i>
89.00m2
<i> /< /i>南北<i> /< /i>
3室1厅1卫<span class="hide"><i>/</i>
中楼层(20 层)</span></p>
< /div>
</div>
</div>

 

Through observation, it will be found that the URL link of the listing details page is contained in

In the tag's href attribute, the geographic location of the listing (city name, street name, and community name) is included in the three "a" tags in the tag.

The specific process of obtaining the URL link and geographic location of the listing is as follows.

① First, obtain the URL link of the listing through XPath, the path is //a[@class="content list--item--aside"]/@href. Use this path to get the href attribute values ​​of all "a" tags whose class attribute is "contentlist--item--aside" in the current page, and the result is the detailsUrl URL link list of all listings on the current page.

② Then, obtain the geographic location of the listing (city name, street name, and community name) through XPath, and the path is //p[@class="contentlist--item--des"]/a/text(). Use this path to get the text content of the "a" tag under the "p" tag with the class attribute of "contentlist--item--des" on the current page, and the result is the location list location of all listings on the current page. It should be noted that there are 3 "a" tags under each "p" tag, which correspond to the city name, street name and community name of the house.

③ Finally, by traversing the list detailsUrl and list location, store the URL link of the listing details page and the corresponding city name, street name, and community name in the dictionary.

The code for this part is as follows.

defgetPageLines(city, page):
//获取指定城市和页码所在页面中的房源 URL 链接、所在地理位置 (district streetcommunity),并分别存入 house 字典中
:param city:城市简称
:param page:要爬取的页码
:return:字典列表
#构造房源列表页的 URL 链接
URI="https://"+ city +",lianjia.com/zufang/pg"+ str(page)
# 构造房源详情页 URL 链接的公共部分
baseUrl=URL.split("/")[0] +"//+ URL.split("/")[2
# 爬取房源列表页,并处理响应信息
response=reguests .get(url=URL)
# 获取页面 HTML,并对其进行解析
html=response.text
mvelement=etree .HTML (html)
# 提取本页面所有房源的 URL 链接和地理位置
detailsUrl=myelement.xpath('//al@class="content list--item--aside"]/@href门)
location=myelement.xpath('//p @class="content list--item--des"7/a/text
07
#将数据存入字典列表中
houses=list()
foriin range(len(detailsurl)):
# 获取房源详情页的 URI 链接detailsLink=baseUrl + detailsUrl i# 获取房源所在地理位置
slineIndex=i *
district=location lineIndex
street=location lineIndex + 17
community=location lineIndex + 2
# 将房源详情页 URI 链接和所在地理位置存入字典
house=[}
house "detailsLink" =detailsLink
house["districtu]=district
housel"street"=street
house "community"=community
houses.append(house)
return houses

(4) Crawl and analyze the listing details page. From the listing details page, you can get information such as the floor of the house, elevator, area, orientation, unit type and rent. Open a listing details page in the browser, and obtain part of the HTML source code of the listing details page through the "check" function of the Firefox browser. Part of the HTML source code of the listing details page is as follows.

<div class="content  aside--title">
<span>5500</span>元/月
(季付价)
<div class="operate-box">······</div></div>
<ul class="content aside list">
<li><span class="label">租赁方式:</span>整租</li><li><span class="label">房屋类型:</span>3室1厅1卫 89.00m’精装修</li>
<li class="floor"><span class="label">朝向楼层:</span><span class="">南/北中楼层/20 层< /span></li>
<li>
< span class="label">风险提示:</span></li>
</ul>
<div class="contentarticle  info" id="info"><h3 id="info">房屋信息< /h3>
<ul>
<li class="fl oneline">基本信息</li>
<li class="fl oneline">面积: 89.00m2< /li>
<li class="fl oneline">朝向: 南北< /li>
<li class="fl oneline">&nbsp;</li>
<li class="fl oneline">维护:7天前< /li>
<li class="fl oneline">入住:随时人住< /li>
<li class="fl oneline">&nbsp;</li>
<l class="fl oneline">楼层:中楼层/20 层</li>
<li class="fl oneline">电梯: 有</li>
...
</ul>

By analyzing the HTML source code of this paragraph, we can see that the house rent is located in the tag under the tag whose class attribute is "contentaside--title", and the house type is located in the second "li" tag under the tag whose class attribute is "contentaside list". The floor, elevator, area and orientation information of the 4 houses all exist in the "li" tag under the "ul" tag under the tag whose class attribute is "contentarticle__info". Obtain the content in the "li" tag through XPath, and perform data cleaning on the text content to obtain the target data of dictionary type, and finally store it in the corresponding house dictionary.

The specific process of obtaining information such as the floor, elevator, area, orientation, unit type and rent of the listing is as follows.

① Read the URL link of the listing details page from the house dictionary obtained in the previous step, send a request to obtain the HTML source code of the listing details page, and parse it.

② Obtain the house rent and house type model through XPath, and store the obtained results in the house dictionary.

③ Obtain the floor, elevator, area and orientation of the house through XPath. Because the parsed data in the HTML content contains spaces, use the dataCleaning() method to clean the data. This method also performs type conversion on the data, and finally stores the obtained house floor, elevator, area, and orientation data into the house dictionary.

So far, the required data is stored in the house dictionary, and the code for this part is as follows.

defgetDetail(house) :
爬取并解析房源详情页的数据,将其存入字典中。
:param house:含有房源详情页 URL 链接和地理位置的字典
: return:含有案例拟爬取数据的字典

# 读取房源详情页的 URL 链接
url=house"detailsLink"
try:
response=requests.get(url=url,timeout=10)except:
return None,None
# 获取房源详情页面 HTML,并对其进行解析myelement=etree.HTML(response.text)# 获取房屋租金和房屋户型
house["rent"]=myelement.xpath('//div[@ class="content aside- - title"]/span/text()')[0house["model"]=(myelement.xpath('//ul[@class="content aside list"]/li[2]/text()'))[o].split("")[o]
# 获取房屋其他信息:楼层,电梯,面积,朝向
details=myelement.xpath('//div[@class="content article info"]/ul[1]/li/text())
details=dataCleaning(details)
house_"floor"=details"楼层
house["lift"=details["电梯
house"area”=details "面积"
house "toward"=details "朝向
return house,response.status code

In the above code, use the dataCleaning() function to perform data cleaning on a piece of house information to obtain data such as the floor, elevator, area, and orientation of the house. Data cleaning includes two parts: delete the space '\xa0' in the data; convert the data from list type to dictionary type for convenient storage. For example, the original type of a piece of house information is a list type: ['Basic Information', 'Area: 89.00m2', 'Orientation: South', '\xa0', 'Maintenance: 5 days ago', 'Check-in: Anytime' , '\xa0', 'floor: middle floor/28th floor', 'elevator: yes', '\xa0', 'parking space: no data', 'water use: no data', '\xa0', 'use Electricity: no data', 'gas: yes', '\xa0', 'heating: central heating'], after the dataCleaning() function, the house type is converted into a dictionary type: {'area': '89.00 ','facing':'south','maintenance':'5 days ago','check-in':'check in anytime','floor':'middle floor/28th floor','elevator':'yes',' Parking space': 'No data', 'Water use': 'No data', 'Electricity': 'No data', 'Gas': 'Yes', 'Heating': 'Central heating' }. The specific implementation of the dataCleaning() function is as follows.

def dataCleaning(details):
//对房屋信息进行清洗
:param details: 列表类型,一条房屋信息
:return: 字典类型,清洗后的房屋信息
details=detailsnew details=list()for detail in details:if detail="xa0":continuedetail=str(detail).split(':')new details.append(detail)
return dict(new details)

When crawling web pages, you will inevitably encounter the "\xa0" string. "\xa0" actually means a space. "\xa0" belongs to the extended character set characters in latin1 (ISO/IEC_8859-1), representing the blank character nbsp (non-breaking space). The latin1 character set is backward compatible with ASCII.

(5) Save the data. The dictionary data is saved row by row through Python's built-in library CSV. The code for this part is as follows.

save(row,fileName) :def
//按行保存数据
:param row:字典类型,每一行的数据
:param fileName: 数据保存的文件名
with open(fileName,"a+",newline= ,encoding='qbk') as f:writer=csv.DictWriter(f,fieldnames=fieldnames)writer.writerow(row)

(6) Main program. In the main function, enter the city abbreviation of Lianjia.com and the range of crawled pages from the keyboard, and create a new CSV file to save the crawled data. Then call each function in the above steps to crawl data.

After all the data crawling is completed, the data is removed from the duplicate rows through the third-party library Pandas, and a column "ID" is added to the data as an index column. So far, the crawling task is completed.

The code for this part is as follows.

if name == main
city = input("请输人要爬取的城市拼音(如北京: bj,上海: sh):").strip().lower()pageRange=input("请输人要爬取的页码范围(如第 1页到第 100 页:1-100):")strip()startPage=int(pageRange.split("-")[0])endPage=int(pageRange.split("-")[17) + 1
#将爬取的数据保存在 CSv 表
fileName=city n lianJia.csv"
fieldnames='floor','lift','district','street','community','area','toward''model''rent'
"w",newline='') as f:with open(fileName,#将表头写入 CSv 表
writer=csv.DictWriter(f,fieldnames=fieldnames)writer.writeheader()startTime=time .time ()
for page in range(startPager endPage):print("\n--->>正在爬取第”+ str(page) +"页”)houses=getPageLines(city=city,page=page)for house in houses:
data,statusCode=getDetail(house=house)
if data= None:
continueprint('响应状态:,statusCode)# CSV 文件不需要保存链接网址,所以删除 detailsLink 键值对del data "detailsLink"save(data,fileName)time.sleep(3)endTime=time .time ()print(耗时:,round(endTime- startTime,2),'s')# 使用 pandas 对数据表进行处理df=pd.read csv(fileName,encoding='gbk')# 删除重复行
df=df.drop duplicates ()index=list()
for i in range(1,len(df) + 1):id="rent" + str(i).zfill(4)index.append(id)
#插入 id列
df.insert(0,"id",index)
df.to csv(fileName,index=False,encoding='gbk')

 

 

 

Guess you like

Origin blog.csdn.net/qq_41640218/article/details/131851149