foreword
Recently, I am trying to use Python to crawl second-hand housing listing information data. Here, I will provide the code for those who need it, and give some tips.
First of all, before crawling, you should pretend to be a browser as much as possible without being recognized as a crawler. The basic thing is to add a request header, but there will be many people crawling such plain text data, so we need to consider changing the proxy IP and random replacement The method of request header is used to crawl the house price data.
Before writing crawler code every time, our first and most important step is to analyze our web pages.
In our example this time, we need to obtain the link of each specific listing on each page, then enter the second-level web page to obtain detailed information, and then return to the upper-level web page to repeat the process.
Through analysis, we found that the speed of crawling is relatively slow during the crawling process, so we can also improve the crawling speed of crawlers by disabling Google browser images, JavaScript, etc.
development tools
Python version: 3.8
Related modules:
requests module
parser module
Environment build
Install Python and add it to the environment variable, and pip installs the required related modules.
Idea analysis
The crawled page is shown in the following figure:
Extract page data
Open the page we want to crawl in the browser
and press F12 to enter the developer tool to see where the data we want is here.
We just need the listing page data.
Code
# 伪装
headers = {
'cookie': 'aQQ_ajkguid=B7A0A0B5-30EC-7A66-7500-D8055BFFE0FA; ctid=27; id58=CpQCJ2Lbhlm+lyRwdY5QAg==; _ga=GA1.2.2086942850.1658553946; wmda_new_uuid=1; wmda_uuid=009620ee2a2138d3bd861c92362a5d28; wmda_visited_projects=%3B6289197098934; 58tj_uuid=8fd994c2-35cc-405f-b671-2c1e51aa100c; als=0; ajk-appVersion=; sessid=8D76CC93-E1C8-4792-9703-F864FF755D63; xxzl_cid=2e5a66fa054e4134a15bc3f5b47ba3ab; xzuid=e60596c8-8985-4ab3-a5df-90a202b196a3; fzq_h=4c8d83ace17a19ee94e55d91124e7439_1666957662955_85c23dcb9b084efdbc4ac519c0276b68_2936029006; fzq_js_anjuke_ershoufang_pc=75684287c0be96cac08d04f4d6cc6d09_1666957664522_25; twe=2; xxzl_cid=2e5a66fa054e4134a15bc3f5b47ba3ab; xxzl_deviceid=OOpJsA5XrQMdJFfv71dg+l+he0O1OKPQgRAQcFPbeRAyhjZ4/7gS3Gj4DfiLjxfc; isp=true; obtain_by=2; new_session=1; init_refer=https%253A%252F%252Fcs.anjuke.com%252F; new_uv=3',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'
}
1.发送请求
response = requests.get(url=url, headers=headers)
2.获取数据
html_data = response.text
3.解析数据
select = parsel.Selector(html_data)
divs = select.css('.property-content')
for div in divs:
# .property-content-title-name 标题
标题 = is_null(div.css('.property-content-title-name::text').get())
# .property-content-info:nth-child(1) .property-content-info-text:nth-child(1) span 户型
户型s = div.css('.property-content-info:nth-child(1) .property-content-info-text:nth-child(1) span::text').getall()
户型 = ' '.join(户型s)
# .property-content-info:nth-child(1) .property-content-info-text:nth-child(2) 面积
面积 = is_null(div.css('.property-content-info:nth-child(1) .property-content-info-text:nth-child(2)::text').get())
# .property-content-info:nth-child(1) .property-content-info-text:nth-child(3) 朝向
朝向 = is_null(div.css('.property-content-info:nth-child(1) .property-content-info-text:nth-child(3)::text').get())
# .property-content-info:nth-child(1) .property-content-info-text:nth-child(4) 楼层
楼层 = is_null(div.css('.property-content-info:nth-child(1) .property-content-info-text:nth-child(4)::text').get())
# .property-content-info:nth-child(1) .property-content-info-text:nth-child(5) 年份
年份 = is_null(div.css('.property-content-info:nth-child(1) .property-content-info-text:nth-child(5)::text').get())
# .property-content-info:nth-child(2) .property-content-info-comm-name 小区名称
小区名称 = is_null(div.css('.property-content-info:nth-child(2) .property-content-info-comm-name::text').get())
# .property-content-info:nth-child(2) .property-content-info-comm-address 小区地址
小区地址 = is_null(div.css('.property-content-info:nth-child(2) .property-content-info-comm-address::text').get())
# .property-content-info:nth-child(3) span 小区标签
小区标签s = div.css('.property-content-info:nth-child(3) span::text').getall()
小区标签 = ' '.join(小区标签s)
# .property-price .property-price-total .property-price-total-num 总价
总价 = is_null(div.css('.property-price .property-price-total .property-price-total-num::text').get())
# .property-price .property-price-average 每平方米的价格
单价 = is_null(div.css('.property-price .property-price-average::text').get())
print(标题, 户型, 面积, 朝向, 楼层, 年份, 小区名称, 小区地址, 小区标签, 总价, 单价)
4.保存数据
with open('安居客.csv', mode='a', encoding='utf-8', newline='') as f:
csv_writer = csv.writer(f)
csv_writer.writerow([标题, 户型, 面积, 朝向, 楼层, 年份, 小区名称, 小区地址, 小区标签, 总价, 单价])
Result display
ps: pictures are for reference only
at last
Today's sharing is over here, and interested friends can also try it.
If you have any questions about the article, or have other questions about python, you can leave a message in the comment area or private message me
If you think the article I shared is good, you can follow me, or give the article a thumbs up (/≧▽≦)/