Crawling rental website data and visual analysis based on Scrapy framework

Collect and follow to avoid getting lost


Preface

  In today’s big data era, faced with the vast amount of information on the Internet, how to extract effective information has become a problem. Based on the Scrapy framework, this article takes the rental information of Hangzhou Lianjia.com as the object and crawls the rental information. In order to break through the crawler blockade, various methods such as IP proxy pool and Bloom deduplication are used. At the same time, the data is stored in the database and Pandas is used to clean the data. Finally, the data is output in the form of a chart on the front end using Flask and Echarts, and the distribution of rental area and amount in Hangzhou is analyzed. The housing supply is mostly concentrated in Yuhang District and Xiaoshan District. The rental price is the most expensive in Binjiang District, and the rental floors are generally higher. Floor and other conclusions provide reference for tenants.
Keywords: Scrapy; web crawler; anti-crawler; URL deduplication; data visualization

1. Function introduction

  This topic focuses on information acquisition and data visualization analysis in the era of big data Internet. It takes the rental information in Hangzhou Lianjia.com as the research object, uses the Scrapy framework to implement the Lianjia.com rental information crawler and crawls the data, and uses the MySQL database as the front end and collection The hub stores data. The original data also needs to be analyzed and processed to remove dirty data and normalize the data to facilitate subsequent data visualization. Finally, the data is output in the form of charts, which has a certain guiding role for tenants, allowing them to find rental information more clearly.

2. Development environment

Development language: Scrapy, scrapy framework
Software version: python3.7/Scrapy
Database tool: Navicat11
Development software: PyCharm/vs code

————————————————

3. Overall structure design of the system

This system will be roughly divided into four modules, data acquisition module, data storage module, data processing module, and data visualization module, as shown in Figure 4-1. The data collection module is responsible for crawling Lianjia.com data; the data storage module is responsible for storing the collected data; the data processing module is responsible for standardizing and quantifying the collected raw data, laying the foundation for subsequent data visualization; the data visualization module is responsible for providing a web platform Display data to users graphically.

Insert image description here

Figure 4-2 Lianjia.com Hangzhou rental host page

4.2 Data acquisition module

First of all, it is necessary to clarify the structure of the web page that needs to be crawled. The main page of Hangzhou rental house on Lianjia.com is shown in Figure 4-2. The first red box shows the various administrative districts of Hangzhou, and the second red box shows the price segment. According to the analysis of the website structure, it is found that when there are too many houses, the web page can only display the data of the first 100 pages. Therefore, in order to fully capture the data, it is necessary to classify and crawl the web pages. Its link structure can be expressed as "https://hangzhou.lianjia.com/zufang/" + region + pg + price range. This article will classify and crawl based on the region plus price segmentation.

Insert image description here

4.2.1 Set entity Item

Item is a container for saving crawled data. It can more conveniently operate and save crawled data. Its usage is similar to that of a python dictionary, and it provides an additional protection mechanism to avoid future errors caused by spelling errors. Error defining field. You can use Scrapy.Field() in item.py to define the fields you need
to crawl. The definition of the ZufangItem class is shown in Figure 4-3:
Insert image description here

4.2.2 Web scraping

The main implementation of web crawling is implemented in lianjia.py under the spiders folder, which is the core part of the crawler. In order to implement full-site crawling, this article uses CrawlSpider to perform page turning processing under each sub-category. . The CrawlSpider class is a subclass of Spider. The Spider class can only crawl web pages in the start_urls list. If you want to continue crawling, you can only send requests manually. However, the CrawlSpider class defines some rules to provide a convenient mechanism for following up on links, making it More suitable for crawling the entire site. Compared with Spider, CrawlSpider has new methods and attributes. LinkExtractors are used to extract links. Rules can contain one or more Rule objects. Each Rule defines actions for crawling websites. a specific operation.
  Here we define the LianjiaSpider class that inherits from the CrawlSpider class. Its main code is shown in Figure 4-4:
  Insert image description here

4.3 Data storage module

In the data storage module, we use the MySQL database for data storage. First, make sure MySQL is installed, and then you need to install a database visual management tool to facilitate data management. This article uses HeidiSQL. The data storage module is connected to the pipeline of the Scrapy framework. Before the pipeline stores data, the corresponding database and table need to be established in MySQL to store the data passed by the pipeline.
  Create a new database lianjia, and create a table hangzhou in it to store the crawled original data. The structure of the table is shown in Figure 4-6:
  Insert image description here

Figure 4-6 Structure of table hangzhou
In this way, data can be stored in MySQL through the pipeline in Scrapy.

4. Visual page display

(1) Scatter chart of rental price and area in Hangzhou
  The scatter relationship between rental price and area in Hangzhou is shown in Figure 5-3. Most of the data is concentrated in the area 0- 300㎡ and the price range of 0-10,000 yuan basically follow the principle that the larger the area, the higher the price. Among them, those larger than 400 square meters may be houses for commercial use.
Insert image description here

Figure 5-3 Scatter chart of rental price and area in Hangzhou
(2) Chart of the number and average price of rental houses in various districts of Hangzhou
  Rental housing in various districts of Hangzhou The quantity and average price are shown in Figure 5-4. Yuhang District and Xiaoshan District have the largest number of rental houses. At the same time, their prices are lower than others because they are far away from the city center. In terms of average house prices, Binjiang District is the highest. It is a high-tech development zone in Hangzhou. Many companies such as NetEase, Alibaba, Hikvision, and Dahua are located here, so the demand for rentals here is huge. This should be Binjiang District. One of the reasons why the average house price is so high is that if you work in Binjiang, you can choose to live in the nearby Xiaoshan District.

Insert image description here

Figure 5-4 The number and average price of rental houses in various districts of Hangzhou
(3) Histogram of rental area in Hangzhou
  The distribution of rental area in Hangzhou is as shown in the figure As shown in 5-5, it can be seen that houses of 60-90㎡ are the most popular, and small houses of 30-60㎡ are also very popular.
  Insert image description here

Figure 5-5 Histogram of rental area in Hangzhou
(4) Floor height distribution map in Hangzhou
The total floors of rental houses in Hangzhou are shown in Figure 5 As shown in -6, there are mainly high-rise buildings (above 20), and there are relatively few low-floor houses for rent.

Insert image description here

Figure 5-6 Floor height distribution map in Hangzhou
(5) Distribution map of rental units in Hangzhou
Figure 5-7 Distribution of rental units in Hangzhou As shown in the figure, the most common house types are 3 bedrooms, 2 living rooms and 2 bathrooms, or 2 bedrooms, 1 living room and 1 bathroom. They are usually shared with others.
Insert image description here

Figure 5-7 Hangzhou rental house type distribution map
(6) Hangzhou rental hot word map
  Extract the title of the listing on Lianjia.com and use jieba Word segmentation is processed, word frequency is counted, and the word cloud diagram is finally generated as shown in Figure 5-8. It is found that in order to attract the attention of tenants, homeowners generally use labels such as fine decoration, viewing at any time, and close to the subway [31].
  Insert image description here

Figure 5-8 Hangzhou rental hot word chart
(7) Overall display
Finally, these charts are collected in an HTML page, such as Shown in Figure 5-9:

Insert image description here

Figure 5-9 Overall display

in conclusion

Today's era is the era of data, with a large amount of data being generated every day. Now people's worry is no longer getting information nowhere, but how to extract useful information from too much information. At the same time, due to the rise in housing prices, young people now choose to rent. Faced with a vast amount of rental information, how to accurately obtain what they want has become a big problem. Therefore, this article proposes a study on crawling rental website data and visual analysis based on the Scrapy framework (taking the rental information of Lianjia.com in Hangzhou as the object). The main work is as follows:
(1) Detailed introduction to the Scrapy framework, database storage, and theoretical knowledge related to URL deduplication.
(2) Use the Scrapy framework to crawl data, and introduce in detail 5 response methods and specific implementation codes for the anti-crawler mechanism of the website.
(3) Preprocess the data and successfully use Flask and Echarts to display it on the front end.

Table of contents

Abstract 1
Abstract1
1 Introduction 2
1.1 Research background and significance 2
1.2 Research status 2
1.3 Research content 3
1.4 Paper structure 3
2 Introduction to web crawler related technologies 4< /span> 2.4.1 Scrapy native deduplication module 7< /span> 4 System design and implementation 18< /span> 5.2.1 Data display 27< /span> Acknowledgments 35 References 33 6.2 Reflection 32 6.1 Summary 32 6 Summary and reflection 32 5.2.2 Visual page display 27 5.2 Test results 27 5.1 Test environment 27 4.5 Data visualization module 24 4.6 Summary of this chapter 25 4.5.2 Visual charts 25 4.5.1 Basic use of Flask framework 25 4.4.3 Data sorting 24 4.4.2 Outlier processing 23 4.4.1 Missing Value processing 23 4.4 Data processing module 23 4.3 Data storage module 22 4.2.5 Breaking through anti-crawler restrictions 22 4.2.4 URL deduplication optimization 22 4.2.3 Data persistence 21 4.2.2 Web crawling 20 4.2.1 Set entity Item19 4.2 Data collection module 18 4.1 Overall system structure design 18 3.6 Summary of this chapter 17 3.5.2 Create IP proxy pool based on Scrapy 14 3.5.1 Overview of proxy server 13 3.5 IP proxy pool13 3.4 Disabling Cookie13 3.3 .3 Bypassing robot protocol based on Scrapy13 3.3.2 Robot protocol example 12 3.3.1 Overview of robot protocol 12 2.5 Summary of this chapter 9 3.2 Disguise as a random browser 11 3.1 Reduce request frequency 11 3 Breaking crawler blockade based on Scrapy framework 11 2.4.3 Bloom Filter algorithm 9 2.4.2 Scrapy-Redis native deduplication module 8 2.4 URL deduplication 7 2.3.2 Redis7 2.3.1 MySQL7 2.3 Database storage 7 2.2.3 Scrapy execution process 6 2.2.2 Data flow in Scrapy framework 6 2.2.1 Scrapy framework structure 5 2.2 Introduction to Scrapy framework 5
2.1 Overview of web crawlers 4


















































Guess you like

Origin blog.csdn.net/QQ2743785109/article/details/134101930