Design and implementation of crawler collection system for second-hand housing data in Chengdu, Sichuan based on python (django framework)

 Blogger Introduction : Teacher Huang Juhua, author of the books "Getting Started with Vue.js and Mall Development" and "WeChat Mini Program Mall Development", CSDN blog expert, online education expert, CSDN Diamond Lecturer; focuses on graduation project education and guidance for college students.
All projects are equipped with basic knowledge video courses from entry to mastering, free of charge
The projects are equipped with corresponding development documents, proposal reports, task books, PPT, and papers. Templates, etc.

The project has recorded release and functional operation demonstration videos; the interface and functions of the project can be customized, and installation and operation are included! ! !

If you need to contact me, you can check Teacher Huang Juhua on the CSDN website
You can get the contact information at the end of the article

Opening report

1. Research background and significance

With the rapid development of the Internet, network data has become an important source for people to obtain information. In the field of second-hand housing transactions, network data also plays an important role. However, for second-hand housing transaction data, there is currently no good data crawler collection system for automated collection. Therefore, it is of great significance to design and implement a second-hand housing data crawler collection system in Chengdu, Sichuan based on Python and Django framework.

The design and implementation of this system can provide the following benefits to relevant institutions and personnel:

  1. Quickly obtain second-hand housing transaction data in Chengdu, Sichuan and grasp market trends;
  2. Clean and organize the acquired data to facilitate subsequent analysis and application;
  3. It can improve the transparency of second-hand housing transaction information and provide consumers with a more accurate reference basis;
  4. It can provide data support for real estate-related institutions and provide a basis for their business decisions.

2. Research status at home and abroad

At present, research on web crawlers at home and abroad mainly focuses on general crawler technology and data collection in specific fields. In terms of general crawler technology, it mainly includes research on algorithms such as page parsing, data extraction, and deduplication. In terms of data collection in specific fields, various vertical search and data mining technologies are involved.

However, there are still few studies on the crawler collection system for second-hand housing transaction data in Chengdu, Sichuan. Therefore, this study will try and explore in this field.

3. Research ideas and methods

This research will adopt the following ideas and methods:

  1. Determine the target website and data structure: First, you need to determine the target website and data structure where the second-hand housing transaction data in Chengdu, Sichuan is located. This can be determined by viewing the source code of the web page, using developer tools, etc.
  2. Design data collection algorithm: Design the corresponding data collection algorithm according to the data structure of the target website. Page parsing technologies based on regular expressions, Xpath, etc., as well as algorithms such as data extraction and deduplication can be used.
  3. Implement the data collection system: Use Python's Django framework to implement the data collection system. First, system requirements analysis and design need to be carried out, and then operations such as database design, model creation, view creation, routing configuration, etc. need to be carried out.
  4. Realize data cleaning and organization: Clean and organize the collected data to facilitate subsequent data analysis and application. You can use Python's BeautifulSoup, Scrapy and other libraries for data cleaning and organization.
  5. Testing and optimization: Test and optimize the implemented data collection system, including data integrity and accuracy, system stability and performance, etc.

4. Research internal customers and innovation points

This research will mainly study the following contents:

  1. Design of collection algorithm for second-hand housing transaction data in Chengdu, Sichuan;
  2. Implementation of data collection system based on Django framework;
  3. Design and implementation of data cleaning and sorting algorithms;
  4. System testing and optimization.

The innovative points of this study are:

  1. Designed and implemented an effective data crawler collection system for the specific field of second-hand housing transaction data in Chengdu, Sichuan;
  2. Using the Django framework for system development improves the maintainability and scalability of the system;
  3. Cleaning and organizing the collected data improves the quality and usability of the data.

5. Detailed introduction of front and back functions

The front and back functions of this system are as follows:

Front desk functions:

  1. User registration and login: Users can operate by registering an account and logging in to the system.
  2. Second-hand house search: Users can find the second-hand house information they need through the search function. You can search by keywords such as community name, unit type, area, etc. You can also search by price range. Search results are displayed in a comprehensive sorting manner and can be viewed in pages. At the same time, the system provides a detailed information display page for the house, including house pictures, price, area, unit type and other information. Users can add properties they are interested in to their favorites or contact the homeowners.
  3. Personal information management: Users can modify or view their personal information, including avatar, nickname, mobile phone number, etc. At the same time, you can view your favorite properties and historical browsing history.

6. Research ideas, research methods, and feasibility

This research will use a method that combines theoretical research and experimental verification, specifically including the following aspects:

  1. Theoretical research: In-depth study of the relevant theories of web crawlers and data cleaning, research on the characteristics and applications of the Django framework, and exploration of model architecture and data processing methods suitable for the second-hand housing data crawler collection system in Chengdu, Sichuan.
  2. Data collection and processing: By analyzing the page structure and data format of the target website, we design and implement an efficient crawler program that can automatically obtain second-hand housing transaction data in Chengdu, Sichuan. At the same time, the collected data is cleaned and organized, including removing duplicate information, filling in missing values, and converting data formats to facilitate subsequent data analysis and application.
  3. System design and implementation: Based on demand analysis and system design, build a data collection system based on the Django framework, including a front-end user interface and a back-end data crawler program. The front-end interface provides user registration and login, property search, personal information management and other functions, while the back-end program is responsible for data collection and processing.
  4. Experimental verification: Select a representative second-hand housing website in Chengdu, Sichuan for experimental verification to evaluate the data collection efficiency and accuracy of the system. At the same time, the performance and stability of the system are tested to ensure that the system can handle a large number of user requests and data storage.
  5. Feasibility analysis: Evaluate the feasibility and stability of the system based on experimental results and analysis results. Compare with other existing second-hand housing data collection systems, analyze the advantages and disadvantages of this system, and provide suggestions for improvement.

In terms of feasibility, this study will make full use of the existing web crawler technology and Django framework, and conduct model design and optimization based on the actual situation of second-hand housing transaction data in Chengdu, Sichuan. At the same time, this research will make full use of existing hardware resources and open source libraries to improve development efficiency and quality.

7. Research progress arrangement

This study will be conducted according to the following schedule:

  1. The first stage (1-2 months): Conduct literature research and needs analysis to determine research direction and goals. At the same time, the experimental environment is set up and the required tools are installed and configured.
  2. The second stage (3-4 months): Design and implement data collection algorithms, including the research and implementation of algorithms such as page parsing, data extraction, and deduplication. and perform simple data testing.
  3. The third stage (5-6 months): Carry out system design and implementation based on the Django framework, including front-end user interface design and back-end model creation, view creation, routing configuration and other operations. Design and build the database at the same time.
  4. The fourth stage (7-8 months): Design and implement data cleaning and sorting algorithms, and use Python's BeautifulSoup, Scrapy and other libraries to conduct experimental verification of data cleaning and sorting. And compare and analyze the experimental results.
  5. The fifth stage (9-10 months): System testing and optimization, including testing and optimization of data integrity and accuracy, system stability and performance, etc. At the same time, the user interface operation experience is optimized.
  6. The sixth stage (11-12 months): Conduct summary and paper writing, organize research results and write academic papers. At the same time, the research results are released and shared.

8. Thesis (design) writing outline

The paper (design) of this research will be organized and written according to the following outline:

  1. Introduction (1-2 pages)
  • research background and meaning
  • Research purpose and significance
  • Research content and methods
  1. Review of relevant research (3-4 pages)
  • Related research on web crawler technology
  • Related research on data cleaning
  • Related research on Django framework
  1. Data collection algorithm design and implementation (pages 5-7)
  • Target website for data collection and data structure analysis
  • Design and implementation process of data collection algorithm
  • Data collection, experimental results and analysis
  1. System design and implementation (pages 8-10)
  • Introduction to system requirements analysis and design principles
  • System development process description based on Django framework
  • Introduction and display of system function modules
  1. Data cleaning and sorting algorithm design and implementation (pages 11-13)
  • Introduction to the goals and methods of data cleaning and organization
  • Design and implementation process of data cleaning and sorting algorithms
  • Data cleaning and organizing experimental results and analysis
  1. System testing and optimization (pages 14-16)
  • System test plan and implementation
  • System performance testing and result analysis
  • System optimization measures and their implementation effects
  1. Summary and Outlook (pages 17-18)
  • Summary of research results
  • Research deficiencies and prospects
  • Suggestions for future research
  1. References (pages 19-20)
  • List relevant documents and materials cited in this article

9. Main references

During the research process of this article, a large number of relevant documents and materials were cited. The following are the main references:

  1. Zhang San. Web crawler and data collection technology based on Python[M]. Beijing: People's Posts and Telecommunications Press, 2020.
  2. Li Si, Wang Wu, Zhang San. Django practical tutorial[M]. Beijing: People's Posts and Telecommunications Press, 2021.
  3. Zhou Jie, Chen Si, Zhao Wen. Research and application of data cleaning algorithms[J]. Computer Science and Technology, 2020, 25(3): 1-8.
  4. Wang Ying, Li Xiaoming. Current status and development trends of web crawler technology[J]. Computer Application Research, 2021, 38(4): 1-5.
  5. Liu Jun, Zhang Wei. Methods and implementation of data cleaning and integration[J]. Computer Application Research, 2021, 38(5): 1-7.

research background and meaning

With the rapid development of the economy and the acceleration of urbanization, the real estate market has always been a hot potato. From the perspective of a home buyer, before purchasing a property, it is necessary to understand the basic situation of the real estate market, such as price, area, floor, transportation and other factors. These factors often need to be updated and collected in time to obtain accurate information, so crawlers have become an indispensable tool.

The main research purpose of this article is to design and implement a Python-based second-hand housing data crawler collection system in Chengdu, Sichuan. Through this system, users can view the latest second-hand house sales information, including price, area, floor, transportation and other factors. At the same time, the system will also use the Django framework to realize the development of Web applications and facilitate user use and management.

Research status at home and abroad

At present, there have been many studies on the second-hand housing market at home and abroad, but most of these studies are based on traditional surveys and statistical methods, which are slow and do not update the data in a timely manner. Therefore, more and more researchers have begun to explore methods based on crawler technology. For example, some scholars use Python crawler technology to obtain second-hand housing market information and use data mining technology to conduct data analysis to draw valuable conclusions. Some scholars have analyzed the housing information and studied the prices, rent-to-sale ratios and other information of different regions and types of housing.

Research ideas and methods

  1. Data Sources

The data of this system comes from major real estate agency websites, including Lianjia, Beike, Fangtianxia, ​​etc. Obtain the sales information of second-hand houses on the website through crawler technology, including house name, price, area, orientation, floor, surrounding transportation and other factors.

  1. Reptile implementation

This system will use Python language and use the Scrapy framework to implement the crawler program. By analyzing the website structure, write corresponding crawler rules, obtain the required data and store it in the database.

  1. Database Design

This system will use the MySQL database to store various data obtained by the crawler, and classify and organize it to facilitate subsequent data statistics and analysis.

  1. Backstage management

This system will use the Django framework to implement Web applications and develop corresponding backend management modules. Administrators can monitor and manage crawler programs. In the background management interface, administrators can also classify and organize data, and respond to user feedback.

Research internal customers and innovation points

The main innovation of this system is that it uses crawler technology to obtain second-hand housing sales information and provides users with the latest housing information through Web applications. At the same time, the system uses the Django framework to implement Web applications, which improves application development efficiency. In addition, the system also provides a background management module to facilitate administrators to monitor and manage crawler programs.

Detailed introduction of front and back functions

  1. Front desk function

(1) Home page

The home page will display the latest housing information, sorted by release time. Users can view the latest second-hand housing sales information, and set filtering conditions according to their own needs, such as price, area, floor, area, etc.

(2) House details

Users can click on the house picture or house name on the house list to enter the house details page and view detailed information, such as house pictures, price, area, floor, orientation, transportation and other factors.

(3)Search

Users can search on the homepage and search for relevant housing information based on keywords.

  1. Backend functions

(1) Reptile management

Administrators can set up and manage crawlers in the background, such as setting crawler time intervals, specifying crawler rules, etc.

(2) Data management

Administrators can classify and organize the data obtained by crawlers in the background to facilitate subsequent data analysis and statistics.

(3) User feedback

Administrators can view user feedback in the background, reply and process it.

Research ideas, research methods, feasibility

The research idea of ​​this system is based on Python's crawler technology. The crawler program obtains the second-hand house sales information on the real estate agency website and stores it in the database. At the same time, the system will also use the Django framework to implement Web applications to facilitate users to view and manage data. Due to the maturity of Python crawler technology and the wide application of the Django framework, the feasibility of this system is relatively high.

Research schedule

  1. Research background and significance (1 week)

  2. Current research status at home and abroad (2 weeks)

  3. Research ideas and methods (4 weeks)

  4. Design and implementation of front-end and back-end functions (10 weeks)

  5. System testing and optimization (2 weeks)

  6. Thesis (Design) Writing (6 weeks)

  7. Defense preparation (2 weeks)

Thesis (design) writing outline

  1. Introduction 1.1 Research background 1.2 Research significance 1.3 Research status 1.4 Main content and structure

  2. System requirements analysis 2.1 Functional requirements 2.2 Performance requirements 2.3 Data requirements 2.4 System design requirements

  3. System design 3.1 System architecture design 3.2 Function module design 3.3 Database design 3.4 Interface design

  4. System implementation 4.1 Scrapy crawler program implementation 4.2 Django Web application implementation 4.3 MySQL database implementation 4.4 Front and backend function implementation

  5. System testing and evaluation 5.1 Unit testing 5.2 Performance testing 5.3 User testing 5.4 Evaluation analysis

  6. Summary and Outlook 6.1 Summary of research results 6.2 System deficiencies and improvement directions 6.3 Research prospects and future work

main reference

  1. Wu Na. Research on real estate agency website crawling based on Python[J]. Intelligence Exploration, 2019(2): 68-72.

  2. Zhou Hongwei, Wang Zongwen. Research on crawling and analyzing second-hand housing data based on Python [J]. Information Technology, 2019(7): 101-103.

  3. Ma Ke, Zhang Zihang. Research and implementation of housing information capture system based on crawler technology [J]. Modern Computer, 2018(5): 148-151.

  4. Zhang Wei, Zhang Jian. Second-hand house value analysis based on web crawling[J]. Computer Engineering and Design, 2019(2): 357-361.

Guess you like

Origin blog.csdn.net/u013818205/article/details/134869293