Computer Major Project Proposal Case 77: Design and Implementation of Web-based Text Crawler System

100 sets of computer graduation projects

WeChat Mini Program Project Practice

Java project practice

If you need the source code, drop me a message

Table of contents

1. The significance of topic selection

2. Research status at home and abroad

2.1 Current status of foreign research

2.2 Domestic research status

3. Feasibility analysis

4. Main issues of research

5. Work focus

6. Work Difficulties

7. Main references


1. The significance of topic selection

With the rapid development of the Internet, big data has penetrated into every industry and business function area, and its value has become increasingly significant. Extracting meaningful and valuable data is particularly important. Therefore, web crawlers used for Internet information collection face huge opportunities and challenges. At present, some large search engines at home and abroad only provide users with non-customizable search services, and single-machine web crawlers cannot take on important tasks. Although existing distributed web crawlers are powerful and efficient, they are difficult for ordinary users to understand and use.

In recent years, with the continuous development of Internet technology, the explosion of Internet data information has brought us into the era of big data. All walks of life in society are deeply affected by the era of big data, which continues to penetrate into our daily work, life and schools, affecting the continuous progress and development of society. In today's booming era of big data, it is very difficult for people to obtain data through search engines, which is not only inefficient but also low in accuracy. Network data crawler technology is an effective means to efficiently obtain and integrate data scattered in every corner of the Internet, and can provide users with the required data information efficiently and accurately.

The crawler web of the movie data collection system based on python technology proposed in this design can efficiently and accurately obtain the data resources we need. According to the main network and data type keywords specified by the user, the data we need is crawled, and the obtained data can be cleaned and classified. Efficient data acquisition, real-time data, and data accuracy are important to users. They are all of great practical significance.

2. Research status at home and abroad

2.1 Current status of foreign research

At present, many web crawler systems have been designed abroad, mainly including Ubi Crawler, Mercator, Nutch and GoogIeCrawler. These web crawler systems have their own unique aspects and are very efficient in data collection. Below, a brief introduction to the above web crawler systems will be given.

Google's search engine web crawler uses a distributed web crawler system, which uses multiple servers to access web pages in parallel and capture data. The system consists of multiple parallel crawler hosts and a central host. The central host first accesses the requested Url, and then distributes the requested web page to the following parallel hosts to locate and capture web page data. After each crawler host completes the crawling of web page data, it converts the captured data into defined specifications and sends it to the indexing process for use. The index process is responsible for managing the web page Url stored in the database and the web page data that has been captured, and the Url interpreter process is responsible for parsing the web page Url. The interpreter process saves the URL of the web page just captured locally and sends it to the central host for reading by the central host. The web crawler system in the Google search engine adopts this cyclic method, using a central host and multiple crawler hosts to continuously crawl the required data from the Internet.

2.2 Domestic research status

In China, there are also many research units and university scholars who have done a lot of research on distributed web crawler systems, and many excellent web crawler systems have also been produced.

In 2019, Li Wenlong studied in detail the orchestration management tool of the ocker cluster and the distributed crawler system based on the ocker cluster. He mastered the working principle of the orchestration management tool as well as the scheduling mechanism and management and applied it to the distributed web crawler system. Subsequently, a distributed web crawler module suitable for Docker clusters was designed and implemented. Developers combined these system modules according to needs, ultimately forming an efficient and convenient distributed crawler system. This distributed crawler system based on Docker cluster uses Kubernetes cluster orchestration management tool. This orchestration management tool is used to uniformly deploy and manage each functional module of the system, so that it can finally run on. ocker cluster as the target.

Through domestic and foreign research on web crawler systems and the Scrapy framework, it is not difficult to see that the research is mainly focused on a specific data type or a specific website, and there is no research on different data types and different web page layout types, and the one we designed is based on Scrapy The data collection system of the framework is a systematic project. In this system project, every link is closely linked, which can meet the collection and management of different web page types and different data types to a certain extent.

3. Feasibility analysis

From the technical feasibility analysis: the system adopts the B/S structure developed by Django, the mainstream framework of python, and is also developed with the help of the very lightweight HTML language and the sqlite3 non-associative database. This is the best development environment and will reduce development to a certain extent. Difficulty.

Economic feasibility analysis: The Python development environment is a free and open platform that can be obtained for free from the Internet. Development can be completed using a personal laptop. There is no need to purchase servers and hardware equipment. The development cost is low, so it is economically feasible. .

4. Main issues of research

The content of this design research is the network design of web text crawling based on Python crawler. The Django framework of Python is used to build a crawler network to crawl the relevant information on the top 250 web pages in Douban web page. Clean and preprocess the data files saved by crawling; import the cleaned data into the database; conduct demand analysis, analyze the data visualization effect to be achieved, create the corresponding database and import the data; save the cleaned data Into the sqlite3 database, the backend uses the django framework and Echarts to achieve data visualization.

The paper designs and implements a data collection system based on the Django crawler framework. Users allocate the main network that needs to be crawled in the form of a task tree, configure it once and use it multiple times. Moreover, similar data types can be classified, making subsequent data queries and data calls very convenient. The crawler project in code form is transformed into a Web page form, which greatly reduces the difficulty for users. Users do not need to understand how the system operates or how web pages are parsed. They only need to follow the necessary steps to build what they need. Data collection tasks.

5. Work focus

(1) Research and analyze the characteristics of data in the big data era, and explain the necessity of developing a data collection system based on the Django framework based on the needs of enterprises and individuals for data. And studying the historical development process of data collection systems has laid a practical foundation for the design and implementation of data collection systems based on the Django framework;

(2) Explain the main technologies used in the design and implementation of data collection systems based on the Django framework;

(3) Analyze the business needs and functional requirements of the data collection system;

(4) Determine the system design principles, plan and elaborate on the overall system framework construction, functional module division and database design;

(5) Design and implement the functional modules of the system.

6. Work Difficulties

The difficulty of this design is that after crawling the data, the data needs to be effectively cleaned. The data needs to be stored in an effective format in the database, and then the Echarts visualization component is used to display the data.

7. Main references

[1] Zhao Qiang. Travel website data analysis and visualization based on Python crawler[J]. Electronic Design Engineering, 2022(016):030.

[2] Yang Mengjiao, Du Qidong. Design and implementation of Python crawler website data analysis system [J]. Computer Age, 2022(11):4.

[3] Meng Baocan. Discussion on the application of Python web crawler[J]. Radio and Television Information, 2022, 29(3):108-110.

[4] Hong Lihua, Huang Qionghui. Research on Python crawler technology [J]. Value Engineering, 2022, 41(34):3.

[5] Liu Jie. Analysis of crawler technology based on Python language [J]. Mobile Information, 2022(005):000.

[6] Wang Guohua. Design and analysis of Douban movie web crawler based on python.

[7] Feng Yanru. Design and implementation of Python-based web crawler system [J]. Computer and Information Technology, 2021.

[8] Gao Zuyan. Design and implementation of Python-based web crawler[J]. 2020.

[9] Wu Yuchao, Bao Zhengde, Tang Yawen. Python-based web crawler[J]. Computer Systems Network and Telecommunications, 2019.

[10] Du Chao. A brief analysis of web crawler technology based on python[J]. 2019.

[11] Sun Jianyan, Ma Yuxin, Wu Wenjie. Web crawler system based on Python [J]. Computer Knowledge and Technology: Academic Edition, 2019, 15(9Z):3.

[12] Wang Jianglong, Wang Xiaohong. Implementation based on Python crawler technology [J]. Computer Programming Skills and Maintenance, 2019(9):4.

[13] Meng Xiaoqing. Analysis and prediction of factors affecting Chinese movie box office[D]. Tianjin University of Finance and Economics, 2018.

[14] Xu Qinya, Cai Jipeng, Wang Xing. Film data analysis based on Python [J]. Information Technology and Informatization, 2019(8):3.

Guess you like

Origin blog.csdn.net/hepingyundanfengqing/article/details/135099813