Crawler basics (1) What is a web crawler

1. Know web crawlers

Speaking of web crawlers, people often use this analogy: if the Internet is likened to a web, then a web crawler can be considered as a small bug crawling around the web. It uses the link address of the web page to find web pages. A specific search algorithm is used to determine the route, usually starting from a certain page of the website, reading the content of the webpage, finding other link addresses in the webpage, and then looking for the next webpage through these link addresses, so the loop continues until this All pages of the site are crawled until the end.

Web crawlers, also known as web spiders, web ants, web robots, etc., can automate the information in the Leland network. Of course, when browsing information, we need to follow the rules we set. These rules are called web crawler algorithms. Using Python can easily write crawler programs for automatic retrieval of Internet information.

2. The composition of web crawlers

The web crawler is composed of control nodes, crawler nodes, and resource libraries. As shown below:
Insert picture description here

It can be seen that there can be multiple control nodes in the web crawler, and there can be multiple crawler nodes under each control node. The control nodes can communicate with each other. At the same time, the control node and the crawler nodes under it can also perform To communicate with each other, each crawler node belonging to the same control node can also communicate with each other.

  • Control node: also called the central controller of the crawler, which is mainly responsible for allocating threads according to the URL address and calling the crawler node for specific crawling
  • Crawler node: The crawler node will crawl the webpage in accordance with the relevant algorithm, mainly including downloading the webpage and processing the text of the webpage, after the crawling, the corresponding crawling result will be stored in the corresponding resource library

3. Types of web crawlers

According to the technology and structure implemented, web crawlers can be divided into general web crawlers, focused web crawlers, incremental web crawlers, and deep web crawlers. In actual web crawlers, it is usually a combination of these types of crawlers.

1. General Web Crawler

General-purpose web crawlers are also called full-web crawlers, and the targets of general-purpose web crawlers are in the entire Internet. The target data it crawls is huge, and the scope of crawling is also very large. It is precisely because the data it crawls is massive data, so for this type of crawler, its crawling performance requirements are very high. This kind of web crawler is mainly used in large search engines and has very high application value.

The general web crawler is mainly composed of initial URL collection, URL queue, page crawling module, page analysis module, page database and link filtering module. General web crawlers will adopt certain crawling strategies when crawling, mainly including depth-first crawling strategy and breadth-first crawling strategy.

2. Focus on web crawlers

Focused web crawlers are also called topic crawlers. Focused web crawler is a crawler that selectively crawls web pages according to pre-defined themes. Focused web crawlers do not locate the target resources in the entire Internet like general web crawlers, but locate the crawled target webpages in the pages of related topics. At this time, it can greatly save the bandwidth resources and servers required for crawling. Resources. The focus network is mainly used in the crawling of specific information, mainly to provide services for a certain type of specific group of people.

Focus web crawler is mainly composed of initial URL collection, URL queue, page crawling module, page analysis module, page database, link filtering module, content evaluation module and link evaluation module . There are four main crawling strategies focusing on web crawlers: crawling strategies based on content evaluation, crawling strategies based on link evaluation, crawling strategies based on reinforcement learning, and crawling strategies based on context graphs.

3. Incremental web crawler

The so-called incremental type corresponds to incremental update. Incremental update means that only the changed places are updated when updating, and the unchanged places are not updated. Therefore, incremental web crawlers, when crawling web pages, only crawl web pages with changed content or new ones. The generated webpages will not be crawled for webpages that have not changed their content. Incremental web crawlers can guarantee the crawled pages to a certain extent, and make new pages as much as possible.

4. Deep web crawler

To introduce deep web crawlers, we must first introduce the relevant knowledge of web pages:

(1) Static webpage

The so-called static webpage means that there is no program code in the webpage and will not be executed by the server. This kind of webpage is usually stored on the server with the extension .htm or .html, which means that the content inside is written in HTML language.

HTML language is composed of many elements called tags. This language indicates the configuration and style of text, graphics and other elements on the browser, and where these elements are actually stored on the Internet (address), or after clicking on a certain text or graphic, it should be connected to Which URL. When we browse this kind of webpage with the extension of .htm, the website server will pass the file to the client's browser for direct interpretation without executing any program. Therefore, unless the website designer updates the content of the web page file, the content of the web page will not appear different due to the execution of the program.

(2) Deep pages and surface pages

Web pages are classified according to the way they exist, and can be divided into surface pages and deep pages. The so-called surface page refers to a static page that can be reached by using a static link without submitting a form; while a deep page is a page that can only be obtained after certain keywords are submitted. In the Internet, the number of deep pages is often much more than that of surface pages

(3) Fill in the web crawler form

There are two types of web crawler form filling:

  • Form filling based on domain knowledge is to build a keyword database for filling in the form. When filling in is needed, select the corresponding keywords based on semantic analysis to fill in;
  • Form filling based on web page structure analysis. Simply put, this filling method is generally used when domain knowledge is limited. This method will analyze the web page structure and automatically fill in the form.

Finally, let's summarize that deep web crawlers are crawlers that crawl deep pages on the Internet.

4. The purpose of web crawlers

Insert picture description here

Web crawlers can do a lot of things instead of manually. For example, it can be used to do search engines and crawl pictures on websites. For example, some friends crawl all pictures on certain websites and browse them in a centralized manner. At the same time, web crawlers can also It can be used in the field of financial investment, for example, it can automatically crawl some financial information and perform investment analysis.

Sometimes, there may be several news websites that we like better, and each time we have to open these news websites for browsing, which is more troublesome. At this time, you can use a web crawler to crawl down the news information from these multiple news websites and read them together.

Sometimes, when we browse the information on the web, we find that there are many advertisements. At this time, the crawler can also be used to crawl the information on the corresponding webpage, so that these advertisements can be automatically filtered out to facilitate the reading and use of the information.

Sometimes, we need to conduct marketing, so how to find the target customer and the contact information of the target customer is a key issue. We can manually search on the Internet, but the efficiency will be very low. At this time, we can use the crawler to set corresponding rules and automatically collect the target user's contact information and other data from the Internet for our marketing use.

Sometimes, we want to analyze the user information of a certain website, such as analyzing the user's activity, the number of speeches, and popular articles of the website. If we are not a website administrator, manual statistics will be a very large project. At this point, you can use the crawler to easily collect these data for further analysis, and all the crawling operations are carried out automatically. We only need to write the corresponding crawler and design the corresponding rules.

You can use crawlers to easily collect these data for further analysis. All of these crawling operations are carried out automatically. We only need to write the corresponding crawler and design the corresponding rules.

In addition, crawlers can also achieve many powerful functions. In short, the emergence of crawlers can replace manual access to web pages to a certain extent. Therefore, the operations that we needed to manually access Internet information can now be automated with crawlers, so that effective information in the Internet can be used more efficiently. .

Guess you like

Origin blog.csdn.net/qq_45617055/article/details/114983824