Python's web crawler framework - a first look at web crawlers


I. Introduction

  • Personal homepage : ζ Xiaocaiji
  • Hello everyone, I am Xiaocaiji, let us learn Python's network crawling framework
  • If the article is helpful to you, welcome to follow, like, and bookmark (one-click three links)

2. Introduction

   With the advent of the era of big data, the amount of network information has become more and larger, and the status of web crawlers in the Internet will become more and more important. This article will introduce common techniques for implementing web crawlers through the Python language, as well as common web crawler frameworks.


3. Overview of web crawlers

   Web crawlers (also known as web spiders, web robots, and often called web page chasers in a certain community) can automatically browse or grab information in the network according to specified rules (web crawler algorithms), and can easily write crawler programs or scripts through Python.

   Web crawlers often appear in life, and search engines cannot do without web crawlers. For example, the name of the crawler of the Baidu search engine is Baidu Spider. Baidu Spider is an automatic program of Baidu search engine. It crawls massive amounts of Internet information every day, collecting and organizing web pages, pictures, videos and other information on the Internet. Then when the user enters the corresponding keywords in the Baidu search engine, Baidu will find relevant content from the collected network information, and then present the information to the user in a certain order. During the working process of Baidu Spider, the search engine will construct a calling program to call the work of Baidu Spider. These scheduling programs need to use certain algorithms to realize. Using different algorithms, the work efficiency of crawlers will be different, and the results of crawling will be different. Therefore, when learning crawlers, you not only need to understand the implementation process of crawlers, but also need to understand some common crawler algorithms. In certain cases, developers need to formulate corresponding algorithms themselves.


4. Classification of web crawlers

   Web crawlers can be divided into the following types according to the technology and structure implemented: common web crawler , focused web crawler , incremental web crawler , deep web crawler and other types. In actual web crawlers, it is usually a combination of these types of crawlers:

1. General web crawler

   General-purpose web crawlers are also called whole-web crawlers. Commonly used web crawlers have a huge range and quantity of crawling. It is precisely because the data they crawl is massive that they have high requirements for crawler speed and storage space. General-purpose web crawlers have relatively low requirements on the order of crawling pages. At the same time, because there are too many pages to be refreshed, they usually work in parallel, so it takes a long time to refresh a page. Therefore, there are certain defects. This kind of web crawler is mainly used in large-scale search engines and has great application value. Usually, a web crawler is mainly composed of an initial URL collection, a URL queue, a page crawling module, a page analysis module, a page data module, and a page filtering module.

2. Gather web crawlers

  Aggregation web crawler is also called topic web crawler, which refers to a crawler that selectively crawls relevant web pages according to a pre-defined theme. Compared with general web crawlers, it does not locate the target resources in the entire Internet, but locates the crawled target web pages in pages related to the topic. This greatly saves hardware and network resources, and the number of saved pages is faster because of the small number. Focused web crawlers are mainly used to crawl specific information and provide services for a specific group of people.

3. Incremental web crawler

  Incremental web crawlers, so-called incremental, correspond to incremental updates. Incremental update means that only the changed place is updated when updating. The unchanged parts will not be updated, so incremental web crawlers, when crawling web pages, will only crawl newly generated pages when needed, and will not crawl pages that have not changed. This can effectively reduce the amount of downloads and reduce time and space consumption, but it adds some difficulty to the crawling algorithm.

4. Deep web crawler

  In the Internet, web pages can be divided into surface web pages and deep web pages according to the way of existence. Surface web pages refer to static pages that can be directly accessed by using static hyperlinks without submitting a form. Deep web pages refer to those that most of the content cannot be obtained through static page links, hidden behind search forms, and require users to submit some keywords to obtain web pages. The amount of information that needs to be accessed on the deep page is hundreds of times that of the surface page, so the deep page is the main object to be crawled.
  The deep web crawler is mainly composed of six basic functional modules (crawler controller, parser, form analyzer, form processor, response analyzer, LVS controller) and two crawler internal data structures (URL list, LVS table). Among them, LVS represents a label/value set, which is used to represent the data source for filling the form.


Five, the basic principles of web crawlers

insert image description here

  • ① Specify a seed url and put it in the queue
  • ② Get a URL from the queue
  • ③ Use the HTTP protocol to initiate a network request
  • ④ In the process of initiating a network request, it is necessary to convert the domain name into an IP address, that is, domain name resolution
  • ⑤ Get the response from the server, which is a binary input stream at this time
  • ⑥ Convert the binary input stream into an HTML document and parse the content (the content we want to grab, such as the title)
  • ⑦ Save the released content to the database
  • ⑧ Record the current URL and mark it as crawled to avoid repeated crawling next time
  • ⑨ From the current HTML document, parse out other URLs contained in the page for the next crawl
  • ⑩ Determine whether the parsed URL has been crawled, and discard it if it has been crawled
  • ⑪ Store URLs that have not been crawled in the queue of URLs waiting to be crawled
  • ⑫ Repeat the above steps to guide that there is no data in the URL queue waiting to be crawled

  Python's web crawling framework - introduction to web crawlers for the first time, this is the end, thank you for reading, if the article is helpful to you, welcome to follow, like, and bookmark (one-click three links)


Guess you like

Origin blog.csdn.net/weixin_45191386/article/details/131445359