[python] Introduction to crawlers

pythonCrawler is a common tool for collecting Internet data and has developed rapidly with the development of the Internet in recent years. To use a web crawler to crawl web data, you must first understand the network concepts and main classifications, the system structure, operation methods, common strategies, and main application scenarios of various crawlers. At the same time, for copyright and data security considerations, you also need to understand the current There are protocols that need to be followed regarding the legality of crawler applications and crawling websites.

The concept of crawler

A web crawler, also known as a web spider or web robot, is a computer program or automated script that automatically downloads web pages.

A web crawler crawls along the threads of URLs on the Internet like a spider, downloads the web pages pointed to by each URL, and analyzes the content of the pages.
Insert image description here

The principle of reptile

1. Universal web crawler

A universal web crawler is also called a full-web crawler. Its crawling objects are expanded from a batch of seed URLs to the entire Web. This type of crawler is more suitable for searching a wide range of topics for search engines and is mainly used by search engines or large Web service providers. Depth-first strategy: Visit the next-level web links in order from low to high depth until you can no longer go deeper. Breadth-first strategy: crawl according to the depth of the web content directory level, giving priority to crawling pages with shallower levels. When all pages in the same layer have been crawled, the crawler proceeds to the next layer.

2. Focus on web crawlers

Focused web crawlers are also called topic web crawlers. Their biggest feature is that they only selectively crawl pages related to preset themes. Crawling strategy based on content evaluation: This strategy uses the query words entered by the user as the topic, and the pages containing the query words are regarded as pages related to the topic. Crawling strategy based on link structure evaluation: This strategy uses semi-structured document Web pages that contain a lot of structural information to evaluate the importance of links. One of the widely used algorithms is the PageRank algorithm. Crawling strategy based on reinforcement learning: This strategy introduces reinforcement learning into focused crawlers, uses a Bayesian classifier to classify hyperlinks, calculates the importance of each link, and determines the access order of the links according to their importance. Crawling strategy based on context graph: This strategy learns the correlation between web pages by establishing a context graph and calculates the distance from the current page to related pages. Links in pages with closer distances are visited first.

3. Incremental web crawler

Incremental web crawlers only incrementally update downloaded web pages or only crawl newly generated and changed web pages. Local pages need to be updated by re-visiting the web pages to keep the locally stored centralized pages as the latest pages. Commonly used update methods are as follows. Unified update method: access all web pages with the same frequency, regardless of the frequency of changes to the web page itself. Individual update method: Determine the frequency of revisiting each page based on the frequency of changes to individual web pages. Classification-based update method: Crawlers are divided into faster-updating and slower-updating webpage categories according to the frequency of webpage changes, and different frequencies are set to visit these two types of webpages.

4. Deep web crawler

Web pages can be divided into two categories: surface pages and deep pages according to the way they exist. Surface pages refer to pages that can be indexed by traditional search engines. Deep pages are Web pages that most of the content cannot be obtained through static links. They are hidden behind the search form and require users to submit keywords before they can be obtained. The core part of the deep crawler is form filling, which includes the following two types. Form filling based on domain knowledge: This method generally maintains an ontology library and selects appropriate keywords to fill in the form through semantic analysis. Form filling based on web page structure analysis: This method generally has no domain knowledge or only limited domain knowledge. It represents the HTML web page in the form of a DOM tree and divides the form into a single-attribute form and a multi-attribute form, which are processed separately and extracted from them. The values ​​of each field in the form.

The legality of crawlers and the robot.txt protocol

The legality of crawlers

Currently, most websites allow the data crawled by crawlers to be used for personal use or scientific research. However, if the crawled data is used for other purposes, especially reprinting or commercial purposes, it may seriously violate the law or cause civil disputes. The following two types of data cannot be crawled, let alone used for commercial purposes. Personal privacy data: such as name, mobile phone number, age, blood type, marital status, etc. Crawling such data will violate the Personal Information Protection Law. Data that is explicitly prohibited from being accessed by others: For example, users have set up account passwords and other permission controls, and encrypted content. You also need to pay attention to copyright-related issues. Copyright-protected content signed by the author is not allowed to be crawled and reprinted or used for commercial purposes.

When using a crawler to crawl data from a website, you need to comply with the protocol established by the website owner for all crawlers, which is the robot.txt protocol.

This agreement is usually stored in the root directory of the website, and it stipulates which content of this website can be obtained by crawlers, and which web pages are not allowed to be obtained by crawlers.

The purpose and means of website anti-crawling

1. Anti-crawl through User-Agent verification

When the browser sends a request, it will attach some parameters of the browser and the current system environment to the server, and the server will distinguish different browsers through the value of User-Agent.

2. Anti-crawling based on access frequency

The speed of ordinary users accessing websites through browsers is much slower than that of crawlers, so many websites will use this to set a threshold for access frequency. If the access frequency of an IP per unit time exceeds the preset threshold , access restrictions will be imposed on this IP. Usually, a verification code is required before normal access can be continued. In serious cases, the IP may even be banned from accessing the website for a period of time.

3. Anti-crawl through verification code verification

Some websites require visitors to enter a verification code to continue operations, regardless of the frequency of visits. For example, on the 12306 website, whether you are logging in or purchasing tickets, you need to verify the verification code, regardless of the frequency of access.

4. Anti-crawl by changing the web page structure

Some social networking sites often change the web page structure, and crawlers in most cases need to parse the required data through the web page structure, so this approach can also play an anti-crawler role. After the web page structure is changed, crawlers are often unable to find the originally needed content in the original web page location.

5. Anti-crawl through account permissions

Some websites require login to continue operations. Although these websites do not require login operations for the purpose of anti-crawlers, they do play an anti-crawler role. For example, to view comments on Weibo, you need to log in.

Crawling strategy development

For the common anti-crawler methods introduced before, the corresponding crawling strategies can be formulated as follows. Send simulated User-Agent: Pass the test by sending simulated User-Agent, disguising the User-Agent value of the request to be sent to the website server as the User-Agent value used by ordinary users to log in to the website. Adjust the access frequency: Test the access frequency threshold of the website through the backup IP, and then set the access frequency slightly lower than the threshold. This method can not only ensure the stability of crawling, but also prevent the efficiency from being too low. Pass the verification code verification: use IP proxy to change the crawler IP; identify the verification code through the algorithm; use cookies to bypass the verification code. Respond to website structure changes: When crawling only once, crawl all required data before adjusting the website structure; use scripts to monitor the website structure. When the structure changes, issue an alarm and stop the crawler in time. Restriction through account permissions: circumventing it through simulated login, which often also requires passing verification code verification. Avoidance through proxy IP: Changing IP through a proxy can effectively avoid website detection. It should be noted that the public IP proxy pool is the key monitoring target of the website.

Technical reserves about Python

Here I would like to share with you some free courses for everyone to learn. Below are screenshots of the courses. Scan the QR code at the bottom to get them all.
If the picture is invalid, click You can jump to the blue font~Click here

1. Python learning routes in all directions

Insert image description here

2. Learning software

If a worker wants to do his job well, he must first sharpen his tools. The commonly used development software for learning Python is here, saving everyone a lot of time.
Insert image description here

3. Study materials

Insert image description here

4. Practical information

Practice is the only criterion for testing truth. The compressed packages here can help you improve your personal abilities in your spare time.
Insert image description here

5. Video courses

Insert image description here

Well, today’s sharing ends here. Happy time is always short. Friends who want to learn more courses, don’t worry, there are more surprises~Insert image description here

Guess you like

Origin blog.csdn.net/bagell/article/details/132915913