Crawler principles and anti-crawler technology

For the big data industry, the value of data is self-evident. In this era of information explosion, there is too much information data on the Internet. For small, medium and micro companies, rational use of crawlers to crawl valuable data is a way to make up for their own inherent data shortcomings. The best choice for the board, this article mainly summarizes the crawler technology from the crawler principle, architecture, classification and anti-crawler technology.

1. Overview of crawler technology

Web crawler is a program or script that automatically crawls World Wide Web information according to certain rules. They are widely used in Internet search engines or other similar websites and can automatically collect the content of all pages they can access. , to obtain or update the content and retrieval methods of these websites. Functionally speaking, crawlers are generally divided into three parts: data collection, processing, and storage.

The traditional crawler starts from the URL of one or several initial web pages and obtains the URL on the initial web page. During the process of crawling the web page, it continuously extracts new URLs from the current page and puts them into the queue until certain stopping conditions of the system are met. The workflow of the focused crawler is more complex, and it requires filtering links unrelated to the topic based on a certain web page analysis algorithm, retaining useful links and putting them into the URL queue waiting to be crawled. Then, it will select the web page URL to be crawled next from the queue according to a certain search strategy, and repeat the above process until it stops when a certain condition of the system is reached. In addition, all web pages crawled by crawlers will be stored by the system, subjected to certain analysis, filtering, and indexing for subsequent query and retrieval; for focused crawlers, the analysis results obtained in this process may also be Give feedback and guidance for future crawling processes.

Compared with general web crawlers, focused crawlers also need to solve three main problems:

(1) Description or definition of the crawling target;

(2) Analysis and filtering of web pages or data;

(3) Search strategy for URLs.

2. Reptile principle

2.1 Principle of web crawler

The function of the web crawler system is to download web page data and provide data sources for the search engine system. Many large-scale online search engine systems are called search engine systems based on Web data collection, such as Google and Baidu. This shows the importance of Web crawler systems in search engines. In addition to text information for users to read, web pages also contain some hyperlink information. The web crawler system continuously obtains other web pages on the network through the hyperconnection information in the web pages. It is precisely because this collection process is like a crawler or spider roaming on the network, so it is called a web crawler system or web spider system, or Spider or Crawler in English.

2.2 Working principle of web crawler system

In the system framework of web crawler, the main process consists of three parts: controller, parser and resource library. The main job of the controller is to assign work tasks to each crawler thread in the multi-thread. The main job of the parser is to download web pages and process the pages, mainly processing some JS script tags, CSS code content, space characters, HTML tags and other content. The basic work of the crawler is completed by the parser. The resource library is used to store downloaded web page resources. It is generally stored in a large database, such as an Oracle database, and indexed.

controller

The controller is the central controller of the web crawler. It is mainly responsible for allocating a thread according to the URL link passed by the system, and then starting the thread to call the crawler to crawl the web page.

parser

The parser is the main part responsible for the web crawler. Its main tasks include: downloading web pages, processing the text of web pages, such as filtering functions, extracting special HTML tags, and analyzing data.

Resource library

It is mainly a container used to store data records downloaded from web pages and provides a target source for generating indexes. Medium and large database products include: Oracle, Sql Server, etc.

Web crawler systems generally select the URLs of some more important websites with larger out-degree (the number of hyperlinks in the web page) as a set of seed URLs. The web crawler system uses these seed collections as initial URLs to start data crawling. Because web pages contain link information, some new URLs will be obtained through the URLs of existing web pages. The pointing structure between web pages can be regarded as a forest. The web page corresponding to each seed URL is the root node of a tree in the forest. .

In this way, the web crawler system can traverse all web pages according to the breadth-first algorithm or the depth-first algorithm. Since the depth-first search algorithm may cause the crawler system to fall into a website, which is not conducive to searching for web page information closer to the homepage of the website, the breadth-first search algorithm is generally used to collect web pages. The web crawler system first puts the seed URL into the download queue, and then simply takes a URL from the head of the queue to download its corresponding web page. After obtaining the content of the web page and storing it, some new URLs can be obtained by parsing the link information in the web page, and these URLs are added to the download queue. Then take out a URL, download the corresponding web page, and then parse it. This is repeated until the entire network is traversed or certain conditions are met before it stops.

The basic workflow of a web crawler is as follows:

1. First select a portion of carefully selected seed URLs;

2. Put these URLs into the URL queue to be crawled;

3. Take out the URL to be crawled from the queue of URLs to be crawled, parse the DNS, get the IP of the host, download the web page corresponding to the URL, and store it in the downloaded web page library. In addition, put these URLs into the crawled URL queue;

4. Analyze the URL in the crawled URL queue, analyze other URLs in it, and put the URL into the queue of URLs to be crawled, thus entering the next cycle.

2.3 Crawl strategy

In the crawler system, the queue of URLs to be crawled is an important part. The order in which the URLs in the URL queue to be crawled are arranged is also a very important issue, because it involves crawling which page first and which page to crawl later. The method of determining the order of these URLs is called a crawl strategy. The following focuses on several common crawling strategies:

2.3.1 Depth-first traversal strategy

The depth-first traversal strategy means that the web crawler will start from the start page and track link by link. After processing this line, it will move to the next start page and continue to follow the links. Let’s take the following picture as an example:

Path traversed: AFG EHI BCD

2.3.2 Breadth-first traversal strategy

The basic idea of ​​the breadth-first traversal strategy is to directly insert links found in newly downloaded web pages at the end of the queue of URLs to be crawled. That is to say, the web crawler will first crawl all web pages linked in the starting web page, then select one of the linked web pages, and continue to crawl all web pages linked in this web page. Let’s take the picture above as an example:

Traversal path: ABCDEF GHI

2.3.3 Backlink Count Strategy

The number of backlinks refers to the number of links to a webpage that are pointed to by other webpages. The number of backlinks indicates the extent to which a webpage's content is recommended by others. Therefore, many times search engine crawling systems will use this indicator to evaluate the importance of web pages, thereby determining the order in which different web pages are crawled.

In a real network environment, due to the existence of advertising links and cheating links, the number of backlinks cannot be completely equal to the importance of others. Therefore, search engines tend to consider some reliable backlink numbers.

2.3.4 Partial PageRank Strategy

The Partial PageRank algorithm draws on the idea of ​​the PageRank algorithm: For the downloaded web pages, together with the URLs in the URL queue to be crawled, a web page set is formed, and the PageRank value of each page is calculated. After the calculation, the URLs in the URL queue to be crawled are URLs are arranged according to the size of the PageRank value, and pages are crawled in that order.

If the PageRank value is recalculated every time a page is crawled, a compromise is to recalculate the PageRank value every time K pages are crawled. But there is another problem in this situation: for the links analyzed in the downloaded pages, that is, the unknown web pages we mentioned before, there is no PageRank value for the time being. In order to solve this problem, a temporary PageRank value will be given to these pages: the PageRank values ​​passed in by all the incoming links of this webpage are summarized, thus forming the PageRank value of the unknown page, thus participating in the sorting.

2.3.5 OPIC Strategy Strategy

This algorithm actually also scores the importance of the page. Before the algorithm starts, all pages are given the same initial cash. After a certain page P is downloaded, the cash of P is distributed to all the links analyzed from P, and the cash of P is cleared. All pages in the URL queue to be crawled are sorted by cash number.

2.3.6 Big site priority strategy

All web pages in the URL queue to be crawled are classified according to the websites they belong to. Websites with a large number of pages to be downloaded will be downloaded first. This strategy is also called the big station priority strategy.

3. Reptile classification

Should you choose Nutch, Crawler4j, WebMagic, scrapy, WebCollector or others to develop a web crawler? The crawlers mentioned above can basically be divided into three categories:

(1) Distributed crawler: Nutch

(2) JAVA Reptile: Crawler4j, WebMagic, WebCollector

(3) Non-JAVA crawler: scrapy (developed based on Python language)

3.1 Distributed crawler

Crawler uses distributed technology to mainly solve two problems:

1) Massive URL management

2)Internet speed

The more popular distributed crawler now is Apache's Nutch. But for most users, Nutch is the worst choice among these types of crawlers for the following reasons:

1) Nutch is a crawler designed for search engines. Most users need a crawler that can perform precise data crawling (refined extraction). Two-thirds of the processes Nutch runs are designed for search engines. It doesn't make much sense for semen extraction. In other words, using Nutch for data extraction will waste a lot of time on unnecessary calculations. And if you try to secondary develop Nutch to make it suitable for fine extraction business, you will basically destroy the framework of Nutch and change Nutch beyond recognition. If you have the ability to modify Nutch, it is really better to rewrite it yourself. Distributed crawler framework.

2) Nutch relies on hadoop to run, and hadoop itself consumes a lot of time. If the number of cluster machines is small, the crawling speed will not be as fast as a single-machine crawler.

3) Although Nutch has a plug-in mechanism, it is promoted as a highlight. You can see some open source Nutch plug-ins that provide fine extraction functions. But anyone who has developed Nutch plug-ins knows how crappy Nutch’s plug-in system is. Using the reflection mechanism to load and call plug-ins makes it extremely difficult to write and debug programs, let alone develop a complex refined extraction system on it. Moreover, Nutch does not provide corresponding plug-in mounting points for fine extraction. Nutch's plug-in has only five or six mount points, and these five or six mount points are for search engine services and do not provide mount points for fine extraction. Most of Nutch's fine extraction plug-ins are mounted on the "page parser" (parser) mount point. This mount point is actually used to parse links (provide URLs for subsequent crawling) and provide some search engines. Easy-to-extract web page information (web page meta information, text).

4) Use Nutch for secondary development of crawlers. The time required to write and debug the crawler is often more than ten times that of a single-machine crawler. The learning cost of understanding the Nutch source code is very high, let alone asking everyone in a team to understand the Nutch source code. During the debugging process, various problems other than the program itself will appear (hadoop problems, hbase problems).

5) Many people say that Nutch2 has gora, which can persist data to avro files, hbase, mysql, etc. Many people actually understand it wrong. The persistent data mentioned here refers to storing URL information (data required for URL management) in avro, hbase, and mysql. It's not the structured data you want to extract. In fact, for most people, it doesn't matter where the URL information exists.

6) The version of Nutch2 is currently not suitable for development. The official stable version of Nutch is nutch2.2.1, but this version is bound to gora-0.3. If you want to use hbase with nutch (most people use nutch2 just to use hbase), you can only use hbase around version 0.90, and accordingly you need to reduce the hadoop version to around hadoop 0.2. Moreover, the official tutorials of Nutch2 are quite misleading. There are two tutorials for Nutch2, namely Nutch1.x and Nutch2.x. The official website of Nutch2.x says that it can support hbase 0.94. But in fact, Nutch2.x means a version before Nutch2.3 and after Nutch2.2.1. This version is constantly updated in the official SVN. And it's very unstable (it's constantly being modified).

Therefore, if you are not a search engine, try not to choose Nutch as a crawler. Some teams like to follow the trend and insist on choosing Nutch to develop refined crawlers. This is actually because of Nutch's reputation. Of course, the final result is often delayed completion of the project.

If you want to be a search engine, Nutch1.x is a very good choice. Nutch1.x combined with solr or es can form a very powerful search engine. If you must use Nutch2, it is recommended to wait until Nutch2.3 is released. The current Nutch2 is a very unstable version.

Distributed crawler platform architecture diagram

3.2 JAVA crawler

JAVA crawlers are divided into a separate category here because the ecosystem of JAVA in web crawlers is very complete. The relevant information is also the most complete. There may be some controversy here, I'm just talking casually.

In fact, the development of open source web crawlers (frameworks) is very simple. Difficult and complex problems have been solved by previous people (such as DOM tree parsing and positioning, character set detection, and massive URL deduplication). It can be said that there is no technical content. . Including Nutch, in fact, the technical difficulty of Nutch is developing hadoop, and the code itself is very simple. In a sense, a web crawler is similar to traversing the files on the local machine to find the information in the files. There is no difficulty at all. The reason why I chose the open source crawler framework is to save trouble. For example, modules such as crawler URL management and thread pools can be made by anyone, but it will take a period of debugging and modification to make them stable.

For the crawler function. The issues that users are more concerned about are often:

1) Does the crawler support multi-threading? Can the crawler use a proxy? Will the crawler crawl duplicate data? Can the crawler crawl information generated by JS?

If it does not support multi-threading, does not support proxies, and cannot filter duplicate URLs, it is not called an open source crawler, but it is called cyclic execution of http requests.

Whether or not the information generated by js can be crawled has little to do with the crawler itself. Crawlers are mainly responsible for traversing websites and downloading pages. The information generated by crawling js is related to the web page information extraction module, which often needs to be completed by simulating a browser (htmlunit, selenium). These simulated browsers often take a lot of time to process a page. So one strategy is to use these crawlers to traverse the website. When encountering a page that needs to be parsed, submit the relevant information of the web page to the simulated browser to complete the extraction of JS-generated information.

2) Can crawlers crawl ajax information?

There are some asynchronously loaded data on the web page. There are two ways to crawl this data: use a simulated browser (described in question 1), or analyze the ajax http request, generate the ajax request URL yourself, and obtain the returned data. If you generate the ajax request yourself, what is the point of using an open source crawler? In fact, you need to use the thread pool and URL management functions of open source crawlers (such as breakpoint crawling).

If I can already generate the ajax requests (lists) I need, how can I use these crawlers to crawl these requests?

Crawlers are often designed in breadth or depth traversal mode to traverse static or dynamic pages. Crawling ajax information belongs to the category of deep web, although most crawlers do not directly support it. But it can also be done in some ways. For example, WebCollector uses breadth traversal to traverse the website. The first round of crawling by the crawler is to crawl all URLs in the seed collection (seeds). To put it simply, the generated ajax request is used as a seed and put into the crawler. Use a crawler to perform a breadth traversal with a depth of 1 on these seeds (the default is breadth traversal).

3) How does the crawler crawl the website you want to log in to?

These open source crawlers all support specifying cookies when crawling, and simulated login mainly relies on cookies. As for how to obtain cookies, it is not a matter of crawlers. You can obtain it manually, simulate login with http request, or automatically log in with a simulated browser to obtain cookies.

4) How does the crawler extract information from the web page?

Open source crawlers generally integrate web page extraction tools. Mainly supports two specifications: CSS SELECTOR and XPATH. As for which one is better, I won’t comment here.

5) How does the crawler save the information of the web page?

Some crawlers come with a module responsible for persistence. For example, webmagic has a module called pipeline. Through simple configuration, the information extracted by the crawler can be persisted to files, databases, etc. There are also some crawlers that do not directly provide users with data persistence modules. Such as crawler4j and webcollector. Let users add the operation of submitting to the database in the web page processing module. As for whether it is good to use a module like pipeline, it is similar to whether it is good to use an ORM to operate a database. It depends on your business.

6) What should I do if the crawler is blocked by the website?

If the crawler is blocked by the website, it can usually be solved by using multiple proxies (random proxies). However, these open source crawlers generally do not directly support random agent switching. Therefore, users often need to put the obtained agents into a global array and write a code to randomly obtain the agents (from the array).

7) Can web pages call crawlers?

The crawler is called on the server side of the Web. You can use it how you usually use it. These crawlers can be used.

8) How is the crawler speed?

The speed of a stand-alone open source crawler can basically reach the limit of the local network speed. The slow speed of the crawler is often because the user has opened fewer threads, the network speed is slow, or the interaction with the database is slow when data is persisted. These things are often determined by the user's machine and secondary development code. The speed of these open source crawlers is very good.

9) Although the code is written correctly, but the data cannot be crawled, is there a problem with the crawler? Can changing the crawler solve it?

If the code is written correctly and the data cannot be crawled, other crawlers will also be unable to crawl it. In this case, either the website has blocked you, or the data you crawled was generated by JavaScript. Failure to crawl the data cannot be solved by changing the crawler.

10) Which crawler can determine whether the website has been crawled, and which crawler can crawl based on the topic?

The crawler cannot judge whether the website has been crawled, it can only cover it as much as possible.

As for crawling based on the topic, the crawler only knows what topic it is after crawling down the content. So I usually climb down completely and then filter the content. If you feel that the crawl is too broad, you can narrow the scope by restricting URL regularization and other methods.

11) Which crawler design pattern and architecture is better?

Design patterns are bullshit. To say that software design patterns are good is because the software has been developed and then several design patterns are summarized. Design patterns have no guiding role in software development. Using design patterns to design crawlers will only make the crawler design more bloated.

As for the architecture, open source crawlers currently focus on the design of detailed data structures, such as crawling thread pools and task queues, which everyone can control. The crawler business is too simple to talk about any structure.

So for JAVA open source crawlers, I think you can just find one that is easy to use. If the business is complex, whichever crawler you use will require complex secondary development to meet the needs.

3.3 Non-JAVA crawler

Among crawlers written in non-JAVA languages, there are many excellent crawlers. The purpose of extracting them here as a category is not to discuss the quality of the crawler itself, but to discuss the impact of crawlers such as larbin and scrapy on development costs.

Let’s talk about the python crawler first. Python can use 30 lines of code to complete the task of 50 lines of JAVA code. It is indeed fast to write code in python, but in the debugging stage of the code, debugging the python code often takes far more time than the time saved in the coding stage. When developing in python, to ensure the correctness and stability of the program, you need to write more test modules. Of course, if the crawling scale is not large and the crawling business is not complicated, it is also good to use a crawler like scrapy, which can easily complete the crawling task.

The above picture is the architecture diagram of Scrapy. The green line is the data flow direction. Starting from the initial URL, the Scheduler will hand it over to the Downloader for downloading. After downloading, it will be handed over to the Spider for analysis. The data that needs to be saved will be sent to the Item Pipeline. , that is post-processing the data. In addition, various middleware can be installed in the data flow channel to perform necessary processing. Therefore, when developing a crawler, it is best to plan various modules first. My approach is to separately plan the download module, crawling module, scheduling module, and data storage module.

For C++ crawlers, the learning cost will be relatively high. And you can’t just calculate the learning cost for one person. If the software requires team development or handover, it will be the learning cost for many people. Debugging software is not that easy either.

There are also some ruby ​​and php crawlers, but I won’t comment much here. There are indeed some very small data collection tasks that are very convenient to use ruby ​​or php. But when choosing open source crawlers in these languages, on the one hand, you need to research the relevant ecosystem, and on the other hand, these open source crawlers may have some bugs that you can’t find (there are few people using them and there is less information)

4. Anti-crawler technology

Because of the popularity of search engines, web crawlers have become a very popular network technology. In addition to Google, Yahoo, Microsoft, and Baidu, which specialize in search, almost every large portal website has its own search engine, and you can name it There are only dozens of them, and there are tens of thousands of unknown ones. For a content-driven website, it is inevitable to be visited by web crawlers.

Some smart search engine crawlers have a reasonable crawling frequency and consume less website resources, but many bad web crawlers have poor web crawling capabilities and often make dozens or hundreds of concurrent requests to crawl repeatedly. Crawlers are often a devastating blow to small and medium-sized websites. In particular, crawlers written by programmers who lack crawler writing experience are extremely destructive, causing great pressure on website access, resulting in slow website access or even inaccessibility.

Generally, websites are anti-crawlers from three aspects: headers requested by users, user behavior, website directory and data loading method. The first two are relatively easy to encounter, and most websites fight crawlers from these perspectives. The third type is used by some websites that use ajax, which increases the difficulty of crawling.

4.1 Anti-crawler through Headers

Anti-crawling Headers requested from users is the most common anti-crawling strategy. Many websites will detect the User-Agent of Headers, and some websites will detect Referer (the anti-leeching of some resource websites is to detect Referer). If you encounter this type of anti-crawler mechanism, you can directly add Headers to the crawler and copy the browser's User-Agent to the crawler's Headers; or change the Referer value to the target website domain name. For anti-crawlers that detect Headers, they can be easily bypassed by modifying or adding Headers in the crawler.

[Comment: It is often easily ignored. By analyzing the packet capture of the request, determine the referrer and add it to the simulated access request header in the program]

4.2 Anti-crawler based on user behavior

There are also some websites that detect user behavior, such as the same IP visiting the same page multiple times in a short period of time, or the same account performing the same operation multiple times in a short period of time.

[Comment: This kind of anti-climbing requires enough IPs to deal with it]

Most websites are in the former situation. For this situation, using an IP proxy can solve it. You can write a special crawler to crawl the proxy IPs that are public on the Internet, and save them all after detection. Such proxy IP crawlers are often used, so it is best to prepare one yourself. After you have a large number of proxy IPs, you can change one IP every few requests. This is easy to do in requests or urllib2, so that you can easily bypass the first anti-crawler.

[Comment: Dynamic dialing is also a solution]

For the second case, you can randomly wait a few seconds after each request before making the next request. Some websites with logical loopholes can bypass the restriction that the same account cannot make the same request multiple times in a short period of time by requesting several times, logging out, logging in again, and continuing to request.

[Comment: It is generally difficult to deal with anti-crawling restrictions on accounts. Random requests for a few seconds may often be blocked. If you have multiple accounts, switch between them and the effect will be better]

4.3 Anti-crawling of dynamic pages

Most of the above situations occur on static pages, and on some websites, the data we need to crawl is obtained through ajax requests or generated through Java. First use Firebug or HttpFox to analyze the network request. If we can find the ajax request and analyze the specific parameters and the specific meaning of the response, we can use the above method to directly simulate the ajax request using requests or urllib2, and analyze the response json to obtain the required data.

[Comment: I feel that Google’s and IE’s network request analysis are also very useful]

It is great to be able to directly simulate an ajax request to obtain data, but some websites encrypt all parameters of the ajax request. We simply have no way to construct a request for the data we need. The website I crawled on these days is like this. In addition to encrypting ajax parameters, it also encapsulates some basic functions. All of them are calling their own interfaces, and the interface parameters are all encrypted. When encountering such a website, we cannot use the above method. I use the selenium + phantomJS framework, call the browser kernel, and use phantomJS to execute js to simulate human operations and trigger js scripts in the page. From filling in the form to clicking the button to scrolling the page, everything can be simulated, regardless of the specific request and response process. It just completely simulates the process of people browsing the page to obtain data.

[Comment: Support phantomJS]

Using this framework can almost bypass most anti-crawlers, because it is not pretending to be a browser to obtain data (the above-mentioned addition of Headers is to a certain extent to pretend to be a browser), it is a browser itself, and phantomJS is A browser without an interface, but the browser is not controlled by humans. You can use selenium+phantomJS to do many things, such as identifying touch (12306) or sliding verification codes, brute force cracking of page forms, etc. It will also show its talents in automated penetration, which will be mentioned later.

[For those who want to learn crawlers, I have compiled a lot of Python learning materials and uploaded them to the CSDN official. Friends in need can scan the QR code below to obtain them]

1. Study Outline

Insert image description here

2. Development tools

Insert image description here

3. Python basic materials

Insert image description here

4. Practical data

Insert image description here

Guess you like

Origin blog.csdn.net/Z987421/article/details/133354105