Crawler from entry to jail (5) - multi-threaded crawler and common search algorithms

The content of the article is from "python crawler development"

5.1 Multi-threaded crawler

5.1.1 Advantages of Multithreading

After mastering requests and regular expressions, you can start crawling some simple URLs in practice.
However, the crawler at this time has only one process and one thread, so it is called a single-threaded crawler . A single-threaded crawler only visits one page at a time and cannot fully utilize the network bandwidth of the computer. A page is only a few hundred KB at most, so when a crawler crawls a page, the extra network speed and the time between initiating the request and getting the source code are wasted. If the crawler can access 10 pages at the same time, it is equivalent to 10 times the crawling speed. In order to achieve this purpose, it is necessary to use multi-threading technology.

The Python language has a Global Interpreter Lock (GIL). This causes Python's multi-threading to be pseudo-multi-threading, that is, it is essentially a thread, but this thread only does each thing for a few milliseconds, saves the scene after a few milliseconds, and does other things after a few milliseconds, and then does other things after a few milliseconds. After one round, go back to the first thing, restore the scene and do a few more milliseconds, and continue to change... A single thread on the micro level is like doing several things at the same time on the macro level. This mechanism has little effect on I/O (Input/Output, input/output)-intensive operations, but in CPU-intensive operations, since only one core of the CPU can be used, it will have a very negative impact on performance. big impact. Therefore, when it comes to computationally intensive programs, it is necessary to use multiple processes. Python's multiple processes are not affected by the GIL. Crawlers are I/O-intensive programs, so using multithreading can greatly improve crawling efficiency.

5.1.2 Multiprocessing library: multiprocessing

multiprocessing itself is Python's multiprocessing library to handle operations related to multiprocessing. However, since processes and processes cannot directly share memory and stack resources, and the overhead of starting a new process is much larger than that of threads, using multi-threading to crawl has more advantages than using multi-process.

There is a dummy module under multiprocessing, which allows Python threads to use various methods of multiprocessing.
There is a Pool class below dummy, which is used to implement the thread pool.
This thread pool has a map() method that allows all threads in the thread pool to execute a function "simultaneously".

For example:
after learning about the for loop

for i in range(10):
	print(i*i)

This way of writing can of course get the result, but the code is calculated one by one, and the efficiency is not high. And if you use multi-threading technology, let the code calculate the square of many numbers at the same time, you need to use multiprocessing.dummy to achieve:

Example of using multithreading:

from multiprocessing.dummy import Pool
def cal_pow(num):
    return num*num
pool=Pool(3)
num=[x for x in range(10)]
result=pool.map(cal_pow,num)
print('{}'.format(result))

In the above code, a function is first defined to calculate the square, and then a thread pool with 3 threads is initialized. These three threads are responsible for calculating the square of 10 numbers. Whoever finishes calculating the number in hand first will take the next number and continue the calculation until all the numbers are calculated.

In this example, the map() method of the thread pool takes two parameters, the first parameter is the function name and the second parameter is a list. Note: The first parameter is just the name of the function and cannot be parenthesized . The second parameter is an iterable object, and each element in the iterable object will be received as a parameter by the function clac_power2(). In addition to lists, tuples, sets, or dictionaries can be used as the second argument to map().

5.1.3 Multi-threaded crawler development

Since the crawler is an I/O-intensive operation, especially when requesting the source code of a webpage, if a single thread is used for development, a lot of time will be wasted waiting for the webpage to return, so applying multi-threading technology to the crawler can greatly Improve the efficiency of the crawler. As an example. It takes 50 minutes for the washing machine to wash clothes, 15 minutes for the kettle to boil water, and 1 hour for memorizing words. If you wait for the washing machine to wash clothes first, then boil the water after the clothes are washed, and then recite the words after the water is boiled, it will take a total of 125 minutes.

But if you look at it another way, from a holistic perspective, 3 things can run at the same time. Suppose you suddenly split into two other people, one of them is responsible for putting clothes in the washing machine and waiting for the washing machine to finish, and the other is responsible for burning Water and wait for the water to boil, and you just need to memorize the words yourself. When the water boils, the clone responsible for boiling the water disappears first. When the washing machine finishes washing the clothes, the clone responsible for the laundry disappears. Finally, you have memorized the words by yourself. It only takes 60 minutes to complete 3 things at the same time.

Of course, you will surely find that the above example is not the actual situation in life. In reality, no one will be separated. The situation in real life is that when people recite words, they concentrate on memorizing words; when the water is boiled, the kettle will beep to remind; So it's good to do the corresponding action when it is reminded, there is no need to check every minute. The above two differences are actually the differences between multi-threading and event-driven asynchronous models. This section is about multi-threaded operations, and we will talk about crawler frameworks that use asynchronous operations later. Now just remember that when the number of actions to be operated is not large, there is no difference in performance between the two methods, but once the number of actions increases greatly, the efficiency improvement of multi-threading will decrease, even worse than single-threading . And at that time, only asynchronous operation is the solution to the problem.

The following two pieces of code are used to compare the performance differences between single-threaded crawler and multi-threaded crawler for crawling the bd homepage: Please add image description
From the running results, we can see that one thread takes about 16.2s, five threads take about 3.5s, and the time is five times that of a single thread. about one-third. From the time, you can also see the effect of 5 threads "running at the same time". But it does not mean that the larger the thread pool setting, the better. It can also be seen from the above results that the running time of 5 threads is actually a little more than one-fifth of the running time of one thread. This extra point is actually the time for thread switching. This also reflects from the side that Python's multithreading is still microscopically serial. Therefore, if the thread pool is set too large, the overhead caused by thread switching may cancel out the performance gains from multithreading. The size of the thread pool needs to be determined according to the actual situation, and there is no exact data. Readers can set different sizes to test and compare in specific application scenarios to find the most suitable data.

5.2 Common search algorithms for crawlers

5.2.1 Depth-first search

The course classification of an online education website needs to crawl the above course information. Starting from the homepage, the course has several major categories, such as Python, Node.js, and Golang according to language. There are many courses under each major category, such as crawler, Django and machine learning under Python. Each course is divided into many hours.

In the case of depth-first search, the crawling route is shown in the figure (serial number from small to large)
Please add image description

5.2.2 Breadth-first search

The sequence is as follows
Please add image description

5.2.3 Algorithm selection

For example, to crawl all the restaurant information and order information of each restaurant on a website. Assuming that the depth-first algorithm is used, then you first crawl to restaurant A from a certain link, and then immediately crawl the order information of restaurant A. Since there are hundreds of thousands of restaurants across the country, it may take 12 hours to climb all of them. The problem that this leads to is that the order volume of restaurant A may be climbed at 8 am, while the order volume of restaurant B is climbed at 8 pm. Their order volume is 12 hours behind. And for popular restaurants, 12 hours has the potential to make a difference of millions. In this way, when doing data analysis, the 12-hour time difference will make it difficult to compare the sales performance of the two restaurants A and B. The number of restaurants has changed much less relative to the order size. So if you use breadth-first search, first crawl all the restaurants from 0:00 in the middle of the night to 12:00 noon the next day, and then focus on crawling the order volume of each restaurant from 14:00 to 20:00 the next day. In this way, the order crawling task was completed in only 6 hours, and the difference in order volume caused by the time difference was reduced. At the same time, because the store grabs it every few days, it has little impact, so the amount of requests is also reduced, making it harder for crawlers to be discovered by the website.

For another example, to analyze real-time public opinion, you need to crawl Baidu Tieba. A popular Tieba may have tens of thousands of pages of posts, assuming the earliest posts date back to 2010. If the breadth-first search is adopted, the titles and URLs of all posts in this post bar are obtained first, and then each post is entered according to these URLs to obtain the information of each floor. However, since it is a real-time public opinion, then the post 7 years ago is of little significance to the current analysis, and the more important thing should be the new post, so the new content should be grabbed first. Compared with past content, real-time content is the most important. Therefore, for the crawling of post bar content, depth-first search should be used. When you see a post, go in quickly, crawl its information on each floor, and climb to the next post after a post has been climbed. Of course, these two search algorithms are not one or the other, and need to be flexibly selected according to the actual situation, and can be used at the same time in many cases.

Guess you like

Origin blog.csdn.net/weixin_55159605/article/details/124147908