4. Write a web crawler in python and download it concurrently

Table of contents

foreword

4.1 1 million web pages

4.1.1 Parse the Alexa list

4.2 Serial crawler

 4.3 Multi-threaded crawler

4.3.1 How threads and processes work

4.3.2 Implementation

4.3.3 Multi-process crawler

4.4 Performance

4.5 Chapter Summary


foreword

    In the previous chapters, our crawlers downloaded web pages serially, starting a new download only after the previous download was complete. Serial downloads are fine when crawling a small sample site, but become overwhelmed when it comes to larger sites. When crawling a large website with 1 million web pages, assuming we download it day and night at a rate of one web page per second, it will take more than 11 days. If we could download multiple web pages at the same time, the download time would be significantly improved.
    This chapter describes two ways of downloading web pages using multithreading and multiprocessing, and compares their performance with serial downloading.

4.1 1 million web pages

    To test the performance of concurrent downloads, it is best to have a large target website. To do this, this chapter will use the list of the top 1 million websites provided by Alexa, ranked by users who have the Alexa Toolbar installed. Although only a small number of users use this browser plug-in, its data is not authoritative, but it is enough for our test.
    We can get this data by browsing the Alexa website at http://www.alexa.corn/topsites. In addition, we can also directly download the compressed file of this list through http://s3.arnazonaws. so that we don't need to grab the data of Alexa website corn/alexa-static/top-lrn.csv.zip Go grab the data of the Alexa website again

4.1.1 Parse the Alexa list

The Alexa website list is provided in the form of a spreadsheet, and the table contains two columns, namely ranking and domain name, as shown in the figure below

 

Extracting data includes the following four steps.
Download the .zip file.
Extract the csv file from the .zip file.
Parse the csv file.
Traverse each line in the csv file and extract the domain name data from it.
Below is the code that implements the above functionality.

 

    You may have noticed that the downloaded compressed data is passed to ZipFile after being encapsulated with StringIO. This is because ZipFile expects a file-like interface, not a string. Next, we extract the name of the csv file from the list of filenames. Since this .zip file contains only one file, we can just choose the first file name. The file is then traversed, adding the domain name data in the second column to the list of URLs. To make URLs legal, we also add http:// protocol before each domain name.
    To reuse the above functions in previously developed crawlers, the scrape_callback interface needs to be modified.

 

    A new input parameter maxurls is added here, which is used to set the number of URLs extracted from the Alexa file. By default, this value is set to 1000 Uwangs because it takes too long to download 1 million web pages (as mentioned at the beginning of this chapter, serial downloads took more than 11 days).

4.2 Serial crawler

    The following is the code for the previously developed link crawler to use the AlexaCallback callback when downloading serially.
The complete source code can be obtained from https://bitbucket.org/wswp/code/src/tip/chapter04/sequential_test .py We can execute the following command on the command line to run the script.

    According to the estimation of the execution result, it takes an average of 1.6 seconds for each URL during serial download. 

 4.3 Multi-threaded crawler

    Now, we extend the crawler that downloads web pages serially to download them in parallel. It should be noted that if this function is abused, the multi-threaded crawler requests the content too fast, which may cause the server to be overloaded or the IP address to be banned. In order to avoid this problem, our crawler will set a delay flag to set the minimum time interval when requesting the same domain name.
    The Alexa website list used as an example in this chapter does not suffer from the above problems because it contains 1 million different domain names. However, when you crawl different web pages under the same domain name in the future, you need to pay attention to the delay of at least 1 second between the two downloads.

4.3.1 How threads and processes work

The figure below shows the execution of a process with multiple threads.

 

    When a Python script or other computer program is run, a process containing code and state is created. These processes are executed by one or more CPUs of the computer. However, the CPU will only execute one process at the same time, and then quickly switch between different processes, which gives people the feeling that multiple programs are running at the same time. Similarly, in a process, the execution of the program is also switched between different threads, and each thread executes a different part of the program. This means that when a thread is waiting for a web page to download, the process can switch to other threads for execution, avoiding wasting CPU time. Therefore, in order to download data as fast as possible using all the resources in the computer, we need to distribute the downloads across multiple processes and threads.

4.3.2 Implementation

    Fortunately, implementing multithreaded programming in Python is relatively simple. We could keep a queue structure similar to the link crawler we developed in Chapter 1, but instead start the crawler loop in multiple threads so that the links are downloaded in parallel. The following code is the modified link crawler start part, where the c rawl loop is moved inside the function.

 

    Here's the multiple remainder threads and waits for it to complete.
    When there is a URL to crawl, the loop in the above code will continue to create threads until the maximum value of the thread pool is reached. During the crawling process, if there are no more URLs to crawl in the current queue, the thread will stop early. Suppose we have 2 threads and 2 URLs to be downloaded. When the first thread finishes downloading, the crawl queue is empty, so the thread exits. The second thread also finished the download later, but found another URL to download. At this time, the thread loop notices that there are still URLs to download, and the number of threads has not reached the maximum value, so a new download thread will be created.
    The test code for the threadedcrawler interface can be obtained from https://bitbucket.org/wswp/code/src/tip/chapter04/threadedtest.py. Now, let us test the performance of the multi-threaded version of the link crawler using the following command.

 

      Since we used 5 threads, the download speed is almost 5 times faster than the serial version. In Section 4.4, the multi-thread performance will be further analyzed.

4.3.3 Multi-process crawler

   To further improve performance, we extended the multi-threaded example to support multiple processes. Currently, crawler queues are stored in local memory, and other processes cannot process this crawler. In order to solve this problem, the crawler queue needs to be transferred to MongoDB. Separate storage queues mean that even crawlers on different servers can collaborate on the same crawler task.
    Note that if you want to have a more robust queue, you need to consider using a dedicated messaging tool such as Celery. However, in order to minimize the variety of technologies covered in this book, we have chosen to reuse MongoDB here. The following is the queue code implemented based on MongoDB.

 

 

 

    The queue in the above code defines 3 states: OUTSTANDING, PROCESS ING and COMPLETE. When a new URL is added, its status is OUTSTANDING: when the URL is taken out of the queue and ready to be downloaded, its status is PROCE SSING; when the download is complete, its status is COMPLETE . In this implementation, most of the code is concerned with the processing when the URL taken out of the queue cannot be completed normally , for example, the process of processing the URL is terminated. To avoid losing results for these URLs, the class uses a timeout parameter whose default value is 300 seconds. In the repair() method, if the processing time of a certain URL exceeds this timeout value, we will consider that there is an error in the processing process, and the status of U Rainbow will be reset to OUTSTAND ING, so that it can be processed again.

    In order to support this new queue type, a small amount of modification to the code of the multi-threaded crawler is required, and the modified part has been bolded in the following code.

 

    The first change is to replace the Python built-in queue with a new one based on MongoDB, named MongoQueue here. Since the queue handles duplicate URLs internally, the seen variable is no longer needed. Finally, the complete() method is called after URL processing to record that the URL has been successfully parsed.
    The updated multithreaded crawler can also start multiple processes, as shown in the code below.

 

    The structure of this code looks very familiar, because the multi-processing module has a similar interface to the multi-threading module used before. This code first gets the number of available CPUs, starts a multi-threaded crawler in each new process, and then waits for all processes to finish executing.
    Now, let us test the performance of the multi-process version of the link crawler using the following command. The interface for testing process_link_crawler is the same as when testing multi-threaded crawlers before, and can be obtained from https://bitbucket.org/wswp/code/src/tip/chapter04/process_test.py.
     Through the script detection, the test server contains 2 CPUs, and the running time is about half of the previous multi-threaded crawler execution using a single process. In the next section, we further investigate the relative performance of these three approaches.

 

4.4 Performance

    In order to further understand how increasing the number of threads and processes will affect the download time, we compared the results when crawling 1000 web pages, as shown in the figure below

 

    The last column of the table gives the time ratio relative to the serial download. It can be seen that the increase in performance is not linearly proportional to the number of threads and processes, but tends to be logarithmic. For example, when using 1 process and 5 threads, the performance is about 4 times that of serial, while the performance of using 20 threads is only 10 times that of serial download. Although the newly added thread can speed up the download speed, the effect will be smaller and smaller than the previously added thread. In fact, this is a predictable phenomenon, because at this time the process needs to switch between more threads, and the time dedicated to each thread will be reduced. Also, download bandwidth is limited, and eventually adding new threads won't lead to faster download speeds. Therefore, in order to achieve better performance, crawlers need to be deployed distributedly on multiple servers, and all servers must point to the same MongoDB queue instance.

4.5 Chapter Summary

In this chapter, we introduce the reasons for serial download bottlenecks, and then give
A method for efficiently downloading a large number of web pages by multiple processes.

Guess you like

Origin blog.csdn.net/weixin_74021557/article/details/131389498