A web crawler using asyncio coroutine (1)

A. Jesse Jiryu Davis is an engineer at MongoDB in New York. He wrote Motor, the asynchronous MongoDB Python driver, and was the development lead for the MongoDB C driver and a member of the PyMongo team. He has also contributed to asyncio and Tornado, and writes at http://emptysqua.re.

introduce

Guido van Rossum, creator of the mainstream programming language Python, is known in the Python community as BDFL (Benevolent Great Dictator For Life)—a title derived from the Monty Python skit. His home page is  http://www.python.org/~guido/  .

Classical computer science emphasizes efficient algorithms, completing calculations as quickly as possible. However, the time of many network programs is not consumed in calculations, but in waiting for many slow connections or low-frequency events to occur. These programs expose a new challenge: how to efficiently wait for a large number of network events. A modern solution is asynchronous I/O.

In this chapter we will implement a simple web crawler. This crawler is just a prototype of an asynchronous application, since it waits for many responses and does only a small amount of computation. The more pages it crawls at one time, the faster it can complete the task. If it starts a thread for each dynamic request, as the number of concurrent requests increases, it will run out of memory or thread-related resources before running out of sockets. Using asynchronous I/O avoids this problem.

We will show this example in three stages. First, we'll implement an event loop and use this event loop and callbacks to sketch out a web crawler. It works well, but when extended to more complex problems it leads to unmanageable messy code. Then, since Python's coroutines are not only efficient but also extensible, we'll implement a simple coroutine using Python's generator functions. In the final stage, we will use the full-featured coroutines in the Python standard library "asyncio" and complete this web crawler through an asynchronous queue. (At PyCon 2013, Guido introduced the standard asyncio library, which was called "Tulip" at the time.)

Task

Web crawlers find and download all of the web pages on a website, and perhaps archive and index them. Starting from the root URL, it fetches each web page, parsing out links it has not encountered and adding them to the queue. It stops when the page has no unseen links and the queue is empty.

We can speed up this process by downloading a large number of web pages at the same time. When the crawler finds a new connection, it processes the new connection in parallel using a new socket, parses the response, and adds the new connection to the queue. When the concurrency is large, it may lead to performance degradation, so we will limit the number of concurrency and keep those unprocessed connections in the queue until some executing tasks are completed.

traditional way

How to make a crawler concurrent? The traditional approach is to create a thread pool, and each thread uses a socket to be responsible for the download of a web page for a period of time. For example, to download a page from the xkcd.com website:

def fetch(url):
    sock = socket.socket()
    sock.connect(('xkcd.com', 80))
    request = 'GET {} HTTP/1.0\r\nHost: xkcd.com\r\n\r\n'.format(url)
    sock.send(request.encode('ascii'))
    response = b''
    chunk = sock.recv(4096)
    while chunk:
        response += chunk
        chunk = sock.recv(4096)
    # Page is now downloaded.
    links = parse_links(response)
    q.add(links)

Socket operations are blocking by default: when a thread calls a method like connect and recv, it blocks until the operation completes. (Even send can be blocked, for example, if the receiver is slow to accept outgoing messages and the system's outgoing data buffer is full.) Therefore, in order to download multiple pages at the same time, we need many threads. A complex application amortizes the overhead of thread creation by keeping idle threads in the thread pool. The same applies to sockets, using a connection pool.

So far, the use of threads is expensive, and the operating system has different hard restrictions on the use of threads by a process, a user, and a machine. In the author Jesse's system, a Python thread requires 50K memory, and opening tens of thousands of threads will fail. The overhead of each thread and the limitations of the system are the bottlenecks of this approach.

In Dan Kegel's influential article "The C10K problem", it addresses the limitations of multithreaded I/O concurrency. He begins by writing,

It's time for web servers to handle thousands of clients simultaneously, don't you think so? After all, the network scale is very large now.

Kegel coined the term "C10K" in 1999. Ten thousand connections seems acceptable today, but the problem remains, just in a different size. Back then, it was impractical to start a thread per connection for the C10K problem. Now that limit has grown exponentially. Indeed, our toy web crawler works just fine with threads. However, for large-scale applications with tens of millions of connections, the limitation still exists: it will consume all threads, even if there are enough sockets. So how do we solve this problem?

asynchronous

Asynchronous I/O frameworks perform concurrent operations in one thread. Let's see how this is done.

Asynchronous frameworks use non-blocking sockets. In the asynchronous crawler, we make the socket non-blocking before initiating the connection to the server:

sock = socket.socket()
sock.setblocking(False)
try:
    sock.connect(('xkcd.com', 80))
except BlockingIOError:
    pass

Calling the connect method on a non-blocking socket throws an exception immediately, even though it would work normally. This exception replicates the annoying behavior of the underlying C language function, which sets errno to EINPROGRESS to tell you that the operation has begun.

Now our crawler needs a way to know when the connection is established so it can send HTTP requests. We can simply use a loop to retry:

request = 'GET {} HTTP/1.0\r\nHost: xkcd.com\r\n\r\n'.format(url)
encoded = request.encode('ascii')
while True:
    try:
        sock.send(encoded)
        break  # Done.
    except OSError as e:
        pass
print('sent')

This method not only consumes CPU, but also cannot effectively wait for multiple sockets. In ancient times, the BSD Unix solution was select, a C function that waited for events to occur on a non-blocking socket or set of sockets. Now, the demand for a large number of Internet applications has caused select to be replaced by poll. The implementation on BSD is kqueue, and on  Linux  it is epoll. Their API is similar to select, but they can also have better performance in a large number of connections.

Python 3.4's DefaultSelector will use the best select class function available on your system. To register for a notification of a network I/O event, we create a non-blocking socket and register it with the default selector.

from selectors import DefaultSelector, EVENT_WRITE
selector = DefaultSelector()
sock = socket.socket()
sock.setblocking(False)
try:
    sock.connect(('xkcd.com', 80))
except BlockingIOError:
    pass
def connected():
    selector.unregister(sock.fileno())
    print('connected!')
selector.register(sock.fileno(), EVENT_WRITE, connected)

We ignore this bogus error and call selector.register, passing the socket file descriptor and a constant expression representing what event we want to listen for. To be notified when the connection is established, we use EVENT_WRITE: it indicates when the socket is writable. We also pass a Python function connected which is called when the corresponding event occurs. Such functions are called callbacks.

In a loop, we process I/O notifications when the selector receives them.

def loop():
    while True:
        events = selector.select()
        for event_key, event_mask in events:
            callback = event_key.data
            callback()

The connected callback function is saved in event_key.data, once the non-blocking socket is connected, it will be fetched and executed.

Unlike the fast spinning loop we had earlier, the select call here pauses, waits for the next I/O event, and then executes the callback function waiting for those events. Incomplete operations will remain pending until the next event loop is executed.

What have we shown so far? We showed how to start an I/O operation and call the callback function when the operation is ready. An asynchronous framework that performs concurrent operations in a single thread, built on top of two features, non-blocking sockets and an event loop.

We have achieved "concurrency" here, but not "parallelism" in the traditional sense. That is, we built a tiny system that does overlapped I/O, starting a new operation while other operations are still in progress. It doesn't actually take advantage of multiple cores to perform calculations in parallel. This system is designed to solve I/O-intensive I/O-bound problems, not CPU-intensive CPU-bound problems. (Python's Global Interpreter Lock prohibits parallel execution of Python code in any way within a single process. Parallelizing CPU-intensive algorithms in Python requires multiple processes, or to port that code to a C parallel version. But this is another a topic.)

So, our event loop is efficient at concurrent I/O because it doesn't allocate thread resources for each connection. But before we start, we need to clear up a common misconception: async is faster than multithreading. This is usually not the case, and in fact event loops like ours are slower than multithreading in Python when handling a small number of very active connections. There is no global interpreter lock in the runtime environment, and threads will perform better under the same load. Asynchronous I/O is really suitable for applications with few events and many slow or sleeping connections. (Jesse in "What is async, how does it work, and when should I use it?" points out when async does and doesn't work. Mike Bayer compares async under different workloads in "Asynchronous Python and databases." The difference between I/O and multithreading.)

call back

Using the asynchronous framework we just built, how can we complete a web crawler? Even a simple web downloader is hard to write.

First, we have a collection of URLs that have not been fetched, and a collection of URLs that have been resolved.

urls_todo = set(['/'])
seen_urls = set(['/'])

The seen_urls collection includes urls_todo and completed URLs. Initialize them with root URL /.

Fetching a web page requires a series of callbacks. The connected callback is fired when the socket connection is established, which sends a GET request to the server. But it has to wait for the response, so we need to register another callback function; when the callback is called, it still can't read the complete request, it will register the callback again, and so on.

Let's put these callbacks in a Fetcher object, which takes a URL, a socket, and a place to save the returned bytes:

class Fetcher:
    def __init__(self, url):
        self.response = b''  # Empty array of bytes.
        self.url = url
        self.sock = None

Our entry point is Fetcher.fetch:

 # Method on Fetcher class.
    def fetch(self):
        self.sock = socket.socket()
        self.sock.setblocking(False)
        try:
            self.sock.connect(('xkcd.com', 80))
        except BlockingIOError:
            pass
        # Register next callback.
        selector.register(self.sock.fileno(),
                          EVENT_WRITE,
                          self.connected)

The fetch method begins by connecting a socket. But be aware that this method returns before the connection is established. It must return control to the event loop to wait for the connection to be established. To understand why this is done, suppose the overall structure of our program is as follows:

# Begin fetching http://xkcd.com/353/
fetcher = Fetcher('/353/')
fetcher.fetch()
while True:
    events = selector.select()
    for event_key, event_mask in events:
        callback = event_key.data
        callback(event_key, event_mask)

When the select function is called, all event reminders will be processed in the event loop, so fetch must hand over control to the event loop, so that our program can know when the connection has been established, and then call the connected callback cyclically, which is already in Registered in the fetch method above.

Here is the implementation of our connected method:

# Method on Fetcher class.
    def connected(self, key, mask):
        print('connected!')
        selector.unregister(key.fd)
        request = 'GET {} HTTP/1.0\r\nHost: xkcd.com\r\n\r\n'.format(self.url)
        self.sock.send(request.encode('ascii'))
        # Register the next callback.
        selector.register(key.fd,
                          EVENT_READ,
                          self.read_response)

This method sends a GET request. A real application would check the return value of send in case not all messages were sent at once. But our request is small and the application is not complex. It simply calls send and waits for a response. Of course, it has to register another callback and pass control to the event loop. Next and last is the callback function read_response, which handles the server's response:

 # Method on Fetcher class.
    def read_response(self, key, mask):
        global stopped
        chunk = self.sock.recv(4096)  # 4k chunk size.
        if chunk:
            self.response += chunk
        else:
            selector.unregister(key.fd)  # Done reading.
            links = self.parse_links()
            # Python set-logic:
            for link in links.difference(seen_urls):
                urls_todo.add(link)
                Fetcher(link).fetch()  # 
            seen_urls.update(links)
            urls_todo.remove(self.url)
            if not urls_todo:
                stopped = True

This callback is called every time the selector finds that the socket is readable, in two cases: the socket has received data or it has been closed.

This callback function reads 4K data from the socket. If it is less than 4k, read as much as you have. If it is more than 4K, only 4K data is included in the chunk and the socket remains readable, so that in the next cycle of the event loop, it will return to this callback function again. When the response is complete, the server closes the socket and the chunk is empty.

The parse_links method, not shown here, returns a collection of URLs. We start a fetcher for each new URL. Note a nice thing about programming with asynchronous callbacks: we don't need to lock shared data, such as when we add new links to seen_urls. This is a non-preemptive multitasking, it will not be interrupted anywhere in our code.

We added a global variable stopped to control the loop:

stopped = False
def loop():
    while not stopped:
        events = selector.select()
        for event_key, event_mask in events:
            callback = event_key.data
            callback()

Once all the pages are downloaded, the fetcher stops the event loop and the program exits.

This example clearly exposes a problem with asynchronous programming: spaghetti code.

We need some way to express a series of computations and I/O operations, and to be able to schedule multiple such series of operations to execute concurrently. However, without threads you can't write this series of operations in a function: when the function starts an I/O operation, it explicitly saves the state needed for the future, and then returns. You need to think about how to write this state saving code.

Let's explain what this actually means. Let's first look at how simple it is to use a normal blocking socket to fetch a web page in a thread.

# Blocking version.
def fetch(url):
    sock = socket.socket()
    sock.connect(('xkcd.com', 80))
    request = 'GET {} HTTP/1.0\r\nHost: xkcd.com\r\n\r\n'.format(url)
    sock.send(request.encode('ascii'))
    response = b''
    chunk = sock.recv(4096)
    while chunk:
        response += chunk
        chunk = sock.recv(4096)
    # Page is now downloaded.
    links = parse_links(response)
    q.add(links)

What exactly is this function remembering state between one socket operation and the next? It has a socket, a URL and a response that can be incremented. Functions running in threads use the programming language's basic features to store these temporary states in local variables on the stack. Such a function also has a "continuation" - it executes the code after the I/O is complete. The runtime remembers the continuation through the thread's instruction pointer. You don't have to think about how to restore the local variable and the continuation after the I/O operation. The characteristics of the language itself can help you solve it.

But with a callback-based asynchronous framework, these language features don't help a little. When waiting for an I/O operation, a function must explicitly save its state because it will return and clear the stack frame before the I/O operation completes. In our callback-based example, instead of local variables, we store sock and response as attributes of the Fetcher instance self. Instead of an instruction pointer, it saves its continuation by registering connected and read_response callbacks. As the functionality of the app grows, so does the complexity of the callbacks we need to manually save. Such complex bookkeeping can cause coders headaches.

Even worse, what happens when our callback function throws an exception? Assuming we didn't write the parse_links method well, it throws an exception when parsing HTML:

Traceback (most recent call last):
  File "loop-with-callbacks.py", line 111, in 
    loop()
  File "loop-with-callbacks.py", line 106, in loop
    callback(event_key, event_mask)
  File "loop-with-callbacks.py", line 51, in read_response
    links = self.parse_links()
  File "loop-with-callbacks.py", line 67, in parse_links
    raise Exception('parse error')
Exception: parse error

This stacktrace can only show that the event loop calls a callback. We don't know what caused this error. Both sides of the chain are broken: from nowhere and from nowhere. This phenomenon of losing context is called "stack ripping" and often leads to failure to analyze the cause. It also prevents us from setting up exception handling for callback chains, the kind that wrap a function call and its call tree with "try/except" blocks. (For a more sophisticated solution to this problem, see http://www.tornadoweb.org/en/stable/stack_context.html)

So, in addition to the long-term debate about which is more efficient, multi-threading or asynchrony, there is also a debate between the two: who is easier to kneel. If there is a mistake in synchronization, threads are more prone to data race problems, and callbacks are very difficult to debug because of "stack ripping" problems.

 

Guess you like

Origin blog.csdn.net/yaxuan88521/article/details/131151311