Learn python from scratch with me (3) multithreading/multiprocessing/coroutine

foreword

Looking back, I talked about python grammar programming before, about the first time to get started with python from scratch, programming grammar compulsory content, such as python3 basics, lists and tuples, strings, dictionaries, conditions, loops and other statements, functions, object-oriented, Exception and file handling and network programming

1. Learn python from scratch with me (1) Compulsory programming grammar
2. Learn python from scratch with me (2) Network programming

This article talks about: python concurrent programming: multi-thread/multi-process/coroutine

This series of articles is based on the following learning routes, due to the large content:

Learn python from scratch to advanced advanced roadmap

Pay attention to the official account: python technology training camp , learn advanced step by step

Python resources suitable for zero-based learning and advanced people:

① Tencent certified python complete project practical tutorial notes PDF
② More than a dozen major manufacturers python interview topic PDF
③ Python full set of video tutorials (zero foundation-advanced advanced JS reverse)
④ Hundreds of project actual combat + source code + notes
⑤ Programming grammar - machine learning -Full-stack development-data analysis-crawler-APP reverse engineering and other full set of projects + documents
⑥ Exchange and study
⑦ Want to take part-time orders

Chapter 1: Multithreading

1. Threads and processes

Thread and process are two important concepts in the operating system, and they are the basis of concurrent programming. A thread is the smallest unit that an operating system can perform operation scheduling, and a process is the basic unit that an operating system performs resource allocation and scheduling.

The difference between thread and process:

A thread is a part of a process, a process can contain multiple threads, and a thread can only belong to one process.
A process has its own memory space, while threads share the memory space of a process.
Communication between processes needs to use the IPC (Inter-Process Communication) mechanism, and data can be directly shared between threads.
The creation and destruction of processes is slower than threads, because processes need to allocate and release independent memory space, while threads only need to allocate and release some registers and stack space.

In Python, you can use threadingmodules to create and manage threads. Here is a simple thread example:

import threading

def worker():
    print('Worker thread started')
    # do some work here
    print('Worker thread finished')

# create a new thread
t = threading.Thread(target=worker)
# start the thread
t.start()
# wait for the thread to finish
t.join()

In this example, we create a workerfunction called , which will run in a new thread. threading.ThreadWe created a new thread object using the class and workerpassed it the function as the target. We then use start()the method to start the thread, and join()the method to wait for the thread to complete.

2. Using threads

In Python, using threads can threadingbe achieved through modules. Here is a simple example showing how to use threads:

import threading

def worker():
    """线程执行的任务"""
    print("Worker thread started")
    # 执行一些任务
    print("Worker thread finished")

# 创建线程
t = threading.Thread(target=worker)
# 启动线程
t.start()

# 主线程继续执行其他任务
print("Main thread finished")

In the above example, we first defined a workerfunction which will be executed in a separate thread. threading.ThreadThen, we create a new thread using the class and pass the worker function as an argument to it. Finally, we call startthe method to start the thread.

Note that threads execute asynchronously, so the main thread does not wait for the thread to complete. In the example above, the main thread would immediately continue executing, outputting Main thread finished. If we want to wait for the thread to complete before continuing to execute the main thread, we can use jointhe method:

# 等待线程完成
t.join()


# 主线程继续执行其他任务
print("Main thread finished")

In the above code, we have called the t.join() method after starting the thread, which will block the main thread until the thread completes. Then, the main thread will continue to execute.

3. Multi-threaded global variables

In multithreaded programming, multiple threads can share global variables. However, it should be noted that when multiple threads read and write the same global variable at the same time, the problem of data race (Data Race) may occur, resulting in unpredictable results of the program.

In order to avoid data competition, you can use thread lock (Thread Lock) to ensure that only one thread can access shared variables at the same time. Python provides a variety of lock mechanisms, such as Lock, RLock, Semaphoreand so on, and you can choose the appropriate lock according to your actual needs.

The following is a Locksample code used to ensure the safety of multi-threaded shared global variables:

import threading

# 定义全局变量
count = 0

# 定义线程锁
lock = threading.Lock()

# 定义线程函数
def add():
    global count
    for i in range(100000):
        # 获取锁
        lock.acquire()
        count += 1
        # 释放锁
        lock.release()

# 创建两个线程
t1 = threading.Thread(target=add)
t2 = threading.Thread(target=add)

# 启动线程
t1.start()
t2.start()

# 等待线程执行完毕
t1.join()
t2.join()

# 输出结果
print(count)

In the above code, we define a global variable countand use it Lockto ensure countthe safety of multiple threads reading and writing. In each thread, we first acquire the lock, then countincrement the value by 1, and finally release the lock. In this way, it can be guaranteed that only one thread can access at the same time count, avoiding the problem of data competition.

It should be noted that using locks will bring a certain performance loss, because it takes a certain amount of time to acquire and release locks each time. Therefore, in practical applications, it is necessary to select an appropriate locking mechanism according to the actual situation to avoid program performance degradation caused by excessive use of locks.

4. Problems caused by shared global variables

In multithreaded programming, multiple threads can share global variables. However, sharing global variables also poses some problems:

Race Conditions : When multiple threads access and modify the same global variable at the same time, race conditions can occur, causing unpredictable results in the program.
Data inconsistency : When multiple threads modify the same global variable at the same time, it may cause data inconsistency, that is, the value of the variable seen by some threads is different from that seen by other threads.
Deadlock : When multiple threads wait for each other to release a resource at the same time, a deadlock may occur, resulting in the inability of the program to continue executing.

Therefore, in multi-thread programming, it is necessary to pay attention to the access and modification of shared global variables to avoid the above problems. Mechanisms such as locks and condition variables can be used to ensure synchronization and mutual exclusion between threads.

5. Solve the way that threads modify global variables at the same time

In multi-threaded programming, sharing global variables may cause some problems, such as:

Race condition : Multiple threads modify the same global variable at the same time, which may lead to data inconsistency or unexpected results.
Deadlock : Multiple threads wait for each other to release resources at the same time, causing the program to fail to continue executing.

In order to solve these problems, the following methods can be used:

Use locks : When accessing shared variables, use locks to ensure that only one thread can modify the variable at a time. Python provides the Lock class in the threading module to implement locks.
Use thread-safe data structures : Python provides some thread-safe data structures, such as Queue, deque, etc., which can safely access and modify data in a multi-threaded environment.
Use local variables : Pass global variables as parameters to thread functions, let thread functions operate on local variables, and avoid multiple threads modifying the same global variable at the same time.
Use thread local storage : Python provides the local class in the threading module, which can create an independent variable in each thread to avoid sharing variables between multiple threads.

6. Mutex lock

In multithreaded programming, mutex is a commonly used synchronization mechanism, which is used to protect shared resources and prevent data inconsistency caused by multiple threads modifying the same variable at the same time.

The basic idea of a mutex is to acquire a lock before accessing a shared resource. If the lock has already been acquired by another thread, the current thread will be blocked until the lock is released. After accessing the shared resource, release the lock so that other threads can acquire the lock and access the shared resource.

In Python, mutexes can be implemented using classes threadingin modules . The class has two basic methods:LockLock

acquire([blocking]): Acquire the lock. If the lock has been acquired by other threads, the current thread will be blocked. If blockingit is False, it will return immediately when the lock acquisition fails False, instead of blocking and waiting.
release(): Release the lock so that other threads can acquire the lock and access the shared resource.
Here is an example using a mutex:

import threading

# 共享变量
count = 0

# 创建互斥锁
lock = threading.Lock()

# 线程函数
def worker():
    global count
    for i in range(100000):
        # 获取锁
        lock.acquire()
        try:
            count += 1
        finally:
            # 释放锁
            lock.release()

# 创建多个线程
threads = []
for i in range(10):
    t = threading.Thread(target=worker)
    threads.append(t)

# 启动线程
for t in threads:
    t.start()

# 等待所有线程执行完毕
for t in threads:
    t.join()

# 输出结果
print(count)

In the above example, we created a shared variable countand protected it with a mutex. In each thread, we first acquire the lock, then modify countthe value, and finally release the lock. In this way, it can be guaranteed that multiple threads will not modify countthe value at the same time, thereby avoiding the problem of data inconsistency. The last output countvalue, you can see that its value is 1000000, as expected.

7. Deadlock

Deadlock refers to a phenomenon in which two or more threads wait for each other due to competition for resources during the execution process. If there is no external interference, they will not be able to continue to execute. In multithreaded programming, deadlock is a common problem that requires special attention.

The generation of deadlock usually needs to meet the following four conditions :

Mutual exclusion: A resource can only be used by one thread at a time.
Request and holding conditions: When a thread is blocked due to requesting resources, it will not let go of the obtained resources.
Non-deprivation condition: The resources obtained by the thread cannot be forcibly deprived by other threads before they are used up, but can only be released by the thread itself.
Circular waiting condition: Several threads form a head-to-tail circular waiting resource relationship.

In order to avoid deadlocks, the following methods can be used :

Avoid using multiple locks and try to use one lock to control access to multiple resources.
Avoid holding the lock for too long, and try to shorten the lock holding time as much as possible.
Avoid circular waiting and try to acquire locks in a fixed order.
Using the timeout mechanism, when the waiting time exceeds a certain period of time, the lock is automatically released to avoid deadlock caused by long waiting.

8. Thread pool

Thread pool is a thread management technology. It can create a certain number of threads when the program starts and put them in a pool. It will not be destroyed, but put back into the pool to wait for the next task to come.

Using the thread pool can avoid the overhead of frequently creating and destroying threads, and improve the performance and efficiency of the program. In Python, concurrent.futuresthread pools can be implemented using modules from the standard library.

Here is a simple thread pool example:

import concurrent.futures
import time

def worker(num):
    print(f"Thread-{num} started")
    time.sleep(1)
    print(f"Thread-{num} finished")

if __name__ == '__main__':
    with concurrent.futures.ThreadPoolExecutor(max_workers=3) as executor:
        for i in range(5):
            executor.submit(worker, i)

In this example, we created a thread pool with 3 threads, and then submitted 5 tasks to the thread pool for execution. Since there are only 3 threads in the thread pool, only 3 tasks will be executed at the same time, and the remaining tasks will wait for the emergence of idle threads.

The output is as follows:

Thread-0 started
Thread-1 started
Thread-2 started
Thread-0 finished
Thread-3 started
Thread-1 finished
Thread-4 started
Thread-2 finished
Thread-3 finished
Thread-4 finished

It can be seen that the three threads in the thread pool execute five tasks in sequence, and the order in which tasks are executed has nothing to do with the order in which they are submitted.

Chapter 2: Multiprocessing

1. The state of the process

In an operating system, a process has the following states :

Ready state (Ready) : The process is ready to run, waiting for the allocation of CPU time slices.
Running state (Running) : The process is running, occupying CPU time slices.
Blocked state (Blocked) : The process cannot continue to execute for some reason, such as waiting for an I/O operation to complete or waiting for a resource to be released.
Suspended state (Suspended) : The process is suspended and no longer occupies CPU time slices, but its state information is still kept in memory.
Terminated : The process has completed execution or was forcibly terminated.

The state transition of a process is usually controlled by the operating system kernel. For example, when a process waits for an I/O operation to complete, the operating system will change the state of the process from the ready state to the blocked state. When the I/O operation is completed, The operating system will transition the state of the process from the blocked state to the ready state, waiting for the CPU time slice to be allocated.

2. Thread creation - multiprocessing

In Python, you can use the multiprocessing module to create multiple processes. The multiprocessing module provides a Process class that can be used to create processes. Here is a simple example:

import multiprocessing

def worker():
    """子进程要执行的任务"""
    print('Worker')

if __name__ == '__main__':
    # 创建子进程
    p = multiprocessing.Process(target=worker)
    # 启动子进程
    p.start()
    # 等待子进程结束
    p.join()

In the above example, we first defined a workerfunction, which is the task to be performed by the child process. Then, we multiprocessing.Processcreated a subprocess using the class and workerpassed the function as a parameter to Processthe constructor of the class. Next, we call startthe method to start the child process, and finally call jointhe method to wait for the child process to end.

It should be noted that in the Windows system, since multiprocessingthe module uses forkthe system call, but Windows does not support it fork, it is necessary to if __name__ == '__main__':call the code of the subprocess in the statement. This is because in Windows, each process will execute all the codes of the program, and if __name__ == '__main__':the statement can ensure that the child process will only execute the specified code.

3. Process and thread comparison

Both processes and threads are ways to implement concurrent programming, but they have the following differences:

Resource occupation : Processes have their own memory space, while threads share the memory space of processes. Therefore, the overhead of creating a process is greater than that of creating a thread, and the communication between processes is also more complicated than the communication between threads.
Concurrency : Since threads share the memory space of a process, communication and data sharing between threads is easier than between processes. At the same time, thread switching is faster than process switching, so the concurrency of threads is higher than that of processes.
Security : Since threads share the memory space of the process, when multiple threads access the same memory at the same time, race conditions may occur, resulting in inconsistent data or program crashes. The memory space between processes is independent, so the data between processes will not affect each other.
Programming complexity : Due to the complexity of inter-process communication and data sharing, the complexity of writing multi-process programs is higher than that of writing multi-threaded programs.

In general, processes are suitable for CPU-intensive tasks, while threads are suitable for IO-intensive tasks. In practical applications, it is necessary to select an appropriate concurrent programming method according to specific scenarios.

4. Communication between processes - Queue

In multi-process programming, data between different processes cannot be shared directly, because each process has its own independent memory space. Therefore, in order to achieve inter-process communication, we need to use some special mechanisms.

Among them, the most commonly used inter-process communication method is to use queues (Queue). A queue is a first-in-first-out (FIFO) data structure that can be used to pass data between multiple processes.

In Python, we can use multiprocessingthe Queue class in the module to implement inter-process communication. The Queue class provides put()methods get()for adding data to and removing data from the queue.

Here is a simple example that demonstrates how to use Queue for interprocess communication:

from multiprocessing import Process, Queue

def worker(q):
    while True:
        item = q.get()
        if item is None:
            break
        print(item)

if __name__ == '__main__':
    q = Queue()
    p = Process(target=worker, args=(q,))
    p.start()

    for i in range(10):
        q.put(i)

    q.put(None)
    p.join()

In this example, we create a process p whose job is to take data from queue q and print it. The main process adds 10 pieces of data to the queue, and then adds a None, indicating that all the data has been added. Finally, the main process waits for process p to finish executing.

It should be noted that when we add data to the queue, if the queue is full, the put() method will block until there is a free position in the queue. Similarly, when we take data out of the queue, if the queue is empty, the get() method will also block until there is data available in the queue.

In addition to Queue, Python also provides some other inter-process communication methods, such as Pipe, Valueand and Arrayso on. These methods have their own characteristics, and you can choose the appropriate method according to your specific needs.

5. Process pool creation -pool

In Python, we can use classes multiprocessingin modules Poolto create process pools. A process pool is a set of reusable processes that can be assigned to tasks when needed. This can avoid frequent creation and destruction of processes, thereby improving the efficiency of the program.

Here is an example of using a process pool:

import multiprocessing

def worker(num):
    """进程池中的任务"""
    print('Worker %d is running' % num)

if __name__ == '__main__':
    # 创建进程池，池中有3个进程
    pool = multiprocessing.Pool(processes=3)
    # 向进程池中添加任务
    for i in range(5):
        pool.apply_async(worker, args=(i,))
    # 关闭进程池，不再接受新的任务
    pool.close()
    # 等待所有任务完成
    pool.join()
    print('All workers done.')

In this example, we first create a process pool with 3 processes in the pool. Then added 5 tasks to the process pool, each of which is calling workera function. Finally, we close the process pool and wait for all tasks to complete.

It should be noted that the tasks in the process pool must be serializable, because the process pool will send tasks to child processes for execution. If the task contains non-serializable objects, the process pool will not work properly.

Chapter 3: Coroutines

A coroutine is a lightweight thread, also known as a micro-thread or a user-level thread. The characteristic of coroutines is that in one thread, there can be multiple coroutines, and the coroutines can switch between each other to achieve concurrent execution.

In Python, coroutines are implemented through generators. Through the yield keyword, a function can be turned into a generator, thereby realizing the function of the coroutine. In a coroutine, you can use the yield keyword to suspend the execution of a function and return a value to the caller. When the coroutine is called again, execution can continue from where it was last suspended.

There are two implementations of coroutines in Python: coroutines implemented using generators and async/awaitcoroutines implemented using keywords.

Coroutines implemented using generators :

def coroutine():
    while True:
        value = yield
        print('Received value:', value)

c = coroutine()
next(c)  # 启动协程
c.send(10)  # 发送值给协程

Coroutines implemented using the async/await keyword :

import asyncio

async def coroutine():
    while True:
        value = await asyncio.sleep(1)
        print('Received value:', value)

loop = asyncio.get_event_loop()
loop.run_until_complete(coroutine())

In a coroutine implemented using the async/await keyword, the event loop provided by the asyncio module needs to be used to run the coroutine. In a coroutine, you can use the await keyword to suspend the execution of a function and wait for an asynchronous operation to complete. When the asynchronous operation completes, the coroutine continues execution from the await statement.

The advantage of coroutines is that the overhead of thread switching can be avoided, thereby improving the performance of the program. At the same time, coroutines can also avoid race conditions and deadlock problems between threads. However, coroutines also have some disadvantages, such as the inability to take advantage of multi-core CPUs and the inability to perform blocking IO operations.

1. The meaning of coroutine

A coroutine is a lightweight thread that can achieve concurrency within a single thread. Compared with threads, coroutines have less switching overhead and can use CPU resources more efficiently. The meaning of coroutines is:

Improve the concurrency performance of the program : Coroutines can achieve concurrency in a single thread, avoiding the overhead of thread switching and improving the concurrency performance of the program.
Simplified programming model : Coroutines can use a synchronous programming model, which avoids complex thread synchronization issues and makes programming easier.
Support high concurrency : Coroutines can support a large number of concurrent tasks, and can be used in high-concurrency network programming, crawlers and other scenarios.
Improve code readability : Coroutines can use a synchronous programming model, which makes the code more readable and easy to maintain.

In short, coroutine is an efficient, simple, and readable concurrent programming model, which can improve the concurrent performance of the program, support high concurrency, and simplify the programming model.

2. asyncio event loop

In Python, a coroutine is a lightweight concurrent programming method that can implement concurrent execution in a single thread. The significance of coroutine is that it can improve the concurrency performance of the program, reduce the overhead of thread switching, and also simplify the programming model, making the code easier to understand and maintain.

In Python 3.4 and above, modules are provided in the standard library asyncio, which is one of the main ways to implement coroutines in Python. asyncioThe module provides an event loop ( Event Loop), which enables concurrent execution of multiple coroutines in a single thread. The event loop will continuously take out coroutines from the coroutine queue and execute them. When the coroutine encounters an IO operation, it will automatically suspend and switch to other coroutines for execution, and then resume execution after the IO operation is completed.

asyncioThe event loop is used in the following way:

1. Create an event loop object :

import asyncio

loop = asyncio.get_event_loop()

2. Add the coroutine object to the event loop :

async def coroutine():
    # 协程代码

loop.run_until_complete(coroutine())

3. Start the event loop :

loop.run_forever()

In the event loop, async/awaitkeywords can be used to define coroutine functions, and asynciovarious methods provided by modules can be used to realize communication and cooperation between coroutines. For example, you can use asyncio.sleep()methods to implement the delay operation of coroutines, use asyncio.wait()methods to wait for the completion of multiple coroutines, and so on.

3. await keyword

In Python, awaitis a keyword used to wait for a coroutine to complete. When a coroutine calls another coroutine, it can use awaitthe keyword to wait for the other coroutine to complete and return the result. During the wait, the current coroutine is suspended until the awaited coroutine completes.

For example, suppose there are two coroutines A and B, and A needs to wait for B to complete before proceeding. In coroutine A, you can use awaitthe keyword to wait for coroutine B to complete:

async def coroutine_b():
    # 协程B的代码

async def coroutine_a():
    # 协程A的代码
    result = await coroutine_b()
    # 继续执行协程A的代码

In this example, when coroutine A calls await coroutine_b(), it waits for coroutine B to complete and return the result. During the waiting period, coroutine A will be suspended until coroutine B completes. Once coroutine B completes and returns a result, coroutine A continues execution.

Using the await keyword can make the calls between coroutines more concise and intuitive, and can also avoid complex asynchronous programming modes such as callback functions.

4. Concurrent and future objects

In Python, asynciomodules provide a way of asynchronous programming based on coroutines. In coroutines, we can use async/awaitkeywords to define asynchronous functions, and use asynciothe event loop provided by the module to schedule the execution of coroutines.

In addition to coroutines, asyncioseveral other concurrent programming tools are provided, including concurrentand futureobjects.

concurrent object

concurrentObject is asyncioan important concept in , which represents the execution state of a coroutine. In asyncio, we can asyncio.create_task()create an concurrentobject using a function that takes a coroutine object as a parameter and returns an concurrentobject.

For example, the following code creates a coroutine object and create_task()converts it to a concurrent object using a function:

import asyncio

async def my_coroutine():
    print('Coroutine started')
    await asyncio.sleep(1)
    print('Coroutine ended')

async def main():
    task = asyncio.create_task(my_coroutine())
    await task

asyncio.run(main())

In the above code, we use create_task()function to my_coroutine()convert the function into concurrentan object and assign it to the task variable. Then, we use awaitthe keyword to wait for the completion of the task object.

future object

futureObject is asyncioanother important concept in , which represents the result of an asynchronous operation. In asyncio, we can asyncio.Future()create an futureobject using a function that returns an unfinished futureobject.

For example, the following code creates an unfinished future object:

import asyncio

async def my_coroutine():
    print('Coroutine started')
    await asyncio.sleep(1)
    print('Coroutine ended')
    return 'Result'

async def main():
    future = asyncio.Future()
    await asyncio.sleep(1)
    future.set_result(await my_coroutine())
    print(future.result())

asyncio.run(main())

In the above code, we asyncio.Future()created an unfinished futureobject using a function and assigned it to futurea variable. We then use awaitthe keyword to wait 1 second before calling my_coroutine()the function and setting its result to futurethe object's result. Finally, we print the result of the future object.

5. asyncio asynchronous iterator and context management

In addition to concurrent and future objects, asyncio also provides some other concurrent programming tools, including asynchronous iterators and context management.

An async iterator is a special iterator that can be used in an asynchronous environment. In asyncio, we can use async for loop to iterate over asynchronous iterators.

For example, the following code uses an async for loop to iterate over an asynchronous iterator:

import asyncio

async def my_coroutine():
    for i in range(5):
        await asyncio.sleep(1)
        yield i

async def main():
    async for i in my_coroutine():
        print(i)

asyncio.run(main())

In the code above, we define an asynchronous generator function my_coroutine()that yieldreturns a value using statements and pauses for 1 second between each return value. We then use async fora loop to iterate over my_coroutine()the asynchronous iterator returned by the function, and print each return value.

Context management is a way of managing resources in an asynchronous environment. In asyncio, we can use async withstatements to manage asynchronous contexts.

For example, the following code uses async withstatements to manage an asynchronous context:

import asyncio

class MyContext:
    async def __aenter__(self):
        print('Entering context')
        await asyncio.sleep(1)
        return self

    async def __aexit__(self, exc_type, exc, tb):
        print('Exiting context')
        await asyncio.sleep(1)

async def main():
    async with MyContext() as context:
        print('Inside context')

asyncio.run(main())

In the above code, we have defined a MyContextclass which implements __aenter__()and __aexit__()method. aenter()Methods are called when entering the context and aexit()methods are called when leaving the context. In main()the function, we async withmanage the object with statements MyContextand print a message in the context. When we enter and exit the context, aenter()and __aexit__()method will be called, and pause for 1 second.

6. Operate MySQL asynchronously

In Python, we can use the asynchronous IO library asyncioto implement asynchronous operations on the MySQL database. Here is a simple example:

import asyncio
import aiomysql

async def test_mysql():
    # 连接MySQL数据库
    conn = await aiomysql.connect(host='localhost', port=3306,
                                  user='root', password='password',
                                  db='test', charset='utf8mb4')
    # 创建游标
    cur = await conn.cursor()
    # 执行SQL语句
    await cur.execute("SELECT * FROM users")
    # 获取查询结果
    result = await cur.fetchall()
    # 输出查询结果
    print(result)
    # 关闭游标和连接
    await cur.close()
    conn.close()

# 运行异步函数
loop = asyncio.get_event_loop()
loop.run_until_complete(test_mysql())

In the above example, we have used aiomysqlthe library to connect to the MySQL database and async/awaitthe syntax to perform asynchronous operations. First, we use aiomysql.connect()the method to connect to the MySQL database, then use await conn.cursor()the method to create a cursor, use await cur.execute()the method to execute the SQL statement, use await cur.fetchall()the method to obtain the query result, and finally use await cur.close()the method to close the cursor and use conn.close()the method to close the connection.

It should be noted that when using aiomysqlthe library, we need to specify when connecting to the MySQL database charset='utf8mb4'to support the Chinese character set.

7. Asynchronous crawler

Asynchronous crawlers refer to the use of coroutines to implement crawler programs, and improve crawling efficiency through asynchronous and non-blocking methods. In Python, asynciolibraries can be used to implement asynchronous crawlers.

Here is an example of a simple asynchronous crawler:

import asyncio
import aiohttp
from bs4 import BeautifulSoup

async def fetch(session, url):
    async with session.get(url) as response:
        return await response.text()

async def get_links(session, url):
    html = await fetch(session, url)
    soup = BeautifulSoup(html, 'html.parser')
    links = []
    for link in soup.find_all('a'):
        href = link.get('href')
        if href and href.startswith('http'):
            links.append(href)
    return links

async def main():
    async with aiohttp.ClientSession() as session:
        links = await get_links(session, 'https://www.baidu.com')
        for link in links:
            print(link)

if __name__ == '__main__':
    loop = asyncio.get_event_loop()
    loop.run_until_complete(main())

In this example, we've used aiohttpthe library to send asynchronous HTTP requests, BeautifulSoupthe library to parse HTML pages, and asynciothe library to implement coroutines.

First, a function is defined fetchto send an HTTP request and return the response content. Then a function is defined get_linksto get all the links in the page. Finally, mainuse aiohttpthe library in the function to create an asynchronous HTTP client session, call get_linksthe function to get the link, and print it out.

It should be noted that when using aiohttpthe library, you need to use async withthe statement to create an asynchronous HTTP client session to ensure that the session can be properly closed after use.

Pay attention to the official account: python technology training camp , learn advanced step by step

Python resources suitable for zero-based learning and advanced people:

① Tencent certified python complete project practical tutorial notes PDF
② More than a dozen major manufacturers python interview topic PDF
③ Python full set of video tutorials (zero foundation-advanced advanced JS reverse)
④ Hundreds of project actual combat + source code + notes
⑤ Programming grammar - machine learning -Full-stack development-data analysis-crawler-APP reverse engineering and other full set of projects + documents
⑥ Exchange and study
⑦ Want to take part-time orders

The next chapter: python database programming