python-16-thread pool and process pool python concurrent programming

What is the basic principle of Python ThreadPoolExecutor thread pool
thread pool?
Use Python to quickly implement a thread pool, very simple
Python concurrent programming topic

1 Concurrent programming

1.1 Concurrent Programming Concepts

1. Why introduce concurrent programming?
Scenario 1: A web crawler took 1 hour to crawl sequentially, but reduced to 20 minutes with concurrent downloads!
Scenario 2: For an APP application, it takes 3 seconds to open the page each time before optimization, and the asynchronous concurrency is used to increase it to 200 milliseconds each time!
The introduction of concurrency is to improve the running speed of the program.

Second, what are the methods to speed up the program?
insert image description here

3. Python's support for concurrent programming
(1) Multi-threading: threading, using the principle that CPU and IO can be executed at the same time, so that the CPU will not wait for IO to complete.
(2) Multi-process: multiprocessing, using the capabilities of multi-core CPUs, truly executes tasks in parallel.
(3) Asynchronous IO: asyncio uses the principle of simultaneous execution of CPU and IO in a single thread to realize asynchronous execution of functions.
(4) Use Lock to lock resources to prevent conflicting access.
(5) Use Queue to realize data communication between different threads/processes, and realize producer-consumer mode.
(6) Use thread pool Pool/process pool Pool to simplify thread/process task submission, waiting for completion, and obtaining results.
(7) Use subprocess to start the process of the external program, and perform input and output interaction.

There are three ways of Python concurrent programming:
multi-threaded Thread, multi-process Process, and multi-coroutine Coroutine.

1.2 Thread process coroutine

1. What are CPU-intensive computing and IO-intensive computing?
insert image description here
2. Comparison of multi-thread, multi-process and multi-coroutine

insert image description here
3. How to choose the corresponding technology according to the task?
insert image description here

1.3 Global Interpreter Lock GIL

1. Two reasons why Python is slow
Compared with C/C++/JAVA, Python is really slow. In some special scenarios, Python is 100~200 times slower than C++. Due to the slow speed, the infrastructure code of many companies is still developed in C/C++, such as the recommendation engine, search engine, storage engine and other low-level modules with high performance requirements of major companies Ali/Tencent/Kuaishou.
insert image description here
Global Interpreter Lock (English: Global Interpreter Lock, abbreviated GIL) is a mechanism used by computer programming language interpreters to synchronize threads, which allows only one thread to execute at any time. Even on multi-core processors, interpreters using the GIL only allow one thread to execute at a time.
insert image description here
2. Why is there such a thing as GIL?
insert image description here
3. How to avoid the restrictions brought by GIL?

insert image description here

2 crawler code blog

import requests
from bs4 import BeautifulSoup

# 列表推导式获取url列表
urls = [
    f"https://www.cnblogs.com/sitehome/p/{page}"
    for page in range(1, 50 + 1)
]


def craw(url):
    # 爬取网页信息
    r = requests.get(url)
    return r.text


def parse(html):
    # 解析网页信息class="post-item-title"
    soup = BeautifulSoup(html, "html.parser")
    links = soup.find_all("a", class_="post-item-title")
    # 返回链接和文本信息
    return [(link["href"], link.get_text()) for link in links]


if __name__ == "__main__":
    for result in parse(craw(urls[2])):
        print(result)

3 Multithreading to speed up crawlers

3.1 How to create multithreading

insert image description here

3.2 Single-threaded and multi-threaded comparison

import blog
import threading
import time


def single_thread():
    print("single thread begin")
    # 循环遍历,逐步执行
    for url in blog.urls:
        blog.craw(url)
    print("single thread end")


def multi_thread():
    print("multi thread begin")
    threads = []
    # 每个链接创建一个线程
    for url in blog.urls:
        threads.append(threading.Thread(target=blog.craw, args=(url,)))
        
    # 逐个启动线程
    for thread in threads:
        thread.start()
        
    # 等待运行结束(阻塞主线程)
    for thread in threads:
        thread.join()

    print("multi thread end")


if __name__ == "__main__":
    start = time.time()
    single_thread()
    end = time.time()
    print("single thread cost:", end - start, "seconds")

    start = time.time()
    multi_thread()
    end = time.time()
    print("multi thread cost:", end - start, "seconds")

insert image description here

4 Producer consumer mode multi-threaded crawler

4.1 Producer Consumer Architecture

1. Multi-component Pipeline technology architecture
Complicated things are generally not done all at once, but are completed step by step in many intermediate steps.
insert image description here
2. The architecture of producer consumer crawlers
insert image description here
3. queue.Queue for multi-threaded data communication
queue.Queue can be used for thread-safe data communication between multiple threads.
insert image description here

4.2 Producer consumer code

import queue
import blog
import time
import random
import threading


def do_craw(url_queue: queue.Queue, html_queue: queue.Queue):
    while True:
        url = url_queue.get()  # 从队列中获取url
        html = blog.craw(url)
        html_queue.put(html)  # 爬取的网页信息写入队列
        print(threading.current_thread().name,
              threading.current_thread().ident,
              f"craw {url}",
              "url_queue.size=", url_queue.qsize())
        # 随机休眠1或2秒
        time.sleep(random.randint(1, 2))


def do_parse(html_queue: queue.Queue, fout):
    while True:
        html = html_queue.get()
        results = blog.parse(html)
        for result in results:
            fout.write(str(result) + "\n")
        print(threading.current_thread().name,
              threading.current_thread().ident,
              f"results.size", len(results),
              "html_queue.size=", html_queue.qsize())
        time.sleep(random.randint(1, 2))


if __name__ == "__main__":
    url_queue = queue.Queue()  # 待爬取的url队列
    html_queue = queue.Queue()  # 爬取的网页信息队列
    for url in blog.urls:
        url_queue.put(url)
    # 启动3个线程,爬取网页信息
    for idx in range(3):
        t = threading.Thread(target=do_craw, args=(url_queue, html_queue),
                             name=f"craw{idx}")
        t.start()
    # 启动2个线程,写入文件
    fout = open("02.data.txt", "w")
    for idx in range(2):
        t = threading.Thread(target=do_parse, args=(html_queue, fout),
                             name=f"parse{idx}")
        t.start()
    print("jiesu")

insert image description here

5 thread safety issues

5.1 Thread Safety Concept

Thread safety means that when a function or function library is called in a multi-threaded environment, it can correctly handle shared variables between multiple threads, so that the program function can be completed correctly.
Since the execution of the thread will switch at any time, it will cause unpredictable results, and the thread will be unsafe.
insert image description here

import threading
import time

lock = threading.Lock()


class Account:
    def __init__(self, balance):
        self.balance = balance


def draw(account, amount):
    # with lock:
    if account.balance >= amount:
        time.sleep(0.1)
        print(threading.current_thread().name, "取钱成功")
        account.balance -= amount
        print(threading.current_thread().name, "余额", account.balance)
    else:
        print(threading.current_thread().name, "取钱失败,余额不足")


if __name__ == "__main__":
    account = Account(1000)
    ta = threading.Thread(name="ta", target=draw, args=(account, 800))
    tb = threading.Thread(name="tb", target=draw, args=(account, 800))

    ta.start()
    tb.start()

insert image description here

5.2 Solving thread safety issues

insert image description here

import threading
import time

lock = threading.Lock()


class Account:
    def __init__(self, balance):
        self.balance = balance


def draw(account, amount):
    with lock:
        if account.balance >= amount:
            time.sleep(0.1)
            print(threading.current_thread().name, "取钱成功")
            account.balance -= amount
            print(threading.current_thread().name, "余额", account.balance)
        else:
            print(threading.current_thread().name, "取钱失败,余额不足")


if __name__ == "__main__":
    account = Account(1000)
    ta = threading.Thread(name="ta", target=draw, args=(account, 800))
    tb = threading.Thread(name="tb", target=draw, args=(account, 800))

    ta.start()
    tb.start()

insert image description here

6 useful thread pool

(1) Reduce resource consumption. Reduce the consumption of thread creation and thread destruction by reusing created threads.
(2) Improve the response speed. When a task arrives, the task can be executed immediately without waiting for the thread to be created.
(3) Improve the manageability of threads. Threads are scarce resources. If created without limit, it will not only consume system resources, but also reduce the stability of the system. Using thread pool can be used for unified allocation, tuning and monitoring.

insert image description here

There is already a threading module in Python, why do we need a thread pool, and what is a thread pool? Taking a crawler as an example, it is necessary to control the number of threads crawling at the same time. For example, 20 threads are created, but only 3 threads are allowed to run at the same time, but all 20 threads need to be created and destroyed, and the creation of threads needs to consume system resources. , is there a better solution?

In fact, only three threads are needed. Each thread is assigned a task, and the remaining tasks are queued. When a thread completes the task, the queued task can be assigned to this thread to continue execution.

However, it is difficult to write a perfect thread pool by yourself. You also need to consider thread synchronization in complex situations, and deadlocks are prone to occur. Starting from Python3.2, the standard library provides us with the concurrent.futures module, which provides two classes, ThreadPoolExecutor and ProcessPoolExecutor, to realize further abstraction of threading and multiprocessing (mainly focus on thread pools here).

6.1 Principle of thread pool

insert image description here
1. Improve performance: because the overhead of creating and terminating threads is reduced, thread resources are reused;
2. Applicable scenarios: suitable for processing a large number of sudden requests or requiring a large number of threads to complete tasks, but the actual task processing time is relatively short;
3 . . Defense function: It can effectively avoid problems such as excessive system load caused by too many threads being created; 4.
Code advantage: The syntax of using the thread pool is more concise than creating a new thread to execute the thread.

6.2 Usage of ThreadPoolExecutor

insert image description here

import concurrent.futures
import blog

# craw爬取
with concurrent.futures.ThreadPoolExecutor() as pool:
    htmls = pool.map(blog.craw, blog.urls)
    htmls = list(zip(blog.urls, htmls))
    for url, html in htmls:
        print(url, len(html))

print("craw over")

# parse解析
with concurrent.futures.ThreadPoolExecutor() as pool:
    futures = {
    
    }
    for url, html in htmls:
        future = pool.submit(blog.parse, html)
        futures[future] = url
    # 方式一:按顺序返回
    for future, url in futures.items():
        print(url, future.result())
    # 方式二:先完成的先返回
    # for future in concurrent.futures.as_completed(futures):
    #     url = futures[future]
    #     print(url, future.result())

6.3 Using Thread Pool Acceleration in Web Services

1. The architecture and characteristics of Web services
insert image description here
The benefits of using the thread pool ThreadPoolExecutor:
1. It is convenient to concurrently execute the IO calls of disk files, databases, and remote APIs;
2. The number of threads in the thread pool will not be created infinitely (cause the system to hang) , has a defensive function.
2. The code uses Flask to implement web services and achieve
results after 5 seconds of acceleration.

import flask
import json
import time
from concurrent.futures import ThreadPoolExecutor

app = flask.Flask(__name__)
pool = ThreadPoolExecutor()


def read_file():
    time.sleep(5)
    return "file result"


def read_db():
    time.sleep(4)
    return "db result"


def read_api():
    time.sleep(3)
    return "api result"


@app.route("/")
def index():
    result_file = pool.submit(read_file)
    result_db = pool.submit(read_db)
    result_api = pool.submit(read_api)

    return json.dumps({
    
    
        "result_file": result_file.result(),
        "result_db": result_db.result(),
        "result_api": result_api.result(),
    })


if __name__ == "__main__":
    app.run()

7 useful process pool

7.1 Multi-process vs multi-thread

1. With multi-threaded threading, why use multi-process multiprocessing?
insert image description here
2. Multi-process multiprocessing vs. multi-threaded threading
insert image description here

7.2 Comparison of CPU-intensive computing speed

Prime numbers are also called prime numbers. A natural number greater than 1, except 1 and itself, is not divisible by other natural numbers called a prime number; otherwise it is called a composite number (it is stipulated that 1 is neither a prime number nor a composite number).
CPU-intensive calculation: 100 calculations of "judging whether a large number is a prime number".

import math
from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor
import time

PRIMES = [112272535095293] * 100


def is_prime(n):
    if n == 1:
        return False
    for i in range(2, int(math.sqrt(n))+1):
        if n % i == 0:
            return False
    return True


def single_thread():
    for number in PRIMES:
        is_prime(number)


def multi_thread():
    with ThreadPoolExecutor() as pool:
        pool.map(is_prime, PRIMES)


def multi_process():
    with ProcessPoolExecutor() as pool:
        pool.map(is_prime, PRIMES)


if __name__ == "__main__":
    start = time.time()
    single_thread()
    end = time.time()
    print("single thread, cost:", end - start, "seconds")

    start = time.time()
    multi_thread()
    end = time.time()
    print("multi thread, cost:", end - start, "seconds")

    start = time.time()
    multi_process()
    end = time.time()
    print("multi process, cost:", end - start, "seconds")

insert image description here
Due to the existence of the GIL, multi-threading is slower than single-threaded calculations, while multi-processing can significantly speed up execution.

7.3 Using Process Pool Acceleration in Web Services

http://127.0.0.1:5000/is_prime/1001245678353,3257385365375634564,3432434345657677

import flask
from concurrent.futures import ProcessPoolExecutor
import math
import json


app = flask.Flask(__name__)


def is_prime(n):
    if n == 1:
        return False
    for i in range(2, int(math.sqrt(n))+1):
        if n % i == 0:
            return False
    return True


@app.route("/is_prime/<numbers>")
def api_is_prime(numbers):
    number_list = [int(x) for x in numbers.split(",")]
    results = process_pool.map(is_prime, number_list)
    return json.dumps(dict(zip(number_list, results)))


if __name__ == "__main__":
    process_pool = ProcessPoolExecutor()
    app.run()

insert image description here

8 Asynchronous IO implements concurrent crawlers

8.1 Coroutine principle

insert image description here

insert image description here

import asyncio
import aiohttp
import blog


# async语法进行声明为异步协程方法
# await语法进行声明为异步协程可等待对象
async def async_craw(url):
    print("craw url: ", url)
    async with aiohttp.ClientSession() as session:
        async with session.get(url) as resp:
            result = await resp.text()
            print(f"craw url: {url}, {len(result)}")

# 获取事件循环
loop = asyncio.get_event_loop()

# 创建task列表
tasks = [
    loop.create_task(async_craw(url))
    for url in blog.urls]

import time

start = time.time()
# 执行爬虫事件列表
loop.run_until_complete(asyncio.wait(tasks))
end = time.time()
print("use time seconds: ", end - start)

8.2 Use semaphore to control crawler concurrency

insert image description here

import asyncio
import aiohttp
import blog

semaphore = asyncio.Semaphore(10)


async def async_craw(url):
    async with semaphore:
        print("craw url: ", url)
        async with aiohttp.ClientSession() as session:
            async with session.get(url) as resp:
                result = await resp.text()
                await asyncio.sleep(2)
                print(f"craw url: {url}, {len(result)}")


loop = asyncio.get_event_loop()

tasks = [
    loop.create_task(async_craw(url))
    for url in blog.urls]

import time

start = time.time()
loop.run_until_complete(asyncio.wait(tasks))
end = time.time()
print("use time seconds: ", end - start)

Appendix ThreadPoolExecutor thread pool

Common functions

When a function is submitted to the thread pool to run, a Future object will be automatically created and returned. This Future object contains the execution status of the function (such as whether it is paused, running or completed at this time). And after the function is executed, it will also call future.set_result to set its own return value.
(1) To create a thread pool, you can specify the max_workers parameter to indicate how many threads are created at most. If not specified, a thread will be created for each function submitted.

When starting the thread pool, it is definitely necessary to set the capacity, otherwise thousands of threads need to be opened to process thousands of functions.

(2) The function can be submitted to the thread pool through submit, and once submitted, it will run immediately. Because a new thread is opened, the main thread will continue to execute. As for the parameters of submit, according to the function name, the corresponding parameters can be submitted.

(3) future is equivalent to a container, which contains the execution status of internal functions.

(4)# When the function is executed, the return value will be set in the future, that is to say, once the future.set_result is executed, it means that the function execution is completed, and then the outside world can call result to get the return value.

from concurrent.futures import ThreadPoolExecutor
import time


def task(name, n):
    time.sleep(n)
    return f"{name} 睡了 {n} 秒"


executor = ThreadPoolExecutor()
future = executor.submit(task, "屏幕前的你", 3)

print(future)  # <Future at 0x7fbf701726d0 state=running
print(future.running())  # 函数是否正在运行中True
print(future.done())  # 函数是否执行完毕False

time.sleep(3)  # 主程序也sleep 3秒,显然此时函数已经执行完毕了

print(future)  # <Future at 0x7fbf701726d0 state=finished returned str>返回值类型是str
print(future.running())  # False
print(future.done())  # True

print(future.result())

add callback

Note: future.result(), this step will be blocked. future.result() is to get the return value of the function. Therefore, we can only wait for the function to be executed first, and only after the return value is set into the future through set_result, can the outside world call future.result() to obtain the value.

future has two protected properties, _result and _state. Obviously _result is used to save the return value of the function, and future.result() essentially returns the value of the _result attribute. The _state attribute is used to indicate the execution state of the function, which is initially PENDING, RUNING during execution, and set to FINISHED when execution is complete.

When future.result() is called, the attribute of _state will be judged, and if it is still in execution, it will wait forever. When _state is FINISHED, the value of the _result attribute is returned.

executor = ThreadPoolExecutor()
future = executor.submit(task, "屏幕前的你", 3)
start = time.perf_counter()
future.result()
end = time.perf_counter()
print(end - start)  # 3.009

Because we don't know when the function is executed. So the best way is to bind a callback, and when the function is executed, the callback is automatically triggered.
It should be noted that after the submit method is called, the function submitted to the thread pool has already started to execute. Regardless of whether the function has been executed or not, we can bind a callback to the corresponding future.

If the callback is added before the function completes, the callback will be triggered after the function completes.
If a callback is added after the function is completed, since the function has been completed, it means that the future has a value at this time, or set_result has been set, then the callback will be triggered immediately.

from concurrent.futures import ThreadPoolExecutor
import time


def task(name, n):
    time.sleep(n)
    return f"{name} 睡了 {n} 秒"


def callback(f):
    print("我是回调",f.result())

executor = ThreadPoolExecutor()
future = executor.submit(task, "自我休眠", 3)
# time.sleep(5)
# 绑定回调,3秒之后自动调用
future.add_done_callback(callback)

If we need to start multiple threads to execute functions, we might as well use the thread pool. Every time a function is called, a thread is taken out of the pool, and the thread is put back into the pool after the function is executed for other functions to execute. If the pool is empty, or a new idle thread cannot be created, then the next function can only be in a waiting state.

Appendix ProcessPoolExecutor Process Pool

concurrent.futures can be used not only to implement thread pools, but also to implement process pools. The APIs of the two are the same, but process pools are rarely created during work.

Guess you like

Origin blog.csdn.net/qq_20466211/article/details/130687063
Recommended