In-depth understanding: threads, processes, coroutines and parallelism, concurrency

Crawler concurrency control:

Multi-process, multi-thread, coroutine yield


From hardware:

Dual Core Quad Thread (Hyper-Threading Technology):
There are two CPU cores, each with two logical processors, equivalent to four CPU cores


Four cores and four threads:
There is one CPU core, and each core has one logical processor, which is equivalent to four CPU cores


From the operating system:

Processes and threads are the execution units of CPU tasks.


Process: Early operating systems are process-oriented:
represent the execution activities of a program (open, execute, save, close)


Thread: Current operating systems are thread-oriented:
it represents the smallest scheduling unit when a process handles tasks (execution function a, execution function b)


A program starts at least one process, and a process has at least one thread.

 

Each process has an independent memory space, and no state is shared between different processes.
The communication between processes needs to be controlled by the operating system scheduling, the communication efficiency is low, and the switching overhead is large.


Multiple threads in the same process share memory space, with low switching overhead and high communication efficiency.
The working mechanism of the thread is "preemptive", and there is a state of competition, which means that the data is not safe.
Introduced "mutex lock": a mechanism that allows multiple threads to access memory space in a safe and orderly manner.

 

Python's multithreading:
similar to the GIL (Global Interpreter Lock): guarantees that only one thread is running in a time slice.
Advantages: Directly eliminate the competition problem of multiple threads:
Disadvantages: Python's multi-threading is not true multi-threading.

When the Python interpreter processes IO blocking methods, it will release the GIL.
If there is no IO operation, the thread will release the GIL every sys.getcheckinterval() times, allowing other threads to try to execute.

 


Parallelism:
Within the same CPU time slice, the CPU can process multiple programs at the same time. If there are multiple programs, execute them synchronously.

Program 1: ----------------
Program 2: ----------------
Program 3: --------- -------
Program 4:----------------

 

Concurrency:
In the same CPU time slice, only one program can be processed. If there are multiple programs, execute them alternately.

Program 1: ----- ------
Program 2: -----
Program 3: ----
Program 4: -----


Multi-process: can make full use of the resources of multi-core CPU, suitable for intensive CPU tasks (a large number of parallel operations)
Python's multi-process module: multiprocessing

The communication cost between processes is high, and the switching overhead is high, which is not suitable for tasks that require a large amount of data communication and switching (crawlers)

Design Pattern: Producer Consumer (Parallel Pattern)

 

Multithreading: suitable for intensive I/O tasks (disk IO, memory IO, network IO), with low switching overhead and low communication costs.
Multithreading in Python: Thread, threading, multiprocessing.dummy

Multithreading: Only one thread can be executed in the same time slice, which cannot make full use of CPU multi-core resources (only concurrency, not parallelism)

 

Coroutine: The operating system and CPU do not know the coroutine, which is controlled by the programmer through code logic.
The feature is that multiple tasks are executed under a single thread, and there is no need to switch through the operating system (no switching overhead, and no need to deal with locks), and the execution efficiency is high.

Python: gevent , monkey patching (Python code will automatically switch coroutines when executing network IO blocking)

Coroutines: suitable for intensive network I/O tasks

 

Multi-process crawler: not suitable

Multi-threaded crawler: Disadvantage - Scheduled by the operating system, there is a thread switching overhead (the scenario of massive URLs will increase the CPU load); Advantage - Wide range of usage scenarios (concurrent network read/write/database read/write/disk read/write concurrent)
Coroutine crawler: Disadvantages - gevent with monkey.patch_all() can only improve the efficiency of network concurrency and cannot handle other concurrent scenarios; Advantages - controlled by programmer code logic, not subject to operating system scheduling, no switching overhead, and reduced CPU load (the advantages of processing massive URLs are obvious )

 


Execution method:
Synchronous: To execute a task, you must wait for the completion of the previous task (no concurrent crawlers)
Asynchronous: Execute a task without waiting for the completion of the previous task (concurrent crawlers)

Program status:
Blocking: When the program executes, it must wait for the task to complete, otherwise it remains in the waiting state.
Non-blocking: When the program executes, it does not have to wait for the task to complete, and can continue to execute the next task.

Asynchronous + non-blocking (highest efficiency):
After sending a request, you can continue to process other requests without waiting for the response to return; when the processing function is suspended, you can immediately switch to other functions to continue execution.



Asynchronous networking framework Twisted Tornada

Scrapy: request processing module + response parsing module + twisted
scrapy-redis: Scrapy + Redis (processing request deduplication, request allocation, data storage in the same database)

Stand-alone crawler:
Distributed crawler:

 

CPU -> Registers -> CPU Cache L1/L2/L3 -> Memory -> HDD/SSD -> Network

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324839072&siteId=291194637