05 say: python reptile much thread

python reptile many threads

1 Introduction

We know that in a computer, we can open a lot of software at the same time, such as while browsing the web, listening to music, typing, etc., it seems very normal. But think about it, why do so many computer software can run it? This involves two important concepts in the computer:A multi-process and multi-threaded

Similarly, in the preparation of the crawler, crawling in order to improve efficiency, we might want to run multiple tasks reptiles. Here, too, the need to involve multi-process and multi-threaded knowledge.

In this lecture, we will first lookMultithreadingThe basic principle, and how to implement multiple threads in Python.

2. The meaning of multi-threaded

Speaking of multi-threading, you have to start with what is the thread. However, I want to figure out what is the thread, they have to start with what is the process.

Our process can be understood as a canProgram run independent units, Where an example:

Such as opening a browser, which opens up a browserprocess; Open a text editor, which opens up a text editorprocess. But a process that can handle many things at the same time, like in the browser, we can open multiple pages in multiple tabs, some pages in music, some pages in the video, playing some pages animation, they can run simultaneously without disturbing each other. Why can achieve both simultaneously so many tasks to run it? Here it is necessary leadsThreadThe concept, in fact, that all tasks, actually corresponds to a thread of execution.

The process? It is a collection of threads, The process is constituted by one or a plurality of threads , the thread is the minimum unit of the operating system operation scheduling is a minimum unit operation in the process. For example, the above mentioned browser process, which is a thread play music, play video is a thread, of course, there are many other threads to run simultaneously, concurrent or parallel execution of these threads and finally makes the entire browser can be run simultaneously so many tasks.

Understand the concept of threads, multi-threaded very easy to understand,Multithreading is a process simultaneously execute multiple threads, Said before the browser scene is a typical multi-threaded execution.

3. Concurrent and Parallel

When it comes to multi-process and multi-threaded, here we need to explain two concepts, that is,Complicated bywithparallel. We know that a program executed in a computer, the underlying processor instructions implemented by running a strip.

① concurrent, called English (concurrency)

It refers to the same timeThere can be only one instruction execution, But the corresponding plurality of instruction threads are rotated quickly performed. For example, a processor, a period of execution of the instruction that thread A, then thread B executed instruction period, and then switch back to thread A execution period.

Since speed processor for executing instructions and very, very fast switching speed, people who have not perceive computer has a plurality of thread switch operations performed in the context of this process, which make it appear macroscopically plurality of threads run simultaneously. But this is only the micro processor continuously switch between multiple threads and execution, each thread of execution will take up the processor a time segment, the same time, there is only one thread in execution.

② parallel, the English called the (parallel)

It refers to the same time, a plurality of instructions inExecuted on multiple processors simultaneously, It must rely on a plurality of parallel processors. Both from the macro and micro level, multiple threads are executing together at the same time.

Parallel can only exist in a multi-processor system, if our computer has only one processor core, it is impossible to achieve parallelism. Complicated by the presence of all be in a single processor and multi-processor systems, since a core alone, can be achieved concurrently.

For example, such systems require the processor to run multiple threads simultaneously. If your system has only one processor core, and that it can only be run concurrent threads by the way. If the system has a plurality of processor cores, one core when executed when a thread, another thread can execute another core, so that the two to achieve parallel execution threads, other threads may of course and in another thread on the same execution core is concurrent execution between them. Specific modalities of implementation, depending on the operating system scheduled.

4. multithreaded application scenarios

In the process of a program, there are some operations are more time-consuming or to wait, such as waiting for the results of a database query returns, waiting for the results page response. If you are using a single-threaded processor must wait until after the completion of these operations continue down other operations, and this thread is waiting for the process, the processor can perform other apparently operation. If you use multiple threads, the processor can be when a thread waiting to execute other threads, thus improving the efficiency as a whole.

Like the above scenario, the threads in the implementation process in many cases is to wait. For example, web crawler is a very typical example, reptiles after initiating a request to the server, there is a period of time to wait for response from the server must return,This task belongs to the IO-intensive tasks. For this task, if we enable multi-threaded processor can process in a thread of waiting to deal with other tasks, thereby enhancing the overall efficiency of crawling.

But not all tasks are IO-intensive tasks, there is a task calledCompute-intensive tasksIt can also be called CPU-intensive tasks. As the name suggests, it is to run the task has been required to participate in the processor. At this point if we opened a multi-threaded, a processor switching from one compute-intensive tasks to switch to another up compute-intensive tasks, processors still will not stop, always busy computing, this does not save the overall time, because the total amount of computing tasks to be processed is constant. If the number of threads too much, but will also spend some time in the process of multi-thread switching, the overall efficiency becomes lower.

So, if the task is not all computationally intensive tasks, we can use multiple threads to improve the overall efficiency of program execution. Especially for such a web crawler IO-intensive tasks, the use of multiple threads will greatly improve the overall efficiency of the procedures crawling.

5.Python multi-threaded

In Python,Module implements multithreading is called threadingIt is a module that comes with Python. Here we use threading under way to achieve multi-threaded to understand. (Understand just fine, we explain by concrete practical courses)

①Thread directly create a child thread

First, we can use the Thread class to create a thread, you need to specify when creating == == target parameter method name is running, if the called method requires additional parameters passed, you can == args parameter of Thread = = specify. Examples are as follows:

import threading, time

def target(second):
    print(f'Threading {threading.current_thread().name} is runing')
    print(f'Threading {threading.current_thread().name} sleep {second}s')
    time.sleep(second)
    print(f'Threading {threading.current_thread().name} ended')

print(f'Threading {threading.current_thread().name} is runing')

for i in [1, 5]:
    t = threading.Thread(target=target, args=[i])
    # t = threading.Thread(target=target, args=(i,))
    t.start()
print(f'Threading {threading.current_thread().name} is ended')


# 输出
Threading MainThread is runing
Threading Thread-1 is runing
Threading Thread-1 sleep 1s
Threading Thread-2 is runing
Threading Thread-2 sleep 5s
Threading MainThread is ended
Threading Thread-1 ended
Threading Thread-2 ended

Here we first declare a method called target, it receives a parameter for the second, the implementation can be found by this method is actually executed a time.sleep sleep operation, second parameter is the number of seconds to sleep, both before and after print some of the content, the name of which we have to get out of the thread by threading.current_thread (). name, if it is the main thread, then its value is MainThread, if a child thread, then its value is Thread- *.

Then we create a class by Thead two threads, target parameter name is the method we have just defined, args passed in the form of a list. Two cycles, where i is 1 and 5, respectively, so that two threads are dormant 1 and 5 seconds after the statement is complete, we began to call the start method to run thread.

Observations we can see that there is generated a total of three threads, which are two main thread and the child thread MainThread Thread-1, Thread-2. In addition, we observed that the main thread ends run first, followed Thread-1, Thread-2 in succession until end of the run, respectively, the interval of one second and 4 seconds. This shows that the main thread does not wait for the child thread has finished running before the end of the run, but directly pulled out, a little common sense.

Perhaps we did not see this case too clear, I would add a few more parameters for everyone to see

import threading, time

def target(second):
    print(f'Threading {threading.current_thread().name} is runing')
    print(f'Threading {threading.current_thread().name} sleep {second}s')
    time.sleep(second)
    print(f'Threading {threading.current_thread().name} ended')

print(f'Threading {threading.current_thread().name} is runing') # --> 1.这个就是主线程了

for i in [1,3,5]:
    t = threading.Thread(target=target, args=[i])
    # t = threading.Thread(target=target, args=(i,))
    t.start()

print(f'Threading {threading.current_thread().name} is ended')# --> 2.在创建完子线程后,主线程就结束了,

At the end of the screenshot:
Here Insert Picture Description

If we want to just quit after the main thread waits for the child thread has finished running, you can let each child thread object invoked join (as I understand it is blocked) method, to achieve the following:

for i in [1, 5]:
    t = threading.Thread(target=target, args=[i])
    t.start()
    t.join()
# 输出
Threading MainThread is runing
Threading Thread-1 is runing
Threading Thread-1 sleep 1s
Threading Thread-1 ended
Threading Thread-2 is runing
Threading Thread-2 sleep 5s
Threading Thread-2 ended
Threading MainThread is ended

In this way, the main thread must wait for the child threads to finish, was the main thread continues to run and ends.

② inheritance Thread class to create a child thread

In addition, we can also create a thread through inheritance Thread class manner, method of this thread to be executed written on the inside can run class methods. Examples of the aboveEquivalent rewrite

import threading, time
class MyThread(threading.Thread):
    def __init__(self, second):
        threading.Thread.__init__(self)
        self.second = second
    def run(self):
        print(f'Threading {threading.current_thread().name} is runing')
        print(f'Threading {threading.current_thread().name} sleep {self.second}s')
        time.sleep(self.second)
        print(f'Threading {threading.current_thread().name} is ended')


print(f'Threading {threading.current_thread().name} is runing')

for i in [1, 5]:
    t = MyThread(i)
    t.start()
    t.join()
print(f'Threading {threading.current_thread().name} is ended')

# 输出
Threading MainThread is runing
Threading Thread-1 is runing
Threading Thread-1 sleep 1s
Threading Thread-1 is ended
Threading Thread-2 is runing
Threading Thread-2 sleep 5s
Threading Thread-2 is ended
Threading MainThread is ended

It can be seen two implementations, which run effect is the same.

③ daemon thread

There is a concept called a daemon threads in the thread, if a thread is set as a daemon thread, it means that this thread is == "unimportant", Which means that, if the main thread is over and the daemon is not running thread has finished, it will be forced to end. In Python we can setDaemon == method to set a thread as a daemon thread.

Examples are as follows:

import threading, time

def target(second):
    print(f'Threading {threading.current_thread().name} is runing')
    print(f'Threading {threading.current_thread().name} sleep {second}s')
    time.sleep(second)
    print(f'Threading {threading.current_thread().name} is ended')

print(f'Threading {threading.current_thread().name} is runing')
t1 = threading.Thread(target=target, args=[2])
t1.start()
t2 = threading.Thread(target=target, args=[5])
t2.setDaemon(True) # --> 就是这个啦
t2.start()
print(f'Threading {threading.current_thread().name} is ended')

Here we will t2 (obviously he has to sleep five seconds) by setDaemon method for setting a daemon thread, so that the main thread when finished running, t2 thread with the end of the thread ends.

operation result:

Threading MainThread is runing
Threading Thread-1 is runing
Threading Thread-1 sleep 2s
Threading Thread-2 is runing
Threading Thread-2 sleep 5s
Threading MainThread is ended
Threading Thread-1 is ended

We can see, we do not see Thread-2 Print Exit news, Thread-2 with the exit from the main thread and quit.

But careful you may find that there is no call to join and method, if we let t1 and t2 are calling join method, the main thread will still wait for each child thread is finished and then quit, whether it be a daemon thread.

6. mutex (thread solve the problem of unsafe)

Multiple threads in a process of shared resources, such as in a process, there is a global variable count to count, and now we declare multiple threads, plus one count gave each thread is running, let's look how effect, the code is implemented as follows:

import threading, time

count = 0

class MyThread(threading.Thread):
    def __init__(self):
        threading.Thread.__init__(self)

    def run(self):
        global count
        temp = count + 1
        time.sleep(0.001)
        count = temp
threads = []
for _ in range(1000):
    thread = MyThread()
    thread.start()
    threads.append(thread)
for thread in threads:
    thread.join()
print(f'Final count: {count}')

Here, we declare 1000 threads, each thread is now taking to the current count value of a global variable, and then dormant for a short time, and then count to give new value.

That way, according to common sense, the final count value should be 1000. But it is not true, let's run it and see.

Results are as follows:

Final count: 57

The end result is actually only 57, but multiple runs or a change in operating results environment is different.

Why is this? Because the count value is shared, each thread can execute to get the current value of temp count = count when this line of code, but these threads concurrently or some threads may be executed in parallel, which results in different threads the same might be to get a count value, leading some thread count is incremented by 1 and the operation is not in force, leading to the final result too small.

So, if multiple threads simultaneously for a data read or modify, unpredictable results will appear. To avoid this, we need to synchronize multiple threads to achieve synchronization, we can lock data protection need to operate, there is a need to use threading.Lock.

Lock protection What does it mean? That is, a thread before the data, you need to lock, so that other threads found after being locked up, we can not continue down, will always wait for a lock to be released, only the locking thread lock release, other threads can continue to lock data and make changes, modifications finished and then release the lock. This ensures that only one thread at the same time operating data, multiple threads will not be simultaneous read and modify the same data, so the final result is a pair of runs.

We can change the code to read as follows:

import threading, time

count = 0

class MyThread(threading.Thread):
    def __init__(self):
        threading.Thread.__init__(self)

    def run(self):
        global count
        lock.acquire() # 计算前先加个锁
        temp = count + 1
        time.sleep(0.001)
        count = temp
        lock.release() # 计算完后解锁

lock = threading.Lock()
threads = []
for _ in range(1000):
    thread = MyThread()
    thread.start()
    threads.append(thread)
for thread in threads:
    thread.join()
print(f'Final count: {count}')

Here we declare a lock object is, in fact, is an example threading.Lock, and then inside the run method, get count before the first lock, you modify the count again after the release of the lock, so that multiple threads do not simultaneously access and modify the value of the count.

Results are as follows:

Final count: 1000

Such operation results to normal.

About the contents of multi-threaded Python, introduce them here for the time being, about theading more use, such as semaphores, queues, etc., can refer to the official document: https://docs.python.org/zh-cn/3.7/library /threading.html#module-threading.

Multithreading problem 7.Python

Due to limitations in Python GIL, resulting in either single-core or multi-core conditions, at the same time you can only run one thread, resulting in multi-threaded Python unable to take advantage of multi-core parallel.

GIL called the Global Interpreter Lock, the Chinese translation for the global interpreter lock, which was originally designed for data security and consideration.

In Python multiple threads, each thread implementation is as follows:

step1: get GIL
step2: the implementation of the corresponding threaded code
step3: GIL release

Visible, a thread wants to perform, you must first get GIL, we can be seen as GIL pass, and in a Python process, GIL only one. Get the thread passes, the implementation is not allowed. This will lead to even under conditions of multi-core, more than one thread in Python process, the same time can only execute one thread.

But for IO-intensive tasks such reptiles, this problem is not too great. For compute-intensive tasks, because of the GIL, multi-threaded operating efficiency compared to overall may actually be lower than single-threaded.

Courseware on the reptile textbooks say most of the content for the CUI Qing was a teacher and I understand some of these small, this series are my notes after class, if there is something wrong place please contact me, thank you!

Published 12 original articles · won praise 7 · views 165

Guess you like

Origin blog.csdn.net/caiyongxin_001/article/details/104888210