How to understand the high performance and high concurrency of high performance servers?

thread | synchronous | asynchronous | heterogeneous

coroutine | process | isomorphism | thread pool

At present, with the implementation of the policy of "counting from the east to counting from the west", the era of computing power is fully opened. With the rapid development of machine learning and deep learning, people are no longer unfamiliar with the concept of high-performance servers. With the increasing number of data analysis and data mining, the traditional air-cooled heat dissipation method is no longer enough to meet the heat dissipation needs, which requires the emerging liquid-cooled heat dissipation technology to meet the needs of energy saving, emission reduction, quietness and high efficiency.

As a domestic brand server manufacturer, Blue Ocean brain liquid-cooled GPU server has large-scale parallel processing capabilities and unparalleled flexibility. It is primarily used to provide sufficient processing power for computationally intensive applications. The advantage of the GPU is that the CPU can run the application code, while the graphics processing unit (GPU) can handle the computationally intensive tasks of the massively parallel architecture. GPU servers are ideal for remote sensing mapping, pharmaceutical R&D, life sciences, and high-performance computing.

This article will give you a comprehensive introduction to the technologies involved in high-performance GPU servers and how to build them.

Threads and thread pools

The following will start from the CPU to the commonly used thread pool, from the bottom to the top, from hardware to software.

1. CPU

You may have questions about this, why start with the CPU when talking about multithreading? In fact, the CPU does not have concepts such as threads and processes. All the CPU does is fetch instructions from memory -- execute them, and go back to 1.

1. Where does the CPU fetch instructions from?

It is the program counter we are familiar with. Don’t think of registers too mysteriously here. You can simply understand registers as memory, but the access speed is faster.

2. What is stored in the PC register?

The address in memory of the instruction (the next instruction the CPU will execute)

3. Who will change the instruction address in the PC register?


Since the CPU is executed sequentially one by one in most cases, the address in the previous PC register is automatically incremented by 1 by default. But when encountering if and else, this sequential execution is broken. In order to correctly jump to the instruction that needs to be executed, the CPU will dynamically change the value in the PC register according to the calculation result when executing such instructions.

4. How is the initial value in the PC set?

The instructions executed by the CPU come from the memory, the instructions in the memory are loaded from the executable program saved in the disk, the executable program in the disk is generated by the compiler, and the compiler generates machine instructions from the defined functions.

Second, from the CPU to the operating system


From the above, we understand the working principle of the CPU. If you want the CPU to execute a certain function, you only need to load the first machine execution corresponding to the function into the PC register, so that the CPU can execute the program even without an operating system. Although feasible, this is a very cumbersome process (1. Find a suitable size area in the memory to load the program; 2. Find the function entry, set the PC register and let the CPU start executing the program).

Since the machine instruction needs to be loaded into the memory for execution, it is necessary to record the starting address and length of the memory; at the same time, it is necessary to find the entry address of the function and write it into the PC register.

The data structure is roughly as follows:

1

2

3

4

5

6

7

struct *** {

   void* start_addr;

   int len;

    

   void* start_point;

   ...

};

3. From single-core to multi-core, how to make full use of multi-core

If a program needs to make full use of multiple cores, it will encounter the following problems:

1. Processes need to occupy memory space (from the previous section to this section). If multiple processes are based on the same executable program, the contents of the memory areas of these processes are almost identical, which will obviously cause memory waste;

2. When the tasks processed by the computer are more complicated, inter-process communication will be involved. However, since each process is in a different memory address space, inter-process communication requires the help of the operating system, which increases the difficulty of programming and increases the system overhead. .

4. From process to thread

A process to a thread is an area in the memory, which stores the machine instructions executed by the CPU and the stack information when the function is running. To make the process run, write the address of the first machine instruction of the main function into the PC register.



The disadvantage of a process is that there is only one entry function (main function), and the machine instructions in the process can only be executed by one CPU. So is there a way for multiple CPUs to execute the machine instructions in the same process? The address of the first instruction of the main function can be written to the PC register. The main function is no different from other functions. Its special feature is that it is the first function executed by the CPU.

When the PC register is pointed to a non-main function, a thread is born.



So far, there can be multiple entry functions in one process, which means that machine instructions belonging to the same process can be executed by multiple CPUs at the same time.



Multiple CPUs can simultaneously execute multiple entry functions belonging to the process under the same roof (the memory area occupied by the process). The operating system maintains a bunch of information for each process, which is used to record the memory space of the process, etc., and this bunch of information is recorded as data set A. Similarly, the operating system also maintains a bunch of information for the thread, which is used to record the entry function or stack information of the thread, etc., and this bunch of data is recorded as data set B.

Obviously, the amount of data set B is smaller than that of data A. Since the thread is running in the address space of the process it is in, it has been created when the program starts, and the thread is created during the running of the program (after the process starts), so when the thread This address space already exists when it starts running, and the thread can use it directly.

It is worth mentioning that with the concept of threads, all CPUs can be busy by creating multiple threads after the process is started. This is the root of the so-called high performance and high concurrency.

Another point worth noting is: because each thread shares the memory address space of the process, the communication between threads does not need to rely on the operating system, which brings convenience to the staff but also has shortcomings. Most of the problems encountered with multithreading stem from the fact that communication between threads is so convenient that it is very error-prone. The root of the error is that the CPU does not have the concept of threads when executing instructions, and the mutual exclusion and synchronization problems faced by multi-threaded programming need to be solved.

The last thing to note is that although multiple CPUs are used in the previous figure explaining the use of threads, it is not necessary to have multiple cores to use multiple threads. In the case of a single core, multiple threads can also be created, mainly because Threads are implemented at the operating system level, and have nothing to do with how many cores there are. When the CPU executes machine instructions, it does not realize which thread the executed machine instructions belong to. Even in the case of only one CPU, the operating system can allow each thread to advance "simultaneously" through thread scheduling, that is, the time slice of the CPU is allocated back and forth between each thread, so that multiple threads seem to run "simultaneously" , but in fact only one thread is running at any time.

5. Threads and memory


The relationship between the thread and the CPU was introduced earlier, that is, the PC register of the CPU is pointed to the entry function of the thread, so that the thread can run.

Regardless of the programming language used, creating a thread is largely the same:

1

2

3

4

5

// 设置线程入口函数DoSomething

thread = CreateThread(DoSomething);

 

// 让线程运行起来

thread.Run();


The data generated when the function is executed includes: function parameters, local variables, return address and other information. These information are stored in the stack. When the concept of thread has not yet appeared, there is only one execution flow in the process, so there is only one stack. The bottom of the stack is the entry function of the process, which is the main function.

Suppose the main function calls funA, and funcA calls funcB, as shown in the figure:

With threads, there are multiple execution entries in a process, that is, there are multiple execution flows at the same time. A process with only one execution flow needs a stack to save runtime information. Obviously, when there are multiple execution flows, multiple stacks are required. To save the information of each execution flow, that is to say, the operating system must allocate a stack for each thread in the address space of the process, that is, each thread has its own stack. It is extremely critical to be aware of this. At the same time, creating threads consumes process memory space.

Six, the use of threads


From the perspective of life cycle, there are two types of tasks to be processed by threads: long tasks and short tasks.

1. Long-lived tasks (long-lived tasks),

as the name suggests, are tasks that survive for a long time. Taking the commonly used word as an example, the text edited in word needs to be saved on the disk, and writing data to the disk is a task. At this time, a better method is to create a thread for writing to the disk. The life cycle of the thread and The word process is the same, as long as the word is opened, the thread will be created, and the thread will be destroyed when the user closes the word, which is a long task. Long tasks are great for creating dedicated threads to handle some specific tasks.

2. Short-lived tasks

means that the processing time of the task is short, such as a network request, a database query, etc. This kind of task can be quickly processed and completed in a short period of time. Therefore, short tasks are more common in various servers, such as web server, database server, file server, mail server, etc. This scenario has two characteristics: the time required for task processing is short and the number of tasks is huge.


This way of working can be fine for long tasks, but it is simple to implement for lots of short tasks but has its drawbacks:

1) Thread is a concept in the operating system, so the creation of threads needs to be done with the help of the operating system, and the creation and destruction of threads by the operating system takes time;

2) Each thread needs to have its own independent stack, so when a large number of threads are created, too much system resources such as memory will be consumed.


This is like a factory owner who has a lot of orders in his hand. Every time a batch of orders comes, he has to recruit a batch of workers. The products produced are very simple, and the workers can finish processing them quickly. After processing these batches of orders, these workers are resigned. Drop, recruit workers again when there is a new order, and recruit workers for 10 hours after 5 minutes of work. If you are not motivated to let the company go bankrupt, you probably won’t do it. Therefore, a better strategy is to recruit a group of people and raise them on the spot. When there is an order, the order is processed, and when there is no order, everyone can stay.

This is the origin of the thread pool.

Seven, from multithreading to thread pool

The thread pool is nothing more than creating a batch of threads and then not releasing them, and submitting tasks to threads for processing, so there is no need to frequently create and destroy threads. At the same time, since the number of threads in the thread pool is usually fixed, it will not consume Too much memory.

8. How does the thread pool work?

Generally speaking, the task submitted to the thread pool includes two parts: the data to be processed and the function to process the data.

Pseudocode description:

1

2

3

4

struct task {

    void* data;     // 任务所携带的数据

    handler handle; // 处理数据的方法

}

The threads in the thread pool will be blocked on the queue. When the worker writes data into the queue, a thread in the thread pool will be woken up. (or object) as parameters and call the processing function.


The pseudocode is as follows:

1

2

3

4

while(true) {

  struct task = GetFromQueue(); // 从队列中取出数据

  task->handle(task->data);     // 处理数据

}

Eight, the number of threads in the thread pool


We all know that too few threads in the thread pool cannot make full use of the CPU, and too many threads created will cause system performance degradation, excessive memory usage, consumption caused by thread switching, and so on. Therefore, the number of threads can neither be too many nor too few. How much should it be?

From the perspective of resources required to process tasks, there are two types: CPU-intensive and I/O-intensive.

1. CPU-intensive

The so-called CPU-intensive means that reasoning tasks do not need to rely on external I/O, such as scientific calculations and matrix operations. In this case, as long as the number of threads is basically the same as the number of cores, CPU resources can be fully utilized.

2. I/O-intensive

tasks of this type may not take much time for the calculation part, and most of the time is spent on disk I/O, network I/O, etc.



Staff need to use performance testing tools to evaluate the time spent on I/O waiting, which is recorded here as WT (wait time), and the time required for CPU calculation, here recorded as CT (computing time), then for an N core For the system, the appropriate number of threads is about N * (1 + WT/CT). Assuming that the I/O waiting time is the same as the computing time, then about 2N threads are needed to fully utilize the CPU resources. Note that this is only a theoretical value. How much to set needs to be tested according to real business scenarios.

Of course, making full use of the CPU is not the only point that needs to be considered. As the number of threads increases, memory usage, system scheduling, the number of open files, the number of open sockets, and open database links all need to be considered. Therefore, there is no one-size-fits-all formula, and specific analysis is required.

9. Factors to consider before using threads

1. Fully understand whether the task is a long task or a short task, whether it is CPU-intensive or I/O-intensive. If there are both, then a better way may be to put these two types of tasks into different thread pools.

2. If the task in the thread pool has I/O operations, be sure to set a timeout for this task, otherwise the thread processing the task may be blocked forever;

4. The tasks in the thread pool do not wait for the results of other tasks synchronously.

I/O and zero-copy technology

1. What is I/O?

I/O is a simple copy of data. If the data is copied from the external device to the memory, it is Input. If the data is copied from the memory to the external device, it is Output. The troublesome back and forth copy data between memory and external devices is Input and Output, referred to as I/O (Input/Output).

2. I/O and CPU


To put it simply: the speed at which the CPU executes machine instructions is at the nanosecond level, and the usual I/O such as disk operations, a disk seek is about at the millisecond level, so if we compare the speed of the CPU to a fighter jet, then I/O The speed of operation is KFC.

That is to say, when the program runs (the CPU executes machine instructions), its speed is much faster than the I/O speed. Then the next question is that there is such a big difference in speed between the two, how to design and make more reasonable and efficient use of system resources?

Since there is a speed difference, the process cannot continue to move forward until the I/O operation is performed, so there is only waiting (wait).

3. What happens at the bottom when performing I/O


In an operating system that supports threads, threads are actually scheduled instead of processes. In order to understand the I/O process more clearly, it is temporarily assumed that the operating system only has the concept of processes, and threads are not considered.

As shown in the figure below, there are two processes in memory, process A and process B, and process A is currently running. As shown in the figure below

, there is a piece of code for reading files in process A. No matter what language it is in, it usually defines a buff for loading data, and then calls functions such as read.

1

read(buff);

Note: Compared with the speed at which the CPU executes instructions, I/O operations are very slow, so it is impossible for the operating system to waste precious CPU computing resources on unnecessary waiting. Since the I/O operation performed by the external device is quite slow, the process cannot continue to move forward until the I/O operation is completed. This is the so-called blocking, that is, block.


Just record the running state of the current process and point the PC register of the CPU to the instructions of other processes, and the operating system will suspend the running of the process after detecting that the process initiates a request to the I/O device. If the process is suspended, it will continue to execute, so the operating system must save the suspended process for subsequent execution. Obviously we can use the queue to save the suspended process.
 

As shown in the figure above, the operating system has sent an I/O request to the disk, so the disk driver starts to copy the data in the disk to the buff of process A. Although process A has been suspended at this time, this does not prevent the disk from copying data to the memory. The process is shown in the figure below:


In addition to the blocking queue, the operating system also has a ready queue. The so-called ready queue means that the processes in the queue are ready to be executed by the CPU. Thousands of processes can be created even on a machine with only 1 core. It is impossible for the CPU to execute so many processes at the same time, so there must be such processes that cannot be allocated to computing resources even if everything is ready. Such processes are placed in the ready queue.
 

Since there is still process B waiting to be fed in the ready queue, the CPU cannot be idle after process A is suspended. At this time, the operating system starts to find the next executable process in the ready queue, which is process B here. At this time, the operating system takes process B out of the ready queue, finds out the position of the machine instruction executed when process B was suspended, and then points the PC register of the CPU to this position, so that process B starts running.

As shown in the figure above, process B is being executed by the CPU, and the disk is copying data to the memory space of process A. The data copy and instruction execution are carried out at the same time. Under the scheduling of the operating system, the CPU and disk are fully utilized. Afterwards, the disk copies all the data to the memory of process A. After the operating system receives the disk interrupt, it finds that the data copy is complete. Process A regains the qualification to continue running. The operating system puts process A from the blocking queue into the ready queue.

After that, process B continues to execute, and process A continues to wait. After process B executes for a while, the operating system thinks that process B has been executed for a long time, so it puts process B in the ready queue, takes process A out and continues to execute. The operating system puts process B in the ready queue, so process B is suspended only because the time slice is up, not because the I/O request is blocked.
 

4. Zero-copy


It is worth noting that: in the above explanation, the disk data is directly copied to the process space, but in fact, in general, the I/O data is first copied to the operating system, and then the operating system is copied to the process space. Scenarios with high performance requirements can actually bypass the operating system and directly copy data. This technology of bypassing the operating system and directly copying data is called zero-copy (Zero-copy).

I/O multiplexing


In this article, we will explain in detail what I/O multiplexing is and how to use it. Among them, the I/O multiplexing (based on event-driven) technology represented by epoll is widely used. In fact, you will find that whenever it involves high Event-driven programming methods can basically be seen in concurrent and high-performance scenarios.

1. What is a document?


In the Linux world, a file is a very simple concept, you only need to understand it as a sequence of N bytes:

b1, b2, b3, b4, ....... bN

In fact, all I/O devices are abstracted, everything is a file (Everything is File), disks, network data, terminals, and even inter-process communication tool pipes are treated as files.

Commonly used I/O operation interfaces generally have the following categories:

1. Open the file, open;

2. Change the reading and writing position, seek;

3. File reading and writing, read, write;

4. Close the file, close.

2. What is a file descriptor?

We mentioned above: To perform I/O read operations, like disk data, you need to specify a buff to load the data. In the Linux world, if you want to use a file, you need to use a number. According to the "do not understand the principle", this number is called a file descriptor (file descriptors), which is well-known in the Linux world, and its reason is the same as the queue number above. . The file description is just a number, but through this number we can operate an open file.


With the file descriptor, the process can know nothing about the file, such as where the file is on the disk, how it is loaded into the memory and how it is managed, etc. All these information are left to the operating system to take care of, and the process does not need to care about it. The system only needs to give the process a file descriptor.

3. What should I do if there are too many file descriptors?


We know from the above that all I/O operations can be performed through the concept of a file, which of course includes network communication.

If you have an IM server, after the three-way handshake suggests that the long connection is successful, we will call accept to obtain a link, and we will also get a file descriptor by calling this function, through which the chat sent by the client can be processed message and forward the message to the receiver.

In other words, through this descriptor, you can communicate with the client:

// Obtain the client's file descriptor through accept
int conn_fd = accept(...);


The processing logic on the server side is usually to receive the client message data, and then execute the forwarding (to the receiver) logic:

if(read(conn_fd, msg_buff) > 0) {
    do_transfer(msg_buff);
}


Since the topic is high concurrency, it is impossible for the server to communicate with only one client, but may communicate with thousands of clients at the same time. At this time, it is no longer as simple as one descriptor to be processed, but thousands of descriptors may be processed. In order not to make the problem too complicated when it comes up, let’s simplify it first, assuming that only two client requests are processed at the same time.

Some students may say that this is not easy, so it is not enough to write like this:

if(read(socket_fd1, buff) > 0) { // process the first
    do_transfer();
}
if(read(socket_fd2, buff) > 0) { // process the second
    do_transfer();


If there is no data to read at this time, the process will be blocked and suspended. At this time, we can't process the second request, even if the data of the second request is already in place, which means that when processing a certain client, all other clients must wait because the process is blocked. Simultaneously handle tens of thousands of clients on the server. This is clearly intolerable.

If you are smart, you will definitely think of using multi-threading: open a thread for each client request, so that a blocked client will not affect the threads processing other clients. Note: Since it is high concurrency, do we need to open thousands of threads for thousands of requests? Creating and destroying threads in large numbers will seriously affect system performance.

So how to solve this problem?

The key point here is: we do not know in advance whether the I/O device corresponding to a file description is readable or writable, and performing I/O in the unreadable or unwritable state of the peripheral will only cause The process blocked is suspended from running.

3. I/O multiplexing (I/O multiplexing)


The word multiplexing is mostly used in the field of communication. In order to make full use of the communication line, it is hoped to transmit multiple signals in one channel. If you want to transmit multiple signals in one channel, you need to combine the multiple signals into one, and combine the multiple signals. The device that forms a signal is called a Multiplexer (multiplexer). Obviously, the receiver needs to restore the original multiplex signal after receiving the combined signal. This device is called a Demultiplexer (a multiplexer). .

As shown in the figure below:


The so-called I/O multiplexing refers to such a process:

1. Get a bunch of file descriptors (whether it is network-related, disk file-related, etc., any file descriptor is fine);

2. Tell the kernel by calling a certain function: "Don't return this function, you monitor these descriptors for me, and return when there are I/O read and write operations in this pile of file descriptors";

3. When the called function returns, you can know which file descriptors can perform I/O operations.

3. Three Musketeers of I/O Multiplexing


When calling these I/O multiplexing functions, if any file descriptor that needs to be monitored is unreadable or writable, the process will be blocked and suspended until there is a file descriptor that is readable or writable. So select, poll, and epoll on Linux are all blocking I/O, that is, synchronous I/O.

1. select: fledgling


Under the select I/O multiplexing mechanism, you need to tell select the file description set you want to monitor in the form of function parameters, and then select copies these file descriptor sets to the kernel. In order to reduce the performance loss caused by this kind of data copy, the Linux kernel limits the size of the collection, and stipulates that the file description collection monitored by the user cannot exceed 1024. At the same time, when select returns, only some file descriptors can be read. wrote.

select features

1. The number of file descriptors that can be looked after is limited and cannot exceed 1024;

2. In the kernel that the user needs to copy to the file descriptor;

3. It can only tell that there is a file descriptor that meets the requirements but I don't know which one.

2. Poll: small success


Poll and select are very similar. Compared with select, the optimization is only to solve the limitation that the number of file descriptors cannot exceed 1024. Both select and poll will degrade in performance as the number of monitored file descriptions increases, so they are not suitable for high-concurrency scenarios.

3. epoll: unrivaled in the world


Among the three problems faced by select, the limit on the number of file descriptions has been solved in poll. What about the remaining two problems? The strategy used by epoll

for the copy problem is to break each and share memory. The change frequency of the file descriptor set is relatively low, and select and poll frequently copy the entire set. By introducing epoll_ctl, epoll considerately only operates those file descriptors that have changed. At the same time, epoll and the kernel have also become good friends, sharing the same piece of memory, which stores the set of file descriptors that are already readable or writable, thus reducing the copying overhead of the kernel and programs.

For the problem of needing to traverse the file descriptors to know which one is readable and writable , the strategy used by epoll is under the select and poll mechanism: the process has to end up in person and wait on each file descriptor, and any file description is readable or writable. Wake up the process, but after the process is woken up, it still looks confused and does not know which file descriptor is readable or writable, and it needs to be checked again from beginning to end. Under the epoll mechanism, the process does not need to end in person. The process only needs to wait on epoll, and epoll replaces the process to wait on each file descriptor. When a file descriptor is readable or writable, it will tell epoll and be recorded by epoll.

Under the mechanism of epoll, the strategy of "don't call me, I will call you if necessary" is actually used. The process does not need to troublesomely ask each file descriptor over and over again, but turns over and becomes the master. ——"Which of your file descriptors is readable or writable, please report it."

synchronous and asynchronous

1. Synchronous and asynchronous scenarios: phone calls and emails

1. Synchronization

Usually when making a phone call, one person is talking and the other is listening. When one person is talking, the other person waits until the other person has finished speaking, so in this scene you can see that "dependence" , "association", and "waiting" these keywords appeared, so the communication method of making a phone call is the so-called synchronization.

2. Asynchronous


Another common way of communication is email, because no one is waiting for you to write an email and do nothing, so you can write slowly, and when you are writing an email, the recipient can do something like touch fish, Going to the toilet, and at the same time complaining about why the eleventh holiday is not closed for two weeks and other meaningful things. At the same time, after you write and send the email, you don’t need to wait for the other party to reply and do nothing. You can also do something meaningful like fishing.


Here, when you write an email, others are taking advantage of it, and these two things are going on at the same time. The recipient and the sender do not need to wait for each other. The recipient can read it after receiving it. The recipient and sender do not need to depend on each other or wait for each other. Therefore, the communication method of email is asynchronous.

2. Synchronous call in programming


Normal function calls are synchronous, like this:

1

2

3

4

5

6

funcA() {

    // 等待函数funcB执行完成

    funcB();

     

    // 继续接下来的流程

}

If funcA calls funcB, the subsequent code in funcA will not be executed until funcB finishes executing, that is to say, funcA must wait for funcB to finish executing, as shown in the figure below.
 

As can be seen from the above figure, funcA can do nothing while funcB is running, which is a typical synchronization. Generally speaking, like this synchronous call, funcA and funcB run in the same thread, but it is worth noting that even functions running in two threads that cannot be used can also make synchronous calls, like when we do IO operations. The upper and lower layers send requests to the operating system through system calls.

As shown in the figure above, the program can only be executed after the read function returns. Unlike the synchronous call above, the function and the called function run in different threads. From this we can conclude that synchronous calls have nothing to do with whether the function and the called function run in the same thread. Here it needs to be emphasized again that the function and the called function cannot be executed at the same time in the synchronous mode.

3. Asynchronous calls in programming


Where there are synchronous calls, there are asynchronous calls. Generally speaking, asynchronous calls always go hand in hand with time-consuming tasks such as I/O operations, such as reading and writing disk files, sending and receiving network data, and database operations.

Here we take disk file reading as an example. In the synchronous call mode of the read function, the caller cannot move forward until the file is read, but if the read function can be called asynchronously, the situation is different. If the read function can be called asynchronously, even if the file has not been read, the read function can return immediately.


如上图所示,在异步调用方式下,调用方不会被阻塞,函数调用完成后可以立即执行接下来的程序。这时异步的重点在于调用方接下来的程序执行可以和文件读取同时进行。值得注意的是异步调用对于程序员来说在理解上是一种负担,代码编写上更是一种负担,总的来说,上帝在为你打开一扇门的时候会适当的关上一扇窗户。

有的同学可能会问,在同步调用下,调用方不再继续执行而是暂停等待,被调函数执行完后很自然的就是调用方继续执行,那么异步调用下调用方怎知道被调函数是否执行完成呢?这就分为调用方根本就不关心执行结果和调用方需要知道执行结果两种情况。

第一种情况比较简单,无需讨论。


第二种情况下就比较有趣了,通常有两种实现方式:

1、通知机制


当任务执行完成后发送信号来通知调用方任务完成(这里的信号有很多实现方式:Linux中的signal,或使用信号量等机制都可实现);

2、回调机制:
也就是常说的callback。

四、具体的编程例子中理解同步和异步


以常见Web服务为例来说明这个问题。一般来说Web Server接收到用户请求后会有一些典型的处理逻辑,最常见的就是数据库查询(当然,你也可以把这里的数据库查询换成其它I/O操作,比如磁盘读取、网络通信等),在这里假定处理一次用户请求需要经过步骤A、B、C,然后读取数据库,数据库读取完成后需要经过步骤D、E、F。

其中步骤A、B、C和D、E、F不需要任何I/O,也就是说这六个步骤不需要读取文件、网络通信等,涉及到I/O操作的只有数据库查询这一步。一般来说Web Server有主线程和数据库处理线程两个典型的线程。

首先我们来看下最简单的实现方式,也就是同步。

这种方式最为自然也最为容易理解:

01

02

03

04

05

06

07

08

09

10

11

12

13

14

15

16

17

// 主线程

main_thread() {

    A;

    B;

    C;

    发送数据库查询请求;

    D;

    E;

    F;

}

// 数据库线程

DataBase_thread() {

    while(1) {

        处理数据库读取请求;

        返回结果;

    }

}


主线程在发出数据库查询请求后就会被阻塞而暂停运行,直到数据库查询完毕后面的D、E、F才可以继续运行,这就是最为典型的同步方法。

如上图所示,主线程中会有“空隙”,这个空隙就是主线程的“休闲时光”,主线程在这段休闲时光中需要等待数据库查询完成才能继续后续处理流程。在这里主线程就好比监工的老板,数据库线程就好比苦逼搬砖的程序员,在搬完砖前老板什么都不做只是紧紧的盯着你,等你搬完砖后才去忙其它事情。

1、异步情况:主线程不关心数据库操作结果


如下图所示,主线程根本就不关心数据库是否查询完毕,数据库查询完毕后自行处理接下来的D、E、F三个步骤。

一个请求通常需要经过七个步骤,其中前三个是在主线程中完成的,后四个是在数据库线程中完成的,数据库线程通过回调函数查完数据库后处理D、E、F几个步骤。

伪码如下:

1

2

3

4

5

void handle_DEF_after_DB_query () {

    D;

    E;

    F;

}

主线程处理请求和数据库处理查询请求可以同时进行,从系统性能上看能更加充分的利用系统资源,更加快速的处理请求;从用户的角度看,系统的响应也会更加迅速。这就是异步的高效之处。但可以看出,异步编程并不如同步来的容易理解,系统可维护性上也不如同步模式。

2、异步情况:主线程关心数据库操作结果


如下图所示,数据库线程需要将查询结果利用通知机制发送给主线程,主线程在接收到消息后继续处理上一个请求的后半部分。

由此我们可以看到:ABCDEF几个步骤全部在主线中处理,同时主线程同样也没有了“休闲时光”,只不过在这种情况下数据库线程是比较清闲的,从这里并没有上一种方法高效,但是依然要比同步模式下要高效。但是要注意的是并不是所有的情况下异步都一定比同步高效,还需要结合具体业务以及IO的复杂度具体情况具体分析。

高并发中的协程

协程是高性能高并发编程中不可或缺的技术,包括即时通讯(IM系统)在内的互联网产品应用产品中应用广泛,比如号称支撑微信海量用户的后台框架就是基于协程打造的。而且越来越多的现代编程语言都将协程视为最重要的语言技术特征,已知的包括:Go、Python、Kotlin等。
 

一、从普通函数到协程

1

2

3

4

5

6

7

void func() {

  print("a")

  暂停并返回

  print("b")

  暂停并返回

  print("c")

}


普通函数下,只有当执行完print("c")这句话后函数才会返回,但是在协程下当执行完print("a")后func就会因“暂停并返回”这段代码返回到调用函数。

我写一个return也能返回,就像这样:

1

2

3

4

5

6

7

void func() {

  print("a")

  return

  print("b")

  暂停并返回

  print("c")

}


直接写一个return语句确实也能返回,但这样写的话return后面的代码都不会被执行到了。

协程之所以神奇就神奇在当我们从协程返回后还能继续调用该协程,并且是从该协程的上一个返回点后继续执行。

就好比孙悟空说一声“定”,函数就被暂停了:

1

2

3

4

5

6

7

void func() {

  print("a")

  

  print("b")

  

  print("c")

}


这时我们就可以返回到调用函数,当调用函数什么时候想起该协程后可以再次调用该协程,该协程会从上一个返回点继续执行。值得注意的是当普通函数返回后,进程的地址空间中不会再保存该函数运行时的任何信息,而协程返回后,函数的运行时信息是需要保存下来的。

二、“Talk is cheap,show me the code”


在python语言中,这个“定”字同样使用关键词yield。这样我们的func函数就变成了:

1

2

3

4

5

6

7

void func() {

  print("a")

  yield

  print("b")

  yield

  print("c")

}


这时我们的func就不再是简简单单的函数了,而是升级成为了协程,那么我们该怎么使用呢?

很简单:

1

2

3

4

5

def A():

  co = func() # 得到该协程

  next(co)    # 调用协程

  print("in function A") # do something

  next(co)    # 再次调用该协程


虽然func函数没有return语句,也就是说虽然没有返回任何值,但是我们依然可以写co = func()这样的代码,意思是说co就是拿到的协程了。

接下来调用该协程,使用next(co),运行函数A看看执行到第3行的结果是什么:

1

a

显然,和预期一样协程func在print("a")后因执行yield而暂停并返回函数A。

接下来是第4行,这个毫无疑问,A函数在做一些自己的事情,因此会打印:

1

2

a

in function A


接下来是重点的一行,当执行第5行再次调用协程时该打印什么呢?

如果func是普通函数,那么会执行func的第一行代码,也就是打印a。

但func不是普通函数,而是协程,我们之前说过,协程会在上一个返回点继续运行,因此这里应该执行的是func函数第一个yield之后的代码,也就是 print("b")。

1

2

3

a

in function A

b

三、图形化解释


为了更加彻底的理解协程,我们使用图形化的方式再看一遍。

首先是普通的函数调用:


在该图中方框内表示该函数的指令序列,如果该函数不调用任何其它函数,那么应该从上到下依次执行,但函数中可以调用其它函数,因此其执行并不是简单的从上到下,箭头线表示执行流的方向。

从上图中可以看到:首先来到funcA函数,执行一段时间后发现调用了另一个函数funcB,这时控制转移到该函数,执行完成后回到main函数的调用点继续执行。这是普通的函数调用。

接下来是协程:


在这里依然首先在funcA函数中执行,运行一段时间后调用协程,协程开始执行,直到第一个挂起点,此后就像普通函数一样返回funcA函数,funcA函数执行一些代码后再次调用该协程。

三、函数只是协程的一种特例


和普通函数不同的是,协程能知道自己上一次执行到了哪里。协程会在函数被暂停运行时保存函数的运行状态,并可以从保存的状态中恢复并继续运行。

四、协程的历史


协程这种概念早在1958年就已经提出来了,要知道这时线程的概念都还没有提出来。到了1972年,终于有编程语言实现了这个概念,这两门编程语言就是Simula 67 以及Scheme。但协程这个概念始终没有流行起来,甚至在1993年还有人考古一样专门写论文挖出协程这种古老的技术。

因为这一时期还没有线程,如果你想在操作系统写出并发程序那么你将不得不使用类似协程这样的技术,后来线程开始出现,操作系统终于开始原生支持程序的并发执行,就这样,协程逐渐淡出了程序员的视线。直到近些年,随着互联网的发展,尤其是移动互联网时代的到来,服务端对高并发的要求越来越高,协程再一次重回技术主流,各大编程语言都已经支持或计划开始支持协程。

五、协程到底如何实现?


让我们从问题的本质出发来思考这个问题协程的本质是什么呢?协程之所以可以被暂停也可以继续,那么一定要记录下被暂停时的状态,也就是上下文,当继续运行的时候要恢复其上下文(状态)函数运行时所有的状态信息都位于函数运行时栈中。如下图所示,函数运行时栈就是需要保存的状态,也就是所谓的上下文。


从上图中可以看出,该进程中只有一个线程,栈区中有四个栈帧,main函数调用A函数,A函数调用B函数,B函数调用C函数,当C函数在运行时整个进程的状态就如图所示。

再仔细想一想,为什么我们要这么麻烦的来回copy数据呢?我们需要做的是直接把协程的运行需要的栈帧空间直接开辟在堆区中,这样都不用来回copy数据了,如下图所示。



从上图中可以看到该程序中开启了两个协程,这两个协程的栈区都是在堆上分配的,这样我们就可以随时中断或者恢复协程的执行了。进程地址空间最上层的栈区现在的作用是用来保存函数栈帧的,只不过这些函数并不是运行在协程而是普通线程中的。

在上图中实际上共有一个普通线程和两个协程3个执行流。虽然有3个执行流但我们创建了几个线程呢?答案是:一个线程。

使用协程理论上我们可以开启无数并发执行流,只要堆区空间足够,同时还没有创建线程的开销,所有协程的调度、切换都发生在用户态,这就是为什么协程也被称作用户态线程的原因所在。所以即使创建了N多协程,但在操作系统看来依然只有一个线程,也就是说协程对操作系统来说是不可见的。


这也许是为什么协程这个概念比线程提出的要早的原因,可能是写普通应用的程序员比写操作系统的程序员最先遇到需要多个并行流的需求,那时可能都还没有操作系统的概念,或者操作系统没有并行这种需求,所以非操作系统程序员只能自己动手实现执行流,也就是协程。

六、协程技术概念小结

1、协程是比线程更小的执行单元


协程是比线程更小的一种执行单元可以认为是轻量级的线程。之所以说轻的其中一方面的原因是协程所持有的栈比线程要小很多,java当中会为每个线程分配1M左右的栈空间,而协程可能只有几十或者几百K,栈主要用来保存函数参数、局部变量和返回地址等信息。

我们知道而线程的调度是在操作系统中进行的,而协程调度则是在用户空间进行的,是开发人员通过调用系统底层的执行上下文相关api来完成的。有些语言,比如nodejs、go在语言层面支持了协程,而有些语言,比如C,需要使用第三方库才可以拥有协程的能力。

由于线程是操作系统的最小执行单元,因此也可以得出,协程是基于线程实现的,协程的创建、切换、销毁都是在某个线程中来进行的。使用协程是因为线程的切换成本比较高,而协程在这方面很有优势。

2、协程的切换到底为什么很廉价?


关于这个问题,回顾一下线程切换的过程:

1)线程在进行切换的时候,需要将CPU中的寄存器的信息存储起来,然后读入另外一个线程的数据,这个会花费一些时间;

2)CPU的高速缓存中的数据,也可能失效,需要重新加载;

3)线程的切换会涉及到用户模式到内核模式的切换,据说每次模式切换都需要执行上千条指令,很耗时。


实际上协程的切换之所以快的原因主要是:

1)在切换的时候,寄存器需要保存和加载的数据量比较小;

2)高速缓存可以有效利用;

3)没有用户模式到内核模式的切换操作;

4)更有效率的调度,因为协程是非抢占式的,前一个协程执行完毕或者堵塞,才会让出CPU,而线程则一般使用了时间片的算法,会进行很多没有必要的切换。

高性能服务器到底是如何实现的?


当你在阅读文章的时候,有没有想过,服务器是怎么把这篇文章发送给你的呢?说起来很简单不就是一个用户请求吗?服务器根据请求从数据库中捞出这篇文章,然后通过网络发回去吗。其实有点复杂服务器端到底是如何并行处理成千上万个用户请求的呢?这里面又涉及到哪些技术呢?

一、多进程


历史上最早出现也是最简单的一种并行处理多个请求的方法就是利用多进程。比如在Linux世界中,可以使用fork、exec等系统调用创建多个进程,可以在父进程中接收用户的连接请求,然后创建子进程去处理用户请求。



1、多进程并行处理的优点

1)编程简单,非常容易理解;

2)由于各个进程的地址空间是相互隔离的,因此一个进程崩溃后并不会影响其它进程;

3)充分利用多核资源。


2、多进程并行处理的缺点

1)各个进程地址空间相互隔离,这一优点也会变成缺点,那就是进程间要想通信就会变得比较困难,你需要借助进程间通信机制,想一想你现在知道哪些进程间通信机制,然后让你用代码实现呢?显然,进程间通信编程相对复杂,而且性能也是一大问题;

2)创建进程开销是比线程要大的,频繁的创建销毁进程无疑会加重系统负担。

二、多线程


由于线程共享进程地址空间,因此线程间通信天然不需要借助任何通信机制,直接读取内存就好了。线程创建销毁的开销也变小了,要知道线程就像寄居蟹一样,房子(地址空间)都是进程的,自己只是一个租客,因此非常的轻量级,创建销毁的开销也非常小。

我们可以为每个请求创建一个线程,即使一个线程因执行I/O操作——比如读取数据库等——被阻塞暂停运行也不会影响到其它线程。

由于线程共享进程地址空间,这在为线程间通信带来便利的同时也带来了无尽的麻烦。正是由于线程间共享地址空间,因此一个线程崩溃会导致整个进程崩溃退出,同时线程间通信简直太简单了,简单到线程间通信只需要直接读取内存就可以了,也简单到出现问题也极其容易,死锁、线程间的同步互斥、等等,这些极容易产生bug,无数程序员宝贵的时间就有相当一部分用来解决多线程带来的无尽问题。


虽然线程也有缺点,但是相比多进程来说,线程更有优势,但想单纯的利用多线程就能解决高并发问题也是不切实际的。因为虽然线程创建开销相比进程小,但依然也是有开销的,对于动辄数万数十万的链接的高并发服务器来说,创建数万个线程会有性能问题,这包括内存占用、线程间切换,也就是调度的开销。

三、事件驱动:Event Loop


到目前为止,提到“并行”二字就会想到进程、线程。但是并行编程只能依赖这两项技术吗?并不是这样的!还有另一项并行技术广泛应用在GUI编程以及服务器编程中,这就是近几年非常流行的事件驱动编程:event-based concurrency。

大家不要觉得这是一项很难懂的技术,实际上事件驱动编程原理上非常简单。

这一技术需要两种原料:

1)event;

2)处理event的函数,这一函数通常被称为event handler;

由于对于网络通信服务器来说,处理一个用户请求时大部分时间其实都用在了I/O操作上,像数据库读写、文件读写、网络读写等。当一个请求到来,简单处理之后可能就需要查询数据库等I/O操作,我们知道I/O是非常慢的,当发起I/O后我们大可以不用等待该I/O操作完成就可以继续处理接下来的用户请求。所以一个event loop可以同时处理多个请求。

四、事件来源:IO多路复用


IO多路复用技术通过一次监控多个文件描述,当某个“文件”(实际可能是im网络通信中socket)可读或者可写的时候我们就能同时处理多个文件描述符啦。

这样IO多路复用技术就成了event loop的原材料供应商,源源不断的给我们提供各种event,这样关于event来源的问题就解决了。

 

五、问题:阻塞式IO


当我们进行IO操作,比如读取文件时,如果文件没有读取完成,那么我们的程序(线程)会被阻塞而暂停执行,这在多线程中不是问题,因为操作系统还可以调度其它线程。但是在单线程的event loop中是有问题的,原因就在于当我们在event loop中执行阻塞式IO操作时整个线程(event loop)会被暂停运行,这时操作系统将没有其它线程可以调度,因为系统中只有一个event loop在处理用户请求,这样当event loop线程被阻塞暂停运行时所有用户请求都没有办法被处理。你能想象当服务器在处理其它用户请求读取数据库导致你的请求被暂停吗?


因此:在基于事件驱动编程时有一条注意事项,那就是不允许发起阻塞式IO。有的同学可能会问,如果不能发起阻塞式IO的话,那么该怎样进行IO操作呢?

六、解决方法:非阻塞式IO


为克服阻塞式IO所带来的问题,现代操作系统开始提供一种新的发起IO请求的方法,这种方法就是异步IO。对应的,阻塞式IO就是同步IO,关于同步和异步详见上文。

异步IO时,假设调用aio_read函数(具体的异步IO API请参考具体的操作系统平台),也就是异步读取,当我们调用该函数后可以立即返回,并继续其它事情,虽然此时该文件可能还没有被读取,这样就不会阻塞调用线程了。此外,操作系统还会提供其它方法供调用线程来检测IO操作是否完成。

七、基于事件驱动并行编程的难点

虽然有异步IO来解决event loop可能被阻塞的问题,但是基于事件编程依然是困难的。

首先event loop是运行在一个线程中的,显然一个线程是没有办法充分利用多核资源的,有的同学可能会说那就创建多个event loop实例不就可以了,这样就有多个event loop线程了,但是这样一来多线程问题又会出现。


其次在于编程方面,异步编程需要结合回调函数(这种编程方式需要把处理逻辑分为两部分:一部分调用方自己处理,另一部分在回调函数中处理),这一编程方式的改变加重了程序员在理解上的负担,基于事件编程的项目后期会很难扩展以及维护。

八、更好的方法

有没有一种方法既能结合同步IO的简单理解又不会因同步调用导致线程被阻塞呢?答案是肯定的,这就是用户态线程(user level thread),也就是大名鼎鼎的协程。

虽然基于事件编程有这样那样的缺点,但是在当今的高性能高并发服务器上基于事件编程方式依然非常流行,但已经不是纯粹的基于单一线程的事件驱动了,而是 event loop + multi thread + user level thread。

进程、线程、协程

一、什么是进程?

1、基本常识

计算机的核心是CPU,它承担了所有的计算任务;操作系统是计算机的管理者,它负责任务的调度、资源的分配和管理,统领整个计算机硬件;应用程序则是具有某种功能的程序,程序是运行于操作系统之上的。


进程是一个具有一定独立功能的程序在一个数据集上的一次动态执行的过程,是操作系统进行资源分配和调度的一个独立单位,是应用程序运行的载体。进程是一种抽象的概念,从来没有统一的标准定义。

进程一般由程序、数据集合和进程控制块三部分组成:

  • 程序用于描述进程要完成的功能,是控制进程执行的指令集;

  • 数据集合是程序在执行时所需要的数据和工作区;

  • 程序控制块(Program Control Block,简称PCB),包含进程的描述信息和控制信息,是进程存在的唯一标志。


进程的特点:

  • 动态性:进程是程序的一次执行过程,是临时的,有生命期的,是动态产生,动态消亡的;

  • 并发性:任何进程都可以同其他进程一起并发执行;

  • 独立性:进程是系统进行资源分配和调度的一个独立单位;

  • 结构性:进程由程序、数据和进程控制块三部分组成。

2、为什么要有多进程?


多进程目的是提高cpu的使用率。假设只有一个进程(先不谈多线程),从操作系统的层面看,我们使用打印机的步骤有如下:

1)使用CPU执行程序,去硬盘读取需要打印的文件,然后CPU会长时间的等待,直到硬盘读写完成;

2)使用CPU执行程序,让打印机打印这些内容,然后CPU会长时间的等待,等待打印结束。


在这样的情况下:其实CPU的使用率其实非常的低。

打印一个文件从头到尾需要的时间可能是1分钟,而cpu使用的时间总和可能加起来只有几秒钟。而后面如果单进程执行游戏的程序的时候,CPU也同样会有大量的空闲时间。

使用多进程后:

当CPU在等待硬盘读写文件,或者在等待打印机打印的时候,CPU可以去执行游戏的程序,这样CPU就能尽可能高的提高使用率。

再具体一点说,其实也提高了效率。因为在等待打印机的时候,这时候显卡也是闲置的,如果用多进程并行的话,游戏进程完全可以并行使用显卡,并且与打印机之间也不会互相影响。

3、总结


进程直观点说是保存在硬盘上的程序运行以后,会在内存空间里形成一个独立的内存体,这个内存体有自己独立的地址空间,有自己的堆,上级挂靠单位是操作系统。操作系统会进程为单位,分配系统资源(CPU时间片、内存等资源),进程是资源分配的最小单位。

二、什么是线程?

1、基本常识


早期操作系统中并没有线程的概念,进程是能拥有资源和独立运行的最小单位,也是程序执行的最小单位。任务调度采用的是时间片轮转的抢占式调度方式,而进程是任务调度的最小单位,每个进程有各自独立的一块内存,使得各个进程之间内存地址相互隔离。后来随着计算机的发展,对CPU的要求越来越高,进程之间的切换开销较大,已经无法满足越来越复杂的程序的要求了。于是就发明了线程。

线程是程序执行中一个单一的顺序控制流程:

1)程序执行流的最小单元

2)处理器调度和分派的基本单位


一个进程可以有一个或多个线程,各个线程之间共享程序的内存空间(也就是所在进程的内存空间)。一个标准的线程由线程ID、当前指令指针(PC)、寄存器和堆栈组成。而进程由内存空间(代码、数据、进程空间、打开的文件)和一个或多个线程组成。


如上图所示,在任务管理器的进程一栏里,有道词典和有道云笔记就是进程,而在进程下又有着多个执行不同任务的线程。

2、任务调度


线程是什么?要理解这个概念,需要先了解一下操作系统的一些相关概念。大部分操作系统(如Windows、Linux)的任务调度是采用时间片轮转的抢占式调度方式。在一个进程中:当一个线程任务执行几毫秒后,会由操作系统的内核(负责管理各个任务)进行调度,通过硬件的计数器中断处理器,让该线程强制暂停并将该线程的寄存器放入内存中,通过查看线程列表决定接下来执行哪一个线程,并从内存中恢复该线程的寄存器,最后恢复该线程的执行,从而去执行下一个任务。

上述过程中任务执行的那一小段时间叫做时间片,任务正在执行时的状态叫运行状态,被暂停的线程任务状态叫做就绪状态,意为等待下一个属于它的时间片的到来。

这种方式保证了每个线程轮流执行,由于CPU的执行效率非常高,时间片非常短,在各个任务之间快速地切换,给人的感觉就是多个任务在“同时进行”,这也就是我们所说的并发(别觉得并发有多高深,它的实现很复杂,但它的概念很简单,就是一句话:多个任务同时执行)。
 

3、进程与线程的区别


进程与线程的关系

1)线程是程序执行的最小单位,而进程是操作系统分配资源的最小单位;

2)一个进程由一个或多个线程组成,线程是一个进程中代码的不同执行路线;

3)进程之间相互独立,但同一进程下的各个线程之间共享程序的内存空间(包括代码段、数据集、堆等)及一些进程级的资源(如打开文件和信号),某进程内的线程在其它进程不可见;

4)线程上下文切换比进程上下文切换要快得多。



总之线程和进程都是一种抽象的概念,线程是一种比进程更小的抽象,线程和进程都可用于实现并发。

在早期的操作系统中并没有线程的概念,进程是能拥有资源和独立运行的最小单位,也是程序执行的最小单位。它相当于一个进程里只有一个线程,进程本身就是线程。所以线程有时被称为轻量级进程。

后来随着计算机的发展,对多个任务之间上下文切换的效率要求越来越高,就抽象出一个更小的概念——线程,一般一个进程会有多个(也可以是一个)线程。
  
 

4、多线程与多核


上面提到的时间片轮转的调度方式说一个任务执行一小段时间后强制暂停去执行下一个任务,每个任务轮流执行。很多操作系统的书都说“同一时间点只有一个任务在执行”。其实“同一时间点只有一个任务在执行”这句话是不准确的,至少它是不全面的。那多核处理器的情况下,线程是怎样执行呢?这就需要了解内核线程。

多核(心)处理器是指在一个处理器上集成多个运算核心从而提高计算能力,也就是有多个真正并行计算的处理核心,每一个处理核心对应一个内核线程。内核线程(Kernel Thread,KLT)就是直接由操作系统内核支持的线程,这种线程由内核来完成线程切换,内核通过操作调度器对线程进行调度,并负责将线程的任务映射到各个处理器上。

一般一个处理核心对应一个内核线程,比如单核处理器对应一个内核线程,双核处理器对应两个内核线程,四核处理器对应四个内核线程。

现在的电脑一般是双核四线程、四核八线程,是采用超线程技术将一个物理处理核心模拟成两个逻辑处理核心,对应两个内核线程,所以在操作系统中看到的CPU数量是实际物理CPU数量的两倍,如你的电脑是双核四线程,打开“任务管理器 -> 性能”可以看到4个CPU的监视器,四核八线程可以看到8个CPU的监视器。

超线程技术就是利用特殊的硬件指令,把一个物理芯片模拟成两个逻辑处理核心,让单个处理器都能使用线程级并行计算,进而兼容多线程操作系统和软件,减少了CPU的闲置时间,提高的CPU的运行效率。这种超线程技术(如双核四线程)由处理器硬件的决定,同时也需要操作系统的支持才能在计算机中表现出来。

程序一般不会直接去使用内核线程,而是去使用内核线程的一种高级接口——轻量级进程(Lightweight Process,LWP),轻量级进程就是通常意义上所讲的线程,也被叫做用户线程。

由于每个轻量级进程都由一个内核线程支持,因此只有先支持内核线程,才能有轻量级进程。

用户线程与内核线程的对应关系有三种模型:

1)一对一模型;

2)多对一模型;

3)多对多模型。

5、一对一模型


对于一对一模型来说:一个用户线程就唯一地对应一个内核线程(反过来不一定成立,一个内核线程不一定有对应的用户线程)。这样如果CPU没有采用超线程技术(如四核四线程的计算机),一个用户线程就唯一地映射到一个物理CPU的内核线程,线程之间的并发是真正的并发。

一对一模型优点

使用户线程具有与内核线程一样的优点一个线程因某种原因阻塞时其他线程的执行不受影响(此处,一对一模型也可以让多线程程序在多处理器的系统上有更好的表现)。

一对一模型缺点

1)许多操作系统限制了内核线程的数量,因此一对一模型会使用户线程的数量受到限制;

2)许多操作系统内核线程调度时,上下文切换的开销较大,导致用户线程的执行效率下降。

6、多对一模型


多对一模型将多个用户线程映射到一个内核线程上,线程之间的切换由用户态的代码来进行,系统内核感受不到线程的实现方式。用户线程的建立、同步、销毁等都在用户态中完成,不需要内核的介入。

多对一模型优点

1)多对一模型的线程上下文切换速度要快许多;

2)多对一模型对用户线程的数量几乎无限制。


多对一模型缺点

1)如果其中一个用户线程阻塞,那么其它所有线程都将无法执行,因为此时内核线程也随之阻塞了;

2)在多处理器系统上,处理器数量的增加对多对一模型的线程性能不会有明显的增加,因为所有的用户线程都映射到一个处理器上了。

7、多对多模型


多对多模型结合了一对一模型和多对一模型的优点将多个用户线程映射到多个内核线程上,由线程库负责在可用的可调度实体上调度用户线程。

这使得线程的上下文切换非常快,因为它避免了系统调用。但是增加了复杂性和优先级倒置的可能性,以及在用户态调度程序和内核调度程序之间没有广泛(且高昂)协调的次优调度。

多对多模型的优点

1)一个用户线程的阻塞不会导致所有线程的阻塞,因为此时还有别的内核线程被调度来执行;

2)多对多模型对用户线程的数量没有限制;

3)在多处理器的操作系统中,多对多模型的线程也能得到一定的性能提升,但提升的幅度不如一对一模型的高。



在现在流行的操作系统中,大都采用多对多的模型。

8、查看进程与线程


一个应用程序可能是多线程的,也可能是多进程的,如何查看呢?

在Windows下我们只须打开任务管理器就能查看一个应用程序的进程和线程数。按“Ctrl+Alt+Del”或右键快捷工具栏打开任务管理器。

在“进程”选项卡下,我们可以看到一个应用程序包含的线程数。

如果一个应用程序有多个进程,我们能看到每一个进程,如在上图中,Google的Chrome浏览器就有多个进程。

同时,如果打开了一个应用程序的多个实例也会有多个进程,如上图中我打开了两个cmd窗口,就有两个cmd进程。如果看不到线程数这一列,可以再点击“查看 -> 选择列”菜单,增加监听的列。

查看CPU和内存的使用率:在性能选项卡中,我们可以查看CPU和内存的使用率,根据CPU使用记录的监视器的个数还能看出逻辑处理核心的个数,如我的双核四线程的计算机就有四个监视器。
 

9、线程的生命周期


当线程的数量小于处理器的数量时,线程的并发是真正的并发,不同的线程运行在不同的处理器上。但当线程的数量大于处理器的数量时,线程的并发会受到一些阻碍,此时并不是真正的并发,因为此时至少有一个处理器会运行多个线程。

在单个处理器运行多个线程时,并发是一种模拟出来的状态。操作系统采用时间片轮转的方式轮流执行每一个线程。现在,几乎所有的现代操作系统采用的都是时间片轮转的抢占式调度方式,如我们熟悉的Unix、Linux、Windows及macOS等流行的操作系统。

我们知道线程是程序执行的最小单位,也是任务执行的最小单位。在早期只有进程的操作系统中,进程有五种状态,创建、就绪、运行、阻塞(等待)、退出。早期的进程相当于现在的只有单个线程的进程,那么现在的多线程也有五种状态,现在的多线程的生命周期与早期进程的生命周期类似。


进程在运行过程有三种状态:就绪、运行、阻塞,创建和退出状态描述的是进程的创建过程和退出过程。

早期进程的生命周期:

创建:进程正在创建,还不能运行。操作系统在创建进程时要进行的工作包括分配和建立进程控制块表项、建立资源表格并分配资源、加载程序并建立地址空间;

就绪:时间片已用完,此线程被强制暂停,等待下一个属于它的时间片到来;

运行:此线程正在执行,正在占用时间片;

阻塞:也叫等待状态,等待某一事件(如IO或另一个线程)执行完;

退出:进程已结束,所以也称结束状态,释放操作系统分配的资源。



线程的生命周期跟进程很类似:

创建:一个新的线程被创建,等待该线程被调用执行;

就绪:时间片已用完,此线程被强制暂停,等待下一个属于它的时间片到来;

运行:此线程正在执行,正在占用时间片;

阻塞:也叫等待状态,等待某一事件(如IO或另一个线程)执行完;

退出:一个线程完成任务或者其他终止条件发生,该线程终止进入退出状态,退出状态释放该线程所分配的资源。

五、什么是协程?

1、基本常识


协程是一种基于线程之上,但又比线程更加轻量级的存在,这种由程序员自己写程序来管理的轻量级线程叫做“用户空间线程”,具有对内核来说不可见的特性。由于是自主开辟的异步任务,所以很多人也更喜欢叫它们纤程(Fiber),或者绿色线程(GreenThread)。正如一个进程可以拥有多个线程一样,一个线程也可以拥有多个协程。
 

2、协程的目的


对于Java程序员来说,在传统的J2EE系统中都是基于每个请求占用一个线程去完成完整的业务逻辑(包括事务)。所以系统的吞吐能力取决于每个线程的操作耗时。

如果遇到很耗时的I/O行为,则整个系统的吞吐立刻下降,因为这个时候线程一直处于阻塞状态,如果线程很多的时候,会存在很多线程处于空闲状态(等待该线程执行完才能执行),造成了资源应用不彻底。

最常见的例子就是JDBC(它是同步阻塞的),这也是为什么很多人都说数据库是瓶颈的原因。这里的耗时其实是让CPU一直在等待I/O返回,说白了线程根本没有利用CPU去做运算,而是处于空转状态。而另外过多的线程,也会带来更多的ContextSwitch开销。

对于上述问题:现阶段行业里的比较流行的解决方案之一就是单线程加上异步回调。其代表派是 node.js 以及 Java 里的新秀 Vert.x 。

而协程的目的就是当出现长时间的I/O操作时,通过让出目前的协程调度,执行下一个任务的方式,来消除ContextSwitch上的开销。

3、协程的特点


协程的特点总结一下就是:

1)线程的切换由操作系统负责调度,协程由用户自己进行调度,因此减少了上下文切换,提高了效率;

2)线程的默认Stack大小是1M,而协程更轻量,接近1K。因此可以在相同的内存中开启更多的协程;

3)由于在同一个线程上,因此可以避免竞争关系而使用锁;

4)适用于被阻塞的,且需要大量并发的场景。但不适用于大量计算的多线程,遇到此种情况,更好实用线程去解决。

4、协程的原理


当出现IO阻塞的时候,由协程的调度器进行调度,通过将数据流立刻yield掉(主动让出),并且记录当前栈上的数据,阻塞完后立刻再通过线程恢复栈,并把阻塞的结果放到这个线程上去跑。

这样看上去好像跟写同步代码没有任何差别,这整个流程可以称为coroutine,而跑在由coroutine负责调度的线程称为Fiber。比如Golang里的 go关键字其实就是负责开启一个Fiber,让func逻辑跑在上面。

由于协程的暂停完全由程序控制,发生在用户态上;而线程的阻塞状态是由操作系统内核来进行切换,发生在内核态上。因此协程的开销远远小于线程的开销,也就没有了ContextSwitch上的开销。

六、总结


1、进程和线程的区别

1)调度:线程作为调度和分配的基本单位,进程作为拥有资源的基本单位;

2)并发性:不仅进程之间可以并发执行,同一个进程的多个线程之间也可并发执行;

3)拥有资源:进程是拥有资源的一个独立单位,线程不拥有系统资源,但可以访问隶属于进程的资源;

4)系统开销:在创建或撤消进程时,由于系统都要为之分配和回收资源,导致系统的开销明显大于创建或撤消线程时的开销。


2、进程和线程的联系

1)一个线程只能属于一个进程,而一个进程可以有多个线程,但至少有一个线程;

2)资源分配给进程,同一进程的所有线程共享该进程的所有资源;

3)处理机分给线程,即真正在处理机上运行的是线程;

4)线程在执行过程中,需要协作同步。不同进程的线程间要利用消息通信的办法实现同步。

开发者在每个线程中只做非常轻量的操作,比如访问一个极小的文件,下载一张极小的图片,加载一段极小的文本等。但是,这样”轻量的操作“的量却非常多。


在有大量这样的轻量操作的场景下,即使可以通过使用线程池来避免创建与销毁的开销,但是线程切换的开销也会非常大,甚至于接近操作本身的开销。对于这些场景,就非常需要一种可以减少这些开销的方式。于是,协程就应景而出,非常适合这样的场景。

 

Guess you like

Origin blog.csdn.net/LANHYGPU/article/details/128697816