How much overhead does process/thread switching actually require?

Process is a concept that our development students are very familiar with, and we may have also heard of process context switching overhead. So today let us think about a question, how much CPU time will a process context switch consume? Threads are said to be lighter than processes. Will context switching save a lot of CPU time than process switching?

01

Process and process switching

The process is one of the great inventions of the operating system, which shields the application from hardware details such as CPU scheduling and memory management, and abstracts the concept of a process, allowing the application to concentrate on implementing its own business logic, and in the limited CPU Many tasks can be performed "simultaneously".

In the process of switching process A to process B, first save the context of process A, so that when A resumes running, you can know what the next instruction of process A is. Then restore the context of the B process to be run to the register. This process is called context switching.

Context switching overhead is not a big problem in application scenarios with few processes and infrequent switching. But now the Linux operating system is used as a back-end server for highly concurrent network programs. When a single machine supports tens of thousands of user requests, this overhead has to be mentioned.

Because when the user process requests Redis, Mysql data and other network IO blocked, or when the process time slice arrives, it will trigger a context switch.

02

A simple process switching overhead test

Not much nonsense, let's use an experiment to test how much CPU time is needed for a context switch!

The experimental method is to create two processes and pass a token between them. One of the processes will block when reading the token. Another process is also blocked while waiting for its return after sending the token. Transfer a certain number of times in this way, and then count their average single switching time overhead. Compile and run

# gcc main.c -o main
# ./main./main
Before Context Switch Time1565352257 s, 774767 us
After Context SWitch Time1565352257 s, 842852 us

Each execution time will vary. After multiple runs, it takes about 3.5us per context switch on average. Of course, this number varies from machine to machine, and it is recommended to test on a real machine.

When we tested the system call earlier, the lowest value was 200ns. It can be seen that the context switching overhead is greater than the overhead of system calls. The system call only switches the user mode to the kernel mode in the process, and then switches back, while the context switch is directly switched from process A to process B. Obviously, this context switch requires more work.

03

Process switching overhead analysis

So what are the specific CPU overheads during context switching? There are two types of overhead, one is direct overhead and the other is indirect overhead.

Direct overhead is what the cpu must do when switching, including:

1. Switch the page table global directory

2. Switch kernel mode stack

3. Switch the hardware context (data that must be loaded into the register before the process resumes is collectively called the hardware context)

  • ip (instruction pointer): points to the next instruction of the currently executed instruction

  • bp (base pointer): used to store the bottom address of the stack frame corresponding to the function being executed

  • sp(stack poinger): used to store the top address of the stack frame corresponding to the function being executed

  • cr3: page directory base address register, save the physical address of the page directory table

  • ......

4. Refresh TLB

5. Code execution of the system scheduler

Indirect overhead mainly refers to the fact that after switching to a new process, the speed will be slower because the various caches are not hot.

It is better if the process is always scheduled on one CPU. If it crosses the CPU, the previously heated TLB, L1, L2, and L3 have changed because the running process has changed, so the code and data cached by the principle of locality are also It is useless, and the IO that causes the new process to penetrate into the memory will increase.

In fact, our experiments above did not measure this situation well, so the actual context switching overhead may be greater than 3.5us.

For students who want to know more detailed operation process, please refer to Chapter 3 and Chapter 9 in "Understanding Linux Kernel".

04

A more professional tool-lmbench

lmbench is a multi-platform open source benchmark used to evaluate the overall performance of the system. It can test performance including document reading and writing, memory operations, process creation and destruction overhead, and network performance. The method of use is simple, but the running is a bit slow, interested students can try it by themselves.

The advantage of this tool is that multiple sets of experiments are carried out, each with 2 processes, 8, and 16. The data size used by each process is also changing, fully simulating the impact of cache miss. I tested it with him and the results are as follows (the picture below needs to be scrolled horizontally to view it completely):

-------------------------------------------------------------------------
Host                 OS  2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
                         ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
--------- ------------- ------ ------ ------ ------ ------ ------- -------
bjzw_46_7 Linux 2.6.32- 2.7800 2.7800 2.7000 4.3800 4.0400 4.75000 5.48000

The process context switching displayed by lmbench takes time from 2.7 us to 5.48 us.

05

Thread context switching time-consuming

Before we tested the overhead of process context switching, let's continue to test threads in Linux. See if it can be faster than the process, and how much faster it can be.

In fact, there are no threads under Linux, just to cater to the tastes of developers, a lightweight process is called a thread. Lightweight processes, like processes, have their own independent task_struct process descriptors and their own independent pids. From the perspective of the operating system, there is no difference between scheduling and processes. They are all selecting a task_struct in the doubly linked list of the waiting queue to switch to the running state. It's just that the difference between lightweight processes and ordinary processes is that they can share the same memory address space, code segments, global variables, and the same set of open files.

The pid seen by all getpid() of threads in the same process is the same, in fact, there is a tgid field in task_struct. For multi-threaded programs, the tgid obtained by the getpid() system call is actually the tgid, so multiple threads belonging to the same process seem to have the same PID.

We use one experiment to perform another test. The principle is similar to that of process testing. 20 threads are created, and signals are passed between the threads through pipes. Wake up when receiving the signal, and then pass the signal to the next thread to sleep by itself. In this experiment, the extra cost of transmitting signals to the pipeline was separately considered and counted in the first step.

# gcc -lpthread main.c -o main
0.508250
4.363495

There will be some differences in the results of each experiment. The above results are averaged after taking the results for many times. The cost of each thread switching is about 3.8us. In terms of time-consuming context switching, Linux threads (lightweight processes) are actually not very different from processes.

06

Linux related commands

Now that we know that context switching consumes CPU time, what tools can we use to check how many switching is happening in Linux? If the context switch has affected the overall performance of the system, is there a way to find out the problematic process and optimize it? (The image below needs to be scrolled horizontally to view it completely)

# vmstat 1
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0      0 595504   5724 190884    0    0   295   297    0    0 14  6 75  0  4
 5  0      0 593016   5732 193288    0    0     0    92 19889 29104 20  6 67  0  7
 3  0      0 591292   5732 195476    0    0     0     0 20151 28487 20  6 66  0  8
 4  0      0 589296   5732 196800    0    0   116   384 19326 27693 20  7 67  0  7
 4  0      0 586956   5740 199496    0    0   216    24 18321 24018 22  8 62  0  8

or it could be

# sar -w 1
proc/s
     Total number of tasks created per second.
cswch/s
     Total number of context switches per second.
11:19:20 AM    proc/s   cswch/s
11:19:21 AM    110.28  23468.22
11:19:22 AM    128.85  33910.58
11:19:23 AM     47.52  40733.66
11:19:24 AM     35.85  30972.64
11:19:25 AM     47.62  24951.43
11:19:26 AM     47.52  42950.50
......

The environment in the above figure is a production environment machine, the configuration is an 8-core 8G KVM virtual machine, the environment is nginx+fpm, the number of fpm is 1000, and the average user interface request processed per second is about 100. The cs column indicates the number of context switches that occurred in the system within 1s. About 1s, the number of switches has reached 4W. As a rough estimate, each core needs to switch about 5K times per second, and it takes nearly 20ms to switch context within 1s. You must know that this is a virtual machine, and there will be some additional overhead in virtualization itself, and it will actually consume the CPU in user interface logic processing, system call kernel logic processing, network connection processing and soft interrupts, so the 20ms overhead Actually not low anymore.

So further, let's see which processes are causing frequent context switching? (The image below may need to be scrolled horizontally to view it completely)

# pidstat -w 1
11:07:56 AM       PID   cswch/s nvcswch/s  Command
11:07:56 AM     32316      4.00      0.00  php-fpm
11:07:56 AM     32508    160.00     34.00  php-fpm
11:07:56 AM     32726    131.00      8.00  php-fpm
......

Since fpm is a synchronous blocking mode, whenever Redis, Memcache, or Mysql is requested, it will block and cause voluntary context switching of cswch/s, and only after the time slice is up will involuntary switching of nvcswch/s be triggered. It can be seen that most of the switching of the fpm process is voluntary, and less involuntary.

If you want to view the total context switch status of a specific process, you can directly view it under the /proc interface, but this is the total value.

grep ctxt /proc/32583/status
voluntary_ctxt_switches:        573066
nonvoluntary_ctxt_switches:     89260

07

in conclusion

We don’t need to remember what context switching does. We just need to remember one conclusion. It is measured that the cost of context switching on the author’s development machine is about 2.7-5.48 us. You can use the code or tools provided by me for your own machine. Some tests.

lmbench is relatively more accurate because it takes into account the additional overhead caused by Cache miss after switching.

Comic: Brother, have to stay up all night again tonight!

From three thousand monthly salary to three thousand monthly salary

How did Daniel train? How does it make money?

Comic: Why is quantum computing so awesome?

Architect dismissal guide

Every time Java is killed, there is always a great god to save

The fate of programmers

70 years of chip wars, the real king is about to appear !

Kill the biggest monster of software development: the werewolf !

Who is the number one IDE in the universe?

HTTP Server: A poor counterattack

How to reduce programmer's salary?

Programmer, you have to choose the right time to run!

In two years, I learned all programming languages!

Javascript: a counterattack from a dick

I am a thread

TCP/IP Daming Postman

Https after a story

CPU Forrest

Guess you like

Origin blog.csdn.net/coderising/article/details/109506714