Performance optimization of KVM CPU optimization

Foreword

 

Any platform depending on the scene, has a corresponding optimization. Not the same hardware environment, network environment, the same platform, it ran out of the effect is certainly not the same. Like a Ferrari, I ran back with the village street on the highway, the speed and passion certainly different ...

So, we do the operation and maintenance work, as well. First you have to fully understand the software platform you use, and then based on your existing production environment to fully test, and finally the outcome, do the best adjustment.

KVM is the same, first thing to do is to fully understand it and see what we can set the parameters and make adjustments, the final application in order to achieve maximum performance.

So KVM performance tuning, we can start from four aspects - CPU, memory, disk IO, network.

 

KVM CPU Performance Tuning

 

 Our piece of this CPU is tuned for NUMA, NUMA then what is it? NUMA is the abbreviation Non Uniform Memory Access Architecture, meaning that non-uniform memory access, which is a multi-CPU work together to solve the solution. We know now server configuration are relatively high, and many are multi-channel multi-core CPU, and the CPU is a need to exchange data with the memory of the past era, when the CPU operation rate is not high, and are single-CPU mode then to put the data in memory in CPU calculates this is completely keep pace. But now the CPU speed is greatly increased, and they are all multi-CPU mode, so it was unbalanced, that is, the data in memory so that CPU completely inadequate digestion, and a plurality of memory will appear CPU snatch .. in this case the CPU performance was very hungry ... not enough data be enjoyed, but not enough memory allocation.

So computer scientists in order to improve your computer's performance, carefully studied the coordination between the interactive mode in CPU and memory. The overall idea is to find a multi-core CPU mode, how to maximize the CPU can "enjoy" more data from multiple memory.

So I devised the following solution sets:

1.1 SMP technology

Starting with the SMP technology, SMP (Symmetric Multi-Processing) technology is symmetric multi-processing architecture, the most important feature of this structure is shared by all CPU resources, such as bus, memory, IO systems.

Since it is all shared resources, so between each CPU is a relationship of equality, then these CPU operating system manages access to resources (usually in the form of a queue to manage). Each CPU in order to deal with the process of the queue, if two CPU simultaneously access, then the general is to solve the problem of competition through the mechanism of software lock, software lock this concept developed in the thread-safe lock mechanism with the truth is the same, when a CPU processing with a process will normally locked, processed and then released.

So here it comes, here refers to the symmetry is no equality between the main CPU from accessing resources are equal. We can look at the picture below:

This is the earliest structure of the program, but that is because the earliest, so its drawbacks quickly emerged that its expansion is not strong. We look at this picture above obviously felt that if the server CPU to enhance performance increase, then the memory (at maximizing memory situations) obviously is not enough, because it is the shared mode, more than one CPU, the more a person eat memory data ... not so much to enjoy increased CPU memory data, it will stop, thus causing a waste of CPU.

Experimental data show that, the SMP server CPU type is preferably 2-4 stars on OK, the excess is wasted.

 

Thus, this approach is flawed. . . Therefore, scientists have thought of another structure of the program, and that is NUMA.

 

1.2 NUMA technology

NUMA we have just in front of said non-uniform memory access means, it appears a good solution to the problem of expansion of SMP. With NUMA technology so you can put dozens or even hundreds of combinations within a server CPU.

NUMA architecture design:

 

We found from the drawing, between each of the CPU modules are connected through the Internet and the information interaction module, it is the interconnection of the CPU, while each CPU module is divided into a number average Chip (no more than 4), each Chip has its own memory controller and memory slots.

In the NUMA node there are three concepts:

  1. Local node: a CPU for all nodes, this node is called the local node.

  2. Neighbors: adjacent to the local node called neighbors.

  3. Remote node: non-local node or a neighbor node of the node, called the remote node.

Neighbor node and the remote node, are referred to as non-local nodes (Off Node).

It should be noted here that, CPU nodes to access different types of memory is not the same speed, the fastest access to the local node, access to the remote node slowest, ie access speed and distance related nodes, the farther away from the access speed slower, this distance called Node distance. It is because of this feature, so our application to try to reduce interaction between the CPU module unreasonable, that is, if your application can have a method for immobilizing a CPU module, the then your application performance will It will be greatly improved.

Speed: local node> neighbors> remote node

 

So KVM is the same, we optimize CPU KVM this is to make binding on the specified CPU, thus reducing the use of interactive cross-CPU, so performance KVM improved. Now our server as well as the linux operating system is the default mode NUMA go, so we are going to talk about how to do bind CPU.

So specifically how to do?

1.3 numactl  order to explain

We are here with a real physical machine demonstration, this physical machine is IBM 3650M4.

First we look at NUMA with numactl command, if your system does not have this command, installation yum install numactl buttons.

 

Numactl # - H numactl help command, the main parameters are as follows:
  --interleave = Nodes, - i Nodes This option is used to allocate memory interleaving mode.

That system when allocating memory for a plurality of nodes, will be assigned to it by way of distribution of polling

Multiple nodes. If the target node in the current multitude of interwoven allocate memory node can not correctly allocate memory space

Between words, memory space will be allocated by the other nodes. Multi-node can - interleave,

--membind and - cpunodebind command to specify.

 --membind = nodes, -m nodes option ' --membind ' it is only used to allocate memory from the node

Space used. If the space can not allocate the requested size, then the nodes in the dispensing operation will fail.

Nodes specified in the above command space needs to be allocated in accordance with the above-described manner may be N, N, N, N - N, N which

Ways to specify.

 Nodes = --cpunodebind, - N Nodes This command is only used in the process of applying and running the cpu.

This command is used to display the number of cpu, the number of cpu stored information is also recorded in the system processor FIELD

Information / proc / under cpuinfo folder, or based on information associated with the central processor in the current

The central processor sets stored.

  --localalloc, - L command option is usually for the current node allocate memory.
 --preferred = node node This command priority in the allocation of memory space due specify, if you can not

Space assigned to the node, it should be allocated to the space on that node will be distributed to other nodes.

After the command option to receive only a single reference node. Related representation may also be used.

  --show, - S This command is used to display those processes NUMA mechanism of action in the current run.
  --hardware, - H This command is used to display the number of currently available systems have nodes.
 --huge When you create a time-based system-level shared memory segment large memory pages - Huge this option,

Note that this option is only --shmid or - Use the back shm valid command.

 - offset option parameter specifies the amount of displacement of the shared memory segment offset. By default

Offset is 0 . Effective offset unit is m (for representing MB) g (expressed for GB), k (for representing KB),

Others are considered not specified in bytes.

 - strict application of this parameter option when the page area of shared memory segments NUMA scheduling mechanism is

Another application of the mechanism and cause errors when using the - strict option will bring up an error message is displayed.

The default is not to use this option.

 --shmmode shmmode This option is only --shmid or - Use take effect before shm.

When creating a shared memory segment when the mode type to the shared memory shared by a specified integer value.

 --shmid the above mentioned id      to create or use a shared memory segment by ID number. (If the shared memory segment has been

Exist, then specify the following to use a shared memory segment ID by shmid;

If the shared memory segment corresponding to the ID does not exist, then create it).

 --shm shmkeyfile by storing in shmkeyfile (shared memory - key file)

The ID number is used to create or use a shared memory segment. Process file is accessed shmkeyfile

By the fork ( . 3 arguments) method to achieve.

 - File tmpfsfile numa mechanism will be applied to the above document, this document belongs to tmpfs

Or hugetlbfs this particular file system.

 - Touch       by numa mechanism applied to just come to realize early numa page of memory.

The default is to not use this option, or if there is a mapping application to access the page, it will use the early

Implementing this approach NUMA mechanism.

  - . This option is used to dump properties on NUMA repeal will have numa of a specific area
  --dump- NUMA characteristics on all nodes other than the nodes nodes specified nodes will all

Be removed

  for all the characteristics of the NUMA node removes all
  After the node number specified by the contact of the digital value corresponding to abolish node 
 NUMA characteristic of the number1 (number2) node number1 (node ​​number2) will be removed

 number1 -number2 node number1 - all existing node on node number2 range

The NUMA features will be removed

 ! Nodes NUMA feature on all nodes except the nodes of the specified node

Will all be removed

 

OK, numactl These are the details of the command, then the next we take a look at the current server CPU numa:

We execute lscpu command to view some CPU information:

 

We can see with numactl --hardware, here as I prepared two IBM servers, the other is a 3650M4 3850M2.

We can see the situation from command returns, this server has two numa node (node0 and node1):

 We look at another server, which is an IBM 3850M2, then it is only one node:

        

 

Numactl --hardware by this command, the above we can see that each node has a machine 81894 MB RAM may be used (about 79G), the server node has the IBM 3850M2 131070MB (120 Multi G) of memory available (substantially the entire server memory)

So then we can look at the scheduling of the distribution of cpu numa:

We run numastat command can be found:

   

3650M4

   

  

3850M2

 

Parameters explanation:

  • using the node number memory numa_hit

  • num_miss plan to use the memory nodes are scheduled to the number of other nodes

  • num_foregin plan to use other nodes using local memory and the number of memory

  • CROSS interleave_hit memory allocated for use times of using the memory of the local node

  • local_node node running the program using the memory node number

  • NB other_node other nodes running the program using the number of times this node memory

   

Then we look at this command: numastat -c, the command is followed by c to keep up with the process name will be able to see NUMA memory usage related processes. For example: numastat -c qemu-kvm, so that we know the qemu-kvm this process, the size of its memory use on node0 and node1, the unit is MB:

 

  

OK by these orders we can view some of the basic state numa and usage. So for the CPU Numa technology, linux operating system itself, it also has its own design for this piece. Take linux, it is the default strategy to use NUMA automatic balancing, that is, the system will automatically allocate numa memory usage, in order to balance a.

Of course, the user can set up their own control, if we want to close, run directly 

     # Echo 0> / proc / sys / kernel / numa_balancing to

     # Echo 1> / proc / sys / kernel / numa_balancing is open

1.4 CPU binding operation

Having said that, since the operating system as well as our CPU features have adopted the NUMA architecture, then we can adjust KVM corresponding NUMA relationship to optimize this area KVM CPU. Here, we generally do related operations by the method of CPU-bound.

Then the specific operation is kind of how it? So then we have to demonstrate with an example. Here is a single physical machine, we have seen before, and now installed well above the KVM, then running several virtual machines, we use virsh list command to view a list of virtual machines currently running.

 

 

For example, we look at this situation Win7-ent virtual machine vCPU corresponding physical CPU, then you can run:

   # Virsh vcpuinfo Win7-ent can view

     

 

The virtual machine is two dual-core vCPU, then are run on CPU8 physical machine, use of time is 2964.6s. The last one is the CPU affinity, the logical core yyyyy represents the physical CPU internal use, wherein a y represents a logical CPU core. All is y, then that 24 physical CPU cores this machine, the CPU scheduling can be used.

Of course, we can enter vrish, and then run emulatorpin Win7-ent, by this command we can get a more detailed virtual machine which several cores may be:

 

We can see the current virtual machine's CPU 0-23 it can use the scheduler

These are then view the information of the virtual machine CPU NUMA scheduling, if we want the virtual machine is bound to a fixed CPU, we need to do the following: # virsh emulatorpin Win7-ent 18-23 --live by this command, this virtual machine vCPU win7 binding in the nuclear between 18-23 six CPU.

We use the command to check emulatorpin Win7-ent

 

We can also see the confirmation virsh dumpxml Win7-ent:

 

The method of binding together to getting in the virtual machine vCPU.

So some people will doubt, a virtual machine I have two vCPU, such as the win7, which is dual-core, I want inside vCPU1 and vCPU2 are bound to different physical CPU can do? How does it work? It is also possible that we may be relevant vCPU are binding by the following method

    # virsh vcpupin Win7-ent 0 22

    # virsh vcpupin Win7-ent 1 23

    # virsh dumpxml Win7-ent

 

 

   # virsh vcpuinfo Win7-ent

     

 

 

 

OK, here to note it is that you reboot to restart the virtual machine, the binding configuration is still in effect, but if you shutdown, CPU binding effect will fail. We want the VM and then shut up and also take effect, it must be written argument to the XML virtual machine, and then save, so the shutdown will not fail to note here at

   # virsh edit vm1
   wq!
   添加:
   <cputune>
      <vcpupin vcpu='0' cpuset='22'/>
      <vcpupin vcpu='1' cpuset='23'/>
   </cputune>

     

 

 

OK, the above operation is CPU-bound technology. By so doing, we can fix a few CPU to the virtual machine on a single physical machine CPU. Of course, as to why you can do that, as we mentioned on the principle of NUMA, if fixed CPU virtual machine, then it will not go to the remote node, and the other is under some scenarios, a more physical machine CPU, If the first few CPU load is high, high utilization, CPU utilization is low behind a few, then we can coordinate, the CPU do the binding, the balance under the CPU load.

These are CPU bound, then we talk about hot add CPU.

 

1.5 CPU  hot-add

First, let's understand what the next hot add, hot-add is the case does not shut down the virtual machine is running, do add operation of the CPU. It should be noted that the hot-add did not appear until after Redhat7.0, and before is not. To enjoy this feature so it must require KVM virtual machine and the host had after version 7.0. So how to operate we demonstrate to everyone through an operation.

Such as the current virtual machine, which is a CentOS7.1 of. We look at the current value of the CPU virtual machines, we can see into the system, cat / proc / cpuinfo | grep "processor" | uniq | wc -l, we see that the current is 2 CPU: 

 

Then we explain the maximum number of CPU allocation is how mean, it means that the number of CPU virtual machine to the maximum reserved. This setting is important if you want to add heat to the virtual machine, then this setting must be written. For example, we write here four, then we can add to the virtual machine maximum heat to 4 CPU, and 4 is the upper limit.

So then the next, specifically how to hot-add. Let's give the Virtual Machine Additions in the host in the third CPU, it turned out to be two, now add one more to become 3: setvcpus VM3_CentOS7.1 3 --live

Then we went inside the virtual machine to activate this CPU:

echo 1 >/sys/devices/system/cpu/cpu2/online

 

We run view, the discovery has become three.

If you want to reduce, then only just reduce CPU in a virtual machine

    # echo 0 >/sys/devices/system/cpu/cpu2/online

 

But look at the host level, the number of vCPU virtual machines, or three, that does not support hot reduction, we run the command found vcpuinfo VM3_CentOS7.1 or 3:

 

 

Similarly, Windows add, too, can be directly added a third in the host CPU in

    # setvcpus VM4_Win2008 3 --live

Then the virtual machine without operation, it will be automatically updated to the three CPU, we can also do a windows virtual machine-related presentations, concrete can be operated by the readers themselves.

To this date, the above is to optimize KVM CPU area. To sum up on two points, one is CPU-bound, there is a hot-add.

CPU-bound first understand NUMA technology, then stood in the level of the entire host CPU resources to adjust.

Hot-add, when you are running a virtual machine, and then suddenly increased business pressures, can improve the performance of virtual machine CPU reaches zero downtime through this method.

Reference links:

https://mp.weixin.qq.com/s?__biz=MzU0NDEyODkzMQ==&mid=2247494800&amp;idx=1&amp;sn=f8b280b15b3e13ec72afa940da60ed29&source=41#wechat_redirect

http://www.cnblogs.com/yubo/archive/2010/04/23/1718810.html

http://cenalulu.github.io/linux/numa/

 

Guess you like

Origin www.cnblogs.com/kcxg/p/11095519.html