Linux Swap is doing?

swap is doing?

Under Linux, similar to the role of SWAP "virtual memory" under the Windows system. When the physical memory, disk space part out when the partition SWAP (virtual memory to) use, in order to address the case of insufficient memory capacity.

SWAP is an exchange means, by definition, when a process of discovery requests to the OS out of memory, OS will temporarily unused data memory swapped out, on the SWAP partition, this process is called SWAP OUT. When a process they need these data and found that there OS when free physical memory, and will exchange data SWAP partition back into physical memory, this process is called SWAP IN.

Of course, there is an upper limit of the size of the swap, once swap finished using, the operating system will trigger OOM-Killer mechanisms to consume most of the memory process kill off to free up memory.

 

Why despise database system swap?

Apparently, the original intention of swap mechanism is to ease the physical memory runs out and brutally direct selection process OOM embarrassment. But frankly speaking, almost all databases on the swap is not how to be see whether MySQL, Oracal, MongoDB Or HBase, Why? This and the following two main aspects:

1. Database systems are generally more sensitive to the response delay, if instead of using the swap memory, database service performance necessarily unacceptable. For extremely sensitive response delay of the system is concerned, the delay is too large and the service is not available there is no difference, than the service unavailable More seriously, the process under the swap scenario is not to die, which means that the system has been unavailable ...... Think again if OOM swap is not used directly, a better choice is not so many from the master server system will be switched off directly, substantially without user perception.

2. In addition to this type of distributed system such as HBase, in fact, is not worried about a node dawdle away, but rather the fear of a node ram live. Shoot down a node, at best, a small part of the request briefly unavailable, retry can be restored. However, a node will live ram ram live all requests are distributed, server-side resources are occupied and hold the thread, cause the entire cluster request blocking, and even clusters collapse.

From these two point of view, all databases do not like swap is still very reasonable!

 

swap working mechanism

Since the database have to swap wait to see, it is not necessary to use swapoff command to turn off the disk cache peculiarities? No, you can think about what close disk cache mean? The actual production environment is not a system would be so radical, you know this world is not always either 0 or 1, we will choose to go more or less in the middle, but some tend to 0, 1 and some favor it. Obviously, in the swap this problem, select the database inevitable bias as little as possible. Several requests HBase official document actually implement this policy: reducing the impact of swap as much as possible. Know thyself to know yourself, to reduce the impact of swap must figure out is how Linux memory recovery work, so as to not miss any possible doubt.

 

Let's look at how the swap is triggered?

 

In short, Linux will trigger garbage collection in two scenarios, one is found in the memory allocation is not enough free memory recovery immediately triggers when memory; one is opening a daemon (swapd process) for periodic system memory check, the memory available to the reduced active trigger memory recovery after a certain threshold. The first scene is nothing to say, to talk about the focus of the second scenario, as shown below:


a2

 

 

 

Here we must lead our attention to the first parameter: vm.min_free_kbytes, the representative system watermark minimum free memory [min] retention and affect watermark [Low], and [high] watermark. Simple can be considered:

 

 

watermark[min] = min_free_kbytes
watermark[low] = watermark[min] * 5 / 4 = min_free_kbytes * 5 / 4
watermark[high] = watermark[min] * 3 / 2 = min_free_kbytes * 3 / 2
watermark[high] - watermark[low] = watermark[low] - watermark[min] = min_free_kbytes / 4

 

Visible, LInux of these parameters min_free_kbytes water line and inseparable. min_free_kbytes the importance of the system is self-evident, neither too big nor too small.

 

min_free_kbytes If too small, buffer level between [min, low] will be small, in the process of kswapd recovered upper layer application memory once too fast (a typical application: a database), it will easily lead to reduced free memory watermark [min] or less, then the kernel will be direct reclaim (direct recycling), recycling directly in the context of the process of the application, and then recovered up free pages of memory to meet the application, so the actual block the application, bring certain response delay. Of course, min_free_kbytes should not be too big, too much on the one hand causes the application process memory to reduce the waste of system memory resources, on the other hand also lead kswapd spend a lot of time to process memory recall. Look at this process, and is not Java garbage collection algorithms in CMS trigger mechanism quite similar to the recovery of the old generation, think about the parameters -XX: CMSInitiatingOccupancyFraction, is not it? Official documentation requirements min_free_kbytes not be less than 1G (set in 8G large memory systems), is not easily trigger direct recycling.

 

Thus, the basic interpretation of the Linux memory recovery trigger mechanism and the first argument vm.min_free_kbytes our attention. Next, a brief look at what Linux memory recovery are recycled. Linux memory recovery is primarily divided into two types:

1. The file cache, this is easy to understand that to avoid file each time data from the hard disk, the system will be hot data stored in memory to improve performance. If you only read the file out of memory recovery only need to release this memory can be, read again the next time the file data can be read directly from the hard disk (similar to HBase file cache). If you read only that file out, and these data cached files were modified (dirty), recovery of memory you need to write this part of the data file will then release the hard disk (similar to MySQL file cache).

2. anonymous memory, this memory is not actually the carrier, unlike hard disk file cache file such a carrier, such as a typical heap, stack data. This memory can not be released directly when recycled or similar document written back to the media, to engage in this swap out this mechanism, this type of memory swapped out to the hard disk, and then load it when needed.

 

What specific Linux uses algorithms to identify which files or cache memory needs to be recovered anonymity out here do not care, we are interested in can be found here. But there is a problem we need to think about: Since there are two types of memory can be recovered, then in the case of these two types of memory can be recovered, Linux in the end is how to decide which type of memory in the end is to recover it? Or both will be recovered? Here we pull out the parameters of a second concern: swappiness, this value is used to define how aggressively the kernel swap, the higher the value, the kernel will actively use the swap, the lower the value, it will reduce the use of the swap enthusiasm. The values ​​range from 0 to 100, default is 60. This swappiness in the end is how to achieve it? Specific principle is very complex, simple terms, swappiness by the control memory recovery, recycling or some more anonymous pages of documents recovered some more cache to achieve this effect. swappiness equal to 100, all anonymous memory and cache files will be recovered with the same priority, the default priority 60 indicates that the file cache will be recycled out, as to why the file cache to be prioritized recovered out, you may wish to think about (usually recover file cache IO operation will not lead to lower, less impact on system performance). For the database is concerned, swap is to be avoided as much as possible, we need to set it to 0. It should be noted here, is set to zero does not mean that does not perform swap oh!

 

So far, we have recovered from the Linux memory trigger mechanism, Linux memory recovery target has been talking to swap, as well as the parameters min_free_kbytes swappiness explained. Then take a look at another and swap related parameters: zone_reclaim_mode, said the document set this parameter to 0 to turn off NUMA-zone reclaim, this is how is it? Lift NUMA, database who are both happy, a lot of DBA have all been miserable pit too. That brief description here three little questions: What NUMA that? NUMA and swap What is the relationship? zone_reclaim_mode specific meaning?

NUMA (Non-Uniform Memory Access) is a UMA relatively speaking, the design of both the CPU architecture, CPU early design UMA configuration, as shown below (image from the network) as shown:



a3

 

In order to alleviate the read channel multicore CPU bottleneck encountered the same memory chip design engineers and NUMA configuration, as shown below (image from the network) as shown:



a4

 

 

This architecture can be a good solution to the problem of UMA, that is, different CPU has the exclusive memory area, in order to achieve "Memory isolation" between the CPU, also need to support the software level two o'clock:

1. Memory allocation must be allocated in the exclusive memory area of ​​the CPU requesting thread is currently located. If the CPU dedicated memory allocated to other areas, isolation is bound to be affected to some extent, bus and memory access performance across the inevitably somewhat reduced.

2. In addition, once the local memory (dedicated memory) is not enough, priority out of local memory memory pages, not to see if the remote free memory memory area will have to borrow.

 

It is achieved in isolation really good, but the problem came: NUMA this feature may result in uneven CPU memory usage, CPU part of the exclusive use of memory is not enough, the frequent need to recycle, and then a large number of swap may occur, the system response delay would seriously shake. At the same time the rest of the CPU dedicated memory may very idle. This will produce a strange phenomenon: using the free command to view the current system as well as some free physical memory, the system has continued to swap occurs, resulting in a sharp decline in performance for some applications. See Ye Jinrong teacher MySQL Case Study: " find MySQL server has SWAP culprit ."

 

So, in terms of applications for small memory, NUMA brought this problem is not prominent, on the contrary, local memory, the performance brought considerable. But for this type of database memory-hungry, NUMA default policy brought stability risks are unacceptable. So the database who are strongly urged to make improvements NUMA default policy, there are two aspects can be improved:

1. The memory allocation strategy of the default mode to the affinity interleave mode, i.e., the memory will be assigned to a different page break in the CPU zone. Memory may solve the problem of uneven distribution in this way to ease the strange case of the above-mentioned problem to some extent. For MongoDB, it will prompt use interleave memory allocation strategy at startup:

 

WARNING: You are running on a NUMA machine.
We suggest launching mongod like this to avoid performance problems:
numactl –interleave=all mongod [other options]

 

 

2. Improved memory recovery strategy: here finally call out today's third protagonist parameters zone_reclaim_mode, this parameter defines the next NUMA memory architecture different recovery strategies can take the values ​​0/1/3/4, where 0 represents the local the case may be not enough memory to allocate the other memory region of memory; 1 indicates a case where the local memory is not enough to recover local redistribution; 3 represents a local cache file recovery to recover as objects; 4 represents a local recycling preferentially used swap recovery anonymous memory. Visible, HBase recommended configuration zone_reclaim_mode = swap reduces the probability of occurrence of 0 to a certain extent.

 

Not all swap things

So far, we have discussed the three system parameters associated with the swap, and these three parameters in-depth interpretation around the Linux system memory allocation, swap, and NUMA and other knowledge points. In addition, for the database system, there are two very important parameters of particular concern:

1. IO scheduling strategy: There are many online this topic explain in detail this does not intend to give only the results. Typically for sata disk for OLTP databases, deadline scheduling algorithm is the best choice.

2. THP (transparent huge pages) off characteristics. THP characteristics I have wondered for a long time, there are two main points of doubt, one is THP and HugePage is not the same thing, and the second is why the HBase demanded the closure of THP. After back and forth many times access to relevant documents, and finally found some clues. Here four dots to explain THP features:

(1) What is HugePage?

Online interpretation HugePage There are many, we can retrieve read. Briefly, computer memory is addressed through the memory mapped table (index table memory), the current system memory as a 4KB page as the smallest unit of memory addressing. With increasing memory, the size of the index table memory will continue to increase. 256G memory of a machine, if you use small 4KB page size will only index table about 4G. To know the index table must be loaded in memory, but also in the CPU memory, too much will happen a lot of miss, memory addressing performance will drop.

HugePage is to solve this problem, HugePage use large page size of 2MB instead of the traditional small page to manage memory, so memory can control the size of the index table is small, and then all packed in CPU memory, to prevent miss.

(2) What is the THP (Transparent Huge Pages)?

HugePage is a large pages theory, how to use that specific characteristics HugePage it? Currently, the system provides two ways, one is called Static Huge Pages, another is Transparent Huge Pages. The former can know by name is a static management strategy, require the user to configure a large number of pages depending on system memory size manually, this will generate a corresponding number of large pages when the system starts, follow-up will not change. And Transparent Huge Pages is a dynamic management strategy, it will be to the application at run-time dynamic allocation of large pages, and these pages to manage large, completely transparent to the user, does not require any configuration. In addition, the current THP only for anonymous memory region.

(3) HBase (database) Why are demanding the closure of THP characteristics?

THP is a dynamic management strategies, management allocates large page at runtime, so there will be a certain degree of distribution of delay, which is unacceptable to pursue latency of database systems. In addition, THP there are many other drawbacks, you can refer to the article " Why-tokudb-hates-transparent-hugepages "

(4) THP OFF / ON have little impact write performance HBase right?

In order to verify the effects of THP closed to open HBase performance in the end how much, I did a simple test in a test environment: only a test cluster RegionServer, reading and writing test load ratio is 1: 1. THP is always and never the two options, and more options in the section called madvise system in parts of the system. Can use the command echo never / always> / sys / kernel / mm / transparent_hugepage / enabled to close / open THP. The test results shown below:



a5

 

FIG above, the TPH off (never) HBase optimal performance scenario, relatively stable. And THP opening scene (Always), compared to the performance drop off scene has about 30%, and a large jitter curve. Visible, HBase online remember to close the THP.

 

to sum up

Any database system performance are related to many factors, and there are various factors of the database itself, such as database configuration, the client uses, capacity planning, design table scheme, in addition to its effect on the underlying system also critical, such as an operating system, JVM and so on. Very often experience some database performance issues, check the left and right can not locate the specific reasons for the investigation, this time will have to see whether the operating system of the configuration are reasonable. Starting from a few parameters HBase official documentation requirements, details the specific meaning of these parameters. Covering relatively more interested students can view the detailed article last referenced article.

 

This switched: http://hbasefly.com/2017/05/24/hbase-linux/?lkfgjq=xbbdl2

Published 136 original articles · won praise 38 · views 260 000 +

Guess you like

Origin blog.csdn.net/Pipcie/article/details/105019884