[Performance] Application of HugePages in general program optimization

table of Contents

1. Background

2. Introduction to fingerprint-based music retrieval

3. Principle

4. The dilemma of small pages

5. Configuration and use of large page memory

6. Optimization effect of large page memory

7. Usage scenarios of large page memory

8. Summary

LD_PRELOAD usage


 

 

Original: https://blog.csdn.net/yutianzuijin/article/details/41912871

Today, I will introduce you to a relatively novel method of program performance optimization—HugePages. Simply put, it is to reduce the page table by increasing the size of the operating system page, thereby avoiding the loss of fast tables. The information in this area is relatively poor, and most of the information on the Internet introduces its application in the Oracle database, which will give people the illusion that this technology can only be applied in the Oracle database. But in fact, huge page memory can be regarded as a very general optimization technology with a wide range of applications. For different applications, it may bring up to 50% performance improvement, and the optimization effect is still very obvious. In this blog, a specific example will be used to introduce the use of large page memory.

       Before the introduction, it needs to be emphasized that large page memory also has a scope of application. The program consumes very little memory or the memory access locality of the program is very good. It is difficult to improve the performance of large page memory. Therefore, if the program optimization problem you are facing has the above two characteristics, please do not consider large page memory. Later, I will explain in detail why the large page memory of the program with the above two characteristics is invalid.

1.  Background

       Recently, I have been engaged in the development of listening and recognizing music projects in the company. For details, please refer to: Fingerprint-based music retrieval , which is currently on the Sogou voice cloud open platform . During the development process, I encountered a very serious performance problem. The performance can still meet the requirements in the single-threaded test, but when the multi-threaded stress test is performed, the most time-consuming part of the algorithm suddenly becomes several times slower! After careful debugging, I found that the most influential performance is the compiler option -pg. After removing it, the performance will be much better, but it will still be about 2 times slower than the single-threaded performance, which will cause the real-time rate of the system to reach 1.0 or more. , The responsiveness is severely reduced.

       Through more careful analysis, we found that the most time-consuming part of the system is the process of accessing the fingerprint library, but this part has no room for optimization at all, and we can only switch to a machine with a higher memory bandwidth. Switching to a machine with a higher memory bandwidth did bring a lot of performance improvements, but it still couldn't meet the requirements. Just when the mountain was exhausted, Dr. Chuntao Hong of MSRA accidentally saw MSRA mentioned in Weibo that they used large page memory to optimize the access problem of a random array and obtained a very good performance improvement. Then I asked him for help. Finally, the system performance was further improved through the method of large page memory, and the real-time rate dropped to about 0.4. Successfully reached the goal!

2. Introduction to fingerprint-based music retrieval

The retrieval process is actually the same as the search engine, the music fingerprint is equivalent to the keywords in the search engine, and the fingerprint database is equivalent to the background web page library of the search engine. The structure of the fingerprint database is the same as the web page database of the search engine, which adopts the inverted index form. As shown below:

 

Figure 1 Fingerprint-based inverted index table

It's just that the fingerprints are all an int type integer (the figure only occupies 24 bits), which contains too little information, so a lot of fingerprints need to be extracted to complete a match, which is about a few thousand per second. Every time a fingerprint is obtained, it is necessary to access the fingerprint library to obtain the corresponding inverted list, and then construct a forward list according to the music id to analyze which music matches, as shown in the following figure:

 

Figure 2 The similarity of statistical matching

The final result is the music with the highest ranking result.

The current fingerprint database is about 60G, which is the result of extracting fingerprints of 25w songs. The length of the inverted list corresponding to each fingerprint is not fixed, but there is an upper limit of 7500. The number of music in the front row list is also 25w, and the number of longest time differences corresponding to each piece of music is 8192. About 1000 fingerprints (or even more) will be generated in a single search.

Through the above introduction, it can be seen that fingerprint-based music retrieval (listening to songs and recognizing music) has three parts: 1. Extract fingerprints; 2. Visit fingerprint library; 3. Sort time difference. In the case of multi-threading, the time-consuming ratios of these three parts are approximately: 1%, 80%, and 19%, that is, most of the time is spent on the operation of finding the fingerprint library. The more troublesome thing is that all accesses to the fingerprint library are out-of-order access, and there is no locality at all, so the cache is always missing, and the conventional optimization methods are ineffective, and it can only be replaced with a server with a higher memory bandwidth.

However, it is precisely because of the above characteristics-huge memory consumption (about 100G), out-of-order memory access, and memory access is the bottleneck, which makes large page memory particularly suitable for optimizing the performance bottlenecks encountered above.

3. Principle

The principle of large page memory involves the conversion process from the virtual address of the operating system to the physical address. In order to run multiple processes at the same time, the operating system provides a virtual process space for each process. On a 32-bit operating system, the process space size is 4G, and a 64-bit system is 2^64 (actually, it may be less than this value). For a long time, I have been very confused about this. Will this cause conflicts in memory access by multiple processes, for example, when both processes access address 0x00000010. In fact, the process space of each process is virtual, which is not the same as the physical address. The two access the same virtual address, but are different after the conversion to the physical address. This conversion is realized through the page table, and the knowledge involved is the paging storage management of the operating system.

Paging storage management divides the virtual address space of the process into several pages, and numbered each page. Correspondingly, the physical memory space is also divided into several blocks, which are also numbered. The page and block size are the same. Assuming that the size of each page is 4K, the paging address structure in a 32-bit system is:

 

In order to ensure that the process can find the actual physical block corresponding to the virtual page in the memory, it is necessary to maintain an image table, that is, the page table, for each process. The page table records the physical block number corresponding to each virtual page in the memory, as shown in Figure 3. After the page table is configured, when the process is executed, the physical block number of each page in the memory can be found by looking up the table.

A page table register is set in the operating system, which stores the starting address of the page table in the memory and the length of the page table. When the process is not executing, the start address of the page table and the length of the page table are placed in the PCB of the process; when the scheduler schedules the process, these two data are loaded into the page table register.

When a process wants to access data in a certain virtual address, the paging address conversion mechanism will automatically divide the effective address (relative address) into two parts, the page number and the address within the page, and then use the page number as an index to retrieve the page table and find The operation is performed by hardware. If the given page number does not exceed the length of the page table, add the start address of the page table to the product of the page number and the length of the page table entry to get the position of the entry in the page table, and then the physical page can be obtained from it Block address, load it into the physical address register. At the same time, the page address in the effective address register is sent to the block address field of the physical address register. This completes the conversion from the virtual address to the physical address.

 

Figure 3 The role of the page table

Because the page table is stored in the memory, this causes the CPU to access the memory twice every time it accesses a piece of data. The first time the page table in the memory is accessed, the physical block number of the specified page is found, and the block number is spliced ​​with the offset in the page to form a physical address. When the memory is accessed for the second time, the required data is obtained from the address obtained the first time. Therefore, using this method will reduce the processing speed of the computer by nearly 1/2.

In order to improve the speed of address conversion, a special high-speed cache with parallel lookup capability can be added to the address conversion mechanism, that is, the fast table (TLB), which is used to store the currently accessed page table entries. The address conversion mechanism with fast table is shown in Figure 4. Due to the cost, the fast table cannot be made very large, usually only 16~512 page table entries are stored.

The above address conversion mechanism works very well for small and medium programs. The hit rate of the fast table is very high, so it will not bring much performance loss, but when the program consumes a lot of memory and the fast table hit rate is not high, then the problem coming.

 

Figure 4 Address conversion mechanism with fast table

4. The dilemma of small pages

       Modern computer systems all support a very large virtual address space (2^32~2^64). In such an environment, the page table becomes very large. For example, assuming that the page size is 4K, for a program that occupies 40G of memory, the page table size is 10M, and the space is also required to be contiguous. In order to solve the space continuity problem, you can introduce a two-level or three-level page table. But this affects performance even more, because if the fast table is missing, the number of times to access the page table changes from twice to three or four times. Because the memory space that the program can access is very large, if the memory access locality of the program is not good, the fast table will always be missing, which will seriously affect the performance.

       In addition, because the page table entries are as many as 10M, and the fast table can only cache a few hundred pages, even if the memory access performance of the program is very good, the probability of the fast table missing is very high in the case of large memory consumption. So, is there any good way to solve the missing fast table? Huge page memory! Suppose we change the page size to 1G, and the page table entry of 40G memory is only 40, and the fast table will not be missing at all! Even if it is missing, because there are few entries, a first-level page table can be used, and the missing will only cause two memory fetches. This is the fundamental reason why large page memory can optimize program performance-fast tables are hardly missing!

       Earlier we mentioned that if the program to be optimized consumes very little memory, or if the memory access locality is very good, the optimization effect of large page memory will be very insignificant. Now we should understand why. If the program consumes very little memory, such as only a few megabytes, there are few page table entries, and the fast table is likely to be completely cached, and even if it is missing, it can be replaced by the first-level page table. If the program memory access locality is also very good, then within a period of time, the program accesses adjacent memory, the probability of missing fast tables is also very small. Therefore, in the above two cases, the fast table is difficult to be missing, so the huge page memory does not show the advantage.

5. Configuration and use of large page memory

       Many information on the Internet will accompany its use in Oracle database when introducing large page memory. This will give people the illusion that large page memory can only be used in Oracle database. Through the above analysis, we can know that, in fact, large page memory is a very general optimization technique. Its optimization method is to avoid missing fast tables. So how to apply it specifically, the steps used are described in detail below.

 

1. Install libhugetlbfs library

       The libhugetlbfs library implements large page memory access. The installation can be done through apt-get or yum command. If the system does not have this command, you can also download it from the official website .

The use of libhugetlbfs in Linux: https://www.dazhuanlan.com/2019/11/22/5dd71081e318e/

2. Configure the grub startup file

      This step is very important, it determines the size of each large page you allocate and how many large pages. The specific operation is to edit the /etc/grub.conf file, as shown in Figure 5.

 

Figure 5 grub.conf startup script

Specifically, add several startup parameters at the end of the kernel option: transparent_hugepage=never default_hugepagesz=1G hugepagesz=1G hugepages =123. Of these four parameters, the most important are the latter two. hugepagesz is used to set the size of each page. We set it to 1G. Other optional configurations are 4K and 2M (2M is the default). If the operating system version is too low, it may cause the 1G page setting to fail, so please check your operating system version if the setting fails. Hugepages is used to set how many pages of huge page memory, our system memory is 128G, now 123G is allocated to serve huge pages. It should be noted here that the allocated huge pages are invisible to conventional programs. For example, our system still has 5G of ordinary memory left. At this time, if I start a program that consumes 10G according to the conventional method, it will fail. After modifying grub.conf, restart the system. Then run the command cat /proc/meminfo|grep Huge to check whether the huge page setting is effective. If it is effective, the following content will be displayed:

 

Figure 6 Current consumption of large pages

We need to focus on four of these values, HugePages_Total said the current total number of large pages, HugePages_Free expressed after the program up and running but also the remaining number of large pages, HugePages_Rsvd represents the number of the current system HugePages total retention, and more specifically point refers to the program has been The system applies, but because the program has no substantial HugePages read and write operations, the system has not actually allocated the number of HugePages to the program. Hugepagesize represents the size of each huge page, which is 1GB here.

       We found a problem in our experiments. The value of Free and the value of Rsvd may be different from the literal meaning. If the large page we applied for is not enough to start the program at the beginning, the system will prompt the following error:

ibhugetlbfs:WARNING: New heap segment map at 0x40000000 failed: Cannot allocate memory

At this point, looking at the above four values ​​again will find such a situation: HugePages_Free is equal to a, and HugePages_Rsvd is equal to a. This makes people feel very strange. Obviously there are still large pages remaining, but the system reports an error indicating that the large page allocation has failed. After many attempts, we believe that Rsvd large pages should be included in Free, so when Free is equal to Rsvd, there are actually no large pages available. Free minus Rsvd is the huge page that can be allocated again. For example, in Figure 6 there are 16 large pages that can be allocated.

How many large pages should be allocated is appropriate? This requires multiple attempts. An experience we have learned is that the use of large pages by sub-threads is very wasteful. It is best to allocate all the space in the main thread and then allocate it to each sub-thread. Thread, this will significantly reduce the waste of large pages.

 

3. mount

Execute mount to map the large page memory to an empty directory. You can execute the following commands:

 

if [ ! -d /search/music/libhugetlbfs ]; then
    mkdir /search/music/libhugetlbfs
fi
mount -t hugetlbfs hugetlbfs /search/music/libhugetlbfs

 

4. Run the application

In order to enable large pages, you cannot start the application in the usual way, you need to start it in the following format:

HUGETLB_MORECORE=yes LD_PRELOAD=libhugetlbfs.so ./your_program

This method will load the libhugetlbfs library to replace the standard library. The specific operation is to replace the standard malloc with a large page malloc. At this point, the memory requested by the program is huge page memory.

Follow the above four steps to enable large page memory, so it is easy to enable large pages.

 

 

6. Optimization effect of large page memory

If your application is out of order memory access is very serious, then the large page memory will bring relatively large benefits, it happens that we are now doing listening to songs and recognizing music is such an application, so the optimization effect is obvious, the following is the music library is 25w When, the performance of the program with and without large pages is enabled.

It can be seen that after the large page memory is enabled, the access time of the program is significantly reduced, and the performance is improved by nearly 50%, which meets the performance requirements.

7. Usage scenarios of large page memory

Any optimization method has its scope of application, and large page memory is no exception. We have always emphasized that only huge page memory that consumes huge memory, random accesses and memory access is the bottleneck of the program will bring a significant performance improvement. In our listening and recognizing music system, the memory consumption is close to 100G, and the memory accesses are all out-of-order accesses, which brings a significant performance improvement. It is not unreasonable that the online examples have been using Oracle database as an example. This is because the memory consumed by the Oracle database is also huge, and the addition, deletion, and modification of the database lacks locality. Additions, deletions, and modifications behind the database are basically operations on B-trees, and tree operations generally lack locality.

What kind of program has poor locality? I personally think that programs implemented using hash and tree strategies often have poor memory access locality. At this time, if the program performance is not good, you can try large page memory. On the contrary, operations such as simple array traversal or graph breadth traversal have good memory access locality, and it is difficult to achieve performance improvement using large page memory. I have tried to enable large page memory on Sogou speech recognition decoder, hoping to improve the performance, but the effect is disappointing, and the performance is not improved. This is because the speech recognition decoder is essentially a wide search of images, with good memory access locality, and memory access is not a performance bottleneck. At this time, the use of large page memory may bring other overheads, resulting in performance degradation .

8. Summary

This blog introduces the principle and usage of large page memory in detail with the example of listening to music and recognizing music. Due to the proliferation of big data, the amount of data processed by current applications is increasing, and data access is becoming more and more irregular. These conditions make it possible to use large-page memory. So, if your program runs slowly and satisfies the usage conditions of large page memory, then try it. Anyway, it is very simple and has no loss, in case it can bring good results.

 

LD_PRELOAD usage

(English: https://catonmat.net/simple-ld-preload-tutorial )

LD_PRELOAD is an environment variable used to load dynamic libraries. The priority of dynamic library loading is the highest. In general, the order of loading is LD_PRELOAD>LD_LIBRARY_PATH>/etc/ld.so.cache>/lib>/usr/lib. In the program, we often need to call some external library functions. Take malloc as an example. If we have a custom malloc function, compile it into a dynamic library and load it through LD_PRELOAD. When the malloc function is called in the program, the call is actually It is our custom function, let's take an example to illustrate.

// test.c
#include <stdio.h>
#include <stdlib.h>

int main()
{
    int i = 0;
    for (; i < 5; ++i) {
        char *c = (char*)malloc(sizeof(char));
        if (NULL == c) {
            printf("malloc fails\n");
        }
        else {
            printf("malloc ok\n");
        }
    }

    return 0;
}

Compile and run, and the results are as follows:

$gcc -o test test.c
$./test
malloc ok
malloc ok
malloc ok
malloc ok
malloc ok

It can be seen that there is no problem with the program running, we make a slight modification and customize malloc.

// preload.c
#include <stdio.h>
#include <stdlib.h>

void* malloc(size_t size)
{
    printf("%s size: %lu\n", __func__, size);
    return NULL;
}

Then package the custom malloc as a dynamic library.

$gcc -shared -fpic -o libpreload.so preload.c

Then use LD_PRELOAD to load libpreload.so and see what happens:

$LD_PRELOAD=./libpreload.so ./test
malloc size: 1
malloc fails
malloc size: 1
malloc fails
malloc size: 1
malloc fails
malloc size: 1
malloc fails
malloc size: 1
malloc fails

As you can see, the malloc returns NULL for 5 times (that is, we call the malloc defined by ourselves). If you don't know that LD_PRELOAD is doing the trick, you may not be able to find the reason for a long time after analysis. This LD_PRELOAD is a double-edged sword. If you use it well, it can help us. If you have ulterior motives, you may have unexpected surprises.

" Be aware of the LD_PRELOAD environment variable under UNIX " https://blog.csdn.net/haoel/article/details/1602108

Guess you like

Origin blog.csdn.net/bandaoyu/article/details/113559126