Design and Implementation of C++ High Concurrency Memory Pool

1. Overall design

1. Demand analysis

Pooling technology is a design pattern in computers. Memory pooling is one of the common pooling technologies. It can effectively improve memory application and release efficiency and memory fragmentation. However, traditional memory pools also have certain defects. Compared with common memory pools, high-concurrency memory pools have their own unique features, which solve some problems existing in traditional memory pools.

The benefits of this article, free C++ learning materials package, technical video/code, content includes (C++ basics, network programming, database, middleware, back-end development, audio and video development, Qt development) ↓↓↓↓↓↓See below↓↓ Click at the bottom of the article to get it for free↓↓

1) Problems with using new/delete and malloc/free directly

new/delete is used for dynamic memory management in c++, while malloc/free can be used in both c++ and c. In essence, new/delete encapsulates malloc/free at the bottom. Regardless of the memory management method above, there are the following two problems:

Efficiency issues: Frequent application and release of memory on the heap will inevitably take a lot of time, which reduces the operating efficiency of the program. For a program that requires frequent application and release of memory, frequent calls to new/malloc to apply for memory and delete/free to release memory will take system time, and frequent calls will inevitably reduce the operating efficiency of the program.

Memory fragmentation: Often applying for small pieces of memory will "cut" the physical memory into pieces, resulting in memory fragmentation. The order of applying for memory is not the order of releasing memory, so frequently applying for small blocks of memory will inevitably lead to memory fragmentation, resulting in the phenomenon of "there is memory but cannot apply for large blocks of memory".

2) Advantages and disadvantages of common memory pool

Aiming at the problems of directly using new/delete and malloc/free, the design idea of ​​the common memory pool is: open up a large memory in advance, and directly "take" a block from the large memory when the program needs memory, so as to improve the application and release of memory. Efficiency, while directly allocating large blocks of memory also reduces memory fragmentation problems.

Advantages: The efficiency of applying and releasing memory has been improved; to a certain extent, the problem of memory fragmentation has been solved.

Disadvantages: There is a lock competition problem in applying and releasing memory in a multi-threaded concurrent scenario, which reduces the efficiency of applying and releasing memory.

3) Problems to be solved by high concurrency memory pool

Based on the above reasons, designing a high-concurrency memory pool needs to solve the following three problems:

  • efficiency problem
  • memory fragmentation problem
  • Memory release and lock competition for application in multi-threaded concurrency scenarios.

2. Overall design idea

The overall framework of the high-concurrency memory pool consists of the following three parts, and the functions of each part are as follows:

  • Thread cache: Each thread has its own thread cache, which mainly solves the problem of lock competition between threads in high-concurrency running scenarios under multi-threading. The thread cache module can provide threads with less than 64k memory allocation, and multiple threads run concurrently without locking.
  • Central control cache (central control cache): As the name implies, the central control cache is the central structure of the high-concurrency memory pool and is mainly used to control memory scheduling issues. Responsible for cutting and assigning large blocks of memory to the thread cache and reclaiming excess memory in the thread cache to merge and return to the page cache, so as to achieve the purpose of more balanced on-demand scheduling of memory allocation among multiple threads. It plays a link in the entire project effect. (Note: Locking is required here. When multiple threads apply for or return memory to the central control cache at the same time, there are thread safety issues, but this situation rarely occurs and will not have a large impact on the efficiency of the program. , generally speaking, the advantages outweigh the disadvantages)
  • Page cache (page cache): apply for memory in units of pages, and provide large blocks of memory for the central control cache. When there is no memory object in the central control cache, a large block of memory can be obtained from the page cache on demand in units of pages. At the same time, the page cache will also reclaim the memory of the central control cache for consolidation to alleviate memory fragmentation problems.

3. Application memory flow chart

 

2. Detailed design

1. Detailed analysis of the internal structure of each module

1)thread cache

logical structure design

The main function of the thread cache is to provide each thread with an application for memory below 64K. In order to facilitate management, it is necessary to provide a specific management mode to save unallocated memory and released memory, so as to facilitate the secondary use of memory. The management here usually uses the memory of different sizes to be mapped in the hash table and linked. And the smallest unit of memory allocation is byte, 64k = 1024*64Byte If it is managed in a byte-by-byte management mode, at least a hash table of 1024*64 size is required to map memory of different sizes. In order to reduce the length of the hash table, memory allocation is performed by aligning certain numbers, and the waste rate is kept between 1% and 12%. The specific structure is as follows:

The specific instructions are as follows:

  • Using arrays for hash mapping, each location stores a linked list freelists, which is used to link memory objects of the same size for easy management.
  • Each array element is linked to a memory object of a different size.
  • The first element indicates that the alignment number is 8, the second is 16....and so on. The alignment number indicates that the memory of the size between the previous alignment number and this alignment number is mapped at this location, that is, if you want to apply for 1 byte or 7 bytes of memory, you can find 8 bytes of memory at index position 0 , to apply for 9~16 bytes of memory, find a 16-byte memory object at the position with index 1.
  • Through the above analysis, it can be seen that if 8-byte alignment is performed, a maximum of 7 bytes of memory will be wasted (actually applying for 1 byte of memory will return an 8-byte memory object). This phenomenon is called memory Fragments are wasted.
  • In order to keep the waste of memory fragmentation below 12%, that is to say, up to 12% of memory waste is tolerated, here different alignment numbers are used for alignment.
  • 0~128 adopts 8-byte alignment, 129~1024 adopts 16-byte alignment, 1025~8*1024 adopts 128-byte alignment, and 8*1024~64*1024 adopts 1024-byte alignment; memory fragmentation waste rates are: 1 /8, 129/136, 1025/1032, 8102/8199 are all around 12%. At the same time, 8-byte alignment requires a total of 16 hash maps [0,15]; 16-byte alignment requires a total of 56 hash maps [16,71]; 128-byte alignment requires a total of 56 hash maps [72,127] Mapping; 1024 byte alignment requires a total of 56 hash maps in [128,184].
  • The structure of the hash map is as follows:

How to ensure that each thread is unique?

How to apply for memory larger than 64k?

When the memory requested in the thread cache is greater than 64K, apply directly to the page cache. However, the page cache can only apply for a maximum of 128 pages of memory, so when the memory requested by the thread cache is greater than 128 pages, the page cache will automatically apply for the thread cache in the system memory.

2)central control cache

The central control cache serves as a communication bridge between the thread cache and the page cache, and acts as a link between the preceding and the following. It needs to provide the thread cache with cut small pieces of memory. At the same time, it also needs to reclaim the excess memory in the thread cache for merging, and allocate it to other thread caches to play a role in resource scheduling. Its structure is as follows:

The specific instructions are as follows:

  • The structure of the central control cache is still an array, which stores objects of type span.
  • Span is used to manage a piece of memory. It contains a freelist linked list, which is used to cut a large piece of memory into small pieces of memory of a specified size and link them to the freelist. When the thread cache needs memory, it directly gives the cut memory to the thread. cache.
  • At the beginning, each array index position is empty. When the thread cache applies for memory, the spanList array will apply for a large piece of memory from the page cache for cutting and hang it in the list. When the fast memory is used up, it will continue to apply for new memory, so there are multiple span links. The reason why the span exists in the front is because it is possible that the memory has been applied for in the back and the memory in the front is also released.
  • When all the memory of a certain span is returned, the central control cache will merge the memory again and return it to the page cache.
  • When the central control cache is empty, apply for memory from the page cache, at least one page at a time, and must apply in units of pages (the page size here is determined by ourselves, here we use 4K).

It should be noted here that there may be multiple thread caches, but there is only one central control cache. To allow multiple thread cache objects to access one central control cache object, the central control cache here needs to be designed as a singleton mode.

3)page cache

The page cache manages memory in units of pages. It maps memory with different page numbers using hashes, and maps up to 128 pages of memory. The specific structure is as follows:

page Cache application and release memory process:

  • When the central control cache applies for memory from the page cache, for example, if it wants to apply for 8 pages of memory, it will first search for the position where the span size is 8. If there is no memory, it will continue to search for 9 10...128, and if there is one, it will be cut from that 8 pages.
  • For example, if there is only memory at 54, 8 pages will be cut from 54 and returned to the central control cache, and the remaining pages 54-846 will be hung at page 46.
  • When there is no memory in the page cache, it directly applies for a 128-page memory and hangs it at position 128. When the central control cache applies for memory, it will be cut from page 128.

2. Design details

1)thread cache

Calculate the corresponding _freelists index according to the application memory size

  • 1~8 are all mapped at index 0, 9~16 are all mapped at index 2...
  • Therefore, when aligned with 8 bytes, it can be expressed as: ((size + (2^3 - 1)) >> 3) - 1;
  • If the requested memory is 129, how is the index calculated?
  • First, the first 128 bytes are aligned according to 8 bytes, so: ((129-128)+(2^4-1))>>4)-1 + 16
  • In the above formula, 16 means that the 16 positions with indexes 0~15 are aligned by 8 bytes.

Code:

//根据内存大小和对齐数计算对应下标
static inline size_t _Intex(size_t size, size_t alignmentShift)
{
	//alignmentShift表示对齐数的位数,例如对齐数为8 = 2^3时,aligmentShift = 3
	//这样可以将除法转化成>>运算,提高运算效率
	return ((size + (1 << alignmentShift) - 1) >> alignmentShift) - 1;
}
//根据内存大小,计算对应的下标
static inline size_t Index(size_t size)
{
	assert(size <= THREAD_MAX_SIZE);
 
	//每个对齐数对应的索引个数,分别表示8 16 128 1024字节对齐
	int groupArray[4] = {16,56,56,56};
 
	if (size <= 128)
	{
		//8字节对齐
		return _Intex(size, 3) + groupArray[0];
	}
	else if (size <= 1024)
	{
		//16字节对齐
		return _Intex(size, 4) + groupArray[1];
	}
	else if (size <= 8192)
	{
		//128字节对齐
		return _Intex(size, 7) + groupArray[2];
	}
	else if (size <= 65536)
	{
		//1024字节对齐
		return _Intex(size, 10) + groupArray[3];
	}
 
	assert(false);
	return -1;
}

When freelist applies for memory from the central cache, it needs to align the requested memory size

First of all, upward alignment is required when the requested memory size is insufficient for alignment. That is, when the memory size to be applied is 1 byte, it needs to be aligned to 8 bytes. How to align? Is it ok to not do the alignment?

First, the freelist index can be calculated without alignment. When applying for memory for the first time, the memory size after cutting the index position of the freelist is the actual requested memory size, and the alignment is not performed, causing memory management confusion. The alignment is as follows:

  • The alignment numbers are 8 = 2^3; 16 = 2^4; 128 = 2^7; 1024 = 2^10, and there is only one 1 after converting to binary.
  • In the alignment interval, all numbers + alignment number - 1 must be greater than or equal to the maximum value of the current interval and less than the maximum value of the next adjacent interval.
  • Therefore, size + alignment number - 1, if it is 8-byte alignment, just change the lower 3 bits to 0, and if it is 16-byte alignment, change the lower 3 bits to 0...
  • For example: when size = 2, the alignment number is 8; then size + 8 - 1 = 9, converted to 1001 in the binary system, 1000 after changing the lower three bits to 0, and converted to decimal is the alignment number 8.
The code is expressed as follows: alignment represents the alignment number
(size + alignment - 1) & ~(alignment - 1);

Note: For these small functions, defining them as inline can reduce the stack push overhead. '

How to "hang" small memory objects in the freelist linked list

Haha, the groundwork has already been prepared for this place. It is stipulated that the minimum size of a single object is 8 bytes, the size of the next pointer of the 32-bit system is 4 bytes, and the size of the next pointer of the 64-bit machine is 8 bytes. We stipulated that the minimum size of a single object is 8 bytes in order to save a pointer to link small objects no matter in a 32-bit system or a 64-bit system. So, how do you use a small chunk of memory to hold pointers?

The address of the next block of memory is saved directly in the first 4/8 bytes of the memory, and the address can be taken out by directly dereferencing the memory when fetching the memory.

Access: *(void**)(mem)

Every time you take memory from freelist or return memory, you can directly insert or delete the head.

How much is appropriate to apply for memory from the central control cache at a time?

The idea here is to use the "slow start" method to apply, that is, one application is applied for the first time, and two are applied for the second time. When it reaches a certain size (512), it will not increase. The advantage of this is that the small number of applications for the first time can prevent some threads from only needing one more to cause waste, and the later can reduce the number of times from the central control cache and improve efficiency.

When the number of expected memory objects obtained by using slow start is greater than the number of memory objects in the current central control cache, give as many as you have. Because, in fact, only one is needed at present, and if we apply more than enough, we will give as much as we have. When there is none, it will go to the page cache to apply.

When does the thread cache return the memory to the central control cache?

When a thread returns the memory to the thread cache, it will judge whether there is too much memory in the corresponding position of the corresponding _freelist (when the size of the memory object in the thread cache is greater than or equal to the maximum number, it will return to the central control cache return).

2)Central Control Cache

SpanList structure

The most important role of SpanList in the central control cache is to manage large blocks of memory. It stores objects of the span class and manages them using a linked list. The structure is as follows:

In other words, SpanList is essentially a span linked list. Here, considering the need to find the corresponding page to return the memory in the future, it is convenient to insert. Here, the spanlist is set as a two-way leading circular list.

Span structure

Span stores the information of large blocks of memory, and manages large blocks of memory together with SpanList, and its memory unit is page (4K). Its structure is actually a linked list of objects of size size. It also serves as the node of SpanList. spanList is a two-way circular linked list, so there are next and prev pointers in span.

The benefits of this article, free C++ learning materials package, technical video/code, content includes (C++ basics, network programming, database, middleware, back-end development, audio and video development, Qt development) ↓↓↓↓↓↓See below↓↓ Click at the bottom of the article to get it for free↓↓

struct Span
{
    PageID _pageId = 0;   // 页号
    size_t _n = 0;        // 页的数量
    Span* _next = nullptr;
    Span* _prev = nullptr;
    void* _list = nullptr;  // 大块内存切小链接起来,这样回收回来的内存也方便链接
    size_t _usecount = 0;    // 使用计数,==0 说明所有对象都回来了
    size_t _objsize = 0;    // 切出来的单个对象的大小
};

When there is no memory in the spanList, you need to apply for memory from PageCache. How much should you apply for at a time?

Allocate memory according to the size of the requested object, that is to say, the smaller the size of a single object, the fewer pages will be allocated, and the larger the size of a single object, the more memory will be allocated. How to measure how much?

Here we are determined by the upper limit of the number of memory objects obtained from the central control cache in the thread cache. In other words, the upper limit of the number * the size of the memory object is the size of the memory we want to apply for. Shifting 12 digits to the right (1 page) is the number of pages that need to be applied for.

//计算申请多少页内存
static inline size_t NumMovePage(size_t memSize)
{
	//计算thread cache最多申请多少个对象,这里就给多少个对象
	size_t num = NumMoveSize(memSize);
	//此时的nPage表示的是获取的内存大小
	size_t nPage = num*memSize;
	//当npage右移是PAGE_SHIFT时表示除2的PAGE_SHIFT次方,表示的就是页数
	nPage >>= PAGE_SHIFT;
 
	//最少给一页(体现了按页申请的原则)
	if (nPage == 0)
		nPage = 1;
 
	return nPage;
}

Apply for a block of memory from the central control cache, and what to do if a fragment (not enough memory for an object) is generated during cutting?

Once this happens, the last fragmented memory can only be discarded and not used. But for our program, it will not be generated, because we apply for at least one page each time, and 4096 can be divisible by any object of any size we correspond to.

When does the central control cache return the memory to the page cache?

The thread cache will return the excess memory to the span corresponding to the spanlist in the central control cache. There is a usecount in the span to count how many objects in the span have been applied for. When the usecount is 0, it means that all objects are still When it comes back, return the span to the page cache and merge it into a larger span.

3)Page Cache

When a small page memory is cut out from a large page, how does the remaining memory hang in the corresponding position?

The span in the Page cache is not cut, it is a whole page, that is to say, the list of Span here is not used. The address of the calculated memory here is calculated according to the page number. When there are multiple pages of memory in a Span, the memory of the first page is saved, then the remaining memory and the page number of the cut-off memory can be calculated, and the corresponding page can be set. No. can be mapped.

When cutting from a large span, do you use a head cut or a tail cut?

How to calculate address by page number in Span?

The size of each page is fixed. When we apply for a piece of memory from the system, the first address of the memory will be returned. When we apply for memory, we will return a piece of continuous memory, so we can use the method of the first memory address/page size to calculate Page number, the page numbers of multiple pages of a large block of memory calculated in this way are all continuous.

Page Cache requests memory from the system

When Page Cache applies for memory from the system, we said earlier that it directly applies for 128 pages of memory each time. What needs to be explained here is that the data structures and library functions in any and STL cannot appear in our project, so the system call VirtualAlloc is directly used to apply for memory here. The following explains VirtualAlloc in detail:

VirtualAlloc is a Windows API function. The function of this function is to reserve or submit a part of the page in the virtual address space of the calling process. The simple point is to apply for memory space.

The function declaration is as follows:

LPVOID VirtualAlloc{
LPVOID lpAddress, // 要分配的内存区域的地址
DWORD dwSize, // 分配的大小
DWORD flAllocationType, // 分配的类型
DWORD flProtect // 该内存的初始保护属性
};

Parameter Description:

  • LPVOID lpAddress, the address of the allocated memory area. When you use VirtualAlloc to commit a previously reserved memory block, the lpAddress parameter can be used to identify the previously reserved memory block. If this parameter is NULL, the system will determine the location of the allocated memory area, and round up by 64-KB.
  • SIZE_T dwSize, the size of the area to be allocated or reserved. This parameter is in bytes, not pages, and the system will allocate to the boundary of the next page according to this size DWORD
  • flAllocationType, the allocation type, you can specify or combine the following flags: MEM_COMMIT, MEM_RESERVE and MEM_TOP_DOWN.
  • DWORD flProtect Specifies the access protection method of the allocated area

Note: There is a map in PageCache to store the mapping between pageId and Span. When releasing the memory, calculate the pageId through memSize, and find the corresponding Span in the map through the PageId to obtain the size of a single object. Determine whether to return the memory to the page cache or to the central control according to the size of the single object cache.

How to merge the memory released by the central control cache into a large memory?

Use the page number in the span to find the previous page and the next page, and judge whether the previous page and the next page are free (there is no requested memory). If they are free, they will be merged, and then re-mapped in the map after the merge.

Note: Set PageCache and CentralControlCache to singleton mode, because multiple threads use a page cache and central control cache at the same time for memory management.

A brief introduction to the singleton mode

  • Singleton mode, as the name implies, can only create one instance.
  • There are two implementations: Lazy implementation and Hungry implementation
  • Approach: Define the constructor and copy constructor as private and cannot be generated by default to prevent objects from being constructed outside the class; define a member of its own type, construct an object in the class, and provide an interface for external calls.

4) Locking problem

  • In both the central control cache and the page cache, multiple threads access the same critical resource, so locks are required.
  • In the central control cache, as long as different threads access memory objects of the same size, there is no need to lock, which can improve the running efficiency of the program (after locking, the thread may hang and wait), that is to say, the central In the control cache is the "bucket lock". If you need to change the memory at the location of the freelist, lock it.
  • In the page cache, it is necessary to lock the application and merge memory.
  • Here we use mutexes uniformly.

Note: Use map for mapping. Although we have locked the pagecache, there will be no conflicts in writing data, but we also provide an external search interface, which may cause one thread to write to the map while another thread writes to the map. A thread searches again, and there is a thread safety problem, but if the search location is locked, this interface will be called frequently, resulting in performance loss. In tcmalloc, a radix tree is used to store the mapping relationship between pageId and span, thereby improving efficiency.

3. Test

1. Unit testing

void func1()
{
	for (size_t i = 0; i < 10; ++i)
	{
		hcAlloc(17);
	}
}
 
void func2()
{
	for (size_t i = 0; i < 20; ++i)
	{
		hcAlloc(5);
	}
}
 
//测试多线程
void TestThreads()
{
	std::thread t1(func1);
	std::thread t2(func2);
 
 
	t1.join();
	t2.join();
}
 
//计算索引
void TestSizeClass()
{
	cout << SizeClass::Index(1035) << endl;
	cout << SizeClass::Index(1025) << endl;
	cout << SizeClass::Index(1024) << endl;
}
 
//申请内存
void TestConcurrentAlloc()
{
	void* ptr0 = hcAlloc(5);
	void* ptr1 = hcAlloc(8);
	void* ptr2 = hcAlloc(8);
	void* ptr3 = hcAlloc(8);
 
	hcFree(ptr1);
	hcFree(ptr2);
	hcFree(ptr3);
}
 
//大块内存的申请
void TestBigMemory()
{
	void* ptr1 = hcAlloc(65 * 1024);
	hcFree(ptr1);
 
	void* ptr2 = hcAlloc(129 * 4 * 1024);
	hcFree(ptr2);
}
 
//int main()
//{
//	//TestBigMemory();
//
//	//TestObjectPool();
//	//TestThreads();
//	//TestSizeClass();
//	//TestConcurrentAlloc();
//
//	return 0;
//}

2. Performance test

void BenchmarkMalloc(size_t ntimes, size_t nworks, size_t rounds)
{
	//创建nworks个线程
	std::vector<std::thread> vthread(nworks);
	size_t malloc_costtime = 0;
	size_t free_costtime = 0;
 
	//每个线程循环依次
	for (size_t k = 0; k < nworks; ++k)
	{
		//铺货k
		vthread[k] = std::thread([&, k]() {
			std::vector<void*> v;
			v.reserve(ntimes);
 
			//执行rounds轮次
			for (size_t j = 0; j < rounds; ++j)
			{
				size_t begin1 = clock();
				//每轮次执行ntimes次
				for (size_t i = 0; i < ntimes; i++)
				{
					v.push_back(malloc(16));
				}
				size_t end1 = clock();
 
				size_t begin2 = clock();
				for (size_t i = 0; i < ntimes; i++)
				{
					free(v[i]);
				}
				size_t end2 = clock();
				v.clear();
 
				malloc_costtime += end1 - begin1;
				free_costtime += end2 - begin2;
			}
		});
	}
 
	for (auto& t : vthread)
	{
		t.join();
	}
 
	printf("%u个线程并发执行%u轮次,每轮次malloc %u次: 花费:%u ms\n",
		nworks, rounds, ntimes, malloc_costtime);
 
	printf("%u个线程并发执行%u轮次,每轮次free %u次: 花费:%u ms\n",
		nworks, rounds, ntimes, free_costtime);
 
	printf("%u个线程并发malloc&free %u次,总计花费:%u ms\n",
		nworks, nworks*rounds*ntimes, malloc_costtime + free_costtime);
}
 
 
// 单轮次申请释放次数 线程数 轮次
void BenchmarkConcurrentMalloc(size_t ntimes, size_t nworks, size_t rounds)
{
	std::vector<std::thread> vthread(nworks);
	size_t malloc_costtime = 0;
	size_t free_costtime = 0;
 
	for (size_t k = 0; k < nworks; ++k)
	{
		vthread[k] = std::thread([&]() {
			std::vector<void*> v;
			v.reserve(ntimes);
 
			for (size_t j = 0; j < rounds; ++j)
			{
				size_t begin1 = clock();
				for (size_t i = 0; i < ntimes; i++)
				{
					v.push_back(hcAlloc(16));
				}
				size_t end1 = clock();
 
				size_t begin2 = clock();
				for (size_t i = 0; i < ntimes; i++)
				{
					hcFree(v[i]);
				}
				size_t end2 = clock();
				v.clear();
 
				malloc_costtime += end1 - begin1;
				free_costtime += end2 - begin2;
			}
		});
	}
 
	for (auto& t : vthread)
	{
		t.join();
	}
 
	printf("%u个线程并发执行%u轮次,每轮次concurrent alloc %u次: 花费:%u ms\n",
		nworks, rounds, ntimes, malloc_costtime);
 
	printf("%u个线程并发执行%u轮次,每轮次concurrent dealloc %u次: 花费:%u ms\n",
		nworks, rounds, ntimes, free_costtime);
 
	printf("%u个线程并发concurrent alloc&dealloc %u次,总计花费:%u ms\n",
		nworks, nworks*rounds*ntimes, malloc_costtime + free_costtime);
}
 
int main()
{
	cout << "==========================================================" << endl;
	BenchmarkMalloc(100000, 4, 10);
	cout << endl << endl;
 
	BenchmarkConcurrentMalloc(100000, 4, 10);
	cout << "==========================================================" << endl;
 
	return 0;
}

Result comparison

 The benefits of this article, free C++ learning materials package, technical video/code, content includes (C++ basics, network programming, database, middleware, back-end development, audio and video development, Qt development) ↓↓↓↓↓↓See below↓↓ Click at the bottom of the article to get it for free↓↓

Guess you like

Origin blog.csdn.net/m0_60259116/article/details/130735234