Why does the new version of the kernel replace process pid management from bitmap to radix-tree?

The kernel version I used when I wrote the process for the first time was still 3.10. In this version the assigned process pid numbers are stored in bitmap. However, in versions 5.4 and 6.1, it was found that the implementation of process pid number management has been replaced from bitmap to radix-tree. Later, I checked the version update history. It turns out that since Linux 4.15, the kernel has replaced the bitmap.

So today I will talk to you about why the Linux kernel replaces bitmap with radix tree, and finally look at the performance effect of this replacement.

1. The old bitmap way to manage pid

The kernel needs to assign a process number to each process/thread.

If each used process number is stored in a traditional int variable, it will consume a lot of memory. If the kernel supports a maximum of 65535 processes, then storing these process IDs requires 65535*4 bytes = 262,140 bytes ≈ 260 KB.

Bitmaps can greatly compress the storage of integers. If bitmap is used to store the used process number, a bit is used to indicate whether the corresponding pid has been used. If it supports a maximum of 65535 processes, only 65535 / 8 = 8 KB of memory is enough. Compared with the above 260 KB, the memory saving is very much.

Small memory usage also has a particularly big advantage, that is, when traversing, because the locality is particularly good, the CPU cache hit rate is particularly high, and the performance during traversal will be particularly good. Therefore, for a long time before, the kernel used bitmap to manage all process pids.

The core function to apply for pid when creating a process in the kernel is alloc_pid. In version 3.10, allocating pid is done by calling the interface function alloc_pidmap of pidmap. Its source code looks like this:

//file:kernel/pid.c
struct pid *alloc_pid(struct pid_namespace *ns)
{
 ...

 // 进程可能归属多个命名空间，在每一个命令空间中都需要分配进程号
 // 实际调用 alloc_pidmap 来申请整数类型的进程号
 tmp = ns;
 pid->level = ns->level;
 for (i = ns->level; i >= 0; i--) {
  nr = alloc_pidmap(tmp);
  pid->numbers[i].nr = nr;
  ...
 }
 ...
 return pid
}

As mentioned earlier, the biggest benefit of bitmap is to save memory. But it also has a relatively big disadvantage, the computational complexity of assigning a new pid is relatively high. If there are a large number of processes, almost every bit in the entire bitmap needs to be traversed.

// file:kernel/pid.c
static int alloc_pidmap(struct pid_namespace *pid_ns)
{
 ...
 map = &pid_ns->pidmap[pid/BITS_PER_PAGE];
 for (i = 0; i <= max_scan; ++i) {
  for ( ; ; ) {
   if (!test_and_set_bit(offset, map->page)) {
    atomic_dec(&map->nr_free);
    set_last_pid(pid_ns, last, pid);
    return pid;
   }
   offset = find_next_offset(map, offset);
   pid = mk_pid(pid_ns, map, offset);
   ...
  }
  ...
 }
}

In the industry development in recent years, the memory of the server is getting bigger and bigger, and the memory of hundreds of GB is very common on the server. In addition, with the development of quantitative container cloud in recent years, the number of processes running on the server is increasing. The memory-saving advantages of the traditional bitmap-based management of allocated pids are becoming more and more worthless, while the disadvantage of high CPU resources occupied when allocating new pids is becoming more and more obvious.

2. Use radix tree to manage pid

In 2017, Gargi Sharma submitted a patch called "Replace PID bitmap allocation with IDR AP", in which the bitmap was replaced with a radix tree. And was finally incorporated into the Linux 4.15 version. See https://lwn.net/Articles/735675/ for details on this commit or https://lore.kernel.org/lkml/f5104f457ed581e0ac032a68af03c5ba5cb94755.1506342921.git.gs051095@gmail.com/

The tree-related data structure is the most variable, and the radix tree is one of the tree data structures. It has the most obvious feature that each layer of it only manages a 6-bit segment. So its number of forks is basically fixed at 64 (2^6=64) (except the root node), and the number of layers is also fixed.

In the data structure definition of the radix tree node, there are several very important fields, namely shift, slots and tags.

//file:include/linux/xarray.h
struct xa_node {
 ...
 unsigned char shift;
 void __rcu *slots[XA_CHUNK_SIZE];
 union {
  unsigned long tags[XA_MAX_MARKS][XA_MARK_LONGS];
  ...
 };
}

shift indicates which segment of the number it represents in the number. The default radix size in Linux is 6. In this case, the internal nodes of the lowest layer have a shift of 0, and the shift of the penultimate layer is 6. The shift of the upper layer node is 12. By analogy, shift increases by 6 from low to high.

slots is an array of pointers, which store pointers to the child nodes it points to. By default in the kernel, XA_CHUNK_SIZE is 64, which is a *slots[64]. Each element points to the next-level tree node, and the pointer points to null if there is no next-level child node.

tags are used to record the storage status of each subscript in the slog array. It can be used to indicate whether each slot has been allocated. It is an array of type long, and a variable of type long is 8 bytes, exactly 64 bits.

The above process description is a bit abstract. For a better understanding, let's use a simple example to see what a radix tree looks like in memory.

The radix tree in the kernel is used to manage 32-bit integer IDs, but for simplicity and clarity, we use a radix tree composed of 16-bit integers as an example.

The representation range of 16-bit unsigned integer is 0-65536. Suppose there is an integer ID of 100, 1000, 10000, 50000, 60000 that has been allocated. We combine these arrays to form a radix tree.

First, the binary form of each of the above integers is converted as follows,

100: 0000,000001,100100
1000: 0000,001111,101000
10000: 0010,011100,010000
50000: 1100,001101,010000
60000: 1110,101001,100000

In terms of representation, starting from the tail, it is expressed in groups of 6 bits. In this way, each 16-bit number can be split and expressed as a three-segment number.

In a radix tree, the root node is used to store the first segment of each number. If one of the numbers is already occupied, point the subscript pointer corresponding to the slot to its child node. Otherwise it is empty. When calculating in the computer, each value is shifted to the right by so many bits, and the shift of the root node is 12, then the result is obtained by shifting to the right by 12 bits.

For the integers 100, 1000, 10000, 50000 and 60000, their first segments are respectively 0000, 0000, 0010, 1100 and 1110 in binary, which are 0, 0, 2, 12 and 14 after conversion into decimal.

The slot subscript of the next layer of nodes is the value of the 6 bits in the middle of each value, and its shift is 6. The slot of the node of the first layer tree is the value of the last 6 bits of each value, and its shift is 0. We then divide each of the above integers into 6-bit segments and express it in decimal as follows:

100: 0,1,36
1000: 0,15,40
10000: 2,28,16
50000: 12,13,16
60000: 14,41,32

Then the structure of the radix tree composed of 100, 1000, 10000, 50000, and 60000 is shown in the figure below.

Taking the integer 100 as an example, it is represented as 0, 1, 36 after every 6 bits. If the first segment is 0, then store the child node pointer in the subscript 0 of the slots of the root node of the radix tree. If its second segment is 1, store its child node pointer in the subscript No. 1 of the slots of its second-level node. The final value 100 is stored in the subscript number 36 of the slots of the third layer node.

The radix tree is thus established.

When judging whether an integer value exists on the basis of this tree, or when assigning a new unused integer ID from this tree, you only need to traverse the tree nodes of the three layers separately, and check each layer separately In the tag status bit, you can check whether the subscript corresponding to the slots is already occupied. Unlike bitmap need to traverse the entire bit array. The computational complexity is greatly reduced.

The difference between the kernel and the above example is that the radix tree stores 32-bit integers. The hierarchy of the tree also requires 6 layers of nodes to store.

After using the radix tree, the kernel source code has also changed. In the relatively new 6.1 version of the kernel, alloc_pid becomes the following, by calling idr_alloc to apply for an unused process ID out.

//file:kernel/pid.c
struct pid *alloc_pid(struct pid_namespace *ns, ...)
{
 ...

 // 进程可能归属多个命名空间，在每一个命令空间中都需要分配进程号
 // 实际调用 idr_alloc 来申请整数类型的进程号 
 tmp = ns;
 pid->level = ns->level;
 for (i = ns->level; i >= 0; i--) {
  nr = idr_alloc(&tmp->idr, NULL, tid,
        tid + 1, GFP_ATOMIC);
  ...
  pid->numbers[i].nr = nr;
  pid->numbers[i].ns = tmp;
  tmp = tmp->parent;
 }
 ...
}

The core process of its application is idr_get_free, which is mainly to traverse several nodes of this radix tree, and find out the unoccupied integer ID according to the tag, slot and other fields of each node.

//file:lib/radix-tree.c
void __rcu **idr_get_free(struct radix_tree_root *root, ...)
{
 ...
 shift = radix_tree_load_root(root, &child, &maxindex);
 while (shift) {
  shift -= RADIX_TREE_MAP_SHIFT; //RADIX_TREE_MAP_SHIFT为6
  ...

  // 遍历 tag 状态 bitmap，寻找下一个可用的下标
  offset = radix_tree_find_next_bit(node, IDR_FREE,
       offset + 1);
  start = next_index(start, node, offset);
 }
 ...
}

Information through train: Linux kernel source code technology learning route + video tutorial kernel source code

Learning through train: Linux kernel source code memory tuning file system process management device driver/network protocol stack

3. The performance effect of radix tree

We have finished the principle, let's look at the performance after using the radix tree instead of the bitmap. Here we directly quote the experimental data provided by Gargi Sharma, the submitter of the patch. From https://lwn.net/Articles/735675/ .

Gargi Sharma counted the time consumption of ps and pstree respectively in the case of 10,000 processes.

ps:
 With IDR API With bitmap
real 0m1.479s 0m2.319s
user 0m0.070s 0m0.060s
sys 0m0.289s 0m0.516s

pstree:
 With IDR API With bitmap
real 0m1.024s 0m1.794s
user    0m0.348s 0m0.612s
sys 0m0.184s 0m0.264s

It can be seen that after using the radix tree, the time-consuming of the ps and pstree commands has been shortened a lot, and the performance has been improved by about 50%.

Author of the original text: Developing Internal Skills