Analysis of Python garbage collection mechanism

1. Garbage collection

Reference counter is the main, sub-code recovery and mark removal are auxiliary

1.1 The butler refchain

In the Python C source code, there is a circular doubly linked list called refchain. This linked list is quite awesome, because once an object is created in the Python program, the object will be added to the linked list of refchain. In other words, he keeps all the objects.

1.2 Reference counter

  • In all objects in refchain, there is an ob_refcnt to save the reference counter of the current object. As the name implies, it is the number of times that it has been referenced.
  • When the value is referenced multiple times, the data will not be created repeatedly in the memory, but the reference counter +1. When the object is destroyed, the reference counter will be set to -1. If the reference counter is 0, the object will be removed from the refchain linked list and destroyed in the memory at the same time (no special circumstances such as caching are considered for the time being).
age = 18
number = age  # 对象18的引用计数器 + 1
del age          # 对象18的引用计数器 - 1
def run(arg):
    print(arg)
run(number)   # 刚开始执行函数时,对象18引用计数器 + 1,当函数执行完毕之后,对象18引用计数器 - 1 。
num_list = [11,22,number] # 对象18的引用计数器 + 1
复制代码

1.3 Mark removal & generational collection

Garbage collection based on the reference counter is very convenient and simple, but it still has the problem of circular references, which makes it impossible to collect some data normally, such as:

v1 = [11,22,33]        # refchain中创建一个列表对象,由于v1=对象,所以列表引对象用计数器为1.
v2 = [44,55,66]        # refchain中再创建一个列表对象,因v2=对象,所以列表对象引用计数器为1.
v1.append(v2)        # 把v2追加到v1中,则v2对应的[44,55,66]对象的引用计数器加1,最终为2.
v2.append(v1)        # 把v1追加到v1中,则v1对应的[11,22,33]对象的引用计数器加1,最终为2.
del v1    # 引用计数器-1
del v2    # 引用计数器-1
复制代码
  • For the above code, you will find that after the del operation, no variables will use the two list objects, but due to the circular reference problem, their reference counter is not 0, so their status: never used, neither Will be destroyed. If this kind of code is too much in the project, it will cause the memory to be consumed until the memory is exhausted and the program crashes.
  • In order to solve the problem of circular references, the technology of mark removal is introduced to specifically deal with objects that may have circular references. The types of circular applications that may exist are: lists, tuples, dictionaries, collections, custom classes, etc. The type of data nesting.

Mark removal : create a special linked list to save objects such as lists, tuples, dictionaries, collections, custom classes, etc., and then check whether there are circular references in the objects in this linked list, and if so, let the reference counters of both parties be -1 .

Generational Reclamation : Optimize the linked list in mark removal, split those objects that may have reference to 3 linked lists, the linked list is called: 0/1/2 three generations, each generation can store objects and thresholds, when it reaches When the threshold is set, each object in the corresponding linked list will be scanned once, except for the circular references each decremented by 1 and the objects whose reference counter is 0 are destroyed.

// 分代的C源码
#define NUM_GENERATIONS 3
struct gc_generation generations[NUM_GENERATIONS] = {
    /* PyGC_Head,                                    threshold,    count */
    {
   
   {(uintptr_t)_GEN_HEAD(0), (uintptr_t)_GEN_HEAD(0)},   700,        0}, // 0代
    {
   
   {(uintptr_t)_GEN_HEAD(1), (uintptr_t)_GEN_HEAD(1)},   10,         0}, // 1代
    {
   
   {(uintptr_t)_GEN_HEAD(2), (uintptr_t)_GEN_HEAD(2)},   10,         0}, // 2代
};
复制代码

Special attention: the meanings of threshold and count of generation 0 and generation 1 and 2 are different.

In generation 0, count represents the number of objects in the generation 0 linked list, and threshold represents the threshold of the number of objects in the generation 0 linked list. If it is exceeded, a generation 0 scan check will be performed. In generation 1, count represents the number of scans of the 0-generation linked list, and threshold represents the threshold of the number of scans of the 0-generation linked list. If it is exceeded, a generation 1 scan is performed. In the second generation, count represents the number of scans of the first-generation linked list, and threshold represents the threshold of the number of scans of the first-generation linked list. If it is exceeded, a second-generation scan check will be performed.

1.4 Scenario simulation

Explain the detailed process of memory management and garbage collection based on the bottom of the C language and combined with the diagram.

The first step: when the object age=19 is created, the object will be added to the refchain list.

Step 2: When the object num_list = [11,22] is created, the list object will be added to refchain and generations 0.

Step 3: When the newly created object makes the number of objects on the generation 0 linked list greater than the threshold 700, the objects on the linked list should be scanned and checked.

When the 0 generation is greater than the threshold, the bottom layer does not directly scan the 0 generation, but first determines whether 2 and 1 also exceed the threshold.

  • If generation 2 and generation 1 do not reach the threshold, scan generation 0 and let the count + 1 of generation 1.
  • If the 2nd generation has reached the threshold, the three linked lists of 2, 1, and 0 will be spliced ​​together for a full scan, and the count of the 2, 1 and 0 generations will be reset to 0.
  • If the 1st generation has reached the threshold, the two linked lists of 1 and 0 are joined together for scanning, and the count of all 1st and 0th generations is reset to 0.

When scanning the spliced ​​linked list, the main purpose is to eliminate circular references and destroy garbage. The detailed process is:

  • Scan the linked list, copy the reference counter of each object and save it in gc_refs to protect the original reference counter.
  • Scan each object in the linked list again, and check whether there is a circular reference, if there is, reduce the respective gc_refs by 1.
  • Scan the linked list again and move the objects whose gc_refs is 0 to the unreachable linked list; the objects that are not 0 are directly upgraded to the next-generation linked list.
  • Deal with the destructor and weak references of the objects in the unreachable linked list. Objects that cannot be destroyed are upgraded to the next-generation linked list, and those that can be destroyed remain in this linked list. The destructor refers to the objects that define the __del__ method, which needs to be executed before being destroyed.
  • Finally, each object in unreachable is destroyed and removed from the refchain list (without considering the caching mechanism).

At this point, the garbage collection process is over.

1.5 Cache mechanism

As you can learn from the above, when the reference counter of an object is 0, it will be destroyed and the memory will be released. In fact, it is not so simple and rude, because repeated creation and destruction will make the execution efficiency of the program lower. The "cache mechanism" mechanism is introduced in Python.

For example: When the reference counter is 0, the object will not be destroyed, but will be placed in a linked list named free_list. When the object is created later, the memory will not be re-opened, but the previous object will be placed in the free_list. Come and reset the internal value to use.

  • The float type, the maintained free_list linked list can cache up to 100 float objects.
  v1 = 3.14    # 开辟内存来存储float对象,并将对象添加到refchain链表。
  print( id(v1) ) # 内存地址:4436033488
  del v1    # 引用计数器-1,如果为0则在rechain链表中移除,不销毁对象,而是将对象添加到float的free_list.
  v2 = 9.999    # 优先去free_list中获取对象,并重置为9.999,如果free_list为空才重新开辟内存。
  print( id(v2) ) # 内存地址:4436033488
  # 注意:引用计数器为0时,会先判断free_list中缓存个数是否满了,未满则将对象缓存,已满则直接将对象销毁。
复制代码
  • The int type is not based on free_list, but maintains a small_ints linked list to store common data (small data pool), the range of small data pool: -5 <= value <257. That is: when the integers in this range are reused, the memory will not be reopened.
  v1 = 38    # 去小数据池small_ints中获取38整数对象,将对象添加到refchain并让引用计数器+1。
  print( id(v1))  #内存地址:4514343712
  v2 = 38 # 去小数据池small_ints中获取38整数对象,将refchain中的对象的引用计数器+1。
  print( id(v2) ) #内存地址:4514343712
  # 注意:在解释器启动时候-5~256就已经被加入到small_ints链表中且引用计数器初始化为1,
  # 代码中使用的值时直接去small_ints中拿来用并将引用计数器+1即可。另外,small_ints中的数据引用计数器永远不会为0
  # (初始化时就设置为1了),所以也不会被销毁。
复制代码
  • The str type maintains a unicode_latin1[256] linked list, and internally caches all ascii characters, and will not be created repeatedly when used in the future.
  v1 = "A"
  print( id(v1) ) # 输出:4517720496
  del v1
  v2 = "A"
  print( id(v1) ) # 输出:4517720496
  # 除此之外,Python内部还对字符串做了驻留机制,针对只含有字母、数字、下划线的字符串(见源码Objects/codeobject.c),如果
  # 内存中已存在则不会重新在创建而是使用原来的地址里(不会像free_list那样一直在内存存活,只有内存中有才能被重复利用)。
  v1 = "asdfg"
  v2 = "asdfg"
  print(id(v1) == id(v2)) # 输出:True
复制代码
  • List type, the maintained free_list array can cache up to 80 list objects.

 v1 = [11,22,33]
print( id(v1) ) # 输出:4517628816
del v1
v2 = ["你","好"]
print( id(v2) ) # 输出:4517628816
复制代码
  • The tuple type maintains a free_list array with an array capacity of 20. The elements in the array can be linked lists and each linked list can hold up to 2000 tuple objects. When the free_list array of tuples stores data, the corresponding linked list in the free_list array is found according to the number that the tuple can hold as an index, and added to the linked list.
v1 = (1,2)
print( id(v1) )
del v1  # 因元组的数量为2,所以会把这个对象缓存到free_list[2]的链表中。
v2 = ("哈哈哈","Alex")  # 不会重新开辟内存,而是去free_list[2]对应的链表中拿到一个对象来使用。
print( id(v2) )
复制代码
  • dict type, the maintained free_list array can cache up to 80 dict objects
  v1 = {"k1":123}
  print( id(v1) )  # 输出:4515998128
  del v1
  v2 = {"name":"哈哈哈","age":18,"gender":"男"}
  print( id(v1) ) # 输出:4515998128
复制代码

 

Guess you like

Origin blog.csdn.net/qq_38082146/article/details/109365216