Python garbage collection mechanism (GC-GarbageCollection)

Thanks to Python's automatic garbage collection mechanism, there is no need to manually release objects when they are created in Python. This is very friendly to developers, so that developers do not need to pay attention to low-level memory management. But if you don't understand its garbage collection mechanism, the Python code you write will be very inefficient in many cases.

There are many garbage collection algorithms, mainly: reference counting, mark-sweep, generational collection, etc. In Python, the garbage collection algorithm is mainly based on [reference counting], supplemented by [mark-clear] and [generational collection].

1. Reference counting

The principle is that each object maintains an ob_ref, which is used to record the number of times the current object is referenced, that is, to track how many references point to this object. When the reference counter of the memory pointing to the object is 0, the memory will be Destroyed by the Python virtual machine.

1. Reference count +1

When the following four situations occur, the reference counter of the object is +1:

* 对象被创建  a=14
* 对象被引用  b=a
* 对象被作为参数,传到函数中   func(a)
* 对象作为一个元素,存储在容器中   List={
    
    a,”a”,”b”,2}

2. Reference count -1

When the following four situations occur, the reference counter of the object is -1:

* 当该对象的别名被显式销毁时  del a
* 当该对象的引别名被赋予新的对象,   a=26
* 一个对象离开它的作用域,例如 func函数执行完毕时,函数里面的局部变量的引用计数器就会减一(但是全局变量不会)
* 将该元素从容器中删除时,或者容器被销毁时。

3. Code combat

We can also get the current reference count of the object referenced by a name through getrefcount() in the sys package (note that getrefcount() itself will increase the reference count by one)

import sys
 
class A():

    def __init__(self):
        pass
 
print("创建对象 0 + 1 =", sys.getrefcount(A()))

a = A()
print("创建对象并赋值 0 + 2 =", sys.getrefcount(a))

b = a
c = a
print("赋给2个变量 2 + 2 =", sys.getrefcount(a))

b = None
print("变量重新赋值 4 - 1 =", sys.getrefcount(a))

del c
print("del对象 3 - 1 =", sys.getrefcount(a))

d = [a, a, a]
print("3次加入列表 2 + 3 =", sys.getrefcount(a))


def func(c):
    print('传入函数 1 + 2 = ', sys.getrefcount(c))
func(A())

4. The reference counting method has advantages and disadvantages:

4.1. Advantages
* 高效
* 运行期没有停顿,也就是实时性:一旦没有引用,内存就直接释放了。不用像其他机制等到特定时机。实时性还带来一个好处:处理回收内存的时间分摊到了平时。
* 对象有确定的生命周期
* 易于实现
4.2. Disadvantages
* 需要为对象分配引用计数空间,增大了内存消耗。
* 当需要释放的对象比较大时,如字典对象,需要对引用的所有对象循环嵌套调用,可能耗时比较长。
* 循环引用。这是引用计数的致命伤,引用计数对此是无解的,因此必须要使用其它的垃圾回收算法对其进行补充。

insert image description here

2. Mark-clear

The "Mark-Sweep" algorithm is a garbage collection algorithm based on tracing GC technology. The reference counting algorithm cannot solve the problem of circular references. The objects that are referenced circularly will cause your counter to never be equal to 0, which will cause the problem that it cannot be recycled.
Python uses the "mark-clear" (Mark and Sweep) algorithm to solve the problem of circular references that may be generated by container objects. (Note that only container objects will generate circular references, such as lists, dictionaries, objects of user-defined classes, tuples, etc. Simple types such as numbers and strings will not have circular references. As an optimization strategy, tuples containing only simple types are not considered by the mark-and-sweep algorithm)

The mark-clear algorithm is mainly used for potential circular reference problems. The algorithm is divided into 2 steps:

* 标记阶段。将所有的对象看成图的节点,根据对象的引用关系构造图结构。从图的根节点遍历所有的对象,所有访问到的对象被打上标记,表明对象是“可达”的。
* 清除阶段。遍历所有对象,如果发现某个对象没有标记为“可达”,则就回收。

insert image description here

Objects are linked together by references (pointers) to form a directed graph, objects constitute nodes of this directed graph, and reference relationships constitute edges of this directed graph. Starting from the root object, objects are traversed along the directed edges. Reachable objects are marked as active objects, and unreachable objects are inactive objects to be cleared. Root objects are global variables, call stacks, and registers.
In the figure above, we regard the small black circle as a global variable, that is, as the root object. Starting from the small black circle, object 1 can be directly reached, then it will be marked, and objects 2 and 3 can be reached indirectly. mark, and 4 and 5 are unreachable, then 1, 2, and 3 are active objects, and 4 and 5 are inactive objects that will be recycled by GC.

As shown in the figure below, in the mark-and-clear algorithm, in order to track container objects, each container object needs to maintain two additional pointers, which are used to form a double-ended linked list of container objects. and delete operations. The python interpreter (Cpython) maintains two such double-ended linked lists, one linked list stores container objects that need to be scanned, and the other linked list stores temporarily unreachable objects. In the figure, the two linked lists are named "Object to Scan" and "Unreachable" respectively. The example in the figure is such a situation: link1, link2, and link3 form a reference ring, and link1 is also referenced by a variable A (in fact, it is better to call it the name A here). link4 is self-referential and also constitutes a reference cycle. From the figure, we can also see that in addition to a variable ref_count that records the current reference count, each node also has a gc_ref variable. This gc_ref is a copy of ref_count, so the initial value is the size of ref_count.

insert image description here

When gc starts, it traverses the container objects in the "Object to Scan" list one by one, and reduces the gc_ref of all objects referenced by the current object by one. (When link1 is scanned, because link1 refers to link2, the gc_ref of link2 will be reduced by one, and then link2 will be scanned. Since link2 refers to link3, the gc_ref of link3 will be reduced by one...) Like this, "Objects to Scan "After examining all the objects in the linked list, the ref_count and gc_ref of the objects in the two linked lists are shown in the figure below. This step is equivalent to removing the influence of circular references on reference counting.

insert image description here

Then, gc will scan all container objects again. If the gc_ref value of the object is 0, then the object will be marked as GC_TENTATIVELY_UNREACHABLE and moved to the "Unreachable" list. Link3 and link4 in the figure below is such a situation.

insert image description here

If the object's gc_ref is not 0, then the object will be marked as GC_REACHABLE. At the same time, when gc finds that a node is reachable, it will recursively mark all nodes that can be reached from this node as GC_REACHABLE, which is the situation encountered by link2 and link3 in the figure below.

insert image description here

In addition to marking all reachable nodes as GC_REACHABLE, if the node is currently in the "Unreachable" list, it needs to be moved back to the "Object to Scan" list. The following figure shows the situation after link3 is moved back.

insert image description here

After all the objects in the second traversal are traversed, the objects that exist in the "Unreachable" linked list are the objects that really need to be released. As shown in the figure above, at this time link4 exists in the Unreachable linked list, and gc releases it immediately.

The garbage collection phase described above will suspend the entire application and wait for the mark to be cleared before resuming the operation of the application.

3. Generational collection

In the recycling of circular reference objects, the entire application will be suspended. In order to reduce the application suspension time, Python improves the efficiency of garbage collection by exchanging space for time through "Generational Collection" .

Generational recycling is based on the statistical fact that, for a program, a certain proportion of memory blocks have a relatively short life cycle; while the remaining memory blocks have a relatively long life cycle, even from the beginning of the program to the end of the program. The proportion of short-lived objects is usually between 80% and 90%. This idea is simply: the longer the object exists, the more likely it is not garbage, and the less it should be collected. In this way, the number of traversed objects can be effectively reduced when executing the mark-clear algorithm, thereby improving the speed of garbage collection.

Python gc defines three generations (0, 1, 2) for objects. Every new object is in generation zero. If it survives a round of gc scanning, it will be moved to generation one, where he will be scanned less, and if it survives another GC round, it will be moved to generation two, where it will be scanned less often.

When will the gc scan be triggered? The answer is that when the difference between the allocated object and the released object in a certain generation reaches a certain threshold, the gc scan for a certain generation will be triggered. It is worth noting that when a scan of a generation is triggered, generations younger than that generation are also scanned. That is to say, if the gc scan of generation 2 is triggered, then generation 0 and generation 1 will also be scanned, and if the gc scan of generation 1 is triggered, generation 0 will also be scanned.

The threshold can be viewed and adjusted through the following two functions:

gc.get_threshold() # (threshold0, threshold1, threshold2).
gc.set_threshold(threshold0[, threshold1[, threshold2]])

The following introduces the three parameters threshold0, threshold1, threshold2 in set_threshold(). GC will record the number of newly allocated objects and the number of released objects since the last collection. When the difference between the two exceeds the value of threshold0, the gc scan will start. Initially, only generation 0 is checked. A check on generation 1 will be triggered if generation 0 has been checked more than threshold1 times since generation 1 was last checked. Likewise, if generation 1 has been checked more than threshold2 times since generation 2 was last checked, then the check for generation 2 will be triggered. get_threshold() is to get the value of the three, the default value is (700,10,10).

Summary
In general, in Python, garbage collection is mainly carried out through reference counting; through "mark-clear" to solve the problem of circular references that may be generated by container objects; through "generational collection" to improve the efficiency of garbage collection by exchanging space for time.

Guess you like

Origin blog.csdn.net/TFATS/article/details/129859219