[LINUX-06-2] Python "garbage" collection

Foreword

For Python, everything is an object, and all variable assignments follow the object reference mechanism. When the program is running, it is necessary to create a space in the memory for storing temporary variables generated during the operation; after the calculation is completed, the results are output to the permanent memory. If the amount of data is too large, poor memory space management is prone to OOM (out of memory), commonly known as burst memory, and the program may be aborted by the operating system. For the server, memory management is more important, otherwise it is easy to cause a memory leak-the leak here does not mean that your memory has information security problems and is used by malicious programs, but that the program itself is not well designed. Causes the program to fail to free memory that is no longer used. -A memory leak does not mean that your memory has physically disappeared, but it means that after the code allocates a certain section of memory, it loses control of this section of memory due to a design error, resulting in a waste of memory. That is, this memory is out of the control of gc

Counting references

Because everything in Python is an object, all the variables you see are essentially a pointer to the object. When an object is no longer called, that is, when the object's reference count (pointer number) is 0, it means that the object is never reachable, naturally it becomes garbage and needs to be recycled. It can be simply understood that there is no variable to point to it.

Import OS  
 Import psutil    
 
# shows the current memory size of the program python 
 
DEF show_memory_info (hint):   
    PID = os.getpid ()   
    P = psutil.Process (PID)   
    info = p.memory_full_info ()   
    Memory = info.uss / 1024./ 1024  
 print ({} memory used: {} MB .format (hint, memory))

It can be seen that the function func () is called. After the list a is created, the memory usage increases rapidly to 433 MB: after the function call ends, the memory returns to normal. This is because the list a declared inside the function is a local variable. After the function returns, the reference of the local variable will be cancelled; at this time, the reference number of the object referred to by the list a is 0, and Python will perform garbage collection, so The large amount of memory previously occupied is back.

def func():  
    show_memory_info( 
 initial  
)  
global a 
    a = [i for  i in  range( 10000000 )]  
    show_memory_info( after a created ) 
func()  
show_memory_info( 
 finished  
) 
########## 输出 ##########  
initial memory used: 48.88671875 MB  
after a created memory used:433.94921875 MB  
finished memory used:433.94921875 MB

In the new code, global a means to declare a as a global variable. Then, even after the function returns, the list reference still exists, so the object will not be garbage collected and still consume a lot of memory. Similarly, if we return the generated list and receive it in the main program, the reference still exists, garbage collection will not be triggered, and a lot of memory is still occupied:

def func():  
    show_memory_info(  initial )  
    a = [i for  i in  derange( 10000000 )]  
    show_memory_info(  after a created ) 
 
return a  
a = func() 
show_memory_info( finished) 
 
########## 输出 ##########  
initial memory used:  47.96484375 MB 
after a created memory used:434.515625 MB 
finished memory used: 434.515625 MB

How can you see how many times the variable is referenced? Through sys.getrefcount

Import SYS   
A = []  
 # twice references, one from a, from a getrefcount 
Print (sys.getrefcount (A))   
 
DEF FUNC (A):  
 # four references, a, Python function call stack, function parameters, and getrefcount   
Print (sys.getrefcount (A))   
func (A)   
# two references, one from a, from a getrefcount, call the function func does not exist   
Print (sys.getrefcount (A))   
 # ####### ## Output ##########   
2   
4   
2

If it involves a function call, it will add two additional 1. Function stack 2. Function call

From here, we can see that Python no longer needs to release memory like C, but Python also provides us with a method to manually release memory gc.collect ()

import gc  
show_memory_info( initial)  
a = [i for  i in range( 10000000 )]  
show_memory_info(  after a created) 
del a 
gc.collect() 
show_memory_info( finish )  
print (a)  
########## 输出 ########## 
initial memory used: 48.1015625 MB 
after a created memory used: 434.3828125 MB  
finish memory used: 48.33203125 MB 
--------------------------------------------------------------------------- 
NameErrorTraceback (most recent call last) 
 
 in  
11  
12 show_memory_info(  finish ) 
--->  13 print (a) 
 
NameError : name  a  isnotdefined

As of now, it seems that the garbage collection mechanism of Python is very simple. As long as the number of object references is 0, it must be triggered by gc. Is the reference number of 0 a sufficient and necessary condition to trigger gc?

Recycling

If there are two objects that refer to each other and are no longer referenced by other objects, should they be garbage collected?

def func(): 
    show_memory_info( initial )  
    a = [i for  i in  range(10000000)] 
    b = [i for  i in  range(10000000)]  
    show_memory_info(  after a, b created )  
    a.append(b)  
    b.append(a) 
func() 
show_memory_info(  finished )  
########## 输出 ##########  
initial memory used: 47.984375 MB  
after a, b created memory used:822.73828125 MB  
finished memory used:  821.73046875 MB

It is obvious from the results that they have not been recovered, but from a procedural point of view, when this function ends, a and b as local variables no longer exist in the sense of the program. But because of their mutual reference, their reference numbers are not zero. How to avoid it at this time

1. Rectify the code logically to avoid such circular references

2. Through manual recycling

import gc 
def func():  
    show_memory_info( initial)  
    a = [i for  i in  range(10000000)]  
    b = [i for  i in  range(10000000)] 
    show_memory_info( after a, b created)  
    a.append(b) 
    b.append(a) 
func() 
gc.collect() 
show_memory_info( finished )  
########## 输出 ##########  
initial memory used:49.51171875 MB  
after a, b created memory used: 824.1328125 MB  
finished memory used:49.98046875 MB

For circular references, Python has its automatic garbage collection algorithm 1. Mark-sweep algorithm 2. Generational collection

Mark clear

The steps to clear the mark are summarized as follows: 1. The GC will mark all "active objects" 2. Recycle those objects that are not marked "inactive objects" So how does Python determine what is an inactive object? By using graph theory To understand the concept of unreachable. For a directed graph, if we start traversing from a node and mark all the nodes it passes through, then, after the traversal ends, all nodes that have not been marked, we call it unreachable nodes. Obviously, the existence of these nodes is meaningless. Naturally, we need to recycle them. But traversing the full graph every time is a huge waste of performance for Python. Therefore, in Python's garbage collection implementation, mark-sweep maintains a data structure using a doubly linked list, and only considers container-type objects (only container-type objects, list, dict, tuple, instance, can have circular references) .

Python "garbage" recycling Python "garbage" recycling

In the figure, the small black circle is regarded as a global variable, that is, it is regarded as the root object. From the small black circle, object 1 can be directly reached, then it will be marked, and objects 2, 3 can be indirectly reached and will be marked, and 4 If it is not reachable with 5, then 1, 2, 3 are active objects, 4 and 5 are inactive objects and will be recycled by GC.

Generational recycling

Generational recycling is a space-for-time operation. Python divides memory into different sets according to the survival time of objects. Each set is called a generation. Python divides memory into 3 "generations", which are young (0th generation), mid-generation (1st generation), and old generation (2nd generation), they correspond to three linked lists, and their garbage collection frequency decreases with the increase of the object's survival time. The newly created objects will be allocated in the young generation. When the total number of young generation linked lists reaches the upper limit (when the new objects minus the deleted objects in the garbage collector reach the corresponding threshold), the Python garbage collection mechanism will be triggered to put Recycled objects are recycled, and those that are not recycled will be moved to the middle age, and so on. The objects in the old age are the longest surviving objects, and even survive the entire system life cycle . At the same time, generational recycling is based on mark removal technology. In fact, generational recycling is based on the idea that newly born objects are more likely to be garbage collected, and objects that live longer have a higher probability of continuing to survive. Therefore, through this approach, you can save a lot of calculation, thereby improving the performance of Python. So for the question just now, reference counting is only a sufficient non-essential condition for triggering gc, and circular references will also trigger.

debugging

You can use objgraph to debug the program, because its official documentation has not been read carefully, you can only put the documentation here for your reference ~ Two of the functions are very useful 1. show_refs () 2. show_backrefs ()

Original text from: https://developer.51cto.com/art/201912/607082.htm

The address of this article: https://www.linuxprobe.com/python-garbage-collection.html

[LINUX-06-2] Python "garbage" collection

Guess you like