In-depth understanding of Python's object copy and memory layout

foreword

In this article, I will mainly introduce the copy problem in python. Without further ado, let's look at the code directly. Do you know the output results of the following program fragments?

a = [1, 2, 3, 4]
b = a
print(f"{a = } \t|\t {b = }")
a[0] = 100
print(f"{a = } \t|\t {b = }")
复制代码

a = [1, 2, 3, 4]
b = a.copy()
print(f"{a = } \t|\t {b = }")
a[0] = 100
print(f"{a = } \t|\t {b = }")
复制代码

a = [[1, 2, 3], 2, 3, 4]
b = a.copy()
print(f"{a = } \t|\t {b = }")
a[0][0] = 100
print(f"{a = } \t|\t {b = }")
复制代码

a = [[1, 2, 3], 2, 3, 4]
b = copy.copy(a)
print(f"{a = } \t|\t {b = }")
a[0][0] = 100
print(f"{a = } \t|\t {b = }")
复制代码

a = [[1, 2, 3], 2, 3, 4]
b = copy.deepcopy(a)
print(f"{a = } \t|\t {b = }")
a[0][0] = 100
print(f"{a = } \t|\t {b = }")
复制代码

In this article, we will analyze the above program in detail.

Memory layout of Python objects

First, let’s introduce a useful website about the logical distribution of data in memory, pythontutor.com/visualize.h…

We run the first code on this site:

From the above output results, a and b point to data objects in the same memory. So the output of the first code is the same. How should we determine the memory address of an object? In Python, we provide us with a built-in function id() to get the memory address of an object:

a = [1, 2, 3, 4]
b = a
print(f"{a = } \t|\t {b = }")
a[0] = 100
print(f"{a = } \t|\t {b = }")
print(f"{id(a) = } \t|\t {id(b) = }")
# 输出结果
# a = [1, 2, 3, 4] 	|	 b = [1, 2, 3, 4]
# a = [100, 2, 3, 4] 	|	 b = [100, 2, 3, 4]
# id(a) = 4393578112 	|	 id(b) = 4393578112
复制代码

In fact, the object memory layout above has some problems, or is not accurate enough, but it can also show the relationship between various objects. Let's take a deeper look at it now. In Cpython, you can think that each variable can be regarded as a pointer, pointing to the data to be represented, and this pointer stores the memory address of the Python object.

In Python, the list actually holds pointers to each Python object, not the actual data. Therefore, the above small piece of code can use the following diagram to represent the layout of the object in memory:

The variable a points to the list in the memory [1, 2, 3, 4], and there are 4 data in the list, and these four data are pointers, and these four pointers point to the four data of 1, 2, 3, and 4 in the memory. You may have doubts, isn't this a problem? They are all integer data, why not directly store the integer data in the list, why add a pointer, and then point to this data?

In fact, in Python, any Python object can be stored in the list. For example, the following program is legal:

data = [1, {1:2, 3:4}, {'a', 1, 2, 25.0}, (1, 2, 3), "hello world"]
复制代码

The data types of the first to last data in the above list are: integer data, dictionary, set, tuple, and string. Now, in order to realize this feature of Python, does the feature of the pointer meet the requirements? The memory occupied by each pointer is the same, so you can use an array to store pointers to Python objects, and then point the pointers to real Python objects!

small test

After the above analysis, let's take a look at the following code, what is its memory layout:

data = [[1, 2, 3], 4, 5, 6]
data_assign = data
data_copy = data.copy()
复制代码

data_assign = data, We have talked about the memory layout of this assignment statement before, but we are also reviewing it. The meaning of this assignment statement is that the data pointed to by data_assign and data are the same data, that is, the same list.
data_copy = data.copy(), the meaning of this assignment statement is to make a shallow copy of the data pointed to by data, and then let data_copy point to the copied data. The meaning of shallow copy here is to copy each pointer in the list instead of the pointer in the list. data is copied. From the memory layout diagram of the object above, we can see that data_copy points to a new list, but the data pointed to by the pointer in the list is the same as the data pointed to by the pointer in the data list, where data_copy is represented by a green arrow, and data It is indicated by a black arrow.

View the memory address of the object

In the previous article, we mainly analyzed the memory layout of the object. In this section, we use python to provide us with a very effective tool to verify this. In python, we can use id() to view the memory address of the object, and id(a) is to view the memory address of the object pointed to by object a.

Look at the output of the following program:

a = [1, 2, 3]
b = a
print(f"{id(a) = } {id(b) = }")
for i in range(len(a)):
    print(f"{i = } {id(a[i]) = } {id(b[i]) = }")
复制代码

According to our previous analysis, a and b point to the same block of memory, that is to say, the two variables point to the same Python object, so the above output id results a and b are the same, the above output result as follows:

id(a) = 4392953984 id(b) = 4392953984
i = 0 id(a[i]) = 4312613104 id(b[i]) = 4312613104
i = 1 id(a[i]) = 4312613136 id(b[i]) = 4312613136
i = 2 id(a[i]) = 4312613168 id(b[i]) = 4312613168
复制代码

Take a look at the memory address of the shallow copy:

a = [[1, 2, 3], 4, 5]
b = a.copy()
print(f"{id(a) = } {id(b) = }")
for i in range(len(a)):
    print(f"{i = } {id(a[i]) = } {id(b[i]) = }")
复制代码

According to our previous analysis, the copy method of calling the list itself is to make a shallow copy of the list, only the pointer data of the list is copied, and the real data pointed to by the pointer in the list is not copied, so if we traverse the data in the list to get For the address of the object pointed to, the results returned by list a and list b are the same, but the difference from the previous example is that the addresses of the lists pointed to by a and b are different (because the data is copied, you can refer to The results of the shallow copy below are understood).

It can be understood by combining the following output results with the above text:

id(a) = 4392953984 id(b) = 4393050112 # 两个对象的输出结果不相等
i = 0 id(a[i]) = 4393045632 id(b[i]) = 4393045632 # 指向的是同一个内存对象因此内存地址相等 下同
i = 1 id(a[i]) = 4312613200 id(b[i]) = 4312613200
i = 2 id(a[i]) = 4312613232 id(b[i]) = 4312613232
复制代码

copy module

There is a built-in package copy in python, which is mainly used for copying objects. In this module, there are mainly two methods copy.copy(x) and copy.deepcopy().

The copy.copy(x) method is mainly used for shallow copying. The meaning of this method is the same for the list as the x.copy() method of the list itself, which is to perform shallow copying. This method will construct a new python object and copy all data references (pointers) in object x.
copy.deepcopy(x) This method is mainly to make a deep copy of the object x. The meaning of the deep copy here is to construct a new object and recursively view each object in the object x. If the recursively viewed object is a Immutable objects will not be copied. If the viewed object is a mutable object, a new memory space will be re-opened, and the original data in object x will be copied into the new memory. (We will analyze mutable and immutable objects in the next section)
According to the above analysis, we can know that the cost of deep copy is more than that of shallow copy, especially when there are many sub-objects in an object, it will take a lot of time and memory space.
For python objects, the difference between deep copy and shallow copy is mainly in compound objects (there are sub-objects in the object, such as lists, ancestors, instances of classes, etc.). This point is mainly related to the mutable and immutable objects in the next section.

Mutable and immutable objects and object copying

There are two main types of objects in python, mutable objects and immutable objects. The so-called mutable object means that the content of the object can be changed, and the immutable object means that the content of the object cannot be changed.

Mutable objects: such as lists (list), dictionaries (dict), collections (set), byte arrays (bytearray), and instance objects of classes.
Immutable objects: integer (int), floating point (float), complex (complex), string, tuple, immutable collection (frozenset), bytes (bytes).

Seeing this, you may have doubts, can't integers and strings be modified?

a = 10
a = 100
a = "hello"
a = "world"
复制代码

For example, the following code is correct and no error will occur, but in fact, the object pointed to by a has changed. When the first object points to an integer or a string, if a new and different integer is reassigned Or a string object, python will create a new object, we can use the following code to verify:

a = 10
print(f"{id(a) = }")
a = 100
print(f"{id(a) = }")
a = "hello"
print(f"{id(a) = }")
a = "world"
print(f"{id(a) = }")
复制代码

The output of the above program is as follows:

id(a) = 4365566480
id(a) = 4365569360
id(a) = 4424109232
id(a) = 4616350128
复制代码

It can be seen that the memory object pointed to by the variable has changed after the reassignment (because the memory address has changed), which is an immutable object. Although the variable can be reassigned, the obtained object is not Modified on the original object!

Let's now take a look at how the memory address changes after the mutable object list is modified:

data = []
print(f"{id(data) = }")
data.append(1)
print(f"{id(data) = }")
data.append(1)
print(f"{id(data) = }")
data.append(1)
print(f"{id(data) = }")
data.append(1)
print(f"{id(data) = }")
复制代码

The output of the above code is as follows:

id(data) = 4614905664
id(data) = 4614905664
id(data) = 4614905664
id(data) = 4614905664
id(data) = 4614905664
复制代码

From the above output results, we can know that when we add new data to the list (modify the list), the address of the list itself does not change, which is a mutable object.

We talked about deep copy and shallow copy earlier, let's analyze the following code now:

data = [1, 2, 3]
data_copy = copy.copy(data)
data_deep = copy.deepcopy(data)
print(f"{id(data ) = } | {id(data_copy) = } | {id(data_deep) = }")
print(f"{id(data[0]) = } | {id(data_copy[0]) = } | {id(data_deep[0]) = }")
print(f"{id(data[1]) = } | {id(data_copy[1]) = } | {id(data_deep[1]) = }")
print(f"{id(data[2]) = } | {id(data_copy[2]) = } | {id(data_deep[2]) = }")
复制代码

The output of the above code is as follows:

id(data ) = 4620333952 | id(data_copy) = 4619860736 | id(data_deep) = 4621137024
id(data[0]) = 4365566192 | id(data_copy[0]) = 4365566192 | id(data_deep[0]) = 4365566192
id(data[1]) = 4365566224 | id(data_copy[1]) = 4365566224 | id(data_deep[1]) = 4365566224
id(data[2]) = 4365566256 | id(data_copy[2]) = 4365566256 | id(data_deep[2]) = 4365566256
复制代码

Seeing this, you will definitely be very confused, why are the memory objects pointed to by deep copy and shallow copy the same? In the previous section, we can understand that because the shallow copy copies references, the objects they point to are the same, but why is the memory object pointed to after the deep copy the same as the shallow copy? This is precisely because the data in the list is integer data, which is an immutable object. If the object pointed to by data or data_copy is modified, it will point to a new object and will not directly modify the original object. Therefore, for In fact, immutable objects do not need to open up a new memory space for reassignment, because the objects in this memory will not change.

Let's look at a copyable object:

data = [[1], [2], [3]]
data_copy = copy.copy(data)
data_deep = copy.deepcopy(data)
print(f"{id(data ) = } | {id(data_copy) = } | {id(data_deep) = }")
print(f"{id(data[0]) = } | {id(data_copy[0]) = } | {id(data_deep[0]) = }")
print(f"{id(data[1]) = } | {id(data_copy[1]) = } | {id(data_deep[1]) = }")
print(f"{id(data[2]) = } | {id(data_copy[2]) = } | {id(data_deep[2]) = }")
复制代码

The output of the above code is as follows:

id(data ) = 4619403712 | id(data_copy) = 4617239424 | id(data_deep) = 4620032640
id(data[0]) = 4620112640 | id(data_copy[0]) = 4620112640 | id(data_deep[0]) = 4620333952
id(data[1]) = 4619848128 | id(data_copy[1]) = 4619848128 | id(data_deep[1]) = 4621272448
id(data[2]) = 4620473280 | id(data_copy[2]) = 4620473280 | id(data_deep[2]) = 4621275840
复制代码

From the output of the above program, we can see that when a mutable object is stored in the list, if we perform a deep copy, a brand new object will be created (the object memory address of the deep copy is different from that of the shallow copy).

Code Fragment Analysis

After the above study, the question raised at the beginning of this article should be very simple for you. Let's analyze these code snippets now:

a = [1, 2, 3, 4]
b = a
print(f"{a = } \t|\t {b = }")
a[0] = 100
print(f"{a = } \t|\t {b = }")
复制代码

This is very simple. The different variables of a and b point to the same list. If the data in a changes, the data in b will also change. The output is as follows:

a = [1, 2, 3, 4] 	|	 b = [1, 2, 3, 4]
a = [100, 2, 3, 4] 	|	 b = [100, 2, 3, 4]
id(a) = 4614458816 	|	 id(b) = 4614458816
复制代码

Let's take a look at the second code snippet

a = [1, 2, 3, 4]
b = a.copy()
print(f"{a = } \t|\t {b = }")
a[0] = 100
print(f"{a = } \t|\t {b = }")
复制代码

Because b is a shallow copy of a, a and b point to different lists, but the data in the lists point to the same, but since integer data is immutable, when a[0] changes, The original data will not be modified, but a new integer data will be created in memory, so the content of list b will not change. So the output of the above code looks like this:

a = [1, 2, 3, 4] 	|	 b = [1, 2, 3, 4]
a = [100, 2, 3, 4] 	|	 b = [1, 2, 3, 4]
复制代码

Let's take a look at the third fragment:

a = [[1, 2, 3], 2, 3, 4]
b = a.copy()
print(f"{a = } \t|\t {b = }")
a[0][0] = 100
print(f"{a = } \t|\t {b = }")
复制代码

This is similar to the analysis of the second fragment, but a[0] is a variable object, so when the data is modified, the point of a[0] does not change, so the modified content of a will affect b.

a = [[1, 2, 3], 2, 3, 4] 	|	 b = [[1, 2, 3], 2, 3, 4]
a = [[100, 2, 3], 2, 3, 4] 	|	 b = [[100, 2, 3], 2, 3, 4]
复制代码

The last snippet:

a = [[1, 2, 3], 2, 3, 4]
b = copy.deepcopy(a)
print(f"{a = } \t|\t {b = }")
a[0][0] = 100
print(f"{a = } \t|\t {b = }")
复制代码

Deep copy will re-create an object identical to a[0] in memory, and let b[0] point to this object, so modifying a[0] will not affect b[0], so the output is as follows :

a = [[1, 2, 3], 2, 3, 4] 	|	 b = [[1, 2, 3], 2, 3, 4]
a = [[100, 2, 3], 2, 3, 4] 	|	 b = [[1, 2, 3], 2, 3, 4]
复制代码

Demystifying Python objects

Let's take a brief look at how Cpython implements the list data structure, and what is defined in the list:

typedef struct {
    PyObject_VAR_HEAD
    /* Vector of pointers to list elements.  list[0] is ob_item[0], etc. */
    PyObject **ob_item;

    /* ob_item contains space for 'allocated' elements.  The number
     * currently in use is ob_size.
     * Invariants:
     *     0 <= ob_size <= allocated
     *     len(list) == ob_size
     *     ob_item == NULL implies ob_size == allocated == 0
     * list.sort() temporarily sets allocated to -1 to detect mutations.
     *
     * Items must normally not be NULL, except during construction when
     * the list is not yet visible outside the function that builds it.
     */
    Py_ssize_t allocated;
} PyListObject;
复制代码

Among the structures defined above:

allocated indicates the amount of allocated memory space, that is, the number of pointers that can be stored. When all the space is used up, memory space needs to be applied again.
ob_item points to the array in the memory that actually stores pointers to python objects. For example, if we want to get the pointer to the first object in the list, it is list->ob_item[0]. If we want to get the real data, it is *(list->ob_item[ 0]).
PyObject_VAR_HEAD is a macro that defines a substructure in the structure. The definition of this substructure is as follows:

typedef struct {
    PyObject ob_base;
    Py_ssize_t ob_size; /* Number of items in variable part */
} PyVarObject;
复制代码

Here we will not talk about the object PyObject, but mainly talk about ob_size, which indicates how many data are stored in the list. This is different from allocated. allocated indicates how much space the array pointed to by ob_item has, and ob_size indicates how many items are stored in the array. Data ob_size <= allocated.

After understanding the structure of the list, we should now be able to understand the previous memory layout. All lists do not store real data but store pointers to these data.

Summarize

This article mainly introduces the copying and memory layout of objects in python, as well as the verification of object memory addresses, and finally introduces the structure of cpython's internal list implementation to help you understand the memory layout of list objects.

The above is all the content of this article, I am a bully , see you in the next issue! ! !