Python source code analysis 2-string object PyStringObject

 two,

1、PyStringObject与 PyString_Type

PyStringObject is an immutable object among variable-length objects. When a PyStringObject object is created, the string maintained inside the object cannot be changed. This feature makes the PyStringObject object can be used as the key value of PyDictObject, but it also greatly reduces the efficiency of some string operations, such as the concatenation operation of multiple strings.

[stringobject.h]
typedef struct {
	PyObject_VAR_HEAD
	long ob_shash;
	int ob_sstate;
	char ob_sval[1];
} PyStringObject;

 ob_sval is an array that stores actual strings, and the length of the array is ob_size+1, because it stores native C strings and requires an additional terminator. But note that the length of the array is set to 1 when applying. In fact, the first address of the array is used as a pointer, and ob_sval is used as the first address. In the string application function, the application is a length of ob_size+1 bytes memory, and ob_sval[ob_size] = "\0' must be satisfied.

The function of the ob_shash variable in PyStringObject is to cache the HASH value of the object, which can avoid recalculating the HASH value of the string object every time. If a PyStringObject object has not been calculated HASH value, then the initial value of ob_shash is -1. The ob_sstate variable of the PyStringObject object indicates whether the object is Intern.

 PyString_Type:

PyTypeObject PyString_Type = {
    PyVarObject_HEAD_INIT(&PyType_Type, 0)
    "str",
    PyStringObject_SIZE,
    ... 
    string_str,          /* tp_str*/     //tp_str 指向string_str 函数
    &string_as_number,   /* tp_as_number */
    &string_as_sequence, /* tp_as_sequence */
    &string_as_mapping,  /* tp_as_mapping */
    (hashfunc)string_hash,
    string_methods,
    ....
    string_new,       //实例化对象方法   /* tp_new */
    PyObject_Del,        /* tp_free */
};

As shown in the figure, tp_itemsize is set to sizeof(char). For any variable-length object in python, the tp_itemsize field must be set. It indicates the unit length of the element saved by the variable-length object. tp_itemsize and ob_size jointly determine that additional applications are required memory size.

It should be noted that the tp_as_number, tp_as_sequence, and tp_as_mapping fields of the string type object are all set. This means that PyStringObject supports numeric operations, sequence operations, and mapping operations.

2. PyStringObject object creation

PyObject *PyString_FromString(const char *str)
{
	register size_t size;
	register PyStringObject *op;
 
	assert(str != NULL);
    【1】:判断字符串长度
	size = strlen(str);
	if (size > PY_SSIZE_T_MAX - sizeof(PyStringObject)) {
		PyErr_SetString(PyExc_OverflowError,
			"string is too long for a Python string");
		return NULL;
	}
    【2】:处理null string
	if (size == 0 && (op = nullstring) != NULL) {
#ifdef COUNT_ALLOCS
		null_strings++;
#endif
		Py_INCREF(op);
		return (PyObject *)op;
	}
    【3】:处理单字符,从缓存池中获取
	if (size == 1 && (op = characters[*str & UCHAR_MAX]) != NULL) {
#ifdef COUNT_ALLOCS
		one_strings++;
#endif
		Py_INCREF(op);
		return (PyObject *)op;
	}
    【4】:创建新的PyStringObject对象,并初始化
	/* Inline PyObject_NewVar */
	op = (PyStringObject *)PyObject_MALLOC(sizeof(PyStringObject) + size);
	if (op == NULL)
		return PyErr_NoMemory();
	PyObject_INIT_VAR(op, &PyString_Type, size);
	op->ob_shash = -1;
	op->ob_sstate = SSTATE_NOT_INTERNED;
	Py_MEMCPY(op->ob_sval, str, size+1);
	.........
	return (PyObject *) op;
}

 [1] Judging the length of the string, if the length of the string is too long, it will not be created and return empty. Under the win32 platform, the value is 2147483647.

[2] Create a PyStringObject based on an empty string for the first time. Since the nullstring pointer is initialized to NULL, python will create a PyStringObject object for this empty string, share this PyStringObject through the intern mechanism, and then nullstring points to The object being shared.

 [4] Apply for memory, and apply for additional memory for the elements in the string array, then set the hash value to -1, and set the intern flag to SSTATE_NOT_INTERNED. Finally, copy the string array pointed to by str to the space maintained by PyStringObject. Such as the state of the "Python" PyStringObject object in memory:

3. Intern mechanism

The purpose of the intern mechanism is: for an object after being interned, there is only one corresponding PyStringObject object in the system during python operation. When judging whether two PyStringObject objects are the same, if they are all intern, then you only need to simply check whether their corresponding PyObject* are the same. This mechanism saves space and simplifies the comparison of PyStringObject objects.
When performing intern processing on a PyStringObject object, first check whether there is an object b that satisfies the condition (the string maintained in b is the same as a) in the interned dict. If it exists, then the PyObject pointer pointing to a points to b, and a The reference count of -1, the reference count of b +1. 

For PyStringObject objects processed by the intern mechanism, python uses a special counting mechanism. When a PyObject of a PyStringObject object a is added to interned as a key and value, the reference count of a is +1 twice at this time, because the designer stipulates that the a pointer in interned cannot be regarded as a valid reference, so in the code [ 3] The counter at a is -2, otherwise it will never be possible to delete a.

4. Character buffer pool

Python prepares an integer object pool for integers, and python also prepares a character buffer pool for character types. The python designer designed an object pool characters for PyStringObject.

First perform intern operation on the created character object, and then cache the result of intern in characters buffer pool characters. The following figure demonstrates the process of caching the PyStringObject object corresponding to a character:

  1. create a PyStringObject object p;
  2. Perform an intern operation on p;
  3. Cache p into the character buffer pool

5. PyStringObject efficiency problem

 String splicing can be done through + in python, but the performance is extremely low. Since PyStringObject is an immutable object, it means that a new PyStringObject object must be created during splicing. In this way, if you connect N PyStringObject objects, you need N-1 memory application work, which will undoubtedly seriously affect the execution efficiency of python.

static PyObject *
string_join(PyStringObject *self, PyObject *orig)
{
	char *sep = PyString_AS_STRING(self);
	const Py_ssize_t seplen = PyString_GET_SIZE(self);
	PyObject *res = NULL;
	char *p;
	Py_ssize_t seqlen = 0;
	size_t sz = 0;
	Py_ssize_t i;
	PyObject *seq, *item;
 
	seq = PySequence_Fast(orig, "");
	 .....
    【1】遍历list中每个字符串,累加获取所有字符串长度
 
	for (i = 0; i < seqlen; i++) {
		const size_t old_sz = sz;
		item = PySequence_Fast_GET_ITEM(seq, i);
		
		sz += PyString_GET_SIZE(item);
		if (i != 0)
			sz += seplen;
		
	}
     创建长度为sz的PyStringObject对象
	res = PyString_FromStringAndSize((char*)NULL, sz);
	if (res == NULL) {
		Py_DECREF(seq);
		return NULL;
	}
   将list中的字符串拷贝到新建的PyStringObject对象中
	p = PyString_AS_STRING(res);
	for (i = 0; i < seqlen; ++i) {
		size_t n;
		item = PySequence_Fast_GET_ITEM(seq, i);
		n = PyString_GET_SIZE(item);
		Py_MEMCPY(p, PyString_AS_STRING(item), n);
		p += n;
		if (i < seqlen - 1) {
			Py_MEMCPY(p, sep, seplen);
			p += seplen;
		}
	}
 
	Py_DECREF(seq);
	return res;
}

When performing the join operation, it will first count how many PyStringObject objects there are in the list, and count the total length of the strings maintained by the PyStringObject here, then apply for memory, and copy the strings maintained by all the PyStringObject objects in the list to the newly opened memory space middle. The join of N PyStringObject objects only needs to apply for memory once, which saves N-2 operations compared with the + operation, and the efficiency improvement is very obvious.
 

————————————————

reference:

  • Python source code analysis (Chen Ru)

おすすめ

転載: blog.csdn.net/qq_19446965/article/details/128172085