In-depth understanding of the underlying implementation of PHP arrays

A PHP array is a magical and powerful data structure. An array can be either a continuous array or a map that stores KV mappings. In PHP7, compared to PHP5, the array has been greatly modified.

  • Semantics of arrays
  • The concept of an array
  • Implementation of PHP5 array
  • Implementation of PHP7 array
    - basic structure
    - initialization
    - difference between packed array and hash array
    - insert, update, search, delete
    - resolution of hash conflict
    - expansion and rehash operation
    - recursive protection of array

1. The semantics of arrays
Essentially, a PHP array is an ordered dictionary, which needs to satisfy two semantics at the same time.
Semantics 1: A PHP array is a dictionary that stores key-value pairs. The corresponding value can be quickly found through the key, which can be an integer or a string.
Semantic 2: PHP arrays are ordered. This order refers to the insertion order. When traversing the array, the order of traversing elements should be consistent with the insertion order, not random like ordinary dictionaries.
In order to achieve Semantic 1, PHP uses HashTable to store key-value pairs, but HashTable itself cannot guarantee Semantic 2. Different versions of PHP have additionally designed HashTable to ensure order, which will be introduced below.
Second, the concept of the array
insert image description here
key: key, through which the corresponding value can be quickly retrieved. Usually a number or a string.
value : value, target data. Can be complex data structures.
bucket: Bucket, the unit for storing data in HashTable. A container used to store key, value, and auxiliary information.
slot: slot, HashTable has multiple slots, a bucket must belong to a specific slot, and there can be multiple buckets under a slot.
Hash function: You need to implement it yourself. When storing, a hash function will be applied to the key to determine the slot.
Hash conflict: When multiple keys are hashed and the resulting slots are in the same position, it is called a hash conflict. The general method of conflict resolution is chain address method and open address method. PHP uses the chain address method to link the buckets in the same slot through a linked list.
In the specific implementation, PHP makes some supplements to the bucket and the hash function based on the above basic concepts, adding the hash1 function to generate the h value, and then hashing to different slots through the hash2 function.
insert image description here
The effect of increasing this intermediate h value:

  1. The key in HashTable may be a number or a string, so the bucket needs to be split when designing the key, splitting the number key and the string key. In the above bucket, "h" stands for the number key, "" key" represents the string key. In fact, hash1 does nothing for the numeric key.
  2. Each string has an h value, which can speed up the comparison of strings. When comparing whether two strings are equal, first compare whether the h values ​​​​of key1 and key2 are equal, and if they are equal, then compare the strings length and content. Otherwise, directly determine that they are not equal.
    2. Implementation of PHP5 array
    First, look at the definition of PHP5 bucket and HashTable structure:
typedef struct bucket {
    
      
    ulong h;                   /* 4字节 对char *key进行hash后的值,或者是用户指定的数字索引值/* Used for numeric indexing */
    uint nKeyLength;           /* 4字节 字符串索引长度,如果是数字索引,则值为0 */  
    void *pData;               /* 4字节 实际数据的存储地址,指向value,一般是用户数据的副本,如果是指针数据,则指向pDataPtr,这里又是个指针,zval存放在别的地方*/
    void *pDataPtr;            /* 4字节 引用数据的存储地址,如果是指针数据,此值会指向真正的value,同时上面pData会指向此值 */  
    struct bucket *pListNext;  /* 4字节 整个哈希表的该元素的下一个元素*/  
    struct bucket *pListLast;  /* 4字节 整个哈希表的该元素的上一个元素*/  
    struct bucket *pNext;      /* 4字节 同一个槽,双向链表的下一个元素的地址 */  
    struct bucket *pLast;      /* 4字节 同一个槽,双向链表的上一个元素的地址*/  
    char arKey[1];             /* 1字节 保存当前值所对于的key字符串,这个字段只能定义在最后,实现变长结构体*/  
} Bucket;

(1) Three new elements are added to the bucket here:
arkey: Corresponding to the key in the HashTable design, it represents the string key.
h: Corresponds to h in the HashTable design, indicating the h value of the digital key or string key.
pData and pDataPtr: correspond to the value in the HashTable design.
Generally, value is stored in the memory pointed to by pData, and pDataPtr is NULL, but if the size of value is equal to the size of a pointer, then no additional memory storage will be applied for, but stored directly on pDataPtr, and then pData points to pDataPtr, which can reduce Memory fragmentation.
(2) In order to realize the two semantics of the array, there are four pointers in the bucket, pListLast, pListNext, pLast, and pNext, and two kinds of doubly linked lists are maintained. One is a global linked list, which connects all buckets in series in the order of insertion, and the entire HashTable has only one global linked list. The other is a local linked list. In order to resolve hash conflicts, each slot maintains a linked list that connects all hash conflict buckets in series. That is, each bucket is on a doubly linked list. pLast and pNext point to the previous and next buckets of the local linked list, respectively, while pListLast and pListTNext point to the previous and next buckets of the entire linked list.

typedef struct _hashtable {
    
      
    uint nTableSize;           /*4 哈希表中Bucket的槽的数量,初始值为8,每次resize时以2倍速度增长*/
    uint nTableMask;           /*4 nTableSize-1 ,索引取值的优化 */
    uint nNumOfElements;       /*4 哈希表中Bucket中当前存在的元素个数,count()函数会直接返回此值*/
    ulong nNextFreeElement;    /*4 下一个数字索引的位置 */
    Bucket *pInternalPointer;  /*4 当前遍历的指针(foreach比for快的原因之一) 用于元素遍历*/
    Bucket *pListHead;         /*4 存储数组头元素指针 */
    Bucket *pListTail;         /*4 存储数组尾元素指针 */
    Bucket **arBuckets;        /*4 指针数组,数组中每个元素都是指针,存储hash数组 */
    dtor_func_t pDestructor;   /*4 在删除元素时执行的回调函数,用于资源的释放 /* persistent 指出了Bucket内存分配的方式。如果persisient为TRUE,则使用操作系统本身的内存分配函数为Bucket分配内存,否则使用PHP的内存分配函数。*/
    zend_bool persistent;      /*1 */
    unsigned char nApplyCount; /*1 标记当前hash Bucket被递归访问的次数(防止多次递归)*/
    zend_bool bApplyProtection;/*1 标记当前hash桶允许不允许多次访问,不允许时,最多只能递归3次 */
#if ZEND_DEBUG  
    int inconsistent;          /*4 */ 
#endif  
} HashTable; 

Explain here:
nTableMask: mask. Always equal to nTableSize - 1, that is 2^n - 1, therefore, each bit of nTableMask is 1. In the hashing process mentioned above, the key is converted into h value through the hash1 function, and the h value is converted into the slot value through the hash2 function. The hash2 function here is slot = h & nTableMask, and then obtain the head pointer of the current slot linked list through arBuckets[slot].
pListHead / pListTail In order to realize the second semantics (ordered) of the array, HashTable maintains a global linked list. These two pointers point to the head and tail of the global linked list respectively.

Come here to analyze why PHP7 rewrites the array implementation.

  1. Each bucket requires a memory allocation.
  2. The values ​​in key-value are all zval. In this case, each bucket needs to maintain the pointer pDataPtr pointing to zval and the pointer pData pointing to pDataPtr.
  3. In order to ensure the two semantics of the array, each bucket needs to maintain 4 pointers to the bucket.
    The above reasons lead to poor performance.
    3. PHP7 array implementation
    Since HashTable is used, if the hash conflict is resolved through the chain address method, then the linked list is necessary. In order to ensure the order, it is indeed necessary to maintain a global linked list. It seems that PHP5 is already impeccable.
    In fact, PHP7 also uses the chain address method, but this "chain" is not the other "chain". The linked list of PHP5 is a physically linked list, and the upstream and downstream relationships between buckets in the linked list are maintained through real pointers. The linked list of PHP7 is a logical linked list. All buckets are allocated in continuous array memory, and the upstream and downstream relationships are no longer maintained through pointers. Each bucket only maintains the index of the next bucket in the array (because it is continuous memory, the bucket can be quickly located through the index), and the traversal of the buckets on the linked list can be completed.
    Ok, let’s uncover the true face of the underlying structure of the PHP7 array:
  4. basic structure:
typedef struct _Bucket {
    
    
    zval              val;      /* 对应HashTable设计中的value */ 
    zend_ulong        h;        /* 对应HashTable设计中的h,表示数字key或者字符串key的h值。*/        
    zend_string      *key;      /* 对应HashTable设计中的key */          
} Bucket;

Buckets can be divided into three types: unused, valid, and invalid.
Unused: Initially all buckets are unused.
Valid: Valid data is stored.
Invalid: When the data on the bucket is deleted, the valid bucket will become an invalid bucket.
In terms of memory distribution, valid and invalid buckets are distributed alternately. But they are all in front of unused buckets. When inserting, it is always performed on unused buckets. When there are too many invalid buckets and few valid bucekts, the rehash operation is performed on the entire bucket array so that the sparse valid buckets become continuous and compact, and some invalid buckets will be reused. When it becomes valid, some valid buckets and invalid buckets will be released and become unused buckets again.
insert image description here

struct _zend_array {
    
     
    zend_refcounted_h  gc;
    union {
    
    
        struct {
    
    
            ZEND_ENDIAN_LOHI_4(
                zend_uchar    flags,
                zend_uchar    nApplyCount,  /* 循环遍历保护 */
                zend_uchar    nInteratorsCount,
                zend_uchar    consistency)
        } v;
        uint32_t flags;
    } u;
    uint32_t          nTableMask;           /* 掩码,用于根据hash值计算存储位置,永远等于nTableSize-1 */
    Bucket           *arData;               /* 存放实际数据 */
    uint32_t          nNumUsed;             /* arData数组已经使用的数量 */
    uint32_t          nNumOfElements;       /* hash表中元素个数 */
    uint32_t          nTableSize;           /* hash表的大小 HashTable的大小,始终为2的指数(8,16,32,64...)。最小为8,最大值根据机器不同而不同*/
    uint32_t          nInternalPointer;     /* 用于HashTable遍历 */
    zend_long         nNextFreeElement;     /* 下一个空闲可用位置的数字索引 */
    dtor_func_t       pDestructor;          /* 析构函数 */
} HashTable;

insert image description here
u.flags is a 32-bit unsigned integer with a value range of 0 ~ 2^32 - 1, and uvflags is an 8-bit unsigned character with a value range of 0 ~ 255.
uvflags: Use each bit to express various marks of HashTable. There are the following 6 kinds of flags, corresponding to the 1st to 6th bits of uvflags.

#define HASH_FLAG_PERSISTENT       (1 << 0)  //是否使用持久化内存(不使用内存池)
#define HASH_FLAG_APPLY_PROTECTION (1 << 1)  //是否开启递归遍历保护
#define HASH_FLAG_PACKED           (1 << 2) //是否是packed  array
#define HASH_FLAG_INITIALIZED      (1 << 3) //是否已经初始化
#define HASH_FLAG_STATIC_KEYS      (1 << 4) //标记HashTable的key是否为long key或者内部字符串key
#define HASH_FLAG_HAS_EMPTY_IND    (1 << 5) //是否存在空的间接val
  1. Initialization:
    (1) Apply for a HashTable structure memory and initialize each field.
    (2) Allocate bucket array memory and modify some field values.

Finally, I would like to say that the underlying implementation also includes the difference between packed array and hash array, insert/delete/find/update operations, hash conflict resolution, capacity expansion, and rehash operations. Interested partners can continue to learn in depth.

Guess you like

Origin blog.csdn.net/weixin_43885417/article/details/101118471