Diving into Python dictionaries

In Python, dictionaries are implemented as hash tables. That is to say, the dictionary is an array, and the index of the array is obtained after the key is processed by the hash function. The purpose of the hash function is to distribute the keys evenly in the array. Since different keys may have the same hash value, collisions may occur, and advanced hash functions can minimize the number of collisions.

Dictionaries are indexed by keys, so dictionaries can also be viewed as two arrays associated with each other. Below we try to add 3 key/value pairs to the dictionary:

d = {
    
    'a': 1, 'b': 2} 
 
d['c'] = 3 
 
print( d )
 
# {'a': 1, 'b': 2, 'c': 3}  

These values ​​can be accessed via:

d = {
    
    'a': 1, 'b': 2} 

print( d['a'] )
 
print( d['b'] )
 
print( d['c'] )
 
print( d['d'] )
''' 
Traceback (most recent call last): 
 
  File "", line 1, in  
 
KeyError: 'd'  '''

Since the key 'd' does not exist, a KeyError exception is raised.

1. Hash tables

In Python, dictionaries are implemented as hash tables. That is to say, the dictionary is an array, and the index of the array is obtained after the key is processed by the hash function. The purpose of the hash function is to distribute the keys evenly in the array. Since different keys may have the same hash value, collisions may occur, and advanced hash functions can minimize the number of collisions. Python does not contain such advanced hash functions, and several important hash functions (for handling strings and integers) are usually of conventional types:

print( map(hash, (0, 1, 2, 3)) )
 
# [0, 1, 2, 3] 
 
print( map(hash, ("namea", "nameb", "namec", "named")) )
 
# [-1658398457, -1658398460, -1658398459, -1658398462]  

In the following pages, we only consider the case of using strings as keys. In Python, a hash function for processing strings is defined like this:

arguments: string object 
 
returns: hash 
 
function string_hash: 
 
    if hash cached: 
 
        return it 
 
    set len to string's length 
 
    initialize var p pointing to 1st char of string object 
 
    set x to value pointed by p left shifted by 7 bits 
 
    while len >= 0: 
 
        set var x to (1000003 * x) xor value pointed by p 
 
        increment pointer p 
 
    set x to x xor length of string object 
 
    cache x as the hash so we don't need to calculate it again 
 
    return x as the hash  

If you run hash('a') in Python, the string_hash() function will be executed in the background, and then return 12416037344 (here we assume a 64-bit platform).

If an array of length x is used to store key/value pairs, then we need to calculate the index of the slot (the unit that stores key/value pairs) in the array with a mask of value x-1. This can make the process of calculating indexes very fast. The mechanism of adjusting the length of the dictionary structure (described in detail below) makes the probability of finding an empty slot very high, which means that in most cases only simple calculations are required. If the length of the array used in the dictionary is 8, then the index of key 'a' is: hash('a') & 7 = 0, similarly the index of 'b' is 3, the index of 'c' is 2, and the index of ' The index of z' is the same as that of 'b', which is also 3, so there is a conflict.

insert image description here
It can be seen that Python's hash function performs ideally when the keys are consecutive to each other, mainly considering that this type of data is usually processed. However, as soon as we add the key 'z' there will be a conflict, because this key value is not adjacent to other keys, and is far away.

Of course, we can also use a linked list whose index is the hash value of the key to store the key/value pair, but it will increase the time to find the element, and the time complexity is no longer O(1). The next section describes the method Python's dictionary uses to resolve conflicts.

2. Open addressing method (Open addressing)

Open addressing is a method that uses probing to handle conflicts. In the above example of a collision for key 'z', index 3 is already taken in the array, so it is necessary to search for a currently unused index. Both adding and searching for a key/value pair take O(1) time.

Searching for free slots uses a quadratic probing sequence, whose code is as follows:

j = (5*j) + 1 + perturb; 
 
perturb >>= PERTURB_SHIFT; 
 
use j % 2**i as the next table index;  

The cyclic 5*j+1 can quickly amplify the small difference of the hash value binary that does not affect the initial index. The variable perturb can make other binary bits also change constantly.

Out of curiosity, let's take a look at the detection sequence when the array length is 32, j = 3 -> 11 -> 19 -> 29 -> 5 -> 6 -> 16 -> 31 -> 28 -> 13 - > 2…

For more information about the detection sequence, please refer to the source code of dictobject.c. The beginning of the document contains a detailed description of the detection mechanism.

insert image description here

Let's take a look at the Python internal code with an example.

3. Dictionary structure based on C language

The following C language-based data structure is used to store the key/value pairs (also called entry) of the dictionary, and the storage content has hash values, keys and values. PyObject is a base class for Python objects.

typedef struct {
    
     
 
    Py_ssize_t me_hash; 
 
    PyObject *me_key; 
 
    PyObject *me_value 
 
} PyDictEntry;  

The following is the data structure corresponding to the dictionary. Among them, ma_fill is the total number of active slots and dummy slots. When a key/value pair in an active slot is deleted, the slot is marked as a dummy slot. ma_used is the total number of active slots. The ma_mask value is the length of the array minus 1, which is used to calculate the index of the slot. ma_table is the array itself, and ma_smalltable is the initial array with a length of 8.

typedef struct _dictobject PyDictObject; 
 
struct _dictobject {
    
     
 
    PyObject_HEAD 
 
    Py_ssize_t ma_fill; 
 
    Py_ssize_t ma_used; 
 
    Py_ssize_t ma_mask; 
 
    PyDictEntry *ma_table; 
 
    PyDictEntry *(*ma_lookup)(PyDictObject *mp, PyObject *key, long hash); 
 
    PyDictEntry ma_smalltable[PyDict_MINSIZE]; 
 
};  

dictionary initialization

The PyDict_New() function is called when a dictionary is first created. Some lines in the source code have been deleted here, and the C language code has been converted to pseudocode to highlight several key concepts.

returns new dictionary object 
 
function PyDict_New: 
 
    allocate new dictionary object 
 
    clear dictionary's table 
 
    set dictionary's number of used slots + dummy slots (ma_fill) to 0 
 
    set dictionary's number of active slots (ma_used) to 0 
 
    set dictionary's mask (ma_value) to dictionary size - 1 = 7 
 
    set dictionary's lookup function to lookdict_string 
 
    return allocated dictionary object  

add item

Adding new key/value pairs calls the PyDict_SetItem() function. The function will take a pointer to the dictionary object and key/value pairs. In this process, it will first check whether the key is a string, and then calculate the hash value. If the hash value of the key has been calculated and cached before, the cached value will be used directly. Then call the insertdict() function to add new key/value pairs. If the total number of active slots and empty slots exceeds 2/3 of the length of the array, the length of the array needs to be adjusted. Why 2/3? This is mainly to ensure that the probing sequence can find free slots fast enough. We will introduce the function to adjust the length later.

arguments: dictionary, key, value 
 
returns: 0 if OK or -1 
 
function PyDict_SetItem: 
 
    if key's hash cached: 
 
        use hash 
 
    else: 
 
        calculate hash 
 
    call insertdict with dictionary object, key, hash and value 
 
    if key/value pair added successfully and capacity over 2/3: 
 
        call dictresize to resize dictionary's table  

insertdict() uses the search function lookdict_string() to find free slots. This is the same function used to look up keys. lookdict_string() computes the index of the slot using the hash and mask. If the key is not found with the "index = hash value & mask" method, it will be detected by calling the loop method described earlier until a free slot is found. The first round of detection, if no matching key is found and a dummy slot is encountered during the detection process, a dummy slot is returned. This gives preference to previously deleted slots.

Now we want to add the following key/value pairs: {'a': 1, 'b': 2′, 'z': 26, 'y': 25, 'c': 5, 'x': 24}, Then the following process will happen:

Allocates a dictionary structure with an inner table of size 8.
insert image description here
insert image description here
Here's what we've got so far:
insert image description here

6 of the 8 slots have been used, and the usage has exceeded 2/3 of the total capacity. Therefore, the dictresize() function will be called to allocate an array with a larger length. Entries are copied to the new table.

In our example, after the dictresize() function is called, the adjusted length of the array is not less than 4 times the number of active slots, ie minused = 24 = 4 ma_used . And when the number of active slots is very large (greater than 50,000), the adjusted length should not be less than twice the number of active slots, that is, 2 ma_used. Why 4 times? This is mainly to reduce the number of calls to the adjustment length function, while significantly improving the sparsity.

The length of the new table should be greater than 24. When calculating the length value, the current length value will be continuously upgraded until it is greater than 24. The final length is 32. For example, the current length is 8, and the calculation process is as follows: 8 -> 16 -> 32.

This is how the length adjustment works: Allocate a new table of length 32, then insert entries from the old table into the new table with the new mask, which is 31. The final result is as follows:
insert image description here

4. Delete item

The PyDict_DelItem() function will be called when an item is deleted. When deleting, first calculate the hash value of the key, then call the search function to return to the entry, *** the slot is marked as a dummy slot.

Suppose we want to delete the key 'c' from the dictionary, we will end up with something like this:
insert image description here

Note that after deleting an item, even if the final number of active slots is much smaller than the total number, it will not trigger the action of adjusting the length of the array. However, if key/value pairs are added after deletion, the length of the array may be reduced because the conditional judgment for adjusting the length is based on the total number of active slots and dummy slots.

Guess you like

Origin blog.csdn.net/weixin_61587867/article/details/132252555