Teach you to implement the HashMap data structure with Python

Click on the blue word above , follow and star, and learn technology with me

"Inuyasha Dream City in the Mirror"

Today's article will tell you about hashmap , which is said to be a data structure that all Java engineers know. Why do you say that all Java engineers can do it, because it is very simple, they can't find a job without this. Almost all interviews will ask, and it has basically become a standard configuration.

In today's article we will unravel many mysteries. For example, why is the complexity of hashmap's get and put operations even faster than red-black trees? What is the relationship between hashmap and hash algorithm? What parameters does hashmap have, and what are these parameters used for? Is hashmap thread safe? How do we maintain the balance of the hashmap?

Let us take a look at the basic structure of hashmap with questions.

basic structure

The data structure of hashmap is actually not difficult. Its structure is very, very clear. I can explain it in one sentence. It is actually an adjacency list. Although the purposes of the two are quite different, their structure is exactly the same. To put it bluntly, it is a fixed-length array, and each element of this array is the head node of a linked list . Let's draw this structure so that everyone can understand it at a glance.

headers is a fixed-length array, and each element in the array is the head node of a linked list. That is to say, according to this head node, we can traverse this linked list. The array is of fixed length, but the linked list is of variable length , so if we add, delete, modify and check elements, it is essentially realized through the linked list.

This is the basic structure of hashmap. If you ask it in the interview, you can answer directly: it is essentially an array whose elements are linked lists.

The role of hash

Now that we understand the basic structure of hashmap, let's move on to the next question, what is the relationship between such a structure and hash?

In fact, it is not difficult to guess, let's think about a scene. Suppose we already have a hashmap, and now a new piece of data needs to be stored. The length of the array in the above figure is 6, that is to say, there are 6 linked lists to choose from, so which linked list should we put this new element in?

You might say that of course it is the shortest one, so that the length of the linked list can be balanced. This is really good, but there is a problem. Although it is convenient to store, it has a big problem when reading. Because we know that it exists in the shortest linked list when we store it, but when we read it, we don’t know which linked list was the shortest at the beginning, and it is very likely that the entire structure has changed beyond recognition. Therefore, we cannot determine the placement of nodes based on this dynamic quantity , but must decide based on the static quantity.

This static quantity is the hash value. We all know that the essence of the hash algorithm is to perform a mapping operation, mapping a value of an arbitrary structure to an integer. Our ideal situation is that different value mappings have different results, and the same value mapping has the same result. That is to say, there is a one-to-one correspondence between a variable and an integer. But since the number of our integers is limited, and the values of variables are infinite, there must be some variables that are not equal but have the same result after mapping. This situation is called a hash collision .

In hashmap, we don't need to care about hash collisions, because we don't pursue different keys to be mapped to different values. Because we just use this hash value to determine which linked list this node should be stored in . As long as the hash function is determined, as long as the value remains unchanged, the calculated hash value will not change. So when we query, we can also follow this logic to find the hash value corresponding to the key and the corresponding linked list.

In Python, since the system provides a hash function, the whole process becomes more convenient. We only need two lines of code to find the linked list corresponding to the key.

hash_key = hash(key) % len(self.headers)
linked_list = self.headers[hash_key]

Get, put implementation

After understanding the function of the hash function, the problem of hashmap is mostly solved. Because the rest is a problem of adding, deleting, modifying and checking in the linked list, for example, when we want to find the value through the key. After we have determined which linked list it is through the hash function, all that remains is to traverse the linked list to find this value.

We can implement this function in the LinkedList class, which is very simple, just a simple traversal:

def get_by_key(self, key):
    cur = self.head.succ
    while cur != self.tail:
        if cur.key == key:
            return cur
        cur = cur.succ
    return None

After the node query logic of the linked list is available, the query logic of the hashmap is also available. Because essentially only two things are done, one thing is to find the corresponding linked list according to the value of the hash function, and the second thing is to traverse the linked list to find this node.

We can also easily implement:

def get(self, key):
    hash_key = self.get_hash_key(key)
    linked_list = self.headers[hash_key]
    node = linked_list.get_by_key(key)
    return node

After the get method is implemented, writing the put method is also a matter of course, because the logic of the put method is opposite to that of get. We can replace the search with adding or modifying :

def put(self, key, val):
    node = self.get(key)
    # 如果能找到，那么只需要更新即可
    if node is not None:
        node.val = val
    else:
        # 否则我们在链表当中添加一个节点
        node = Node(key, val)
        linked_list.append(node)

Guarantee of complexity

Both get and put are implemented, is the entire hashmap completed? Obviously not, because there is another very important thing we haven't done, which is to ensure the complexity of hashmap .

After a brief analysis, we will find that the hashmap implemented in this way has a major problem. It is because the array of the linked list at the beginning of the hashmap is of fixed length. No matter how long the array is, as long as we store enough elements, there will be a lot of elements allocated to each linked list. We all know that the traversal speed of the linked list is , so how can we ensure that the query speed is constant?

In addition to this, there is another problem, which is the problem of skewed hash values . For example, we obviously have 100 linked lists, but most of the hash values of our data are 0 after modulo 100. Therefore, a large amount of data will be stored in the bucket of 0, resulting in no data in other buckets, and only this bucket is full. How can we avoid this situation?

In fact, whether it is too much data or uneven distribution, it is actually the same situation. That is, too much data is stored in at least one bucket, resulting in reduced efficiency. In response to this situation, a check mechanism is designed in the hashmap. Once the elements in a certain bucket exceed a certain threshold, a reset will be triggered . That is to double the number of linked lists in the hashmap, and rebuild all the data in disorder. This threshold is set by a parameter called load_factor. When the element in a certain bucket is greater than load_factor * capacity, the reset mechanism will be triggered.

We add the logic of reset, then the put function becomes like this:

def put(self, key, val):
    hash_key = self.get_hash_key(key)
    linked_list = self.headers[hash_key]
    # 如果超过阈值
    if linked_list.size >= self.load_factor * self.capacity:
        # 进行所有数据reset
        self.reset()
        # 对当前要加入的元素重新hash分桶
        hash_key = self.get_hash_key(key)
        linked_list = self.headers[hash_key]
        node = linked_list.get_by_key(key)
        if node is not None:
            node.val = val
        else:
            node = Node(key, val)
            linked_list.append(node)

The logic of reset is also very simple. We double the length of the array, then read the original data one by one, and re-hash them into new buckets .

def reset(self):
    # 数组扩大一倍
    headers = [LinkedList() for _ in range(self.capacity * 2)]
    cap = self.capacity
    # capacity也扩大一倍
    self.capacity = self.capacity * 2
    for i in range(cap):
        linked_list = self.headers[i]
        nodes = linked_list.get_list()
        # 将原本的node一个一个填入新的链表当中
        for u in nodes:
            hash_key = self.get_hash_key(u.key)
            head = headers[hash_key]
            head.append(u)
    self.headers = headers

In fact, the threshold here is our maximum query time. We can approximate it as a relatively large constant, so the efficiency of put and get is guaranteed. Because a large amount of data has been inserted or the hash is uneven, we have solved it all.

Details and Sublimation

If you read the source code of hashmap in JDK, you will find that the capacity of hashmap, that is, the number of linked lists, is a power of 2 . Why is this?

In fact, it is also very simple, because according to our logic just now, after we calculate the hash value through the hash function, we need to take this value modulo the capacity. That is hash(key) % capacity, which is also reflected in the code just now.

There is a small problem here that the modulo operation is very, very slow , dozens of times slower than addition, subtraction, and multiplication at the system level. In order to optimize and improve the performance of this part, we use a power of 2, so that we can use hash(key) & (capacity - 1) instead of hash(key) % capacity, because when capacity is a power of 2, the two calculations are equivalent. We all know that the calculation speed of bit operations is the fastest among all operations in computers , so we can improve a lot of calculation efficiency.

Finally, let’s talk about thread safety. Is hashmap thread safe? The answer is simple, of course not. Because there are no locks or mutual exclusion restrictions, thread A is modifying a node, and thread B can also read the same node at the same time. Then it is easy to have problems, especially when there are operations such as reset that take a long time. If other queries come during the reset period, the result must not be queried, but it is very likely that this data exists. So hashmap is not thread-safe and cannot be used in concurrent scenarios.

Finally, we attach the complete implementation code of hashmap:

import random

class Node:
    def __init__(self, key, val, prev=None, succ=None):
        self.key = key
        self.val = val
        # 前驱
        self.prev = prev
        # 后继
        self.succ = succ

    def __repr__(self):
        return str(self.val)


class LinkedList:
    def __init__(self):
        self.head = Node(None, 'header')
        self.tail = Node(None, 'tail')
        self.head.succ = self.tail
        self.tail.prev = self.head
        self.size = 0

    def append(self, node):
        # 将node节点添加在链表尾部
        prev = self.tail.prev
        node.prev = prev
        node.succ = prev.succ
        prev.succ = node
        node.succ.prev = node
        self.size += 1

    def delete(self, node):
        # 删除节点
        prev = node.prev
        succ = node.succ
        succ.prev, prev.succ = prev, succ
        self.size -= 1

    def get_list(self):
        # 返回一个包含所有节点的list，方便上游遍历
        ret = []
        cur = self.head.succ
        while cur != self.tail:
            ret.append(cur)
            cur = cur.succ
        return ret

    def get_by_key(self, key):
        cur = self.head.succ
        while cur != self.tail:
            if cur.key == key:
                return cur
            cur = cur.succ
        return None



class HashMap:
    def __init__(self, capacity=16, load_factor=5):
        self.capacity = capacity
        self.load_factor = load_factor
        self.headers = [LinkedList() for _ in range(capacity)]

    def get_hash_key(self, key):
        return hash(key) & (self.capacity - 1)

    def put(self, key, val):
        hash_key = self.get_hash_key(key)
        linked_list = self.headers[hash_key]
        if linked_list.size >= self.load_factor * self.capacity:
            self.reset()
            hash_key = self.get_hash_key(key)
            linked_list = self.headers[hash_key]
        node = linked_list.get_by_key(key)
        if node is not None:
            node.val = val
        else:
            node = Node(key, val)
            linked_list.append(node)

    def get(self, key):
        hash_key = self.get_hash_key(key)
        linked_list = self.headers[hash_key]
        node = linked_list.get_by_key(key)
        return node.val if node is not None else None

    def delete(self, key):
        node = self.get(key)
        if node is None:
            return False
        hash_key = self.get_hash_key(key)
        linked_list = self.headers[hash_key]
        linked_list.delete(node)
        return True

    def reset(self):
        headers = [LinkedList() for _ in range(self.capacity * 2)]
        cap = self.capacity
        self.capacity = self.capacity * 2
        for i in range(cap):
            linked_list = self.headers[i]
            nodes = linked_list.get_list()
            for u in nodes:
                hash_key = self.get_hash_key(u.key)
                head = headers[hash_key]
                head.append(u)
        self.headers = headers

That’s all for today’s article, and I sincerely wish you all gain something every day. If you still like today’s content, please come and support me~ ( like, watch, forward )

Note

High-quality articles, recommended reading:

How to play Redis with Python?

Recommend some software that is easier to use on Mac

Why does Python recommend snake-like nomenclature?

Cold knowledge of Python floating point numbers

Thanks to the creator for the great article