[Algorithm and data structure 09] Hash table-a powerful tool for efficient search


Foreword:

Before, we have studied linear tables, arrays, strings, and trees. They all have such a defect that the search of data numerical conditions requires traversal of all or part of the data . So, is there a way to omit the process of data comparison, thereby further improving the efficiency of searching for numerical conditions?

The answer is of course: yes! In this lesson, we will introduce such an efficient search artifact: hash table.

Insert picture description here


One, what is a hash table

The name of the hash table is derived from Hash and can also be called a hash table . Hash table is a special data structure, which is very different from the data structures we have learned before, such as arrays, linked lists, and trees.

1.1 Principle of Hash Table

A hash table is a data structure that uses hash functions to organize data to support quick insertion and search . The core idea of ​​a hash table is to use a hash function to map keys to buckets . more specifically:

  • When we insert a new key, the hash function will determine which bucket the key should be allocated to and store the key in the corresponding bucket;
  • When we want to search for a key, the hash table will use the same hash function to find the corresponding bucket, and only search in a specific bucket.

Here is a simple example, let's understand:

Insert picture description here
In the example, we use y = x% 5 as the hash function. Let's use this example to complete the insertion and search strategy:

  • Insertion: We parse the keys through the hash function and map them to the corresponding buckets. For example, 1987 is assigned to bucket 2 and 24 is assigned to bucket 4.
  • Search: We parse keys through the same hash function and search only in specific buckets. For example, if we search for 23, we will map 23 to 3 and search in bucket 3. We find that 23 is not in bucket 3, which means that 23 is not in the hash table.

1.2 Design a hash function

Hash function is a hash table of the most important components, the hash table is used to map keys to a particular bucket . In the previous example, we used y = x% 5 as the hash function, where x is the key value and y is the index of the allocated bucket .

The hash function will depend on the range of key values and the number of buckets . Here are some examples of
Insert picture description here
hash functions : The design of hash functions is an open question. The idea is to assign keys to buckets as much as possible. Ideally, the perfect hash function will be a one-to-one mapping between keys and buckets. However, in most cases, the hash function is not perfect. It requires a trade-off between the number of buckets and the capacity of the bucket.

Of course, we can also customize some hash functions. The general methods are:

  • Direct customization method . The hash function is a linear function from keywords to addresses. For example, H (key) = a * key + b. Here, a and b are set constants.
  • Digital analysis method . Suppose that each key in the key set is composed of s-digit numbers (k1, k2,..., Ks), and several evenly distributed bits are extracted from it to form a hash address.
  • The square is the Chinese method . If every digit of the keyword has certain digits repeated, and the frequency is very high, we can first find the square value of the keyword, expand the difference through the square, and then take the middle digits as the final storage address.
  • Folding method . If the keyword has a lot of digits, you can divide the keyword into several equal-length parts, and take the value of their superposition and (round up) as the hash address.
  • In addition to the remainder method . Set a number p in advance, and then perform the remainder operation on the keyword. That is, the address is key% p.

Two, resolve hash conflicts

Ideally, if our hash function is a perfect one-to-one mapping, we won't need to deal with conflicts. Unfortunately, in most cases, conflict is almost inevitable. For example, in our previous hash function (y = x% 5), both 1987 and 2 are assigned to bucket 2, which is a hash collision.

The following questions should be considered to resolve hash conflicts:

  • How to organize the values ​​in the same bucket?
  • What if too many values ​​are assigned to the same bucket?
  • How to search for the target value in a specific bucket?

So once a conflict occurs, how do we resolve it?

There are two commonly used methods: open addressing method and chain addressing method .

2.1 Open addressing method

That is, when a keyword conflicts with another keyword, a detection sequence is formed in the hash table using a certain detection technology, and then the detection sequence is searched in turn. When an empty cell is encountered, it is inserted into it.

The commonly used detection method is the linear detection method . For example, there is a set of keywords {12, 13, 25, 23}, and the hash function used is key% 11 . When inserting 12, 13, 25, it can be inserted directly, the addresses are 1, 2, and 3 respectively. When 23 is inserted, the hash address is 23% 11 = 1.

However, address 1 is already occupied, so follow address 1 in sequence until address 4 is detected and found to be empty, then 23 is inserted into it. As shown below:
Insert picture description here

2.2 Chain address method

Store the records with the same hash address in a linear linked list. For example, there is a set of keywords {12,13,25,23,38,84,6,91,34}, and the hash function used is key% 11. As shown below:
Insert picture description here

Three, the application of hash table

3.1 Basic operation of hash table

In many high-level languages, hash functions and hash conflicts have been black-boxed at the bottom, and developers do not need to design themselves. In other words, the hash table completes the mapping of keywords to addresses, and data can be found through keywords within a constant level of time complexity.

As for the implementation details, such as which hash function is used, what conflict handling is used, and even the hash address of a certain data record, it is not necessary for developers to pay attention. Next, from the actual development point of view, let's take a look at the addition, deletion, and check of data by the hash table.

The operation of adding and deleting data in the hash table does not involve the problem of shifting data after adding or deleting (arrays need to be considered), so the processing is fine.

The detailed process of the hash table lookup is: for a given key, the hash address H (key) is calculated through a hash function.

  • If the value corresponding to the hash address is empty, the search is unsuccessful.
  • Otherwise, the search is successful.

Although the detailed process of hash table lookup is still troublesome, because of the black-box processing of some high-level languages, developers do not need to actually develop the underlying code, just call the relevant functions.

3.2 Advantages and disadvantages of hash tables

  • Advantages : It can provide very fast insert-delete-search operations, no matter how much data, insert and delete values ​​require close to constant time . In terms of search, the speed of the hash table is faster than that of the tree, and the desired element can be found in an instant.
  • Disadvantage : The data in the hash table has no concept of order , so the elements cannot be traversed in a fixed way (such as from small to large). When the data processing order is sensitive, choosing a hash table is not a good solution. At the same time, the
    keys in the hash table are not allowed to be repeated, and the hash table is not a good choice for data with very high repeatability.

Fourth, design a hash map

4.1 Design requirements

Claim:

Design a hash map without using any built-in hash table library. Specifically, the design should include the following functions:

  • put(key, value) : Insert the value pair of (key, value) into the hash map. If the value corresponding to the key already exists, update this value.
  • get(key) : Return the value corresponding to the given key, if the key is not included in the map, return -1.
  • remove(key ): If the key exists in the map, delete the value pair.

Example:

MyHashMap hashMap = new MyHashMap();
hashMap.put(1, 1);          
hashMap.put(2, 2);         
hashMap.get(1);            // 返回 1
hashMap.get(3);            // 返回 -1 (未找到)
hashMap.put(2, 1);         // 更新已有的值
hashMap.get(2);            // 返回 1 
hashMap.remove(2);         // 删除键为2的数据
hashMap.get(2);            // 返回 -1 (未找到)

note:

 所有的值都在 [0, 1000000]的范围内。
 操作的总数目在[1, 10000]范围内。
 不要使用内建的哈希库。

4.2 Design Ideas

Hash table is a common data structure available in different languages. For example, dict in Python, map in C++, and Hashmap in Java. The characteristic of the hash table is that the value can be quickly accessed according to the given key.

The simplest idea is to use modular arithmetic as the hash method. In order to reduce the probability of hash collisions, the modulus of prime numbers is usually used, such as modulo 2069.

Define array as storage space, and calculate array subscript by hash method. In order to solve the hash collision (that is, the key value is different, but the mapping subscript is the same), a bucket is used to store all corresponding values. Buckets can be implemented as arrays or linked lists. In the following specific implementations, arrays are used in Python.

Define the hash table methods, get(), put() and remove(). The addressing process is as follows:

  • For a given key value, use the hash method to generate the hash code of the key value, and use the hash code to locate the storage space. For each hash code, a bucket can be found to store the value corresponding to the key value.
  • After finding a bucket, check whether the key-value pair already exists by traversing .

Insert picture description here

4.3 Practical case

Python implementation is as follows:

class Bucket:
    def __init__(self):
        self.bucket = []

    def get(self, key):
        for (k, v) in self.bucket:
            if k == key:
                return v
        return -1

    def update(self, key, value):
        found = False
        for i, kv in enumerate(self.bucket):
            if key == kv[0]:
                self.bucket[i] = (key, value)
                found = True
                break

        if not found:
            self.bucket.append((key, value))

    def remove(self, key):
        for i, kv in enumerate(self.bucket):
            if key == kv[0]:
                del self.bucket[i]


class MyHashMap(object):

    def __init__(self):
        """
        Initialize your data structure here.
        """
        # better to be a prime number, less collision
        self.key_space = 2069
        self.hash_table = [Bucket() for i in range(self.key_space)]


    def put(self, key, value):
        """
        value will always be non-negative.
        :type key: int
        :type value: int
        :rtype: None
        """
        hash_key = key % self.key_space
        self.hash_table[hash_key].update(key, value)


    def get(self, key):
        """
        Returns the value to which the specified key is mapped, or -1 if this map contains no mapping for the key
        :type key: int
        :rtype: int
        """
        hash_key = key % self.key_space
        return self.hash_table[hash_key].get(key)


    def remove(self, key):
        """
        Removes the mapping of the specified value key if this map contains a mapping for the key
        :type key: int
        :rtype: None
        """
        hash_key = key % self.key_space
        self.hash_table[hash_key].remove(key)


# Your MyHashMap object will be instantiated and called as such:
# obj = MyHashMap()
# obj.put(key,value)
# param_2 = obj.get(key)
# obj.remove(key)

Complexity analysis:

  • Time complexity: The time complexity of each method is O(N/K), where N is the number of all possible key values, K is the number of predefined buckets in the hash table, where K is 2069. Here we assume that the key value is evenly distributed in all buckets, and the average size of the bucket is N/K​. In the worst case, a complete bucket needs to be traversed, so the time complexity is O(N/K).
  • Space complexity: O(K+M), where K is the number of predefined buckets in the hash table, and M is the number of key values ​​inserted in the hash table.

Today’s sharing is over, I hope it will be helpful to your study!

Insert picture description here

Develop a habit, like first and then watch! Your support is the biggest motivation for my creation!

Guess you like

Origin blog.csdn.net/wjinjie/article/details/108773366