Principle and implementation of hash table

This article mainly introduces the principle and realization of the common data structure of Hash Table. Due to the limited personal level, there are inevitably inaccurate or unclear parts in the article. I hope you can correct me:)

Overview

The symbol table is a data structure used to store key-value pairs. The array we usually use can also be regarded as a special symbol table. The "key" in the array is the array index and the value Is the corresponding array element. In other words, when all the keys in the symbol table are small integers, we can use an array to implement the symbol table, using the index of the array as the key, and the array element at the index is the value corresponding to the key, but this One means only when all keys are relatively small integers, otherwise a very large array may be used. The hash table is an "upgrade" of the above strategy, but it can support arbitrary keys without restricting them too much. For a symbol table based on a hash table, if we want to find a key in it, we need to perform the following steps:

  • First, we use a hash function to convert a given key into an "array index". Ideally, different keys will be converted to different indexes, but in practical applications we will encounter different keys that are converted to the same In the case of indexing, this situation is called a collision . The method to solve the collision will be introduced in detail later.
  • After getting the index, we can access the corresponding key-value pair through this index just like accessing an array.

The above is the core idea of ​​the hash table, which is a classic example of space-time trade-offs. When our space is infinite, we can directly use a large array to store key-value pairs, and use the key as the array index. Because the space is not limited, the value of our key can be infinite, so looking for any key Only one ordinary array access is required. Conversely, if there is no time limit for the search operation, we can directly use the linked list to save all key-value pairs, which minimizes the use of space, but can only be searched sequentially. In actual applications, our time and space are limited, so we must make a trade-off between the two, the hash table has found a good balance between the use of time and space. One advantage of the hash table is that we only need to adjust the corresponding parameters of the hash algorithm without making any changes to other parts of the code to be able to make strategic adjustments in the trade-off of time and space.

Hash function

Before introducing the hash function, let's first introduce the basic concepts of a few hash tables. Inside the hash table, we use buckets to store key-value pairs. The array index we mentioned earlier is the bucket number, which determines which bucket of the hash table a given key is stored in. The number of buckets owned by the hash table is called the capacity of the hash table.

Now suppose that there are M buckets in our hash table, with bucket numbers ranging from 0 to M-1. The function of our hash function is to convert any given key to an integer on [0, M-1]. We have two basic requirements for the hash function: one is to short the calculation time, and the other is to distribute the keys in different buckets as much as possible. For different types of keys, we need to use different hash functions to ensure a better hashing effect.
The hash function we use should satisfy the uniform hash hypothesis as much as possible. The following definition of the uniform hash hypothesis comes from Sedgewick's "Algorithm" book:

(Uniform hash assumption) The hash function we use can evenly and independently spread all keys between 0 and M – 1.

There are two keywords in the above definition, the first one is uniform, which means that the bucket number we calculated for each key has M "candidate values", and uniformity requires that the probability of these M values ​​being selected is equal The second keyword is independent, which means that whether each bucket number is selected is independent of each other, and has nothing to do with whether other bucket numbers are selected. In this way, satisfying uniformity and independence can ensure that the distribution of key-value pairs in the hash table is as uniform as possible, and there will be no "many key-value pairs are hashed to the same bucket, while many buckets are empty". .
Obviously, it is not easy to design a hash function that satisfies the assumption of uniform hashing. The good news is that usually we don’t need to design it, because we can directly use some efficient implementations based on probability statistics, such as many commonly used in Java The class has rewritten the hashCode method (the hashCode method of the Object class returns the memory address of the object by default), which is used to return a hashCode for this type of object. Usually, we divide this hashCode by the remainder of the bucket number M to get a bucket number. Let's take some classes in Java as examples to introduce the implementation of hash functions for different data types.

The hashCode method of the String class

The hashCode method of the String class is as follows:

public int hashCode() { 
  int h = hash; 
  if (h == 0 && value.length > 0) { 
    char val[] = value; 
    for (int i = 0; i < value.length; i++) { 
      h = 31 * h + val[i]; 
    } 
    hash = h; 
  } 
  return h;
}

The value in the hashCode method is a char[] array, which stores each character of the string. We can see that at the very beginning of the method we will assign the hash to h. This hash represents the previously calculated hashCode, so that if the hashCode of this string object has been calculated before, we don’t need to calculate it again this time. Just return to the previous calculation. This strategy of caching hashCode is only effective for immutable objects, because the hashCode of immutable objects will not change.
According to the above code, we can know that if h is null, it means that we are calculating the hashCode for the first time. The body of the if statement is the specific calculation method of the hashCode. Suppose our string object str contains 4 characters, and ck represents the kth character in the string (counting from 0), then the hashCode of str is equal to: 31 * (31 * (31 * c0 + c1) + c2) +c3.

The hashCode method of numeric types

Here we take Integer and Double as examples to introduce the general implementation of the hashCode method of numeric types.
The hashCode method of the Integer class is as follows:

public int hashCode() { 
  return Integer.hashCode(value);
}
public static int hashCode(int value) { 
  return value;
}

The value represents the integer value wrapped by the Integer object, so the hashCode method of the Integer class simply returns its own value.

Let's look at the hashCode method of the Double class again:

@Override
public int hashCode() { 
  return Double.hashCode(value);
}
public static int hashCode(double value) { 
  long bits = doubleToLongBits(value); 
  return (int)(bits ^ (bits >>> 32));
}

We can see that the hashCode method of the Double class first converts its value to the long type, and then returns the XOR result of the lower 32 bits and the upper 32 bits as the hashCode.

The hashCode method of the Date class

The data types we introduced earlier can be regarded as a numerical type (String can be regarded as an integer array), then how to calculate the hashCode for non-numeric objects? Here we take the Date class as an example to briefly introduce. The hashCode method of the Date class is as follows:

public int hashCode() { 
  long ht = this.getTime(); 
  return (int) ht ^ (int) (ht >> 32);
}

We can see that the implementation of its hashCode method is very simple, it just returns the XOR result of the low 32 bits and the high 32 bits of the time encapsulated by the Date object. From the implementation of the hashCode of the Date class, we can understand that for the calculation of the hashCode of a non-numeric type, we need to select some instance domains that can distinguish instances of each class as the calculation factor. For example, for the Date class, we usually consider the Date objects with the same time to be equal and therefore have the same hashCode. Here we need to explain that for two equivalent objects (that is, calling the equals method to return true), their hashCode must be the same, and vice versa.

Obtain the bucket number from hashCode

Earlier we introduced some methods for calculating the hashCode of an object. After we obtain the hashCode, how can we further obtain the bucket number? A straightforward way is to directly divide the hashCode obtained by the capacity (the number of buckets), and then use the remainder as the bucket number. But in Java, hashCode is int type, and int type in Java is signed, so if we use the returned hashCode directly, we may get a negative number. Obviously, the bucket number cannot be negative. So we first convert the returned hashCode into a non-negative integer, and then divide it by the capacity to take the remainder as the corresponding bucket number of the key. The specific code is as follows:

private int hash(K key) { return (x.hashCode() & 0x7fffffff) % M;} 

Now that we know how to get the bucket number by a key, then we will introduce the second step of using the hash table to find-dealing with collisions.

Use the zipper method to handle collisions

Using different collision handling methods, we get different implementations of the hash table. The first thing we want to introduce is the implementation of a hash table that uses the zipper method to handle collisions. In the hash table implemented in this way, a linked list is stored in each bucket. Initially, all linked lists are empty. When a key is hashed to a bucket, this key becomes the first node of the linked list in the corresponding bucket. After that, if another key is hashed to this bucket (that is, a collision occurs), The second key will become the second node of the linked list, and so on. In this way, when the number of buckets is M and the number of key-value pairs stored in the hash table is N, the average number of nodes in the linked list in each bucket is N/M. Therefore, when we look up a key, we first determine the bucket it is in through the hash function. The time required for this step is O(1); then we sequentially compare the key of the node in the bucket with the given key, and if they are equal, we find To specify the key-value pair, the time required for this step is O(N / M). So the time required for the search operation is O(N / M), and usually we can guarantee that N is a constant multiple of M, so the time complexity of the search operation of the hash table is O(1), and we can also get The complexity of the insert operation is also O(1).

After understanding the above description, it is easy to implement a hash table based on the zipper method. For simplicity, we directly use the previous SeqSearchList as the linked list in the bucket. The reference code is as follows:

public class ChainingHashMap<K, V> { 
  private int num; //当前散列表中的键值对总数 
  private int capacity; //桶数 
  private SeqSearchST<K, V>[] st; //链表对象数组 
  
  public ChainingHashMap(int initialCapacity) { 
    capacity = initialCapacity; 
    st = (SeqSearchST<K, V>[]) new Object[capacity]; 
    for (int i = 0; i < capacity; i++) { 
      st[i] = new SeqSearchST<>(); 
    } 
  } 
  
  private int hash(K key) { 
    return (key.hashCode() & 0x7fffffff) % capacity; 
  } 
  
  public V get(K key) { 
      return st[hash(key)].get(key); 
  }

  public void put(K key, V value) { 
    st[hash(key)].put(key, value); 
  }
} 

In the above implementation, we fixed the number of buckets in the hash table. When we clearly know that the number of key-value pairs we want to insert can only reach a constant multiple of the number of buckets, fixing the number of buckets is completely feasible. But if the number of key-value pairs grows far greater than the number of buckets, we need the ability to dynamically adjust the number of buckets. In fact, the ratio of the number of key-value pairs in the hash table to the number of buckets is called the load factor. Generally, the smaller the load factor, the shorter the time we need to search, and the greater the space usage; if the load factor is larger, the search time will become longer, but the space usage will decrease. For example, HashMap in the Java standard library is a hash table implemented based on the zipper method, and its default load factor is 0.75. The way to dynamically adjust the number of buckets in HashMap is based on the formula loadFactor = maxSize / capacity, where maxSize is the maximum number of key-value pairs that support storage, and loadFactor and capacity (the number of buckets) will be specified by the user during initialization or defaulted by the system value. When the number of key-value pairs in the HashMap reaches maxSize, the number of buckets in the hash table will be increased.
SeqSearchST is also used in the above code. In fact, this is a symbol table implementation based on a linked list. It supports adding key-value pairs to it. When searching for a specified key, sequential search is used. Its code is as follows:

public class SeqSearchST<K, V> { 
  private Node first; 
  
  private class Node { 
    K key; 
    V val; 
    Node next; 
    public Node(K key, V val, Node next) { 
      this.key = key; 
      this.val = val; 
      this.next = next; 
    } 
  } 

  public V get(K key) { 
    for (Node node = first; node != null; node = node.next) { 
      if (key.equals(node.key)) { 
        return node.val; 
      } 
    } 
    return null; 
  } 

  public void put(K key, V val) { 
    //先查找表中是否已存在相应key 
    Node node; 
    for (node = first; node != null; node = node.next) { 
      if (key.equals(node.key)) { 
        node.val = val; 
        return; 
      } 
    } 
    //表中不存在相应key 
    first = new Node(key, val, first); 
  }
}

Use linear detection to handle collisions

Basic principles and implementation

The linear detection method is another specific method of the implementation strategy of the hash table. This strategy is called the open addressing method. The main idea of ​​the open addressing method is: use an array of size M to store N key-value pairs, where M> N, and the space in the array is used to solve the collision problem.

The main idea of ​​the linear detection method is: when a collision occurs (a key is hashed to an array position that already has a key-value pair), we will check the next position of the array. This process is called linear detection. Linear detection may produce three results:

  • Hit: The key at this position is the same as the key to be found;
  • Miss: The position is empty;
  • The key at this position is different from the key being searched.

When we look up a key, we first get an array index through the hash function, and then we start to check whether the key at the corresponding position is the same as the given key, and if it is different, we will continue to search (if we don’t find the end of the array, we will fold Back to the beginning of the array) until the key is found or an empty position is encountered. From the process of linear detection, we can know that if we insert new keys into the array when it is full, we will fall into an infinite loop.

After understanding the above principles, it is not difficult to implement a hash table based on the linear detection method. Here we use the array keys to store the keys in the hash table, and the array values ​​to store the values ​​in the hash table. The elements at the same position of the two arrays jointly determine the key-value pairs in a hash table. The specific code is as follows:

public class LinearProbingHashMap<K, V> { 
  private int num; //散列表中的键值对数目 
  private int capacity; 
  private K[] keys; 
  private V[] values; 

  public LinearProbingHashMap(int capacity) { 
    keys = (K[]) new Object[capacity]; 
    values = (V[]) new Object[capacity]; 
    this.capacity = capacity; 
  } 

  private int hash(K key) { 
    return (key.hashCode() & 0x7fffffff) % capacity; 
  } 
  
  public V get(K key) { 
    int index = hash(key); 
    while (keys[index] != null && !key.equals(keys[index])) { 
      index = (index + 1) % capacity; 
    } 
    return values[index]; //若给定key在散列表中存在会返回相应value,否则这里返回的是null 
  }
  
 public void put(K key, V value) { 
    int index = hash(key); 
    while (keys[index] != null && !key.equals(keys[index])) { 
      index = (index + 1) % capacity; 
    } 
    if (keys[index] == null) { 
      keys[index] = key; 
      values[index] = value; return; 
    } 
    values[index] = value; num++; 
  }
}

Dynamically adjust the array size

In our implementation above, the size of the array is twice the number of buckets, and dynamic adjustment of the array size is not supported. In practical applications, when the load factor (the ratio of the key-value pair to the array size) is close to 1, the time complexity of the search operation will be close to O(n), and when the load factor is 1, according to our implementation above , The while loop will become an infinite loop. Obviously we don't want to reduce the complexity of the search operation to O(n), let alone fall into an infinite loop. Therefore, it is necessary to implement a dynamic growth array to maintain the constant time complexity of the search operation. When the total number of key-value pairs is small, if the space is tight, the array can be dynamically reduced, depending on the actual situation.

To dynamically change the size of the array, you only need to add the following judgment at the beginning of the put method above:

if (num == capacity / 2) { 
  resize(2 * capacity); 
}

The logic of the resize method is also very simple:

private void resize(int newCapacity) { 
  LinearProbingHashMap<K, V> hashmap = new LinearProbingHashMap<>(newCapacity); 
  for (int i = 0; i < capacity; i++) { 
    if (keys[i] != null) { 
      hashmap.put(keys[i], values[i]); 
    } 
  } 
  keys = hashmap.keys; 
  values = hashmap.values; 
  capacity = hashmap.capacity; 
}

Regarding the relationship between the load factor and the performance of the search operation, here is a conclusion from "Algorithm" (Sedgewick, etc.):

In a hash table with a size of M and N = a*M (a is the load factor) keys based on linear detection, if the hash function satisfies the uniform hash hypothesis, the detection required for the search of hits and misses The times are: ~ 1/2 * (1 + 1/(1-a)) and ~1/2*(1 + 1/(1-a)^2)

Regarding the above conclusions, we only need to know that when a is about 1/2, the number of detections required to find hits and misses is 1.5 and 2.5, respectively. Another point is that when a approaches 1, the accuracy of the estimated value in the above conclusion will decrease, but we will not let the load factor approach 1 in practical applications. In order to maintain good performance, we should Keep a not more than 1/2.

Guess you like

Origin blog.csdn.net/orzMrXu/article/details/102534437