"Algorithm 4" hash table implementation notes

First, the hash function

The hash function converts the key to the index of the array. Our hash function should be fast and distribute all keys evenly. For example, for a hash table of size M, our hash function should be able to convert any key into an integer from 0 to M-1. There should be different hash functions for different keys. Many commonly used classes in Java have rewritten the hashCode method to use different hash functions for different data types.

Second, the hash table based on the zipper method

The ideal state of the hash algorithm is to convert different keys to different index values, but this is obviously impossible, and there will be conflicts, so we need to deal with the conflicts.
A direct method is to point each index in an array of size M to a linked list. Each node in the linked list stores a key value pair whose hash value is the index value of the linked list . This method is the zipper method. .

The following is the basic data structure:

public class SeparateChainingHashST<Key,Value> {
    /**
     * 键值对总数
     */
    private int N;
    /**
     * 散列表大小
     */
    private int M;

    private SequentialSearchST<Key,Value>[] st;

    public SeparateChainingHashST() {
        this(997);
    }
    public SeparateChainingHashST(int M) {
        this.M=M;
        st=new SequentialSearchST[M];
        for (int i = 0; i < M; i++) {
        	//数组中每个索引值都初始化一个链表
            st[i]=new SequentialSearchST<>();
        }
    }
}    

SequentialSearchSTIt is the unordered linked list implemented in the previous sequential search:

ublic class SequentialSearchST<Key, Value> {
    /**
     * 首节点
     */
    private Node first;
    private int size;


    private class Node {
        private Key key;
        private Value value;
        private Node next;

        public Node(Key key, Value value, Node next) {
            this.key = key;
            this.value = value;
            this.next = next;
        }
    }

    /**
     * 根据key查询对应的值,一个个往下遍历直到找到相等的key,返回对应的值,否则返回null
     * @param key
     * @return
     */
    public Value get(Key key) {
        for (Node x = first; x != null; x = x.next) {

            if (key.equals(x.key)) {
                return x.value;
            }
        }
        return null;
    }

    /**
     * 加入一个元素
     * @param key
     * @param value
     */
    public void put(Key key, Value value) {
        for (Node x = first; x != null; x = x.next) {
            //key已存在,更新对应的值
            if (key.equals(x.key)) {
                x.value = value;
                return;
            }
        }
        //key不存在,新添加一个节点
        first = new Node(key, value, first);
        size++;
    }
    public boolean isEmpty() {
        return size == 0;
    }

    private int size() {
        return size;
    }
    public boolean contains(Key key) {
        if (key == null) throw new IllegalArgumentException("argument to contains() is null");
        return get(key) != null;
    }

    /**
     * 删除key对应的节点
     * @param key
     */
    public void delete(Key key) {
        if (key == null) throw new IllegalArgumentException("argument to delete() is null");
        first = delete(first, key);
    }

    /**
     * 递归查找,直到找到相等的key,正常删除链表节点
     * @param x
     * @param key
     * @return
     */
    private Node delete(Node x, Key key) {
        if (x == null) return null;
        if (key.equals(x.key)) {
            size--;
            return x.next;
        }
        x.next = delete(x.next, key);
        return x;
    }
    public Iterable<Key> keys()  {
        Queue<Key> queue=new Queue<>();
        while (first!=null){
            queue.enqueue(first.key);
            first=first.next;
        }
        return queue;
    }
}

Hash calculation: As
mentioned above, the hashCode () method has been rewritten for all data types in Java. This method returns a 32-bit integer, but what we need is the index of the array, so we need to change the default hashCode method and The remainder method is combined to produce an integer from 0 to M-1, and because the value returned by hashCode is signed, so even if the calculation result is negative, we need to pass 0x7fffffffinto a 31-bit non-negative integer , Then use the remainder of division method to let it %M, M is a larger prime number.

 private int hash(Key key){
        return (key.hashCode() & 0x7fffffff) % M;
    }

In this way, we can insert, delete, and obtain data:
Insertion implementation:
as follows, first calculate the hash value of the key to obtain an array index, and then insert the key-value pair into the linked list corresponding to the index.

 public void put(Key key,Value value){
        if (key==null){
            throw new NoSuchElementException("key为空");
        }
        if (value==null){
            delete(key);
        }
        //保证链表的长度在2到8之间
        if (N>=8*M){
            resize(M*2);
        }
        st[hash(key)].put(key,value);
    }

Deletion implementation:
first calculate the hash value of the key, find the linked list where it is located, and delete the key if it exists in the linked list.

 /**
     * 删除指定键值对
     * @param key
     */
    public void delete(Key key) {
        if (key == null) throw new IllegalArgumentException("argument to delete() is null");

        int i = hash(key);
        if (st[i].contains(key)){
            N--;
        }
        st[i].delete(key);
        //保证链表平均长度在2到8间 此为下界2
        if (N>0 && N<=M*2){
            resize(M/2);
        }

    }
 public boolean contains(Key key) {
        if (key == null) throw new IllegalArgumentException("key为空");
        return get(key) != null;
    }    

Get value:
calculate the hash value of the key and return the value in the corresponding linked list

public Value get(Key key){
        if (key == null) return null;

        return st[hash(key)].get(key);
    }

We use M linked lists to store N keys. No matter how the keys are distributed in the table, the average length must be N/M.
One advantage of using the zipper method is that if there are more keys than expected, the search time will only be longer than choosing a larger array; if it is lower than expected, although it will cause a little waste of space, the search is fast.
Therefore, when the memory is sufficient, you can choose a large enough M to make the search use constant; when the memory is tight, choosing the largest M can still improve the performance by M times.

Dynamically adjust the array:
after the deletion, if the average length N / M is less than 2, the array is doubled: M / 2;
after adding a data, if the N / M is higher than 8, the array is doubled: M * 2

 private void resize(int capacity){
        SeparateChainingHashST<Key,Value>hashST=new SeparateChainingHashST<>(capacity);
        for (int i = 0; i < M; i++) {
            for (Key key:st[i].keys()){
                if (key!=null){
                    hashST.put(key,st[i].get(key));
                }
            }
        }
        this.M=hashST.M;
        this.N=hashST.N;
        this.st=hashST.st;
    }

Three, hash table based on linear detection method

Another way to implement a hash table is to use an array of size M to store N key-value pairs (M> N), and use collisions to resolve collision conflicts. All methods based on this strategy become open address hash tables.
The simplest method of open address hash table is the linear detection method, that is, if a conflict occurs (the hash value of one key is already occupied by a different key), then directly check the next position of the hash table (index +1), if If there is still a conflict, it keeps detecting backwards until it finds an empty position and inserts the key-value pair into it.
The data structure is as follows:
here uses a Key [] to save the key, and a Values ​​[] to save the value corresponding to the key

public class LinearProbingHashST<Key, Value> {
    private Key[] keys;
    private Value[] values;

    /**
     * 键值对数
     */
    private int N;
    /**
     * 线性表大小
     */
    private int M;

    public LinearProbingHashST(int M) {
        this.M = M;
        keys = (Key[]) new Object[this.M];
        values = (Value[]) new Object[this.M];
    }
}    

Insert operation:
we need to calculate the hash value of the key to be inserted, and then determine whether the current index is occupied by other keys, if the occupied key is also the key to be inserted, then modify the corresponding value, otherwise iterate until it finds an empty position .

    public void put(Key key, Value value) {
        if (key == null) throw new IllegalArgumentException("first argument to put() is null");
        if (value==null){
            delete(key);
        }
        //保证使用率 N/M 小于等于 1/2 ,当使用率趋近于1时,探测的次数会变得很大
        if (N>=M*2){
            resize(M*2);
        }
        int i;
        for (i = hash(key); keys[i] != null; i = (i + 1) % M) {
            if (keys[i].equals(key)) {
            	//待插入的key存在,修改对应的值并返回
                values[i] = value;
                return;
            }
        }
        //找到空位置,插入键值对
        keys[i] = key;
        values[i] = value;
        N++;
    }
     private int hash(Key key) {
        return (key.hashCode() & 0x7fffffff) % M;
    }

Query operation:
Calculate the hash value of the key. If the current position conflicts (occupied by other keys), continue to traverse backward. The end condition of the traversal is from hash (key) to the next null position. If it exists, return the corresponding The value of, or null if it does not exist

 public Value get(Key key) {
        for (int i = hash(key); keys[i] != null; i = (i + 1) % M) {
            if (keys[i].equals(key)) {
                return values[i];
            }
        }
        return null;
    }

For a hash table as follows:

0   1   2   3    4    5   6    7    8     9    10   11    12     13    14    15
P   M            A    C   S    H 	L			E						R	  X
10  9			 8    4	  0    5    11          12                      3     7

The hash value of A is 4, and the hash value of H is also 4, but because of the conflict in 4, it is linearly moved to the 7 position when inserting H, so when we look up the value corresponding to H, we start with index 4. , Do not match, and look down in order. If it is not found before the empty position 9, it means that there is no H, but because H is in the 7 position before 9, the index returns the corresponding value 5.
Delete operation:
For the delete operation, we You cannot directly set the corresponding key to null.
The same is the hash table above. If C is deleted and the hash value of H is 4, in the process of linear detection, it will return null because the index 5 is empty, and "falsely" thinks that H does not exist in the hash table. But the fact is that H exists and is in index 7.

Therefore, after deleting a given key, you need to re-insert all keys from key + 1 to the next empty position to avoid the above error.

  public void delete(Key key) {
        if (!contains(key)) {
            return;
        }
        int i=hash(key);
        //线性探测找到待删除的key的索引
        while (!keys[i].equals(key)){
            i=(i+1)%M;
        }
        //置空
        keys[i]=null;
        values[i]=null;
        //将i+1的位置到下一个空位置前的所有key重新插入到散列表中
        i=(i+1)%M;
        while (keys[i]!=null){
            Key oldKey=keys[i];
            Value oldValue=values[i];
            keys[i]=null;
            values[i]=null;

            N--;
            put(oldKey,oldValue);
            i=(i+1)%M;
        }
        N--;
        if (N>0 && N <= M/8){
            resize(M/2);
        }
    }

In addition α= N/M, we call α the usage rate of the hash table. The
following conclusions are given in "Algorithm 4":

In a hash table based on linear detection of size M and containing N keys, if our hash function can evenly and independently distribute all keys between 0 and M-1, then hits and misses The number of probes required for the search is:
1/2 (1 + 1 / 1−α) and 1/2 (1 + 1 / (1−α) ^ 2)

That is, when the hash table is almost full, the number of detections required for searching is huge (α approaches 1 ), but when the utilization rate α < 1/2is reached, the estimated number of detections is only between 1.5 and 2.5, so we must ensure that the The value is not greater than 1/2.
Based on this we need
to dynamically adjust the array before insertion and after deletion:

Judge before inserting:

        //保证使用率 N/M 不能超过1/2 ,当使用率趋近于1时,探测的次数会变得很大
        if (N>=M*2){
            resize(M*2);
        }

Judgment after deletion:
ensure that the ratio of the amount of memory used and the key value to the number in the table is always within a certain range.

        //数组减小一半,如果N/M 为12.5% 或更少
		if (N>0 && N <= M/8){
            resize(M/2);
        }
Published 75 original articles · won praise 13 · views 8369

Guess you like

Origin blog.csdn.net/weixin_43696529/article/details/104731252