Data structure#Hash table

HashTabble Basics

What is a hash function?

The query and modification time complexity of the array is O(1). If there is a mapping relationship between the attributes of the object, you can use the advantages of the array to convert the "key" into the index of the array. This is what the hash function does.

Thoughts caused by converting "keys" in life into indexes

If there are 30 students in a class, the student numbers are 1-30. At this time, the student number minus one can be used as the index of the array to successfully store the information of 30 students. This method of converting from "key" to index is relatively simple.

However, in most cases, the data we process is relatively complex. For example, if we are interested in residents’ information, the resident’s unique identification may be the ID number (18 digits) because the number on the ID card is too large and exceeds the integer limit. We cannot directly use this number as the index of the array. In fact, it is also a very large number. Even if we use this number as the index of the array, we need to apply for a huge space, and the unused memory of 17 bits and below will cause extreme problems. Big waste.

What's more, some unique identifiers have no direct relationship with numbers. The most common ones are strings. Let's still use the student's information as a chestnut. If we use the student's name as the "key" to identify the student's information. At this time, the "key" is a string. How to design a hash function to convert the string into a number? This is the first issue we need to consider when designing a hash table~

The index obtained by the hash function designed with the student number as the index is unique. The index range is small enough and it is convenient to use arrays for storage; but for more data types such as strings, dates, floating point numbers, etc., it is difficult for us to guarantee that every A "key" corresponds to different indexes through the conversion of the hash function we designed. That is to say, two different keys generate the same index after being converted by the hash function we designed. We call it "hash conflict". This is also the second problem we need to solve when designing the hash table~

Thinking about time and space

Hash tables fully embody the classic idea in the field of algorithm design: exchanging space for time. As an example of the above ID card, if we can apply for a large space of 18 nines, then the time complexity of user information query is O(1). Assuming an extreme situation, we can only apply for 1 array space, then all data will generate hash conflicts when converted into indexes. At this time, if we use a data structure such as a linked list to store data, the query will also be O(n ) time complexity.

The above are two extreme situations. One is that the space is very large and the time consumption is very small. The other is that the space is small and the time is relatively large. The hash table is a balance between time and space~

Hash function design

The design of the hash function follows the benchmark

Consistency: if a==b then hash(a)==hash(b)
Efficiency: The calculation is efficient and simple.
Uniformity: The obtained index distribution is approximately uniform, the better

So how to design a hash function?

This needs to be analyzed based on specific problems, because the design of hash functions has many special practices in many special fields. This article uses java int integer as index to design a hash function:

Small-range positive integers can be directly used as array indexes, and small-range negative integers can consider interval offsets. For example, for numbers in the interval [-100,100], all negative numbers can be mapped to [100,200].
For large-range integers such as the Long type, a common approach is to take the modulus and remainder method. For example, the ID number is an 18-digit number, so how to change it into a smaller int integer? At this time, you can take the next 4 digits and use the modulo method to get the last few digits. However, usually modulo a prime number, modulo a prime number is helpful to solve the uneven distribution of indexes and better utilize all the digital information of large integers. Behind this is verified by a large number of mathematical theories. We don’t need to dig deep, but we can verify its uniformity with chestnuts, and the probability of hash conflict is small:

a set of numbers	Choose 4 for non-prime numbers	Choose 7 as a prime number
10	2	3
20	0	6
30	2	2
40	0	4
50	2	1

How to design a hash function for strings? In fact, strings can also be treated as large integer numbers. Each character can be treated as a number, and 26 letters can be treated as a hexadecimal number. like:

100 in decimal can be written as 1 * 10 ² + 0 * 10 ¹ + 0 * 10 ⁰

In the same way, strings are similar. For example, the word "code" can be written as c * 26 3 ⁺ o * 26 ² + d * 26 ¹ + e * 26 ⁰ c, o, d, e. The corresponding numbers defined in hexadecimal are: Can. At this time the hash function is designed:

hash(code) = (c * 26 ³ + o * 26 ² + d * 26 ¹ + e * 26 ⁰ )% M where M is a prime number.

HashCode in Java

Java provides a hashCode method to facilitate us to get the hash value of a class. For existing classes, you can obtain it directly through the hashCode method. For custom classes, you can override the hashCode method to obtain it.

/**
 * Create by SunnyDay on 2022/05/06 17:55
 */
public class Student {
    
    
    private int age;
    private String name;
    private String sex;

    // 主要用于计算hash值
    @Override
    public int hashCode() {
    
    
        int M = 31;
        int hash = 0;
        hash = hash * M + age;
        hash = hash * M + name.hashCode();
        hash = hash * M + sex.hashCode();
        return hash;
    }
    // hash 冲突时可利用这个判断对象是否相等
    @Override
    public boolean equals(Object obj) {
    
    
        if (this == obj) return true;
        if (null == obj) return false;
        if (obj.getClass() != this.getClass()) return false;

        Student another = (Student) obj;
        return this.age == another.age && 
                this.name.equals(another.name) && 
                this.sex.equals(another.sex);
    }
}

However, the value returned by Java's hashCode method is a 32-bit int value, which is a signed integer number, which means that this value may be a negative number. To convert a negative number into an index in an array, we need to do it in our own hash table. In fact, the design of Java's hashCode is also relatively reasonable, because when we design a hash table, we usually need to modulate a prime number, and this prime number is usually the size of a hash table. Without a hash table we can't get prime numbers. Therefore the index cannot be obtained directly when defining the class. This is the design consideration of java hashCode.

Implementation of HashTab

First think about how to design HashTab. We need to solve two problems:

Design of hash function

Here you can get a hash value through java's hashCode method, but this value may be a negative number and we need to handle it manually. At this time, the index value in the HashTab can be designed based on the capacity of the array.

First, get a hash value through java's hashCode method.
Secondly, perform non-negative processing on the hash value (java's hashCode method returns an integer that may be negative)
Finally, the result is modulo to obtain a uniformly distributed value (usually modulo a prime number)

Resolution of hash conflicts

Even if the prime numbers are chosen well in the modulo operation, there will be cases of hash conflicts. In this case, the hash conflicts need to be resolved. The most commonly used solution is the linked list address method.

Insert image description here

First Edition: Basic Implementation

Before Java8, each position in HashMap corresponds to a linked list, but starting from Java8, when the Hash conflict reaches a certain level, the linked list will be converted into a red-black tree.

The bottom layer of the linked list address method does not necessarily require us to write a linked list node to implement it ourselves, because the bottom layer of TreeMap is a red-black tree implementation. So we can just use it and write a version~

/**
 * Create by SunnyDay on 2022/05/06 14:23
 * custom hashTable base on TreeMap.
 */
public class MyHashTable<K, V> {
    
    
    private TreeMap<K, V>[] hashTable; //TreeMap base on red black tree.
    private int M;//capacity 
    private int size;

    public MyHashTable(int M) {
    
    
        this.M = M;
        this.size = 0;
        hashTable = new TreeMap[M];
        for (int i = 0; i < M; i++) {
    
    
            hashTable[i] = new TreeMap<>();
        }
    }

    /**
     * default constructor，default capacity is 97.
     */
    public MyHashTable() {
    
    
        this(97);
    }

    /**
     * calculate index
     */
    private int hash(K key) {
    
    
        return (key.hashCode() & 0x7fffffff) % M;
    }

    public int getSize() {
    
    
        return size;
    }

    /**
     * add element.
     */
    public void add(K key, V value) {
    
    
        TreeMap<K, V> map = hashTable[hash(key)];
        if (map.containsKey(key)) {
    
    
            map.put(key, value);
        } else {
    
    
            map.put(key, value);
            size++;
        }

    }

    /**
     * delete element.
     */
    public V remove(K key) {
    
    
        TreeMap<K, V> map = hashTable[hash(key)];
        V element = null;
        if (map.containsKey(key)) {
    
    
            element = map.remove(key);
            size--;
        }
        return element;
    }

    /**
     * Detect whether the target element exists.
     */
    public boolean containKey(K key) {
    
    
        return hashTable[hash(key)].containsKey(key);
    }

    /**
     * query the target element.
     */
    public V get(K key) {
    
    
        return hashTable[hash(key)].get(key);
    }
}

Time complexity analysis: There are M addresses in total, if there are N elements.

If implemented using an ordinary linked list, each address has an average time complexity of O(N/M) and a worst-case time complexity of O(N).

However, the above is implemented using TreeMap. The average time complexity of each address as a balanced tree is O(log(N/M)), and the worst-case time complexity is O(logN).

Second Edition: Dynamic Spatial Processing of Arrays

As mentioned earlier, the time complexity of HashTab is O(1) level. It seems that the time complexity is related to the number of elements in the array. It can be seen that there is a relationship between M and N. M is a fixed value of array capacity. As N approaches infinity, the value of N/M also approaches infinity. The time complexity is impossible to approach O(1). However, we can expand the space dynamically, so that the time complexity approaches O(1)

Since the linked list using the chain address method does not have a full capacity, we cannot expand it in the same way as ArrayList, but we can use such a standard:

When the average load capacity of each address exceeds a certain level, the capacity is expanded. For example: expand when N/M >= upperTol (N: total number of elements, M array capacity, upperTol capacity limit)
When the average load capacity of each address is less than a certain level, the capacity is reduced. For example: shrink when N/M < lowerTol (N: total number of elements, M array capacity, lowerTol capacity lower limit)

/**
 * Create by SunnyDay on 2022/05/06 14:23
 * custom hashTable base on TreeMap.
 */
public class MyHashTable<K, V> {
    
    

    // about resize
    private static final int upperTol = 10;
    private static final int lowerTol = 2;
    private static final int initCapacity = 7;

    private TreeMap<K, V>[] hashTable; //TreeMap base on red black tree.
    private int M;
    private int size;

    public MyHashTable(int M) {
    
    
        this.M = M;
        this.size = 0;
        hashTable = new TreeMap[M];
        for (int i = 0; i < M; i++) {
    
    
            hashTable[i] = new TreeMap<>();
        }
    }

    /**
     * default constructor，default capacity is 97.
     */
    public MyHashTable() {
    
    
        this(initCapacity);
    }

    /**
     * calculate index
     */
    private int hash(K key) {
    
    
        return (key.hashCode() & 0x7fffffff) % M;
    }

    public int getSize() {
    
    
        return size;
    }

    /**
     * add element.
     */
    public void add(K key, V value) {
    
    
        TreeMap<K, V> map = hashTable[hash(key)];
        if (map.containsKey(key)) {
    
    
            map.put(key, value);
        } else {
    
    
            map.put(key, value);
            size++;
            //size就是N，与size/M >= upperTol 等价，这里改除法为乘法。
            if (size >= upperTol * M) {
    
    
                resize(2 * M);
            }
        }

    }

    /**
     * delete element.
     */
    public V remove(K key) {
    
    
        TreeMap<K, V> map = hashTable[hash(key)];
        V element = null;
        if (map.containsKey(key)) {
    
    
            element = map.remove(key);
            size--;
            // M / 2 >0 即可 。由于我们hashTab有初始容积则可写为M / 2 >= initCapacity
            if (size <= lowerTol * M && M / 2 >= initCapacity) {
    
    
                resize(M / 2);
            }
        }
        return element;
    }

    /**
     * Detect whether the target element exists.
     */
    public boolean containKey(K key) {
    
    
        return hashTable[hash(key)].containsKey(key);
    }

    /**
     * query the target element.
     */
    public V get(K key) {
    
    
        return hashTable[hash(key)].get(key);
    }

    private void resize(int newM) {
    
    
        // new array.
        TreeMap<K, V>[] newHashTable = new TreeMap[newM];
        for (int i = 0; i < newM; i++) {
    
    
            newHashTable[i] = new TreeMap<>();
        }

        int oldM = M;
        this.M = newM;

        for (int i = 0; i < oldM; i++) {
    
    
            // TreeMap element in old  array.
            TreeMap<K, V> map = hashTable[i];

            // element put into newHashTable
            for (K key : map.keySet()) {
    
    
                newHashTable[hash(key)].put(key, map.get(key));
            }
        }
        // reset pointer
        this.hashTable = newHashTable;
    }
}

It can be seen that the average probability of each address conflict is between O(lowerTol) and O(upperTol). Since lowerTol and upperTol are controlled by us, the average time complexity can be controlled within a small number, and the time complexity approaches O(1).

Third Edition: Array Dynamic Space Optimization

In the above expansion, each time M*2 is obtained, an even number must be obtained, which results in a case of uneven index distribution. This can still be optimized: dynamically set the capacity to a prime number.


/**
 * Create by SunnyDay on 2022/05/06 14:23
 * custom hashTable base on TreeMap.
 */
public class MyHashTable<K, V> {
    
    

    // int 范围内素数
    private final int capacity[] = {
    
    53, 97, 193, 389, 769, 1543, 3079, 6151, 12289, 24593,
            49157, 98317, 196613, 393241, 786433, 1572869, 3145739, 6291469, 12582917, 25165843,
            50331653, 100663319, 201326611, 402653189, 805306457, 1610612741};

    // about resize
    private static final int upperTol = 10;
    private static final int lowerTol = 2;
    // 默认指向 capacity数组中第一个元素
    private static int capacityIndex = 0;

    private TreeMap<K, V>[] hashTable; //TreeMap base on red black tree.
    private int M;
    private int size;

    public MyHashTable() {
    
    
        this.M = capacity[capacityIndex];
        this.size = 0;
        hashTable = new TreeMap[M];
        for (int i = 0; i < M; i++) {
    
    
            hashTable[i] = new TreeMap<>();
        }
    }


    /**
     * calculate index
     */
    private int hash(K key) {
    
    
        return (key.hashCode() & 0x7fffffff) % M;
    }

    public int getSize() {
    
    
        return size;
    }

    /**
     * add element.
     */
    public void add(K key, V value) {
    
    
        TreeMap<K, V> map = hashTable[hash(key)];
        if (map.containsKey(key)) {
    
    
            map.put(key, value);
        } else {
    
    
            map.put(key, value);
            size++;
            // 避免越界
            if (size >= upperTol * M && capacityIndex + 1 < capacity.length) {
    
    
                capacityIndex++;
                resize(capacity[capacityIndex]);
            }
        }

    }

    /**
     * delete element.
     */
    public V remove(K key) {
    
    
        TreeMap<K, V> map = hashTable[hash(key)];
        V element = null;
        if (map.containsKey(key)) {
    
    
            element = map.remove(key);
            size--;

            if (size <= lowerTol * M && capacityIndex - 1 >= 0) {
    
    
                capacityIndex--;
                resize(capacity[capacityIndex]);
            }
        }
        return element;
    }

    /**
     * Detect whether the target element exists.
     */
    public boolean containKey(K key) {
    
    
        return hashTable[hash(key)].containsKey(key);
    }

    /**
     * query the target element.
     */
    public V get(K key) {
    
    
        return hashTable[hash(key)].get(key);
    }

    private void resize(int newM) {
    
    
        // new array.
        TreeMap<K, V>[] newHashTable = new TreeMap[newM];
        for (int i = 0; i < newM; i++) {
    
    
            newHashTable[i] = new TreeMap<>();
        }

        int oldM = M;
        this.M = newM;

        for (int i = 0; i < oldM; i++) {
    
    
            // TreeMap element in old  array.
            TreeMap<K, V> map = hashTable[i];

            // element put into newHashTable
            for (K key : map.keySet()) {
    
    
                newHashTable[hash(key)].put(key, map.get(key));
            }
        }
        // reset pointer
        this.hashTable = newHashTable;
    }
}

summary

reward

The amortized time complexity of the hash table is O(1).
The hash table loses the order of the elements.

Solutions to other Hash conflicts

Open address method: Involving the concept of load rate, the time complexity of selecting the load rate is also O(1)

Linear detection method (+1 each time)
Square detection method (+2 squares each time)
quadratic hashing

Hash again:

Coalesced Hashing: Combines chain address method and open address method.