Are you still worrying about hash tables?

Today  's hash table can be seen together with the Map in the previous section .

content

hash table

1. Concept

1.1 Conflict

1.2 Avoid conflict

1.3 Hash function design

1.3.1 Common Hash Functions

Direct customization method--(commonly used)

Divide the remainder method -- (commonly used)

Squared method-(OK)

Folding method -- (understand)

Random Number Method--(Understand)

Mathematical Analysis--(Understanding)

1.4 Load factor adjustment (focus on mastering)

1.5 Conflict resolution

1.5.1 Closed hashing

Linear detection

Secondary detection

1.5.2 Open hash/hash bucket (focus on mastering)

1.5.3 Solutions for serious conflicts

1.6 Performance Analysis


hash table

1. Concept

  ①The derivation of the hash table:
Is there a data structure that can get the element to be searched directly from the table once without any comparison . If a storage structure is constructed so that a one-to- one mapping relationship can be established between the storage location of an element and its key code through a certain function (hashFunc), the element can be quickly found through this function when searching .
②When inserting an element , use this function to calculate the storage location of the element and store it at this location
③ When searching for an element, perform the same calculation on the key code of the element, take the obtained function value as the storage location of the element, and compare the element at this location in the structure. If the key code is equal, the search is successful.
The conversion function used in the hash method is called a hash ( hash ) function, and the constructed structure is called a hash table (HashTable) (or hash table)  
Use a diagram to briefly explain: (The corresponding position is obtained through the hash function mapping relationship)

 

1.1 Conflict

①What is conflict:

For the keyword sum (i != j) of two data elements , there is != , but: Hash( ) == Hash( ) , that is: different keywords calculate the same hash address through the same hash number , this phenomenon is called hash collision or hash collision .

②What should I do if I encounter a negative number? 

If a negative number is encountered during the storage process, let the entire array add the minimum value of the negative number and change it to a positive number

1.2 Avoid conflict

Since the capacity of the underlying array of our hash table is often smaller than the actual number of keys to be stored, this leads to a
A problem, conflicts are inevitable , but what we can do is to reduce the conflict rate as much as possible .

1.3 Hash function design

① The definition domain of the hash function must include all the keys that need to be stored, and if the hash table allows m addresses, its value domain must be between 0 and m-1
②The address calculated by the hash function can be evenly distributed in the whole space
③Hash function should be relatively simple

1.3.1 Common Hash Functions

Direct customization method--(commonly used)

①Operation:
Take a certain linear function of the key as the hash address: Hash ( Key ) = A*Key + B
②Features: Simple and uniform; Disadvantages: Need to know the distribution of keywords Usage scenario: Suitable for finding relatively small and continuous situations (the hash function is a linear function, so it can be guaranteed to be uniformly distributed)
③Example with a title: 387. The first unique character in a string - LeetCode (leetcode-cn.com) icon-default.png?t=M276https://leetcode-cn.com/problems/first-unique-character-in -a-string/ a. Stored according to ASCLL, because the ascll of a is 97, in order to maximize the space utilization, we subtract 97 from each letter, and the subscript represented by 24 letters is 0~23;
b. Make a deposit.
c. Each time it is stored, it is recorded as 1, then accumulated, and finally the characters that appear for the first time and only appear once are read in the order in which the strings appear.

code show as below:

class Solution {
    public int firstUniqChar(String s) {
if(s==null)return -1;
int []nums=new int[26];
for(int i=0;i<s.length();i++){
    char ch=s.charAt(i);
    nums[ch-97]++;
}
for(int i=0;i<s.length();i++){
    char ch=s.charAt(i);
    if(nums[ch-97]==1){
        return i;
    }
}return -1;
    }
}

 

Divide the remainder method -- (commonly used)

①Operation: Let the number of addresses allowed in the hash table be m , take a prime number p not greater than m, but closest to or equal to m as the divisor, according to the hash function: Hash(key) = key% p(p<=m ), convert the key into a hash address (p<m)
②Features: Disadvantages: Waste of space (the length is 5, if you take 3, there will be two wasteful situations)

Squared method-(OK)

Assuming that the keyword is 1234 , the square of it is 1522756 , and the middle 3 bits 227 are extracted as the hash address; for example, the keyword is 4321 , the square of it is 18671041 , and the middle 3 bits 671 ( or 710) are extracted as the hash address. The square method is more suitable: the distribution , and the number of digits is not very large.

Folding method -- (understand)

The folding method is to divide the keyword into several parts with equal digits from left to right ( the last part can be shorter ), then superimpose and sum these parts, and according to the length of the hash table, take the last few digits as the hash. column address. The folding method is suitable for the distribution of keywords that do not need to be known in advance, and is suitable for the case where the number of keywords is relatively large.

Random Number Method--(Understand)

Select a random function and take the random function value of the keyword as its hash address, that is, H(key) = random(key), where random is a random number
function. This method is usually used when the keyword lengths are not equal

Mathematical Analysis--(Understanding)

There are n d digits, and each digit may have r different symbols. The frequency of these r different symbols may not be the same on each bit, and may be evenly distributed on some bits. Equal opportunity, uneven distribution on certain bits, only certain kinds of symbols appear frequently. According to the size of the hash table, several bits in which various symbols are evenly distributed can be selected as the hash address. E.g:

 

1.4 Load factor adjustment (focus on mastering)

① What is the load factor and how to calculate the load factor:

 ②The relationship between load factor and conflict rate

The number of input keywords will not change, so in order to reduce the conflict rate, we often use the method to increase the length of the hash table 

1.5 Conflict resolution

Two common ways to resolve hash collisions are: closed hashing and open hashing

1.5.1 Closed hashing

①What is closed hash? ? ?
It is also called open addressing method. When a hash conflict occurs, if the hash table is not full, it means that there must be an empty position in the hash table, then the key can be stored in the " next " empty position in the conflict position. go in.
② How is the closed hash stored? ? ?
Linear probing and quadratic probing

Linear detection

①What is linear detection:

Starting from the position where the conflict occurred, probe backwards in sequence until the next empty position is found.

②Relevant operations of linear detection:

During the insertion operation, the position of the element to be inserted in the hash table is obtained through the hash function; if there is no element in the position, a new element is directly inserted, if there is an element in the position with a hash conflict, use linear detection to find it; An empty position to insert a new element

In short, find the next empty place

③ Disadvantages: (may cause conflicting elements to be put together)

 

Secondary detection

①How to carry out secondary detection:

Use this formula to enter insert . Among them: i = 1,2,3... , Hi is the position obtained by calculating the key key of the element through the hash function Hash(x) , and m is the size of the table.

For the problem in the above linear detection , if you want to insert 44 , a conflict occurs, the situation after using the solution is:

②Important conclusion:
When the length of the table is prime and the table load factor a does not exceed 0.5 , new entries must be inserted, and no position will be probed twice. So as long as there are half of the empty positions in the table, there is no problem of table full. The situation that the table is full can be ignored when searching, but it must be ensured that the loading factor a of the table does not exceed 0.5 when inserting . If it exceeds, the capacity must be considered.
Therefore, the biggest defect of closed hashing is that the space utilization rate is relatively low, which is also the defect of hashing.

1.5.2 Open hash/hash bucket (focus on mastering)

①What is a hash bucket? ? ?
The open hash method is also called the chain address method ( open chain method ) . First, the hash function is used to calculate the hash address for the key code set. The key codes with the same address belong to the same subset, and each subset is called a bucket. The elements in the bucket are linked by a singly linked list, and the head node of each linked list is stored in a hash table.
②How to store the hash bucket? ? ? (chain storage method)

③ If the load factor is too large and the capacity needs to be expanded, how should the stored data be handled? ? ? (Every number in the linked list needs to be re-hashed)

The following is the picture after the double expansion

code show as below:

public class HashBuck {

    static class Node {
        public int key;
        public int val;
        public Node next;

        public Node(int key,int val) {
            this.key = key;
            this.val = val;
        }
    }

    public Node[] array;
    public int usedSize;

    public static final double DEFAULT_LOAD_FACTOR = 0.75;

    public HashBuck() {
        this.array = new Node[10];
    }

    /**
     * put函数
     * @param key
     * @param val
     */
    public void put(int key,int val) {
        //1、找到Key所在的位置
        int index = key % this.array.length;
        //2、遍历这个下标的链表,看是不是有相同的key。有 要更新val值的
        Node cur = array[index];
        while (cur != null) {
            if(cur.key == key) {
                cur.val = val;//更新val值
                return;
            }
            cur = cur.next;
        }
        //3、没有这个key这个元素,头插法
        Node node = new Node(key, val);
        node.next = array[index];
        array[index] = node;
        this.usedSize++;
        //4、插入元素成功之后,检查当前散列表的负载因子
        if(loadFactor() >= DEFAULT_LOAD_FACTOR) {
            resize();//
        }
    }

    private void resize() {
        Node[] newArray = new Node[array.length*2];
        for (int i = 0; i < array.length; i++) {
            Node cur = array[i];
            while (cur != null) {
                int index = cur.key % newArray.length;//获取新的下标 11
                //就是把cur这个节点,以头插/尾插的形式 插入到新的数组对应下标的链表当中
                Node curNext = cur.next;
                cur.next = newArray[index];//先绑定后面
                newArray[index] = cur;//绑定前面
                cur = curNext;
            }
        }
        array = newArray;
    }

    private double loadFactor() {
        return 1.0*usedSize/array.length;
    }

    /**
     * 根据key获取val值
     * @param key
     * @return
     */
    public int get(int key) {
        //1、找到Key所在的位置
        int index = key % this.array.length;
        //2、遍历这个下标的链表,看是不是有相同的key。有 要更新val值的
        Node cur = array[index];
        while (cur != null) {
            if(cur.key == key) {
                return cur.val;
            }
            cur = cur.next;
        }
        return -1;
    }
④hashcode (in the case of solving the reference type, turn it into a legal integer)

But at this time, the hashcodes directly outputting them are different.

So we did a rewrite


hashcode 和 equals

①Hashcode is the same, equals is not necessarily the same (it can only indicate that it is in the same position, but multiple nodes can be linked) 

②The equals are the same, the hashcode must be the same (the node is determined, so it must be in the same position)

code show as below:

class Person {
    public String ID;

    public Person(String ID) {
        this.ID = ID;
    }


    @Override
    public boolean equals(Object o) {
        if (this == o) return true;
        if (o == null || getClass() != o.getClass()) return false;
        Person person = (Person) o;
        return Objects.equals(ID, person.ID);
    }

    @Override
    public int hashCode() {
        return Objects.hash(ID);
    }

    @Override
    public String toString() {
        return "Person{" +
                "ID='" + ID + '\'' +
                '}';
    }
}
public class HashBuck2<K,V> {

    static class Node<K,V> {
        public K key;
        public V val;
        public Node<K,V> next;

        public Node(K key,V val) {
            this.val = val;
            this.key = key;
        }
    }

    public Node<K,V>[] array = (Node<K,V>[])new Node[10];
    public int usedSize;

    public void put(K key,V val) {
        int hash = key.hashCode();
        int index = hash % array.length;
        Node<K,V> cur = array[index];
        while (cur != null) {
            if(cur.key.equals(key)) {
                cur.val = val;//更新val值
                return;
            }
            cur = cur.next;
        }
        Node<K,V> node = new Node<>(key, val);
        node.next = array[index];
        array[index] = node;
        this.usedSize++;
    }

    public V get(K key) {
        int hash = key.hashCode();
        int index = hash % array.length;
        Node<K,V> cur = array[index];
        while (cur != null) {
            if(cur.key.equals(key)) {
                //更新val值
                return cur.val;
            }
            cur = cur.next;
        }
        return null;
    }

    public static void main(String[] args) {
        Person person1 = new Person("123");
        Person person2 = new Person("123");

        HashBuck2<Person,String> hashBuck2 = new HashBuck2<>();
        hashBuck2.put(person1,"bit");

        System.out.println(hashBuck2.get(person2));
    }


1.5.3 Solutions for serious conflicts

Hash buckets can actually be regarded as transforming the search problem of large sets into search problems of small sets. If the conflict is serious, it means that the search performance of small sets is actually not good. At this time, we can call this so-called search problem. The small set search problem of , goes on to transform, for example: 1. Behind each bucket is another hash table 2. Behind each bucket is a search tree

1.6 Performance Analysis

At work, in fact, the collision rate of the hash table is quite low, and at the same time its efficiency is extremely high, so we always use the hash table to insert / delete / look up the time complexity is O(1)

Thanks for watching~

 

Guess you like

Origin blog.csdn.net/weixin_58850105/article/details/123371418