Everything about hashing is here!

foreword

> This article is included in the album: http://dwz.win/HjK , click to unlock more knowledge of data structures and algorithms.

Hello, my name is Tong.

In the last section, we learned together how to build a high-performance queue in Java, which involves a lot of underlying knowledge. I don’t know how much you can get? !

In this section, I want to follow you to relearn everything about hashes - hashes, hash functions, hash tables.

What kind of love and hatred do these three have?

Why does the Object class need to have a hashCode() method? What does it have to do with the equals() method?

How to write a high performance hash table?

Can the red-black tree in HashMap in Java be replaced with other data structures?

What is a hash?

Hash refers to the process of converting an input of any length into a fixed-length output through a certain algorithm . This output is called a Hash value, or a Hash code. This algorithm is called a Hash algorithm, or a Hash function. This process is generally called Hash, or calculate Hash, Hash translated into Chinese has hash, hash, hash, etc.

Since it is a fixed-length output, it means that the input is infinite and the output is limited. There will inevitably be situations where different inputs may get the same output. Therefore, the Hash algorithm is generally irreversible.

So, what are the uses of the Hash algorithm?

Purpose of hash algorithm

Hash algorithm is a generalized algorithm, or a kind of idea. It does not have a fixed formula. As long as the algorithm defined above is satisfied, it can be called a Hash algorithm.

Generally speaking, it has the following uses:

  1. Encrypt the password, for example, use MD5+salt to encrypt the password;
  2. Fast queries, for example, the use of hash tables, which can quickly query elements through hash tables;
  3. Digital signatures, such as inter-system calls plus signatures, can prevent data tampering;
  4. File inspection, for example, when downloading Tencent games, there is usually an MD5 value. After the installation package is downloaded, an MD5 value is calculated and compared with the official MD5 value to know whether there is any file damage during the download process. tampering, etc.;

Well, speaking of Hash algorithm, or Hash function, in Java, the parent class Object of all objects has a Hash function, the hashCode() method. Why does the Object class need to define such a method?

> Strictly speaking, the Hash algorithm and the Hash function are still a bit different, I believe you can distinguish them according to the context.

Let's see what the comments in the JDK source code say:

Please see the red box. The translation is roughly: Return a Hash value for this object, which exists to better support hash tables, such as HashMap. To put it simply, this method is used for hash tables such as HashMap.

// 默认返回的是对象的内部地址
public native int hashCode();

At this point, we have to mention another method in the Object class - equals().

// 默认是直接比较两个对象的地址是否相等
public boolean equals(Object obj) {
    return (this == obj);
}

What is the entanglement between hashCode() and equals?

Generally speaking, hashCode() can be regarded as a weak comparison, returning to the essence of Hash, mapping different inputs to fixed-length outputs, then the following situations will occur:

  1. The input is the same, the output must be the same;
  2. The input may be different, the output may be the same or different;
  3. The output is the same, the input may or may not be the same;
  4. The output is different, the input must be different;

And equals() is a method to strictly compare whether two objects are equal, so if two objects equals() is true, then their hashCode() must be equal, what if they are not equal?

If equals() returns true, but hashCode() is not equal, then, imagine using these two objects as the keys of HashMap, they are likely to be located in different slots of HashMap, and a HashMap will be inserted at this time. Two equal objects are not allowed, which is why if you override the equals() method, you must override the hashCode() method.

For example, in the String class, we all know that its equals() method compares whether the contents of two strings are equal, not the addresses of the two strings. The following is its equals() method:

public boolean equals(Object anObject) {
    if (this == anObject) {
        return true;
    }
    if (anObject instanceof String) {
        String anotherString = (String)anObject;
        int n = value.length;
        if (n == anotherString.value.length) {
            char v1[] = value;
            char v2[] = anotherString.value;
            int i = 0;
            while (n-- != 0) {
                if (v1[i] != v2[i])
                    return false;
                i++;
            }
            return true;
        }
    }
    return false;
}

So, for the following two string objects, use equals() to compare them as equal, but their memory addresses are not the same:

String a = new String("123");
String b = new String("123");
System.out.println(a.equals(b)); // true
System.out.println(a == b); // false

At this time, if the hashCode() method is not rewritten, then a and b will return different hash codes, which will cause huge interference for us to use String as the key of HashMap. Therefore, the hashCode() method rewritten by String:

public int hashCode() {
    int h = hash;
    if (h == 0 && value.length > 0) {
        char val[] = value;

        for (int i = 0; i < value.length; i++) {
            h = 31 * h + val[i];
        }
        hash = h;
    }
    return h;
}

This algorithm is also very simple, expressed by the formula: s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1].

Well, since the hash table is mentioned many times here, let's take a look at how the hash table evolves step by step.

Hash table evolution history

array

Before talking about hash tables, let's take a look at the originator of data structures - arrays.

The array is relatively simple, so I won't say more, everyone will understand it, see the figure below.

The subscript of the array generally starts from 0, and the elements are stored in sequence, and the same is true for finding the specified element.

For example, to find the element 4, it would take 3 searches to search from the beginning.

early hash table

The shortcomings of the array are discussed above. To find an element, you can only search for elements from the beginning or the end until they match, and its equilibrium time is complex O(n).

So, is there any way to quickly find elements using an array?

The smart programmer brothers thought of a way to calculate the value of the element through the hash function, and use this value to determine the position of the element in the array, so that the time complexity can be shortened to O(1).

For example, if there are 5 elements 3, 5, 4, and 1, the positions are calculated by a hash function before putting them into the array, and they are placed precisely, instead of placing the elements in sequence like a simple array (based on the index instead of the element value) to find the location).

If the length of the array applied here is 8, we can create such a hash function as hash(x) = x % 8, then the final element becomes the following image:

At this time, we look for the element 4 again, and first calculate its hash value as hash(4) = 4 % 8 = 4, so you can directly return the element at position 4.

Evolved hash table

Things look perfect, but when an element 13 comes, and it is to be inserted into the hash table, its hash value is hash(13) = 13 % 8 = 5, Nani, its calculated position is also 5 , but No. 5 has been occupied first, what should I do?

This is a hash collision .

Why do hash collisions occur?

Because the array we apply for is of finite length, mapping an infinite number to a finite array will conflict sooner or later, that is, multiple elements are mapped to the same position.

Well, now that there is a hash collision, we need to solve it, we must!

How to?

Linear detection method

Since position 5 already has the owner, then I will admit to element 13, I will move one position back, and I will go to position 6. This is the linear detection method. When there is a conflict, move backwards in sequence until an empty position is found. .

However, there is a new element 12, and its hash value is hash(12) = 12 % 8 = 4, what? In this way, you have to move back 3 times to the 7th position to have a free position, which leads to the low efficiency of inserting elements, and the search is the same. The 4th position located first is not what I am looking for. people, and then move back until you find the number 7 position.

secondary detection method

Using the linear detection method has a big drawback. Conflicting elements tend to pile up together. For example, put the 12th in the 7th position, and then the 14th will be the same conflict, and then the array ends, and then start from the beginning. At position 0, you will find that the conflicting elements are clustered, which is not conducive to searching, and it is also not conducive to inserting new elements.

At this time, another smart programmer brother came up with a new idea - the second detection method. When there is a conflict, I don't come to find empty positions one by one, but use the original hash value plus i The quadratic power of , i is from 1, 2, 3... in this way, until an empty position is found.

Taking the above example as an example, inserting element 12, the process is like this, this article comes from the source code of Princess Tongge reading:

This makes it possible to quickly find empty places to place new elements, and there will be no accumulation of conflicting elements.

But goose, there is a new element 20, where do you put it?

I found that I couldn't put it anywhere.

Studies have shown that with hash tables using quadratic probing, when more than half of the elements are placed, there will be situations where new elements cannot find their place.

Therefore, a new concept is introduced - expansion.

What is expansion?

When the placed element reaches x% of the total capacity, it needs to be expanded, and this x% is also called the expansion factor .

Obviously, the larger the expansion factor, the better, indicating that the space utilization of the hash table is higher.

So, unfortunately, the quadratic detection method cannot meet our goal, the expansion factor is too small, only 0.5, and half of the space is wasted.

At this time, it is time for the programmers to use their smart features. After brainstorming in 996, they came up with a new hash table implementation method - the linked list method.

linked list method

Isn't it just conflict resolution? If there is a conflict, I will not put it in the array. I use a linked list to connect the elements at the subscript position of the same array, so that I can make full use of the space, ahahaha~~

Hehehehe, perfect △△.

Really perfect, I am a hacker, I keep putting *%8=4 elements in it, and then you will find that almost all elements go to the same linked list, huh, the final result is you The hash table degenerates into a linked list, and the efficiency of querying and inserting elements has become O(n).

At this point, of course there is a way, what does the expansion factor do?

For example, if the expansion factor is set to 1, when the number of elements reaches 8, the expansion is doubled, half of the elements are still at the 4th position, and half of the elements are at the 12th position, which can relieve the pressure on the hash table.

However, the goose is still not perfect, and it just changed from one linked list to two linked lists. This article comes from the source code of Princess Tong.

The smart programmer brothers started a brainstorm of growing up 9127 this time, and finally came up with a new structure - the linked list tree method.

linked list tree

Although the above expansion can solve some problems when the number of elements is relatively small, the overall search and insertion efficiency will not be too low, because the number of elements is small.

However, hackers are still attacking, and the number of elements continues to increase. When the number of elements increases to a certain extent, the efficiency of search and insertion will always be particularly low.

So, another way of thinking, since the efficiency of the linked list is low, I will upgrade it. How about upgrading to a red-black tree when the linked list is long?

Well, I see it, just do what you say.

Well, not bad, my mother is no longer afraid of me being attacked by hackers. The query efficiency of the red-black tree is O(log n), which is much higher than the O(n) of the linked list.

So, is this the end?

If you think too much, you still have to move half of the elements every time you expand. Is one tree divided into two trees? Is this really good?

The programmer brothers are too difficult. After 12127 brainstorming this time, I finally came up with a new thing-consistent Hash.

Consistent Hash

Consistent Hash is more used in distributed systems. For example, Redis cluster deploys four nodes. We define all hash values ​​as 0~2^32, and place a quarter of the elements on each node. .

> This is just an example. The actual principle of Redis cluster is like this, but the specific value is not like this.

At this point, suppose you need to add a node to Redis, such as node5, and place it between node3 and node4, so that you only need to move the elements between node3 and node4 from node4 to node5, and other elements remain unchanged.

In this way, the speed of expansion is increased, and the elements affected are relatively small, and most requests are almost unaware.

Well, that's it for the evolutionary history of the hash table, have you got it?

postscript

In this section, we re-learned the knowledge about hash, hash function, and hash table together. In Java, the ultimate form of HashMap is presented in the form of array + linked list + red-black tree.

It is said that this red-black tree can also be replaced with other data structures, such as a skip table. Did you make it?

In the next section, let's talk about the data structure of the jump table , and use it to rewrite the HashMap. If you want to get the latest promotion, come and follow me!

> Follow the official account owner "Tong Ge Read Source Code" to unlock more source code, basic and architecture knowledge.

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324142435&siteId=291194637