Original | I said that I understand collections, and the interviewer asked me why the load factor of HashMap is not set to 1! ?

Original | I said that I understand collections, and the interviewer asked me why the load factor of HashMap is not set to 1! ?

△Hollis, a person with a unique pursuit of Coding △
Original | I said that I understand collections, and the interviewer asked me why the load factor of HashMap is not set to 1!  ?
This is Hollis’s 254th original sharing
author l Hollis
source l Hollis (ID: hollischuang)
In the foundation of Java, collection classes are a very key piece of knowledge and are also daily development It is often used when For example, List and Map are also very common in code.
Personally, I think that the JDK engineers actually made a lot of optimizations for the implementation of HashMap. If you want to say which of all the JDK source code has the most buried eggs, then I think HashMap can be at least the top five.
It is precisely because of this that many details are easily overlooked. Today we will focus on one of the questions, that is:
Why is the load factor of HashMap set to 0.75 instead of 1 and not 0.5? What are the considerations behind this?
Don't underestimate this question, because load factor is a very important concept in HashMap and a common test point for high-end interviews.
In addition, this is worth setting, and some people will use it wrong. For example, my "Alibaba Java Development Manual a few days ago recommended setting the initial capacity when creating a HashMap, but how much is appropriate?" "In this article, some readers responded like this:
Original | I said that I understand collections, and the interviewer asked me why the load factor of HashMap is not set to 1!  ?

Original | I said that I understand collections, and the interviewer asked me why the load factor of HashMap is not set to 1!  ?
Since someone will try to modify the load factor, is it appropriate to change it to 1? Why doesn't HashMap use 1 as the default value of the load factor?

What is loadFactor

First, let's introduce what is load factor (loadFactor), if the reader already knows this part, then you can skip this paragraph directly.
We know that when the HashMap is created for the first time, its capacity will be specified (if not explicitly specified, the default is 16, see why the default capacity of HashMap is 16?), then as we continue to put elements in the HashMap If the capacity is exceeded, there is a need for an expansion mechanism.
The so-called expansion is to expand the capacity of HashMap:

void addEntry(int hash, K key, V value, int bucketIndex) {
    if ((size >= threshold) && (null != table[bucketIndex])) {
        resize(2 * table.length);
        hash = (null != key) ? hash(key) : 0;
        bucketIndex = indexFor(hash, table.length);
    }
    createEntry(hash, key, value, bucketIndex);
}

From the code, we can see that in the process of adding elements to the HashMap, if the number of elements (size) exceeds the threshold (threshold), it will automatically expand (resize), and after the expansion, it is also necessary Rehash the original elements in the HashMap, that is, redistribute the elements in the original bucket to the new bucket.
In HashMap, the threshold (threshold) = load factor (loadFactor) * capacity (capacity).
loadFactor is the load factor, indicating how full the HashMap is. The default value is 0.75f, which means that by default, when the number of elements in the HashMap reaches 3/4 of the capacity, it will automatically expand. (For details, see those concepts that are unclear in HashMap)

Why expand

Remember that we said before that HashMap not only needs to expand its capacity during the expansion process, but also needs to rehash! Therefore, this process is actually very time-consuming, and the more elements in the Map, the more time-consuming.
The process of rehashing is equivalent to re-hashing all the elements in it, and recalculating to which bucket should be allocated.
So, has anyone thought about a question, since it is so troublesome, why do you need to expand? Isn't HashMap an array linked list? Without expansion, it can be stored infinitely. Why expand?
This is actually related to hash collisions.
Hash collision

We know that HashMap is actually implemented at the bottom based on a hash function, but hash functions have the following basic characteristics: if the hash value calculated according to the same hash function is different, the input value must be different. However, if the hash value calculated based on the same hash function is the same, the input value may not be the same.
The phenomenon that two different input values ​​have the same hash value calculated from the same hash function is called collision.
An important indicator to measure the quality of a hash function is the probability of collision and the solution to the collision.
In order to solve the hash collision, there are many methods, among which the more common is the chain address method, which is also the method adopted by HashMap. For details, see the most thorough article on the analysis of hash() in Map on the whole network.
HashMap combines an array and a linked list, and takes advantage of the two. We can understand it as an array of linked lists.
Original | I said that I understand collections, and the interviewer asked me why the load factor of HashMap is not set to 1!  ?
HashMap is implemented based on the data structure of an array of linked lists.
When we put an element in the HashMap, we need to locate which linked list in the array first, and then hang this element behind the linked list.
When we get elements from HashMap, we also need to locate which linked list in the array, and then traverse the elements in the linked list one by one until we find the required element.
However, if the conflict in a HashMap is too high, the linked list of the array will degenerate into a linked list. At this time, the query speed will be greatly reduced.
Original | I said that I understand collections, and the interviewer asked me why the load factor of HashMap is not set to 1!  ?
So, in order to ensure the speed of reading HashMap, we need to find ways to ensure that the conflict of HashMap is not too high.
Scaling to avoid hash collision

So how can we effectively avoid hash collisions?
Let's think backwards first, what do you think will cause more hash collisions in HashMap?
There are two situations:
1. The capacity is too small. The smaller the capacity, the higher the probability of collision. If there are more wolves and less meat, there will be competition.
2. The hash algorithm is not good enough. If the algorithm is unreasonable, it may be divided into the same or several buckets. Uneven distribution can also lead to competition.
Therefore, solving the hash collision in HashMap also starts from these two aspects.
Both of these points are well reflected in HashMap. Combining the two methods, expanding the capacity of the array at the right time, and then calculating which array the elements are allocated to through a suitable hash algorithm, can greatly reduce the probability of conflict. Can avoid the problem of inefficient query.

Why the default loadFactor is 0.75

At this point, we know that loadFactor is an important concept in HashMap, which represents the maximum degree of fullness of this HashMap.
In order to avoid hash collisions, HashMap needs to be expanded when appropriate. That is when the number of elements in it reaches a critical value, which is related to loadFactor as mentioned earlier. In other words, setting a reasonable loadFactor can effectively avoid hash conflicts.
So, what is the appropriate loadFactor setting?
This value is now 0.75 in the JDK source code:

/**
 * The load factor used when none specified in constructor.
 */

static final float DEFAULT_LOAD_FACTOR = 0.75f;
So, why choose 0.75? What are the considerations behind? Why not 1, not 0.8? Not 0.5, but 0.75?
In the official JDK documentation, there is such a description:

As a general rule, the default load factor (.75) offers a good tradeoff between time and space costs. Higher values decrease the space overhead but increase the lookup cost (reflected in most of the operations of the HashMap class, including get and put).

The rough meaning is: Generally speaking, the default load factor (0.75) provides a good trade-off between time and space costs. Higher values ​​reduce space overhead, but increase lookup costs (reflected in most operations of the HashMap class, including get and put).
Imagine that if we set the load factor to 1, and the capacity uses the default initial value of 16, then it means that a HashMap needs to be "full" before expansion.
Then in HashMap, the best situation is that these 16 elements fall into 16 different buckets after passing the hash algorithm, otherwise hash collisions will inevitably occur. And with the more elements, the greater the probability of hash collisions, the lower the search speed.

0.75 mathematical basis

In addition, we can calculate how appropriate this value is through a kind of mathematical thinking.
We assume that the probability of a bucket being empty and non-empty is 0.5, we use s to represent the capacity, and n to represent the number of added elements.
Let s denote the size of the added key and the number of n keys. According to the binomial theorem, the probability that the bucket is empty is:

P(0) = C(n, 0) * (1/s)^0 * (1 - 1/s)^(n - 0)

Therefore, if the number of elements in the bucket is less than the following value, the bucket may be empty:

log(2)/log(s/(s - 1))

When s tends to infinity, if the number of increased keys makes P(0) = 0.5, then n/s quickly approaches log(2):

log(2) ~ 0.693...

Therefore, the reasonable value is about 0.7.
Of course, this mathematical calculation method is not reflected in the official Java documentation, and we have no way to investigate whether there is such a consideration. Just as we don't know what Lu Xun thought when writing the article, we can only speculate. This speculation comes from Stack Overflow ( https://stackoverflow.com/questions/10901752/what-is-the-significance-of-load-factor-in-hashmap )

The inevitable factor of 0.75

In theory, we believe that the load factor should not be too large, otherwise it will cause a lot of hash collisions, and it should not be too small, which will waste space.
Through a mathematical reasoning, it is reasonable to calculate that this value is around 0.7.
So, why was 0.75 selected in the end?
Remember that we mentioned a formula earlier, that is, threshold = load factor capacity (capacity).
We are in "Why the default capacity of HashMap is 16? As mentioned in ", according to the expansion mechanism of HashMap, he will ensure that the value of capacity is always a power of 2.
So, in order to ensure that the result of load factor (loadFactor)
capacity is an integer, this value is 0.75 (3/4) more reasonable, because the product of this number and any power of 2 is an integer.

to sum up

HashMap is a kind of KV structure. In order to improve the speed of query and insertion, the bottom layer adopts the data structure of linked list array.
But because the hash algorithm needs to be used when calculating the location of the element, and the hash algorithm used by HashMap is the chain address method. There are two extremes to this approach.
If the probability of hash collision in HashMap is high, then HashMap will degenerate into a linked list (not really degenerate, but the operation is like direct manipulation of the linked list), and we know that the biggest disadvantage of the linked list is that the query speed is relatively slow. The header of the table is traversed one by one.
Therefore, in order to avoid a large number of hash collisions in HashMap, it needs to be expanded when appropriate.
The condition for expansion is when the number of elements reaches a critical value. The calculation method of the critical value in HashMap:

临界值(threshold) = 负载因子(loadFactor) * 容量(capacity)

The load factor represents the maximum degree of fullness that an array can reach. This value should not be too large or too small.
The loadFactor is too large, for example, equal to 1, then there will be a high probability of hash collision, which will greatly reduce the query speed.
The loadFactor is too small, for example equal to 0.5, then frequent expansions will result in a huge waste of space.
Therefore, this value needs to be between 0.5 and 1. Calculated according to mathematical formulas. This value is reasonable in log(2).
In addition, in order to improve the expansion efficiency, the capacity of HashMap has a fixed requirement, that is, it must be a power of 2.
So, if loadFactor is 3/4, then the product of capacity and capacity can be an integer.
Therefore, under normal circumstances, we do not recommend modifying the value of loadFactor, unless there are special reasons.
For example, if I clearly know that my Map only saves 5 kv and will never change, then I can consider specifying loadFactor.
But in fact, I don't recommend this. We can achieve this goal by specifying capacity. For details, please refer to the Alibaba Java Development Manual suggesting to set the initial capacity when creating a HashMap, but how much is appropriate?
Reference materials:
https://stackoverflow.com/questions/10901752/what-is-the-significance-of-load-factor-in-hashmap
https://docs.oracle.com/javase/6/docs/api/ java/util/HashMap.html
https://preshing.com/20110504/hash-collision-probabilities/
About the Author: Hollis, has a unique quest for Coding people, the current Alibaba technical experts, personal technology blogger, technical articles, the amount of reading the whole network of tens of millions, "three classes programmer" joint author.

  • MORE | More wonderful articles-A
    big topic decided to stay: Why can't synchronized not prohibit instruction rearrangement, but can guarantee order?
    A technical director’s advice: Why is it that proficient in so many technologies is still not good at doing a project?
    Undertow technology: why many Spring Boot developers abandon Tomcat
    , the world’s largest adult website, and preserve the final conscience of Western media

If you like this article,
please press
Original | I said that I understand collections, and the interviewer asked me why the load factor of HashMap is not set to 1!  ?
and hold the QR code and follow Hollis. Forward it to the circle of friends. This is my greatest support.
Good article, I am reading ❤️

Guess you like

Origin blog.51cto.com/13626762/2544190