Why is the initial value of HashMap 2 raised to the power of n?

Collection is often used in the daily development of Java development, and as a typical data structure of KV structure, HashMap is certainly no stranger to Java developers.

In daily development, we often create a HashMap as follows:

Map<String, String> map = new HashMap<String, String>();

However, have you ever thought that in the above code, we did not specify the capacity for the HashMap, so what is the default capacity of a newly created HashMap at this time? why?

This article will analyze this problem.

What is capacity

In Java, there are two relatively simple data structures for saving data: arrays and linked lists. The characteristics of arrays are: easy to address, difficult to insert and delete; and the characteristics of linked lists are: difficult to address, easy to insert and delete. HashMap is the combination of an array and a linked list, giving play to the advantages of both, we can understand it as an array of linked lists.

In HashMap, there are two key fields that are easy to confuse: size and capacity, where capacity is the capacity of the map, and size is called the number of elements in the map.

A simple analogy will make it easier for you to understand: HashMap is a "bucket", then capacity (capacity) is the current maximum number of elements in this bucket, and the number of elements (size) indicates how many elements the bucket has already contained.

-w778

Such as the following code:

Map<String, String> map = new HashMap<String, String>();
map.put("hollis", "hollischuang");

Class<?> mapType = map.getClass();
Method capacity = mapType.getDeclaredMethod("capacity");
capacity.setAccessible(true);
System.out.println("capacity : " + capacity.invoke(map));

Field size = mapType.getDeclaredField("size");
size.setAccessible(true);
System.out.println("size : " + size.get(map));

Output result:

capacity : 16、size : 1

Above we define a new HashMap, and want to put an element in it, and then print the capacity and size through reflection, its capacity is 16, and the number of stored elements is 1.

Through the previous example, we found that when we create a HashMap, if we do not specify its capacity, we will get a Map with a default capacity of 16. Then, how does this capacity come from? Why is this number?

Capacity and hash

To clarify the reason for this default capacity, we must first know what is the use of this capacity?

We know that capacity is the number of "buckets" in a HashMap. Then, when we want to put an element in a HashMap, we need to use a certain algorithm to figure out which bucket should be put in. This process is just It is called hash, which corresponds to the hash method in HashMap.

-w688

We know that the function of the hash method is to locate the position of this KV in the linked list array based on the Key. That is, the input of the hash method should be a Key of type Object, and the output should be an array subscript of type int. If you were asked to design this method, what would you do?

In fact, it is simple, we only need to call the hashCode() method of the Object object, which will return an integer, and then use this number to modulate the capacity of the HashMap.

If it is really that simple, then the capacity setting of HashMap will be much simpler, but considering efficiency and other issues, the implementation of the hash method of HashMap is still somewhat complicated.

Implementation of hash

Next, I will introduce the implementation principle of the hash method in HashMap. (The following part refers to my article: The most thorough article on the analysis of hash() in Map on the whole network, there is no other one.  PS: Many articles on the analysis of the hash method of HashMap on the Internet are in my place. "Derived" from the article.)

In terms of specific implementation, it is implemented by two methods int hash(Object k) and int indexFor(int h, int length).

hash: This method is mainly to convert Object into an integer.

indexFor: This method is mainly to convert the integer generated by the hash into a subscript in the linked list array.

In order to focus on the focus of this article, we will only look at the indexFor method. Let's first look at the implementation details in Java 7 (although there is no such a separate method in Java 8, the algorithm for querying subscripts is the same as in Java 7):

static int indexFor(int h, int length) {
    return h & (length-1);
}

The indexFor method is actually to replace the hashcode with the subscript in the linked list array. The two parameters h represent the hashcode value of the element, and length represents the capacity of the HashMap. So what does return h & (length-1) mean?

In fact, he just takes the mold. All Java uses bit operation (&) instead of modulo operation (%), the most important consideration is efficiency.

The efficiency of bit operation (&) is much higher than that of substituting modulo operation (%). The main reason is that bit operation directly operates on memory data and does not need to be converted to decimal, so the processing speed is very fast.

So, why can we use bit operation (&) to implement modulo operation (%)? The principle of this realization is as follows:

X % 2^n = X & (2^n – 1)

Assuming that n is 3, then 2^3 = 8, expressed in binary is 1000. 2^3 -1 = 7, which is 0111.

At this time, X & (2^3 – 1) is equivalent to taking the last three digits of the binary system of X.

From a binary point of view, X/8 is equivalent to X >> 3, that is, shift X to the right by 3 digits. At this time, the quotient of X/8 is obtained. X% 8, which is the remainder.

I don’t know if you understand the above explanation. If you don’t understand it, it doesn’t matter. You only need to remember this technique. Or you can find a few examples to try.

6 % 8 = 6 ,6 & 7 = 6

10 & 8 = 2 ,10 & 7 = 2

Therefore, return h & (length-1); as long as the length of length is guaranteed to be 2^n, the modulo operation can be implemented.

Therefore, because bit operations directly operate on memory data and do not need to be converted to decimal, bit operations are more efficient than modulo operations. Therefore, HashMap uses bit operations instead when calculating the index of elements to be stored in the array. Modulo operation. The reason why equivalent substitution can be made is that the capacity of HashMap must be 2^n  .

So, since it is 2^n, why must it be 16? Why can't it be 4, 8 or 32?

Regarding the choice of this default capacity, JDK has not given an official explanation, and the author has not found any valuable information about this on the Internet. (If anyone has relevant authoritative information or ideas, you can leave a message for exchange)

According to the author's inference, this should be an experience value. Since a default 2^n must be set as the initial value, then there is a trade-off between efficiency and memory usage. This value can neither be too small nor too large.

If it is too small, capacity expansion may occur frequently, affecting efficiency. It's too big and wastes space, not cost-effective.

Therefore, 16 was adopted as an empirical value.

In JDK 8, the definition of the default capacity is: static final int DEFAULT_INITIAL_CAPACITY = 1 << 4; // aka 16, which deliberately writes 16 as 1<<4 to remind developers that this place should be a power of 2. It’s interesting to note that the aka 16 in the comment is   also new in 1.8.

So, let's talk about it next, how does HashMap ensure that its capacity must be 2^n? What happens if the user sets it up by himself?

Regarding this part, HashMap has done compatibility processing in two places where its capacity may be changed, namely when the designated capacity is initialized and when it is expanded.

Specify capacity initialization

When we set the initial capacity through HashMap (int initialCapacity), HashMap does not necessarily directly use the value we passed in, but after calculation, a new value is obtained. The purpose is to improve the efficiency of hash. (1->1, 3->4, 7->8, 9->16)

In JDK 1.7 and JDK 1.8, the timing of HashMap initialization of this capacity is different. In JDK 1.8, when the HashMap constructor is called to define the HashMap, the capacity will be set. In JDK 1.7, this operation is not performed until the first put operation.

Take a look at how the JDK finds the first power of 2 greater than the specified value passed in:

int n = cap - 1;
n |= n >>> 1;
n |= n >>> 2;
n |= n >>> 4;
n |= n >>> 8;
n |= n >>> 16;
return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;

The purpose of the above algorithm is quite simple, that is: According to the capacity value (cap in the code) passed in by the user, through calculation, the first power of 2 greater than it is obtained and returned.

Please pay attention to the changes in the blue font in the above few examples, maybe you will find some rules. 5->8, 9->16, 19->32, 37->64 are mainly through two stages.

Step 1,5->7

Step 2,7->8

Step 1,9->15

Step 2,15->16

Step 1,19->31

Step 2,31->32

Corresponding to the above code, Step1:

n |= n >>> 1;
n |= n >>> 2;
n |= n >>> 4;
n |= n >>> 8;
n |= n >>> 16;

Corresponding to the above code, Step2:

return (n < 0) ? 1 : (n >= MAXIMUM_CAPACITY) ? MAXIMUM_CAPACITY : n + 1;

Step 2 is relatively simple, that is, make a judgment of the limit value, and then add the value obtained in Step 1 to 1.

Step 1 How do you understand it? In fact, it shifts a binary number to the right one by one, and then takes the OR with the original value. Its purpose is for a digital binary, starting from the first bit that is not 0, and setting all the following bits to 1.

Just take a binary number and set the above formula again to find its purpose:

1100 1100 1100 >>>1 = 0110 0110 0110
1100 1100 1100 | 0110 0110 0110 = 1110 1110 1110
1110 1110 1110 >>>2 = 0011 1011 1011
1110 1110 1110 | 0011 1011 1011 = 1111 1111 1111
1111 1111 1111 >>>4 = 1111 1111 1111
1111 1111 1111 | 1111 1111 1111 = 1111 1111 1111

After several unsigned right shifts and bitwise OR operations, we convert 1100 1100 1100 into 1111 1111 1111, and then add 1 to 1111 1111 1111, and we get 1 0000 0000 0000, which is the first value greater than 1100 1100 1100 A power of 2.

Well, we have now explained the code of Step 1 and Step 2. It is possible to convert a number into the first power of 2 greater than itself.

However, there is a special case where the above formula cannot be applied. These numbers are powers of 2 themselves. If the number 4 applies the formula. The result will be 8, but in fact, this problem has also been solved. For specific verification methods and JDK solutions, see the most thorough article on the analysis of hash() in Map on the whole network . There is no other way , so I will not expand it here. Up.

In short, HashMap uses unsigned right shift and bitwise OR operation to calculate the first power of 2 greater than this number based on the initial capacity passed in by the user.

Expansion

In addition to specifying the capacity of the HashMap during initialization, its capacity may also change during expansion.

HashMap has an expansion mechanism, that is, it will expand when the expansion conditions are reached. The expansion condition of HashMap is that when the number of elements in the HashMap (size) exceeds the threshold (threshold), it will automatically expand.

In HashMap, threshold = loadFactor * capacity.

loadFactor is the load factor, indicating how full the HashMap is. The default value is 0.75f. There is an advantage to setting it to 0.75, that is, 0.75 is exactly 3/4, and capacity is a power of 2. Therefore, the product of two numbers is an integer.

For a default HashMap, by default, expansion will be triggered when its size is greater than 12 (16*0.75).

The following is a section of the expansion method (resize) in HashMap:

if ((newCap = oldCap << 1) < MAXIMUM_CAPACITY &&
                 oldCap >= DEFAULT_INITIAL_CAPACITY)
    newThr = oldThr << 1; // double threshold
}

As can be seen from the above code, the size of the expanded table becomes twice the original size. After this step is executed, the expanded table will be adjusted. This part is not the focus of this article and is omitted.

It can be seen that when the number of elements in the HashMap (size) exceeds the threshold (threshold), it will automatically expand, and the expansion will be twice the original capacity, that is, from 16 to 32, 64, 128...

Therefore, by ensuring that the initial capacity is a power of 2 and the capacity is expanded to 2 times the previous capacity when expanding, it is guaranteed that the capacity of the HashMap will always be a power of 2.

to sum up

HashMap is a data structure. The element needs to be hashed during the put process. The purpose is to calculate the specific location where the element is stored in the hashMap.

The process of hash operation is actually to hashcode the key of the target element, and then modulate the capacity of the Map. In order to improve the efficiency of modulo taking, the engineers of JDK use bit operation instead of modulo operation, which requires a certain capacity of Map. Must be a power of 2.

As the default capacity, too large or too small is not appropriate, so 16 was adopted as a more appropriate empirical value.

In order to ensure that the capacity of the Map is a power of two in any case, HashMap has restrictions in two places.

First of all, if the user sets the initial capacity, then HashMap will calculate the first power of 2 greater than this number as the initial capacity.

In addition, when expanding the capacity, the capacity is also doubled, that is, 4 becomes 8 and 8 becomes 16.

In this article, by analyzing why the default capacity of HashMap is 16, we go deep into the principle of HashMap and analyze the underlying principles. From the code, we can find that JDK engineers have used various bit operations to the extreme and tried various ways to optimize them. effectiveness. worth to learn from!

Guess you like

Origin blog.csdn.net/small_love/article/details/112539974