Why did JDK9 change the underlying implementation of String from char[] to byte[]

Pay attention to the public account: IT brother, read a dry technical article every day, you will find a different self after a year.

If you are not a Java8 nail user, you should have discovered it long ago: the source code of the String class has been  char[] optimized  byte[] to store the string content, why do it?

Straight to the point, from  char[] to  byte[], the main purpose is to save the memory occupied by strings . Another benefit of reducing memory usage is that the number of GCs will also be reduced.

1. Why optimize String to save memory space

We can use the  jmap -histo:live pid | head -n 10 command to view the statistics of the objects in the heap, view the information of the ClassLoader and the finalizer queue.

Here's the result for my running instance of the Programming Meow project (based on Java 8).

picture

Among them, there are 17638 String objects, occupying 423312 bytes of memory, ranking third.

Since Java 8's String internals are still implemented  char[], we can see that the number one memory footprint is char arrays.

char[] There are 17673 objects, occupying 1621352 bytes of memory, ranking first.

That is to say, it is very necessary to optimize String to save memory space. If it is to optimize a class library that is not used as frequently as String, it will be very tasteless.

Second, byte[] why can save memory space?

As we all know, data of type char occupies two bytes in the JVM, and uses UTF-8 encoding, and its value range is between '\u0000' (0) and '\uffff' (65,535) (inclusive) .

也就是说,使用 char[] 来表示 String 就导致了即使 String 中的字符只用一个字节就能表示,也得占用两个字节。

而实际开发中,单字节的字符使用频率仍然要高于双字节的。

当然了,仅仅将 char[] 优化为 byte[] 是不够的,还要配合 Latin-1 的编码方式,该编码方式是用单个字节来表示字符的,这样就比 UTF-8 编码节省了更多的空间。

换句话说,对于:

String name = "jack";  

复制代码

这样的,使用 Latin-1 编码,占用 4 个字节就够了。

但对于:

String name = "小二";  

复制代码

这种,木的办法,只能使用 UTF16 来编码。

针对 JDK 9 的 String 源码里,为了区别编码方式,追加了一个 coder 字段来区分。

/**  
 * The identifier of the encoding used to encode the bytes in  
 * {@code value}. The supported values in this implementation are  
 *  
 * LATIN1  
 * UTF16  
 *  
 * @implNote This field is trusted by the VM, and is a subject to  
 * constant folding if String instance is constant. Overwriting this  
 * field after construction will cause problems.  
 */  
private final byte coder;  

复制代码

Java 会根据字符串的内容自动设置为相应的编码,要么 Latin-1 要么 UTF16。

也就是说,从 char[] 到 byte[]中文是两个字节,纯英文是一个字节,在此之前呢,中文是两个字节,英文也是两个字节

三、为什么用UTF-16而不用UTF-8呢?

在 UTF-8 中,0-127 号的字符用 1 个字节来表示,使用和 ASCII 相同的编码。只有 128 号及以上的字符才用 2 个、3 个或者 4 个字节来表示。

  • 如果只有一个字节,那么最高的比特位为 0;

  • 如果有多个字节,那么第一个字节从最高位开始,连续有几个比特位的值为 1,就使用几个字节编码,剩下的字节均以 10 开头。

具体的表现形式为:

  • 0xxxxxxx:一个字节;

  • 110xxxxx 10xxxxxx:两个字节编码形式(开始两个 1);- 1110xxxx 10xxxxxx 10xxxxxx:三字节编码形式(开始三个 1);

  • 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx:四字节编码形式(开始四个 1)。

In other words, UTF-8 is variable length, which is very inconvenient for a class with random access methods such as String. The so-called random access is the method of charAt and subString. You can specify a number at will, and String must be able to give the result. If the memory occupied by each character in the string is of indeterminate length, then when performing random access, you need to count the length of each character from the beginning to find the character you want.

Then some friends may ask, does UTF-16 also become longer? A character may also occupy 4 bytes?

Indeed, UTF-16 uses 2 or 4 bytes to store characters.

  • For characters with Unicode numbers in the range 0 to FFFF, UTF-16 uses two bytes for storage.

  • For characters whose Unicode numbers range from 10000 to 10FFFF, UTF-16 uses four bytes to store them. Specifically, all bits of the character number are divided into two parts, and the higher bits are stored with a value between D800Double-byte storage between DBFF, the lower bits (the remaining bits) are stored with a value between DC00Double-byte storage between DFFFs.

But in Java, a character (char) is 2 bytes, occupying 4 bytes of characters. In Java, two chars are also used to store, and various operations of String are based on Java characters ( char) as a unit, charAt is the number of chars obtained, subString is also a substring composed of the number of chars to the number of chars, and even length returns the number of chars.

So UTF-16 can be regarded as a fixed-length encoding in the Java world.

Pay attention to the public account: IT brother, read a dry technical article every day, you will find a different self after a year.

Guess you like

Origin juejin.im/post/7078475591099351048