03 - String performance optimization should not be underestimated, hundreds of megabytes of memory can easily store tens of gigabytes of data

The String object is the most frequently used object type, but its performance problems are the most easily overlooked. As an important data type in the Java language, the String object occupies the largest space in memory. Efficient use of strings can improve the overall performance of the system.

Next, we will start with the implementation, characteristics, and optimization in actual use of the String object to gain an in-depth understanding.

Before I start, I want to ask you a little question, which is also a question I often ask interviewers when I am recruiting. Although it is a cliché, the error rate is still high. Of course, some interviewees got the answer right, but very few people can explain the rationale behind the answer. Questions are as follows:

Three objects are created in three different ways, and then matched in pairs. Are the two objects matched in each group equal? code show as below:

String str1= "abc";
String str2= new String("abc");
String str3= str2.intern();
assertSame(str1==str2);
assertSame(str2==str3);
assertSame(str1==str3);

You can think about the answer first, and why you answered it that way. I hope that through today's study, you can get full marks.

1. How is the String object implemented?

In the Java language, Sun's engineers have made a lot of optimizations to the String object to save memory space and improve the performance of the String object in the system. Let's take a look at the optimization process, as shown in the figure below:

1.1. In Java6 and earlier versions

The String object is an object that encapsulates the char array. It mainly has four member variables: char array, offset, character count, and hash value.

The String object locates the char[] array through two attributes of offset and count to obtain the string. Doing so can efficiently and quickly share array objects while saving memory space, but this method is likely to cause memory leaks.

1.2. From Java7 version to Java8 version

Java made some changes to the String class. There are no longer offset and count variables in the String class. The advantage of this is that the memory occupied by the String object is slightly less. At the same time, the String.substring method no longer shares char[], thus solving the memory leak problem that may be caused by using this method.

1.3, starting from Java9 version

The engineer changed the char[] field to a byte[] field, and maintained a new attribute coder, which is an identifier of the encoding format.

Why did the engineer modify it like this?

We know that a char character occupies 16 bits and 2 bytes. In this case, it is very wasteful to store characters in single-byte encodings (characters occupying one byte). In order to save memory space, the String class of JDK1.9 uses an 8-bit, 1-byte byte array to store strings.

The function of the new attribute coder is that when calculating the length of a string or using the indexOf() function, we need to judge how to calculate the length of the string according to this field. The coder attribute has two values ​​of 0 and 1 by default, 0 represents Latin-1 (single-byte encoding), and 1 represents UTF-16. If String judges that the string contains only Latin-1, the value of the coder attribute is 0, otherwise it is 1.

2. The immutability of String objects

After understanding the implementation of the String object, have you found that the String class is modified by the final keyword in the implementation code, and the variable char array is also modified by final.

We know that a class modified by final means that the class cannot be inherited, and char[] is modified by final+private, which means that the String object cannot be changed. This feature implemented by Java is called the immutability of the String object, that is, once the String object is created successfully, it cannot be changed.

2.1. What are the benefits of Java doing this?

First, ensure the security of the String object. Assuming that the String object is mutable, the String object may be modified maliciously.

Second, ensure that the hash attribute value does not change frequently, ensuring uniqueness, so that a container like HashMap can realize the corresponding key-value caching function.

Third, a string constant pool can be implemented. In Java, there are usually two ways to create a string object, one is to create a string constant, such as String str="abc"; the other is to create a string variable through new, such as String str = new String("abc").

When the code uses the first method to create a string object, the JVM will first check whether the object is in the string constant pool, and if so, return the object reference, otherwise a new string will be created in the constant pool. This method can reduce the repeated creation of string objects with the same value and save memory.

String str = new String(“abc”) In this way, first when compiling the class file, the "abc" constant string will be put into the constant structure, and when the class is loaded, "abc" will be in the constant pool Create; secondly, when new is called, the JVM command will call the String constructor, and at the same time refer to the "abc" string in the constant pool, and create a String object in the heap memory; finally, str will refer to the String object.

2.2. Classic counterexample

In normal programming, assign a value of "hello" to a String object str, and then let the value of str be "world". At this time, the value of str becomes "world". So the str value has indeed changed, why do I still say that the String object is immutable?

First, let me explain what are objects and object references. Java beginners often have misunderstandings about this, especially some students who switch from PHP to Java. In Java, to compare whether two objects are equal, == is often used, and to judge whether the values ​​of two objects are equal, you need to use the equals method to judge.

This is because str is just a reference to the String object, not the object itself. The object is a memory address in memory, and str is a reference pointing to the memory address. So in the example we just mentioned, when assigning for the first time, a "hello" object is created, and the str reference points to the address of "hello"; when assigning for the second time, an object "world" is recreated, The str reference points to "world", but the "hello" object still exists in memory.

That is to say, str is not an object, but just an object reference. The real object is still in memory, unchanged.

3. Optimization of String objects

After understanding the implementation principle and characteristics of the String object, let's combine the actual scene to see how to optimize the use of the String object and what points need to be paid attention to during the optimization process.

3.1. How to build a super large string?

In the process of programming, string concatenation is very common. I said earlier that String objects are immutable. If we use String objects to add and splice the strings we want, will multiple objects be generated? For example the following code:

String str= "ab" + "cd" + "ef";

Analyzing the code shows that: firstly, the ab object will be generated, then the abcd object will be generated, and finally the abcdef object will be generated. In theory, this code is inefficient.

But in actual operation, we found that only one object is generated, why is this? Is our theoretical judgment wrong? Let's look at the compiled code again, and you will find that the compiler automatically optimizes this line of code, as follows:

String str= "abcdef";

What I introduced above is the accumulation of string constants. Let's take a look at the accumulation of string variables.

String str = "abcdef";
 
for(int i=0; i<1000; i++) {
      str = str + i;
}

After the above code is compiled, you can see that the compiler has also optimized this code. It is not difficult to find that Java prefers to use StringBuilder when splicing strings, which can improve the efficiency of the program.

String str = "abcdef";
 
for(int i=0; i<1000; i++) {
    str = (new StringBuilder(String.valueOf(str))).append(i).toString();
}

To sum up, it is known that even if the + sign is used as the concatenation of strings, it can also be optimized into StringBuilder by the compiler. But if you look more carefully, you will find that in the code optimized by the compiler, a new StringBuilder instance will be generated every time the loop is passed, which will also reduce the performance of the system.

So when doing string concatenation, I suggest you still use String Builder explicitly to improve system performance.

If the splicing of String objects involves thread safety in multi-threaded programming, you can use StringBuffer. However, it should be noted that since StringBuffer is thread-safe and involves lock competition, it is worse than StringBuilder in terms of performance.

3.2. How to use String.intern to save memory?

After talking about building strings, let's discuss the storage of String objects. Let's look at a case first.

Every time Twitter releases a message status, it will generate an address information. Based on the estimated size of Twitter users at that time, the server needs 32G of memory to store the address information.

public class Location {
    private String city;
    private String region;
    private String countryCode;
    private double longitude;
    private double latitude;
} 

Considering that many users have overlapping address information, such as country, province, city, etc., this part of information can be listed in a separate class to reduce duplication. The code is as follows:

public class SharedLocation {
 
	private String city;
	private String region;
	private String countryCode;
}
 
public class Location {
 
	private SharedLocation sharedLocation;
	double longitude;
	double latitude;
}

Through optimization, the data storage size is reduced to about 20G. But for the data stored in the memory, it is still very large, what should I do?

This case comes from a Twitter engineer's speech at the QCon Global Software Development Conference. The solution they thought of was to use String.intern to save memory space and optimize the storage of String objects.

The specific method is to use the intern method of String each time a value is assigned. If there is the same value in the constant pool, the object will be reused and the object reference will be returned, so that the original object can be recycled. This method can reduce the storage size of very highly repetitive address information from 20G to hundreds of megabytes.

SharedLocation sharedLocation = new SharedLocation();
 
sharedLocation.setCity(messageInfo.getCity().intern());
sharedLocation.setCountryCode(messageInfo.getRegion().intern());
sharedLocation.setRegion(messageInfo.getCountryCode().intern());
 
Location location = new Location();
location.set(sharedLocation);
location.set(messageInfo.getLongitude());
location.set(messageInfo.getLatitude());

For a better understanding, let's review the principle through a simple example:

String a =new String("abc").intern();
String b = new String("abc").intern();
    	  
if(a==b) {
    System.out.print("a==b");
}

Output result:

a==b

In the string constant, the object will be put into the constant pool by default; in the string variable, the object will be created in the heap memory, and a string object will also be created in the constant pool and copied to the heap memory object. And return the heap memory object reference.

If the intern method is called, it will check whether there is a string equal to the object in the string constant pool, if not, add the object in the constant pool, and return the object reference; if there is, return the string in the constant pool String reference. The original object in the heap memory will be recycled by the garbage collector because there is no reference to it.

After understanding the principle, let's take a look at the above example together.

When creating a variable at the beginning, an object will be created in the heap memory, and at the same time, when the class is loaded, a string object will be created in the constant pool. After calling the intern method, it will go to the constant pool to find out whether it is equal to the An object of string, if there is one, returns a reference.

When the b string variable is created, an object will also be created in the heap. At this time, if there is the string object in the constant pool, it will not be created again. Calling the intern method will go to the constant pool to determine whether there is an object equal to the string, and if there is an object equal to the "abc" string, it will directly return the reference. Objects in heap memory will be garbage collected because there are no references to them. So a and b refer to the same object.

Below I use a picture to summarize the creation and allocation of memory addresses of String strings:

 One thing to note when using the intern method is that it must be combined with the actual scene. Because the implementation of the constant pool is similar to that of a HashTable, the larger the data stored in the HashTable, the greater the time complexity of traversal. If the data is too large, it will increase the burden of the entire string constant pool.

3.3. How to use the string segmentation method?

Finally, I want to talk to you about string segmentation, which is also very common in coding.

The Split() method uses regular expressions to realize its powerful segmentation function, but the performance of regular expressions is very unstable, and improper use will cause backtracking problems, which may cause high CPU.

So we should use the Split() method carefully, we can use the String.indexOf() method instead of the Split() method to split the string.

If you really can't meet your needs, you can pay attention to the backtracking problem when you use the Split() method.

4. Summary

In this lecture, we realized that optimizing the performance of String strings can improve the overall performance of the system. Based on this theory, the Java version optimizes the String object by continuously changing member variables during iteration to save memory space.

We also specifically mentioned the immutability of String objects. It is this feature that implements the string constant pool, which further saves memory by reducing the repeated creation of string objects with the same value.

But also because of this feature, when we do long string splicing, we need to explicitly use StringBuilder to improve string splicing performance. Finally, in terms of optimization, we can also use the intern method to allow variable string objects to reuse objects with the same value in the constant pool, thereby saving memory.

Guess you like

Origin blog.csdn.net/qq_34272760/article/details/131808431