03 | String performance optimization should not be underestimated, hundreds of megabytes of memory can easily store dozens of gigabytes of data

Insert picture description here
Starting from the second module, I will take you to learn the performance optimization of Java programming. Today we will start with the most basic String optimization.
String object is one of the most frequently used object types, but its performance problems are the most easily overlooked.
String, as the most important data type in the Java language, is the object that occupies the largest space in memory. Efficient use of strings can improve the overall performance of the system.
Next, we will start from the three aspects of String object implementation, characteristics, and optimization in actual use, and have an in-depth understanding.
Before starting, I would like to ask you an effectiveness question, which is also a question I often ask interviewers when I am recruiting. Although it is a cliche, the error rate is still very high. Of course, some interviewers get the correct answer, but few people can explain the rationale behind the answer. The question is as follows:
Three objects are created in three different ways, and the two matched objects are matched again. Are the two matched objects in each group equal? code show as below:

		String str1= "abc";
        String str2= new String("abc");
        String str3= str2.intern();
        System.out.println(str1==str2);
        System.out.println(str2==str3);
        System.out.println(str1==str3);

You can first think about the answer and the reason for this answer. Hope that through today's study, you can get full marks.

How is the String object implemented?

In the Java language, Sun (Oracle) engineers have done a lot of optimizations on String objects to save memory space and improve the performance of String objects in the system. Let's take a look at the optimization process, as shown in the following figure:
Java version String iteration changes
1. In Java6 and previous versions , the String object is an object that encapsulates the char array. There are mainly four member variables: char array, offset, and offset. Number of characters count, hash value hash.
The String object locates the char[] array through the offset and count two attributes, and obtains the string. Doing so can share array objects efficiently and quickly, while saving memory space, but this approach is likely to cause memory leaks.
2. Starting from the Java 7 version to the Java 8 version , Java has made some changes to the String class. There are no longer offset and count variables in the String class. The advantage of this is that the String object occupies slightly less memory, and at the same time, the String.substring() method no longer shares char[], thus solving the memory leak problem caused by this method.
Why did the engineer modify it like this?
We know that a char character occupies 16 bits and 2 bytes. In this case, storing characters in a single-byte encoding (characters that occupy one byte) is very wasteful. In order to save memory space, the String class of JDK1.9 uses an 8-bit, 1-byte byte array to store strings.
The role of the new attribute coder is that when calculating the length of the string or using the indexOf() function, we need to determine how to calculate the length of the string based on this judgment. The coder attribute has two values 1 and 0 by default, 1 represents UTF-16, and 0 represents Latin-1 (single-byte encoding). If String determines that the string only contains Latin-1, the coder attribute value is 0, otherwise it is 1.

Immutability of String objects

After understanding the implementation of the String object, have you found that the String class is modified by the final keyword in the implementation code, and the variable char array is also modified by the final.
We know that the class is modified by final means that the class cannot be inherited, and char[] is modified by final+private, which means that the String object cannot be changed. This feature implemented by Java is called the immutability of String, that is, once the String object is created successfully, it is impossible to change it.

What are the advantages of Java doing this?
First, to ensure the security of String objects. Assuming that the String object is variable, the String object may be maliciously modified.
Second, it is ensured that the hash attribute value will not be changed frequently and uniqueness is ensured, so that a container like HashMap can realize the corresponding key-value caching function.
Third, you can implement a string constant pool. In Java, there are usually two ways to create string objects. One is to create string constants, such as String str="abc"; the other is to create string variables in the form of new, such as String str. =new String("abc");.
When the first method is used to create a string object in the code, the JVM will first check whether the object is in the string constant pool, and if so, it will return the object reference, otherwise a new string will be created in the constant pool. This method can reduce the repeated creation of string objects with the same value and save memory.
String str = new String("abc") In this way, first when compiling the class file, the "abc" constant string will be put into the high-yield structure, and when the class is loaded, "abc" will be in the Created in the constant pool; secondly, when calling new, the JVM command will call the String constructor, while referencing the "abc" string in the constant pool, creating a String object in the heap memory; finally, str will refer to the String object .

Here is a classic counterexample that you might think of.
In normal programming, assign "hello" to a String object str, and then make str value world. At this time, the value of str is programmed with "world", then the value of str has indeed changed. Why do I still say that the String object is immutable?
First, let me explain what objects and object references are. Java beginners often have a misunderstanding about this, especially some students who switch from PHP to Java. To compare whether two objects are equal in Java, == is often used. To judge whether the values of two objects are equal, you need to use the equals method to judge.
This is because str is only a reference to the String object, not the object itself. An object is a memory address in memory, and str is a reference to that memory address. So in the example we talked about at Guingang, when the first assignment is made, a "hello" object is created, and the str reference points to the memory address of "hello"; during the second assignment, an object is recreated "world", the str reference points to "world", but the "hello" object still exists in memory.
In other words, str is not an object, but just an object reference. The real object is still in memory and has not been changed.

Optimization of String objects

After understanding the implementation principles and characteristics of the String object, we will then combine the actual scenario to see how to optimize the use of the String object, and what needs to be paid attention to during the optimization process.

1. How to construct a very large string?

In the process of programming, the splicing of strings is very common. I mentioned earlier that Stirng objects are immutable. If we use String objects to add together and splice the strings we want, will multiple objects be generated? For example, the following code:

String str="abc" + "cd" + "ef";

Analyzing the code shows that: first an ab object is generated, then abcd object is generated, and finally abcdef object is generated. In theory, this code is inefficient.
But in actual operation, we found that only one object was generated. Why? Is our theoretical judgment wrong? Let's look at the compiled code again, you will find that the compiler automatically optimized this line of code, as follows:

String str="abcdef";

What I introduced above is the accumulation of string constants. Let’s take a look at the accumulation of string variables.

	String str = "Str test";
    for(int i = 0; i < 1000; ++i) {
    
    
        str = str + i;
    }

After the above code is compiled, you can see that the compiler has also optimized this code. It is not difficult to find that Java prefers to use StringBuilder when concatenating strings, which can improve the efficiency of the program.

String str = "abcdef";
for(int i=0; i<1000; i++) {
    
    
	str = (new StringBuilder(String.valueOf(str))).append(i).toString();
}

**To sum up: **Even if the + sign is used as the concatenation of strings, it can also be optimized into StringBuilder by the compiler. But more careful, you will find that in the code optimized by the compiler, a new StringBuilder instance will be generated every time the loop is looped, which will also reduce the performance of the system.
So when doing string splicing, I suggest you still use StringBuilder explicitly to improve system performance.
If in multithreaded programming, the splicing of String objects involves thread safety, you can use StringBuffer. But it should be noted that because StringBuffer is thread-safe and involves lock contention, it is worse than StringBuilder in terms of performance.

2. How to use String.intern() to save memory?

After talking about constructing strings, let's discuss the storage of String objects. Let's look at a case first.
Every time Twitter publishes a message status, it will generate an address information. Based on the estimate of the size of Twitter users at that time, the server needs 32G of memory to store the address information.

public class Location {
    
    
	private String city;
	private String region;
	private String countryCode;
	private double longitude;
	private double latitude;
}

Considering that many users overlap in address information, such as country, province, city, etc., this part of the information can be listed in a separate category to reduce duplication. The code is as follows:

public class SharedLocation {
    
    
	private String city;
	private String region;
	private String countryCode;
}
public class Location {
    
    
	private SharedLocation sharedLocation;
	double longitude;
	double latitude;
}

Through optimization, the data storage size is reduced to about 20G. But for the data stored in the memory, it is still very large, what should I do?
This case comes from a Twitter engineer's speech at the QCon Global Software Development Conference. The solution they thought of was to use String.intern() to save memory space and optimize the storage of String objects.
The specific method is to use String's intern method for each assignment. If there is the same value in the constant pool, the object will be reused and the object reference will be returned, so that the original object can be recycled. This method can reduce the storage size of very repetitive address information from 20G to hundreds of megabytes.

SharedLocation sharedLocation = new SharedLocation();
sharedLocation.setCity(messageInfo.getCity().intern()); sharedLocation.setCount
sharedLocation.setRegion(messageInfo.getCountryCode().intern());
Location location = new Location();
location.set(sharedLocation);
location.set(messageInfo.getLongitude());
location.set(messageInfo.getLatitude());

In order to better understand, let's review the principle through a simple example:

		String a = "abc";
        String b = (new String("abc")).intern();
        if (a == b) {
    
    
            System.out.print("a==b");
        }

Output result:

	a==b

In string constants, the object will be placed in the constant pool by default; in string variables, the object will be created in the heap memory, and a string object will be created in the constant pool, copied to the heap memory object, and Returns a reference to the heap memory object.
If the inern method is called, it will check whether there is a string equal to the object in the string constant pool. If not, add the object in the constant pool and return the object reference; if there is, return the string in the constant pool String reference. The original object in the heap memory will be recycled by the garbage collector because there is no reference to it.
After understanding the principle, let's take a look at the example above.
When a variable is created at the beginning, an object will be created in the heap memory. At the same time, a string object will be created in the constant pool when the class is loaded. After calling the intern method, it will go to the constant pool to find whether it is equal to The object of the string is returned by reference.
When creating a b string variable, an object will also be created in the heap. At this time, if the string object is in the constant pool, it will not be created. Calling the intern method will go to the constant pool to determine whether there is an object equal to the object, and if there is an object equal to the "abc" string, it will directly return the reference. Objects in the heap memory, because there is no reference to it directly, will be garbage collected. So a and b refer to the same object.
Below I use a picture to summarize the creation and allocation of memory addresses for String strings:
Insert picture description here
When using the intern method, you must pay attention to the actual scene. Because the implementation of the constant pool is similar to the implementation of a HashTable, the larger the data stored in the HashTable, the time complexity of traversal will increase. If the data is larger, it will increase the burden of the entire string constant pool .

3. How to use the string splitting method?

Finally, I want to talk to you about the segmentation of strings. This method is also very common in encoding. The Split() method uses regular expressions to achieve its powerful split function, and the performance of regular expressions is very unstable. Improper use will cause catastrophic backtracking (also known as backtracking trap) problems, which may cause CPU High.
So we should use the split() method carefully, we can use the String.indexOf() method instead of the split() method to complete the split of the string. If you really can't meet the demand, you can pay attention to the backtracking problem when using the split() method.

to sum up

In this lecture, we realized that doing a good job of String performance optimization can improve the overall performance of the system. On the basis of this theory, the Java version saves memory space and optimizes String objects by continuously changing member variables during iteration.
But also because of this feature, when we are doing long string splicing, we need to display the use of StringBuilder to improve the string splicing performance. Finally, in terms of optimization, we can also use the intern method to allow variable string objects to reuse objects with the same value in the constant pool, thereby saving memory.
Finally, I will share a personal point of view. That is the embankment of a thousand miles, collapsed in an ant nest. In daily programming, we often do not know enough about a small string and use it inappropriately, which leads to online accidents.