[Chat and Miscellaneous Talk] An article to explain the essence of JVM tuning to you

1. What is JVM

Definition of JVM

Before talking about JVM, you must first know what is JVM. JVM is the abbreviation of Java Virtual Machine (Java Virtual Machine). JVM is a specification for computing devices. It is a fictitious computer that is realized by simulating various computer functions on an actual computer.

Why add a JVM between the program and the operating system? Java is a language with a particularly high degree of abstraction, providing a series of features such as automatic memory management. It is impossible to implement these features directly on the operating system, so the JVM needs to be converted. With the abstraction layer of JVM, Java can achieve cross-platform. The JVM only needs to ensure that the .class file can be executed correctly, and it can run on platforms such as Linux, Windows, and MacOS. The significance of Java cross-platform is to compile once and run everywhere, and the JVM is indispensable for being able to do this. For example, if we download the jar package of the same version in the Maven warehouse, it can run everywhere without compiling it again on each platform. Some of the current JVM extension languages, such as Clojure, JRuby, Groovy, etc., are compiled to .class files at the end. Java language maintainers only need to control the JVM parser to run these extension languages ​​seamlessly. On top of the JVM.

If Java is a cross-platform language, then JVM is a cross-language platform. According to incomplete statistics, there are more than one hundred languages ​​that can run directly on the JVM. Of course, JVM itself is a specification (what does the specification define? If you are interested, you can go to Oracle’s official website to have a look: https://docs.oracle.com/javase/specs/index.html), on different operating systems There are different concrete implementations.

Why can these languages ​​run on the JVM? Because files written in these languages ​​can be compiled into class files, class files can be directly thrown in the JVM to run. From the perspective of the JVM, no matter what language you are, as long as it can be turned into a class file, then I can run it for you.

Common JVM implementations

Hotspot

Oracle's official implementation was actually acquired from Sun. Not surprisingly, this is also the most commonly used virtual machine. After the installation is complete, enter java -version on the command line to see the relevant information.

The bottom line shows that the current JVM uses the 64-bit Serverv version of Hotspot. It is worth mentioning that mixed mode refers to the mixed version of interpreted execution and compiled execution currently in use. People often ask whether Java is an interpreted and executed language or a compiled and executed language. The default is a hybrid execution of the two, which the official has already told you in this interface. The so-called mixed mode is to use interpretation and execution at the initial stage. If the detection encounters hot code (such as a method that is called multiple times, or a loop that is called multiple times, etc.), it will be compiled. Of course, this can be specified using parameters:

-Xmixed: the default mixing mode

-Xint: use interpretation mode, fast startup, slow execution

-Xcomp: Use pure compilation mode, fast execution, slow startup

Jrockit

The implementation of BEA, once claimed to be the fastest JVM in the world, was later acquired by Oracle and merged with Hotspot.

J9

IBM's implementation.

Microsoft VM

Microsoft's implementation.

TaobaoVM

It is equivalent to a deeply customized version of Hotspot. Of course, a big company like Ali has to customize it to meet its status.

LiquidVM

The virtual machine directly targeting the hardware does not interface with the operating system, but directly interfaces with the hardware. The efficiency of this operation directly soars.

azul zing

Known as the industry benchmark for the latest garbage collection, it is said to be particularly expensive. But people charge so much money, the service will not be bad. According to the introduction given on the official website, it is incredible that garbage collection can be achieved within 1ms. However, azul's garbage collection algorithm was later absorbed and optimized by Hotspot, and the later ZGC was born.

2. Know the class file

generate class file

Create a class using Java

After compiling, a class file will be generated. Open this class file with the BinEd plug-in of IDEA, and you can see the content of a hexadecimal file

What is this thing? ? In fact, the class file is essentially a binary byte stream, and its data types are: u1, u2, u4, u8 and _info (the source of _info is written in the Hotspot source code). U1 represents one byte, and u2 represents two bytes... In fact, these data types are also logically divided. The type of binary itself is either 0 or 1, and these binary bytes are finally interpreted and run by the JVM. flow.

The composition of the class file

The role of magic is to tell the JVM that this is a class file in the Java language;

minor version and major version are used to indicate the version number of the class file. Here 34 is hexadecimal, converted to decimal is 52, JDK7 defaults to 51, JDK8 defaults to 52;

constant_pool_count refers to the number of constants in the constant pool;

access_flag refers to modifiers, public, final.....

interface_count refers to the number of interfaces;

filds_count refers to the number of member variables;

.....

See the figure below for the meaning of each item

Use tools to view the contents of class files

The tools that come with JDK also provide the javap command to analyze the content of the class file

Enter the directory where the class file is located, and analyze the class file

In addition, I personally feel that the JClassLib plug-in provided in IDEA is the best to use, which will analyze the contents of the class file in great detail.

3. Know the class loader

We already know what is in the class file above, so how is the class file loaded into the class loader next? What is this class loader?

class file loading process

According to the official specification definition, in general, the loading process is divided into three steps: loading, linking, and initializing.

loading

It is to load the class file into memory.

linking

The process of verification is to verify whether the class file meets the defined standards. For example, if the first 4 bytes of the file are not CAFE BABE, it means that this is not a standard class file and there is a verification error.

The process of preparation is to assign default values, not initial values. For example, int a = 8, this step is to assign a default value of 0 instead of the initial value of 8.

The process of resolution is to resolve symbolic references such as classes, methods, attributes, etc. into direct references. Various symbolic references in the constant pool are resolved to direct references to memory addresses such as pointers and offsets.

initializing

This step is to assign the initial value and call the static code block.

class loader

In the loading step mentioned above, to load the class file into the memory, you have to use the class loader to load it, that is, ClassLoader. When a class file is loaded into memory, two pieces of content will be generated. The first piece of content is to throw binary things into the memory to occupy a piece, and at the same time generate a class object, and this object points to the content of the first piece.

The ClassLoader itself is also hierarchical, and different classes will be loaded using different ClassLoaders.

When the getClassLoader() method is used to get empty, it means that the top-level Bootstrap class loader has been reached. This Bootstrap is implemented in C++, and there is no corresponding class in Java, so it can only return null.

Parental Delegation Mechanism

So how do so many types of class loaders work together?

1) Generally, when a class starts to load, it will first search in the custom loader to see if the class has already been loaded. If it has already been loaded, there is no need to load it again; if it has not been loaded, quickly load it in;

2) The custom loader finds that it is not loaded, but it will not run to load the class immediately, but first ask its parent loader, that is, App ClassLoader, to see if it is loaded. If it is loaded, it will return loaded, if it is not loaded, continue to look for whether it has been loaded like its parent class loader;

3) Up to the topmost Bootstrap classloader, there are no records loaded yet. Then at this time, Bootstrap will check whether this class should be loaded by itself. If it does not belong to the class loaded by itself, it will reversely entrust its own next-level class loader, that is, Extension ClassLoader to load, if it does not belong to Extension ClassLoader. Loading, continue to entrust the next level of App ClassLoader to load until a matching class loader is found to load this class;

4) After going through the circle, the class is completely loaded into the memory. If the loading is not completed at the end, a ClassNotFoundException will be reported;

This is the so-called parental delegation mechanism. Parental delegation is a parental delegation process in which a child goes to the father, and then the father goes to the child. Then why is it so complicated? It’s not enough to just load it directly, and it’s still entrusted.

The fundamental reason is still for safety. If the parent delegation mechanism is not adopted, but a custom class loader can be loaded arbitrarily. Then you can write a String type yourself to overwrite the String type provided by JDK. Then this String is custom, do what you want to do, when others use your String type, your code will do it arbitrarily. Of course, the secondary thing is to save resources. For the files that have already been loaded, there is no need to load them repeatedly, just look for them.

custom class loader

You only need to inherit ClassLoader, then rewrite the findClass method to load the file into memory, and then call the defineClass method to convert it into a class file.

package com.feenix.jvm.c2_classloader;

import com.feenix.jvm.Hello;

import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileInputStream;

public class T006_MyClassLoader extends ClassLoader {

    @Override
    protected Class<?> findClass(String name) throws ClassNotFoundException {
        File f = new File("c:/test/", name.replace(".", "/").concat(".class"));
        try {
            FileInputStream fis = new FileInputStream(f);
            ByteArrayOutputStream baos = new ByteArrayOutputStream();
            int b = 0;

            while ((b = fis.read()) != 0) {
                baos.write(b);
            }

            byte[] bytes = baos.toByteArray();
            baos.close();
            fis.close();

            return defineClass(name, bytes, 0, bytes.length);
        } catch (Exception e) {
            e.printStackTrace();
        }

        return super.findClass(name); //throws ClassNotFoundException
    }

    public static void main(String[] args) throws Exception {
        ClassLoader l = new T006_MyClassLoader();
        Class clazz = l.loadClass("com.feenix.jvm.Hello");
        Class clazz1 = l.loadClass("com.feenix.jvm.Hello");

        System.out.println(clazz == clazz1);

        Hello h = (Hello) clazz.newInstance();
        h.m();

        System.out.println(l.getClass().getClassLoader());
        System.out.println(l.getParent());

        System.out.println(getSystemClassLoader());
    }
}

Basically, when writing a framework, a custom ClassLoader is definitely indispensable. Spring has its own ClassLoader, and Tomcat also has its own ClassLoader. The custom ClassLoader is implemented, that is, hot deployment can be realized.

4. Java Memory Model (JMM)

The current CPU structure is becoming more and more complex. In fact, it is to further squeeze efficiency, increase speed, and stuff more and more caches. The computing speed of the CPU is many times faster than that of the memory or the hard disk. Therefore, if something as slow as a hard disk wants to be read and calculated by the CPU, it is impossible to directly interact with the CPU, which is too slow for the CPU. Therefore, the data in the hard disk is first read into the memory, and then read from the memory into the cache...and finally read into the register, which is used by the CPU for calculation.

A problem can be seen from the above figure. After the data is read from the L3 cache to the L2 cache, since the L2 cache is the internal cache of the CPU, that is to say, if there are two CPUs on a motherboard, each CPU will have its own The L2 cache, then the data read from L3 is not shared. Assuming that both CPUs have modified the data, there will be a problem of data inconsistency.

From the hardware point of view, the simplest and rude method is to add a bus lock to directly lock L3. When one CPU accesses the data in L3, the other CPU can only wait until the lock is released before accessing the data. This is the method used by the old CPU, because the efficiency is really low, and the new CPU has already eliminated this synchronization method.

MESI Cache consistency protocol

The current CPU's guarantee for hardware data consistency is achieved through the cache consistency protocol. There are many specific implementations of the protocol. Intel uses the MESI protocol, so it is generally called the MESI protocol when it comes to this. Generally, the cached content is marked and divided into four different states of M, E, S, and I, and the data is processed differently according to different states.

Through this protocol, data consistency is guaranteed for caches in each CPU. It is worth noting that MESI does not completely solve the problem of locking the bus. The current CPU does not completely abandon the bus lock, and it will still be used when necessary. So the current CPU's data consistency implementation is actually implemented through cache locks (MESI ...) + bus locks.

Cache line sharing issues and cache line alignment

When reading the data on the hard disk into the memory, it is impossible to read only the data that is just needed each time. In order to improve the efficiency of reading, a long series of continuous content where the data is located will be read all at once. into memory. A long string of contiguous content is called a cache line, and most cache lines are 64 bytes in length.

However, if two different data located in the same cache line are locked by two different CPUs, there will be a false sharing problem that affects each other. Assume x, y are on the same cache line. The first CPU only needs to use x, and the second CPU only needs to use y. After the first CPU modifies the value of x, it notifies the second CPU that the cache line has been modified, and the second CPU needs to reload the data of the cache line; similarly, after the second CPU modifies the value of y, it notifies The first CPU cache line is modified, and the first CPU needs to reload the data in the cache line.

The problem of false sharing will cause a serious waste of efficiency. In theory, as long as the two data are not placed on the same cache line, it should be possible to improve the efficiency of data operations.

Before optimization:

package com.feenix.juc.c_028_FalseSharing;

public class T01_CacheLinePadding {
    private static class T {
        public volatile long x = 0L;
    }

    public static T[] arr = new T[2];

    static {
        arr[0] = new T();
        arr[1] = new T();
    }

    public static void main(String[] args) throws Exception {
        Thread t1 = new Thread(() -> {
            for (long i = 0; i < 1000_0000L; i++) {
                arr[0].x = i;
            }
        });

        Thread t2 = new Thread(() -> {
            for (long i = 0; i < 1000_0000L; i++) {
                arr[1].x = i;
            }
        });

        final long start = System.nanoTime();
        t1.start();
        t2.start();
        t1.join();
        t2.join();
        System.out.println((System.nanoTime() - start) / 100_0000);
    }
}

Optimized:

package com.feenix.juc.c_028_FalseSharing;

public class T02_CacheLinePadding {
    private static class Padding {
        public volatile long p1, p2, p3, p4, p5, p6, p7;
    }

    private static class T extends Padding {
        public volatile long x = 0L;
    }

    public static T[] arr = new T[2];

    static {
        arr[0] = new T();
        arr[1] = new T();
    }

    public static void main(String[] args) throws Exception {
        Thread t1 = new Thread(() -> {
            for (long i = 0; i < 1000_0000L; i++) {
                arr[0].x = i;
            }
        });

        Thread t2 = new Thread(() -> {
            for (long i = 0; i < 1000_0000L; i++) {
                arr[1].x = i;
            }
        });

        final long start = System.nanoTime();
        t1.start();
        t2.start();
        t1.join();
        t2.join();
        System.out.println((System.nanoTime() - start) / 100_0000);
    }
}

This optimization method is called cache line alignment. Directly occupy all the space of the same row, and exchange space for time. This method has been used by many open source authors. It is a useful method when extreme pursuit of speed.

command out of order problem

As mentioned earlier, the statements in the code will be disassembled into multiple instructions for execution. The speed at which the CPU executes instructions is very fast. In order to improve the efficiency of instruction execution, the CPU will execute another instruction while waiting after the execution of one instruction. The premise is that there is no dependency between the two instructions.

The out-of-order problem can also be proved with code:

package com.feenix.jvm.c3_jmm;

public class T04_Disorder {
    private static int x = 0, y = 0;
    private static int a = 0, b = 0;

    public static void main(String[] args) throws InterruptedException {
        int i = 0;
        for (; ; ) {
            i++;
            x = 0;
            y = 0;
            a = 0;
            b = 0;
            Thread one = new Thread(new Runnable() {
                public void run() {
                    //由于线程one先启动,下面这句话让它等一等线程two. 读着可根据自己电脑的实际性能适当调整等待时间.
                    shortWait(100000);
                    a = 1;
                    x = b;
                }
            });

            Thread other = new Thread(new Runnable() {
                public void run() {
                    b = 1;
                    y = a;
                }
            });
            one.start();
            other.start();
            one.join();
            other.join();
            String result = "第" + i + "次 (" + x + "," + y + ")";
            if (x == 0 && y == 0) {
                System.err.println(result);
                break;
            } else {
                //System.out.println(result);
            }
        }
    }


    public static void shortWait(long interval) {
        long start = System.nanoTime();
        long end;
        do {
            end = System.nanoTime();
        } while (start + interval >= end);
    }
}

This waiting time will be very long, and my own computer has run it more than a million times to get the out-of-order results

instruction order guarantee

In many cases, the operation of the program needs to ensure that the instructions cannot be executed out of order, so how to solve this out of order problem?

hardware memory barrier

From a hardware point of view, the order of instructions is guaranteed through the memory barrier of the CPU. Taking intel's CPU as an example, it is guaranteed by three instructions:

1. sfence: the write operation before the sfence instruction must be completed before the write operation after the sfence instruction;

2. lfence: the read operation before the lfence instruction must be completed before the read operation after the lfence instruction;

3. mfence: The read and write operations before the mfence instruction must be completed before the read and write operations after the mfence instruction;

atomic instruction

For example, the "lock ..." instruction on x86 is a Full Barrier, which locks the memory subsystem during execution to ensure the order of execution, even across multiple CPUs. Software Locks usually use memory barriers or atomic instructions to achieve variable visibility and maintain program order.

JVM specification

LoadLoad barrier: For such a statement Load1; LoadLoad; Load2, before the data to be read by Load2 and subsequent read operations is accessed, the data to be read by Load1 is guaranteed to be read completely;

StoreStore barrier: For such a statement Store1; StoreStore; Store2, before Store2 and subsequent write operations are executed, the write operation of Store1 is guaranteed to be visible to other processors;

LoadStore barrier: For such a statement Load1; LoadStore; Store2, before Store2 and subsequent write operations are flushed out, the data to be read by Load1 is guaranteed to be read completely;

StoreLoad barrier: For such a statement Store1; StoreLoad; Load2, before Load2 and all subsequent read operations are executed, the write to Store1 is guaranteed to be visible to all processors;

volatile

For the guarantee of the order of instructions, from the code point of view, just add the volatile modifier. So what exactly does this volatile do? From the bytecode level, that is, an ACC_VOLATILE tag is added

Nothing can be seen here, so we still have to look at it from the JVM level, and add a barrier to the read and write of the volatile memory area:

StoreStoreBarrier - volatile write operations - StoreLoadBarrier

LoadLoadBarrier - volatile read operation - LoadStoreBarrier

The order of instructions is guaranteed through the barrier specification defined by the JVM.

synchronized

From the bytecode level, an ACC_SYNCHRONIZED mark is also added, and the monitorenter and monitorexit instructions are added. From the perspective of the JVM, it is C and C++ that call the synchronization mechanism provided by the operating system. From the hardware level, it is also locked by the lock cmpxchg instruction.

Java object model

object creation process

1. Class loading;
2. Class linking (verification, preparation, resolution);
3. Class initializing;
4. Applying for object memory ; 5. Assigning default
values ​​to member variables;
value, execute the constructor statement;

The storage layout of objects in memory

Ordinary object:
1. Object header: markword 8bytes;
2. ClassPointer pointer: 4bytes if -XX:+UseCompressedClassPointers is enabled, 8bytes if not enabled;
3. Instance data: reference type -XX:+UseCompressedOops is 4bytes if enabled, 8bytes if not enabled ;
4. Padding alignment, a multiple of 8;

Array object:
1. Object header: markword 8bytes
2. ClassPointer pointer: -XX:+UseCompressedClassPointers is 4bytes if it is enabled, and 8bytes if it is not enabled;
3. Array length: 4bytes;
4. Array data;
5. Padding alignment, a multiple of 8;

Whether the parameters mentioned above are enabled, you can observe the virtual machine configuration through java -XX:+PrintCommandLineFlags -version

object size

It is troublesome to observe the size of an object in Java, unlike C or C++, which can be obtained directly through sizeof. For Java, only through the Agent mechanism, when the class file is loaded into the JVM, the loaded class file can be obtained through interception. After interception, the class file can be modified arbitrarily, and the object can also be read. the size of.

package com.feenix.jvm.agent;

import java.lang.instrument.Instrumentation;

public class ObjectSizeAgent {
    private static Instrumentation inst;

    public static void premain(String agentArgs, Instrumentation _inst) {
        inst = _inst;
    }

    public static long sizeOf(Object o) {
        return inst.getObjectSize(o);
    }
}

Create META-INF/MANIFEST.MF in the src directory

Manifest-Version: 1.0
Created-By: feenix.com
Premain-Class: com.feenix.jvm.agent.ObjectSizeAgent

There is a fixed method premain in ObjectSizeAgent, which is packaged into a jar package and introduced into the project that needs to use the Agent Jar. Then the JVM will automatically pass in this Instrumentation, so that you can directly call inst.getObjectSize to get the size of the object.

The class of the Agent Jar is required at runtime, and the parameter is added:
-javaagent:C:\work\ijprojects\ObjectSize\out\artifacts\ObjectSize_jar\ObjectSize.jar

   package com.feenix.jvm.c3_jmm;
   
   import com.feenix.jvm.agent.ObjectSizeAgent;
   
   public class T03_SizeOfAnObject {
       public static void main(String[] args) {
           System.out.println(ObjectSizeAgent.sizeOf(new Object()));
           System.out.println(ObjectSizeAgent.sizeOf(new int[] {}));
           System.out.println(ObjectSizeAgent.sizeOf(new P()));
       }
   
       private static class P {
                           // 8bytes _markword
                           // 4bytes _oop指针
           int id;         // 4bytes
           String name;    // 4bytes
           int age;        // 4bytes
   
           byte b1;        // 1byte
           byte b2;        // 1byte
   
           Object o;       // 4bytes
           byte b3;        // 1byte
       }
   }

You can see the size of each member variable, and then the sum is 31bytes, and the last padding must be a multiple of 8, so the total size of such an object is 32bytes.

What exactly does the object header include?

The object head is actually very complicated, and the implementation of each version is not exactly the same. If you want to find out the specific implementation details of each version, you can only go to the official website to find the original specification document. For example, the implementation of JDK8 is in the source code of Hotspot:

This is a file written in C++. You can see how 32bits is implemented, how 64bits is implemented, how many bits are unused (unused), and what each bit represents. Specifically look at the implementation of 64bits

Why is the GC age maxed at 15? Because use 4bits to record. The maximum is 15, which is not adjustable.

About the hashcode: If the object does not rewrite the hashcode, the default is to call os::random to generate it, which can be obtained through System.identityHashCode; in addition, once the hashcode is generated, the JVM will record it in the mark. When will hashcode be generated? Of course, when calling the unoverridden hashcode method and System.identityHashCode.

It should be noted that when the hashcode of the lock object is called, it will cause the object's biased lock or lightweight lock upgrade. Because the hashcode of an object in Java is only generated when two methods are called by the caller, if it is in a lock-free state, it will be stored in the markword, if it is a heavyweight lock, it will be stored in the corresponding monitor, and there is no place for biased locks. This information is stored, so it must be upgraded.

1. Handle pool

2. Direct pointer

These two methods have their own advantages and disadvantages. The implementation of Hotspot uses a direct pointer method. The handle pool method is more efficient in garbage collection, and the direct pointer method is more efficient in positioning.

5. JVM instruction set

When a class file is loaded into the JVM, it will enter the run engine in the JVM, and it will enter the run-time data area when it runs

Program Count program counter: store instruction location;

while( not end ) {

        Take the position in the PC and find the instruction corresponding to the position;

        carry out the order;

        PC ++;

}

JVM Stack : Each thread corresponds to a stack, each method corresponds to a stack frame, and the stack frame contains:

① Local Variable Table
② Operand Stack: For long processing (store and load), most virtual machine implementations are atomic, jls 17.7, there is no need to add volatile
③ Dynamic Linking
④ return address: a() -> b() , method a calls method b, where is the return value of method b

Direct Memory : The memory in the kernel space that the JVM can directly access, and the memory directly managed by the operating system. NIO, improve efficiency and realize zero copy;

Heap : thread sharing in JVM;

Method Area : The method area is shared by all threads.
Before JDK1.8 (excluding 1.8), Method Area refers to Perm Space, string constants are located in Perm Space, FGC will not clean up; after
JDK1.8 (including 1.8), Method Area refers to Meta Space, characters The string constant is located in Heap, which will trigger FGC cleanup;

Run-time Constant Pool : One item in the class file is the constant pool, which is stored here when running;

JVM Stack

To sum up, each thread has its own PC, VMS, NMS, Heap and Method Area shared between threads.

Well, with the above knowledge base, let’s look at an interview question:

    public static void main(String[] args) {
        int i = 8;
        i = i++;
        System.out.println(i);
    }

Take a look at the internal structure of this class file,

In the above, a method corresponds to a stack frame, and there is a local variable table in the stack frame. For this main method, there are two variables used: args and i passed as parameters, which are circled in China in the above figure part.

The Operand Stack (Operand Stack) has no explicit display, and can only be imagined by yourself through the instructions given one by one.

What is shown in the figure above is the order of operations after the code in the method is converted into instructions:

bipush 8 is to push the number 8 to the operation stack. To put it bluntly, it is to push the stack;

istore_1 is to pop the value at the top of the stack and archive it into the local variable table with a subscript value of 1 (according to the above picture, you can see that this variable is i in the method), to put it bluntly, assign i is 8, and the corresponding code int i = 8 is completed;

iload_1 is to take out the value of the variable whose subscript is 1 in the local variable table and push it onto the stack, so 8 is pushed into the operation stack again at this time;

iinc 1 by 1 is to add 1 to the value of the variable whose subscript is 1 in the local variable table. At this time, the value in the local variable table is 9, but the value in the stack is still 8;

istore_1 pops the 8 at the top of the stack and assigns it to i in the local variable table, so the value of i in the previous step is 9, and now it becomes 8 again;

Therefore, the final result i is 8.

6. Garbage collection mechanism

what is garbage

Objects without references are garbage.

C / C++
- Manually manage malloc free / new delete
- Forget to release - memory leak - out of memory
- Release multiple times to generate extremely difficult and easy to debug bugs, one thread space is inexplicably released by another
- Development efficiency is very low

Java / Python / Go /Js / Kotlin / Scala
– languages ​​that facilitate memory management
– ​​GC – Garbage Collector – application threads are only responsible for allocation, and garbage collectors are responsible for recycling
– greatly reducing the barriers to entry for programmers

Rust
- super efficient
- no need to manually manage memory (no GC)
- huge learning curve (ownership)
- as long as you pass the program syntax, there will be no bugs

how to locate trash

reference count reference count

Mark the number of references to the current object on the object header. When the number is 0, it will be treated as garbage. The problem of this positioning cannot solve the problem of circular references between objects. Three or five objects refer to each other, but there is no other reference pointing to these three or five object groups. The whole group of objects is actually garbage, but the marks on the object headers are not garbage.

root searching root reachable algorithm

Which are the root objects? Defined in the JVM specification: JVM Stack, NMS, run-time constant pool, references used inside the method area...these are all root objects. Basically, it can be simply understood that all the necessary objects required after a program starts can be regarded as root objects.

garbage removal algorithm

Mark-Sweep Mark Sweep

The mark removal algorithm is relatively simple, which is to find out those garbage, mark it as an object, and then clear the marked garbage. In the case of many inventory objects, the efficiency is higher, but two scans are required, the execution efficiency is low, and fragments are prone to occur.

Copying

The memory is divided into two, and the useful objects are copied to the clean half of the memory. After the copy is completed, all the original messy half of the memory is cleared. This algorithm is suitable for the situation where there are few surviving objects. It only needs to be scanned once, which has high efficiency and will not generate fragments. But the problem is also obvious, which will cause a waste of space. In the process of moving and copying objects, it is necessary to adjust the references of the objects.

Mark-Compact mark compression

Useful objects are all compressed together without fragmentation, and there is no need to halve memory, which is convenient for object allocation. But there are also problems. It still needs to scan twice and move the object, so the execution efficiency is low.

Heap memory logical partition

The generational algorithm used by Hotspot in the old version of the garbage collector has been logically generational by the time of G1, and the physical generation is not. In ZGC, the generational planning is completely abandoned.

In the generational algorithm, the memory is divided into the new generation and the old generation, which is 1:3 by default. The new generation is divided into one Eden area (eden) and two survivor areas (survivor), the default is 8:1:1; while the old generation is generally tenured. Different areas use different garbage collection algorithms. For example, the eden area generally uses the mark clear algorithm, the survivor area generally uses the copy algorithm, and the old generation generally uses the mark compression algorithm to clean up garbage.

1. When an object is created, first try to allocate it on the stack. If it cannot be allocated on the stack, it will enter the eden area;

2. After a garbage collection in the eden area, it will enter Survivor1. After another garbage collection in Survivor1, it will enter Survivor2, and then repeatedly move back and forth between Survivor1 and Survivor2. A garbage collection) will record +1 in the age of the object header;

3. When the age of this object is old enough, it will enter the old age.

The garbage collection triggered when the young generation space is exhausted is called MinorGC / YongGC, or YGC for short;

The garbage collection triggered when the old generation cannot continue to allocate space is called MajorGC / FullGC, or FGC for short. When FGC is triggered, the entire memory will be reclaimed.

Which objects will be allocated on the stack and which objects will be allocated thread-locally?

▪ Stack allocation
- thread-private small objects
- no escape
- support for scalar replacement
- no tuning required

▪ Thread Local Allocation TLAB (Thread Local Allocation Buffer)
– Occupies eden, 1% by default
– When multi-threaded, you can apply for space without competing for eden, improving efficiency
– Small objects
– no need to adjust

Since the birth of JDK, the first garbage collector that appeared was Serial (meaning single-threaded). Later, in order to make up for the low efficiency of single-threaded recycling, Parallel Scavenge (meaning multi-threaded) was born. CMS was introduced in the late JDK1.4 version, which can be said to be a milestone GC, which starts the process of concurrent recycling. However, there are many problems with CMS, so any current JDK version defaults to CMS concurrent garbage collection because it cannot tolerate STW.

Common Garbage Collector

Serial

When Serial is garbage collected, all worker threads in the program are stopped.

When the JDK was first born, the memory of the JVM was not very large, and it was tens of megabytes. It didn't take long to use one thread to collect garbage STW. However, with the development of hardware, the memory of computers is getting bigger and bigger. Just imagine, if a single thread is used to reclaim tens of hundreds of gigabytes of space, the STW time will not last for dozens of hours.

Parallel Scavenge

If the efficiency of one thread is too slow, then the efficiency of multiple threads will be improved. Therefore, to solve the problem of long time of single-threaded STW, the simplest and rude way is to add threads, so that there are more people and more power. Until JDK8, PS is used as the default garbage collector, that is to say, if there is no special setting for the garbage collector when it goes online, Parallel Scavenge is used by default.

Parallel New

It can be considered as a new version of Parallel Scavenge. The difference from Parallel Scavenge is that it has made some enhancements so that it can be used with CMS.

CMS

In terms of functionality, CMS can be considered a milestone of the garbage collector. Before CMS, when all garbage collectors were performing garbage collection, the worker threads had to stop and wait for the completion of garbage collection before proceeding. During this period of STW, the program had nothing to do except freeze and no response. Method. The great thing about CMS is that the garbage collection thread and the worker thread work at the same time, you do your work, I collect my garbage, and truly realize Concurrent Mark Sweep! However, there are many problems with CMS, so that no version is CMS by default and can only be specified manually.

CMS has gone through a long period of time from the publication of the paper to the release of the first version, so it is really not easy to write this thing. It is essentially a mark-clearing algorithm, which is roughly divided into four steps:

1. Initial mark . Mark only those root objects with a single thread, and the worker thread waits at this time;

2. Concurrent marking . Mark the so-called garbage objects with multi-threading, the worker thread continues to work without waiting;

3. Remark . Use multi-threading to mark the garbage newly generated by the worker thread in 2, and the worker thread waits at this time;

4. Concurrent cleaning . Clean up the garbage marked in 3, the worker thread continues to work without waiting;

If you want to say this, the worker thread is still stopped in 1 and 3, and the STW time is generated. However, the STW time of these two steps is very short, and the STW that really takes a long time is in 2, accounting for more than 80% of the entire time. So this most of the time has allowed the worker thread to continue working, so the remaining small amount of STW time is negligible overall.

As mentioned earlier, there are many problems with CMS. For example, at 4, the working thread will also generate new garbage while cleaning. These garbage are called floating garbage, and these floating garbage can only be processed after the next garbage collection. . Of course, this is not a serious problem. The real uncomfortable point is:

Since CMS is Mark Sweep (Mark Sweep Algorithm), there must be a problem of fragmentation. When the fragmentation reaches a certain level and the allocated objects in the old generation of CMS cannot be allocated, Serial Old (Serial algorithm is used for the old generation) is used for the old generation. Recycle. Imagine that when the memory of the computer is very large, such as dozens of G, hundreds of G, using a single thread to perform mark compression to clean up garbage in such a huge memory, the time for STW can be imagined! CMS itself is designed to solve the waiting time of STW in large memory. As a result, it does not stop by itself. Once STW is generated, imagine an old lady sweeping Tiananmen Square alone with a broom...

Generally speaking, there is no particularly good solution. It can only be said that you can try to lower the threshold for triggering CMS, or keep enough space in the old generation: –XX:CMSInitiatingOccupancyFraction 92% can reduce this value, so that CMS can keep enough space in the old generation .

There is a very good question: There is a 500,000 PV information website (extracting documents from disk to memory). The original server is 32 bits, 1.5G heap, and the user feedback that the website is slow, so the company decided to upgrade. The new server It is a 64-bit, 16G heap memory. As a result, the user reported that the freeze is very serious, but the efficiency is lower than before. why is that?

In fact, it is very simple. When the memory was small, many users browsed data, and a lot of data was loaded into the memory. The memory was insufficient, frequent GC, STW was long, and the response time slowed down. The hardware was upgraded later, the larger the memory, the longer the FGC time. So how to solve this problem? It's very simple, if you use G1, the future will definitely be dominated by G1 or ZGC.

7. Understand JVM tuning

JVM common command line parameters

First of all, we need to know the regulations for JVM parameters in Hotspot:
1. The parameters starting with - are standard parameters, which should be supported by all JVMs;
2. The parameters starting with -X are non-standard parameters, and each JVM implementation may be different, not all versions All JVMs are common;
3. Those starting with -XX are unstable parameters, which may be canceled in the next version;

Next, use a short-answer code to set the parameters in the JVM for a small test:

package com.feenix.jvm.c5_gc;

import java.util.LinkedList;
import java.util.List;

public class T01_HelloGC {

    public static void main(String[] args) {
        System.out.println("----- Hello ----- Feenix ----- GC ----->");
        List list = new LinkedList();
        for ( ; ; ) {
            byte[] b = new byte[1024 * 1024];
            list.add(b);
        }
    }

}

java -XX:+PrintCommandLineFlags T01_HelloGC

When no parameters are specified, the JVM will calculate an initial heap size data based on the size of the memory:

-XX:InitialHeapSize=60777536

-XX:MaxHeapSize=972440576

java -Xmn10M -Xms40M -Xmx60M -XX:+PrintCommandLineFlags -XX:+PrintGC T01_HelloGC

Here, through the command, the minimum heap is set to 40M, and the maximum heap is set to 60M. Generally speaking, these two parameters are set to the same size, do not set the size of the heap to be flexible. The memory occupancy starts from the minimum heap, and when the memory is not enough, it will expand elastically until it reaches the size of the maximum heap. This kind of elastic growth has no special meaning other than wasting precious system resources, just give it a stuck size, and don’t flick it around if it’s okay.

java -XX:+UseConcMarkSweepGC -XX:+PrintCommandLineFlags T01_HelloGC

I want to see the recycling process of CMS, but the JDK on my machine is 11, and CMS has been removed at 1.9.

GC log output

Now the mainstream GC is still PS+PO, so let’s use the log of this garbage collector as a chestnut: java -Xmn10M -Xms40M -Xmx60M -XX:+UseParallelOldGC -XX:+PrintCommandLineFlags -XX:+PrintGCDetails T01_HelloGC

.....

-XX:+UseParallelOldGC Use PS+PO garbage collector

-XX:+PrintGCDetails Print garbage collection log details

Common Garbage Collector Parameter Combinations

-XX:+UseSerialGC

The setting of this parameter is equivalent to using Serial New (DefNew) + Serial Old, which is generally suitable for small programs. By default, this option will not be selected, HotSpot will automatically select the collector according to the calculation and configuration and JDK version;

-XX:+UseParNewGC

The setting of this parameter is equivalent to using ParNew + SerialOld. This combination is rarely used and has been discarded in some versions;

-XX:+UseConcMarkSweepGC(或 -XX:+UseConcurrentMarkSweepGC)

The setting of this parameter is equivalent to using ParNew + CMS + Serial Old;

-XX:+UseParallelGC

The setting of this parameter is equivalent to using Parallel Scavenge + Parallel Old, which is also the garbage collector combination used by JDK1.8 by default;

-XX:+UseParallelOldGC

The setting of this parameter is equivalent to using Parallel Scavenge + Parallel Old, the same as -XX:+UseParallelGC;

-XX:+UseG1GC

The setting of this parameter is equivalent to using G1;

Basic Concepts of Tuning

Before talking about tuning, understand two basic concepts:

1. Throughput : user code time / (user code execution time + garbage collection time)

2. Response time : The shorter the STW, the better the response time

The so-called tuning, first determine, what to pursue? Throughput priority, or response time priority? Still in the case of meeting a certain response time, how much throughput is required. For scientific computing and data mining projects, the general pursuit is to give priority to throughput, and it is recommended to use PS+PO for garbage collection; for website projects, the general pursuit is to give priority to corresponding time, and it is recommended to use G1 for garbage collection.

In fact, tuning is not such a mysterious thing. To put it bluntly, it is system optimization. GC tuning is basically classified into three major parts:

1. Carry out JVM planning and pre-tuning according to requirements;

2. Optimize the JVM operating environment, such as slowness, lag, etc.;

3. Solve various problems that occur during the operation of the JVM, such as OOM;

Tuning starts with planning

Pre-planning can be said to be the most difficult step in tuning. Many leaders don't understand it at all, and they just want to support millions of concurrency. Generally speaking, a million concurrency means that the TPS reaches the million level, and for an e-commerce website, it means placing millions of orders per second. If you don’t do this, you may not understand the concept of placing millions of orders per second. Taobao’s annual Double Eleven is almost 500,000 to 600,000 orders per second. It is said that only 12306 in the country can reach this number during the Spring Festival travel season. ....

Therefore, tuning must start from the business scenario, and tuning without a business scenario is a hooligan. You must know that there is no best garbage collector, only the most suitable garbage collector. You can refer to the following steps for tuning estimation:

1. Familiar with the business scenario, choose whether the tuning is to pursue response time or throughput according to the business scenario;
2. According to the tuning goal, choose the garbage collector combination;
3. Calculate the memory requirement;
4. Select the CPU, within the budget The more expensive the better, the stronger the performance, the better;
5. Set parameters such as space size and upgrade age for different ages;
6. Set log parameters. This step is more complicated. Normally, there are many parameters in the production environment: -Xloggc:/opt/xxx/logs/xxx-xxx-gc-%t.log -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=5 -XX:GCLogFileSize =20M -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCCause

Tuning case

Case 1: Vertical e-commerce, with a maximum of one million orders per day, what server configuration is required for the order processing system?

This question is more amateurish, because many different server configurations can support it. After careful analysis, with one million orders per day, the time of order generation will definitely be dense or sparse. Find out the time when the order is most frequent and the peak period of the order. Assume that the peak time is 1000 orders/second. Then calculate how much memory is needed to generate an order. Different businesses consume different memory. Normally, the space occupied by an order rarely exceeds 512k. Calculated according to 512k, 1000 orders will require almost 500M memory. Of course, a more professional grammar is how to design the response time within 100ms, and do the pressure test in this way.

Case 2: How should 12306 support the large-scale ticket grabbing during the Spring Festival?

12306 should be the second kill website with the largest concurrent volume in China, claiming that the maximum concurrent volume can reach 1 million. Such high traffic can no longer be considered in terms of the performance of a single machine. Generally, it can be considered as follows:
start with CDN first, and use different CDN caches in different regions of the country. Beijing users access Beijing servers, and Shanghai users access Shanghai servers. . The CDN is followed by a bunch of LVS, and a bunch of Nginx is connected behind the LVS, and the business server is behind Nginx.

Assuming that after distribution at all levels, each machine needs to support 1,000 or 10,000 concurrency (single machine 10k problem), how to solve it?
Ordinary e-commerce order: place an order -> order system (IO) reduce inventory -> wait for user payment;
a possible model of 12306: place an order -> reduce inventory and order (Redis or Kafka) simultaneously and asynchronously -> wait for payment ;

Case 3: The system CPU is often 100%, how to tune it?

Let's use a piece of code to simulate this kind of problem, it will be more intuitive to see chestnuts

package com.feenix.jvm.c5_gc;

import java.math.BigDecimal;
import java.util.ArrayList;
import java.util.Date;
import java.util.List;
import java.util.concurrent.ScheduledThreadPoolExecutor;
import java.util.concurrent.ThreadPoolExecutor;
import java.util.concurrent.TimeUnit;

/**
 * 从数据库中读取信用数据,套用模型,并把结果进行记录和传输
 */

public class T15_FullGC_Problem01 {

    private static class CardInfo {
        BigDecimal price = new BigDecimal(0.0);
        String name = "Feenix";
        int age = 5;
        Date birthdate = new Date();

        public void m() {

        }
    }

    private static ScheduledThreadPoolExecutor executor = new ScheduledThreadPoolExecutor(50,
            new ThreadPoolExecutor.DiscardOldestPolicy());

    public static void main(String[] args) throws Exception {
        executor.setMaximumPoolSize(50);

        for (; ; ) {
            modelFit();
            Thread.sleep(100);
        }
    }

    private static void modelFit() {
        List<CardInfo> taskList = getAllCardInfo();
        taskList.forEach(info -> {
            // do something
            executor.scheduleWithFixedDelay(() -> {
                //do sth with info
                info.m();

            }, 2, 3, TimeUnit.SECONDS);
        });
    }

    private static List<CardInfo> getAllCardInfo() {
        List<CardInfo> taskList = new ArrayList<>();

        for (int i = 0; i < 100; i++) {
            CardInfo ci = new CardInfo();
            taskList.add(ci);
        }

        return taskList;
    }
}

Throw this Java file into the Linux environment, use javac to compile it into a class file, and run: java -Xms200M -Xmx200M -XX:+PrintGCDetails T15_FullGC_Problem01

1. Use the [top] command to find out which process occupies the most CPU resources;

It can be clearly seen that the 3503 process in the current system occupies the most resources, as high as 60.8%!

2. Use the [top -Hp ${process ID}] command to find out which thread in the process occupies the most CPU resources;

3. Use [jstack ${process ID}] to print out the stack information of all threads under the process;

The key to the information printed by jstack is the running state of the thread: java.lang.Thread.State: WAITING (parking) When many threads are found to be in the WAITING state for a long time, it means that the program itself must have problems. Multiple threads have been waiting for the release of a mutex, which may be a lock or a resource. Find out the locked resource in the information exported by jstack, which thread holds the lock, and the most likely state of this thread is RUNNABLE.

Case 4: yml configuration file setting server.max-http-header-size=100000000

This parameter means that every time Tomcat establishes a link, the request header will occupy 100000000kb of space, and the memory will be used up soon. The purpose of this case is very clear. In the logs, it must be that the resources related to http (http11OutputBuffer) are very high. When there are too many requests, the set heap memory does not care about the size, which can easily lead to OOM problems.

Case 5: Lambda expression causes method area (MethodArea) overflow problem

When a Lambda expression is used, it will dynamically generate a class file. When Lambda expressions are used continuously in a loop, classes will be continuously generated and thrown into the method area, which will eventually cause the method area overflow problem. Regarding the cleaning of this method area, the logic of each garbage collector is different. Some garbage collectors do not clean up at all, and some garbage collectors only clean up under particularly harsh conditions.

"C:\Program Files\Java\jdk1.8.0_181\bin\java.exe" -XX:MaxMetaspaceSize=9M -XX:+PrintGCDetails "-javaagent:C:\Program Files\JetBrains\IntelliJ IDEA Community Edition 2019.1\lib\idea_rt.jar=49316:C:\Program Files\JetBrains\IntelliJ IDEA Community Edition 2019.1\bin" -Dfile.encoding=UTF-8 -classpath "C:\Program Files\Java\jdk1.8.0_181\jre\lib\charsets.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\deploy.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\ext\access-bridge-64.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\ext\cldrdata.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\ext\dnsns.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\ext\jaccess.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\ext\jfxrt.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\ext\localedata.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\ext\nashorn.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\ext\sunec.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\ext\sunjce_provider.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\ext\sunmscapi.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\ext\sunpkcs11.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\ext\zipfs.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\javaws.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\jce.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\jfr.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\jfxswt.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\jsse.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\management-agent.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\plugin.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\resources.jar;C:\Program Files\Java\jdk1.8.0_181\jre\lib\rt.jar;C:\work\ijprojects\JVM\out\production\JVM;C:\work\ijprojects\ObjectSize\out\artifacts\ObjectSize_jar\ObjectSize.jar" com.mashibing.jvm.gc.LambdaGC
[GC (Metadata GC Threshold) [PSYoungGen: 11341K->1880K(38400K)] 11341K->1888K(125952K), 0.0022190 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
[Full GC (Metadata GC Threshold) [PSYoungGen: 1880K->0K(38400K)] [ParOldGen: 8K->1777K(35328K)] 1888K->1777K(73728K), [Metaspace: 8164K->8164K(1056768K)], 0.0100681 secs] [Times: user=0.02 sys=0.00, real=0.01 secs] 
[GC (Last ditch collection) [PSYoungGen: 0K->0K(38400K)] 1777K->1777K(73728K), 0.0005698 secs] [Times: user=0.00 sys=0.00, real=0.00 secs] 
[Full GC (Last ditch collection) [PSYoungGen: 0K->0K(38400K)] [ParOldGen: 1777K->1629K(67584K)] 1777K->1629K(105984K), [Metaspace: 8164K->8156K(1056768K)], 0.0124299 secs] [Times: user=0.06 sys=0.00, real=0.01 secs] 
java.lang.reflect.InvocationTargetException
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at sun.instrument.InstrumentationImpl.loadClassAndStartAgent(InstrumentationImpl.java:388)
	at sun.instrument.InstrumentationImpl.loadClassAndCallAgentmain(InstrumentationImpl.java:411)
Caused by: java.lang.OutOfMemoryError: Compressed class space
	at sun.misc.Unsafe.defineClass(Native Method)
	at sun.reflect.ClassDefiner.defineClass(ClassDefiner.java:63)
	at sun.reflect.MethodAccessorGenerator$1.run(MethodAccessorGenerator.java:399)
	at sun.reflect.MethodAccessorGenerator$1.run(MethodAccessorGenerator.java:394)
	at java.security.AccessController.doPrivileged(Native Method)
	at sun.reflect.MethodAccessorGenerator.generate(MethodAccessorGenerator.java:393)
	at sun.reflect.MethodAccessorGenerator.generateSerializationConstructor(MethodAccessorGenerator.java:112)
	at sun.reflect.ReflectionFactory.generateConstructor(ReflectionFactory.java:398)
	at sun.reflect.ReflectionFactory.newConstructorForSerialization(ReflectionFactory.java:360)
	at java.io.ObjectStreamClass.getSerializableConstructor(ObjectStreamClass.java:1574)
	at java.io.ObjectStreamClass.access$1500(ObjectStreamClass.java:79)
	at java.io.ObjectStreamClass$3.run(ObjectStreamClass.java:519)
	at java.io.ObjectStreamClass$3.run(ObjectStreamClass.java:494)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.io.ObjectStreamClass.<init>(ObjectStreamClass.java:494)
	at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:391)
	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1134)
	at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1548)
	at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1509)
	at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1432)
	at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1178)
	at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:348)
	at javax.management.remote.rmi.RMIConnectorServer.encodeJRMPStub(RMIConnectorServer.java:727)
	at javax.management.remote.rmi.RMIConnectorServer.encodeStub(RMIConnectorServer.java:719)
	at javax.management.remote.rmi.RMIConnectorServer.encodeStubInAddress(RMIConnectorServer.java:690)
	at javax.management.remote.rmi.RMIConnectorServer.start(RMIConnectorServer.java:439)
	at sun.management.jmxremote.ConnectorBootstrap.startLocalConnectorServer(ConnectorBootstrap.java:550)
	at sun.management.Agent.startLocalManagementAgent(Agent.java:137)

Case 6: Overriding finalize triggers frequent GC

C++ programmers rewrite finalize, causing frequent triggers. Because C++ reclaims memory resources manually, a destructor needs to be used. C++ programmers who don't understand Java rewrite finalize, and finalize takes a long time, taking about 200ms, resulting in frequent GC.

Case 7: The memory consumption does not exceed 10%, but observing the GC log, it is found that FGC is always generated frequently

Explicitly call the System.gc() method for garbage collection.

tuning tool

jconsole

jconsole is a tool that comes with JDK. It will be available when JDK is installed on the computer. It is located in the bin directory of JDK.

However, if you want to use this tool, you need to add relevant parameters when the program starts: java -Djava.rmi.server.hostname=192.168.160.129 -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port =11111 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Xms200M -Xmx200M -XX:+PrintGCDetails T15_FullGC_Problem01

After the project is successfully started, double-click jconsole.exe to run the tool and fill in the relevant information

After a successful connection, you can see the corresponding running information

As a graphics management tool, jconsole is definitely not used much under the Linux system, and the graphics are not very good. JDK also provides another tool jvisualvm, which has the same location as jconsole. For specific usage methods, please refer to: https://www .cnblogs.com/liugh/p/7620336.html

jmap

This is a command under the Linux system. This command will list the objects in the program, so pipeline filtering is generally added to filter and view the top dozens of objects: jmap -histo ${process ID} | head -20

Obviously, the number of these objects is very wrong. You can use the command: jmap -dump:format=b,file=xxx ${PID} to export the abnormal information into a file for detailed viewing and analysis. Find the corresponding code to troubleshoot the problem through the generated classes of these abnormal numbers. This requires a very familiarity with the business code. There are more than tens of millions of classes in a large project, and it is very difficult to locate.

Especially for online systems, the memory is very large. During the execution of jmap -dump, it will have a very large impact on the program. Basically, the business function is paralyzed and unresponsive. Therefore, the jmap -dump command should not be used easily. Generally, after an OOM problem occurs, the heap information is exported to a file for analysis through the set parameters. Unless there are many server backups, using one of them will have no effect on users.

arthas

arthas is an online monitoring and diagnosis product open sourced by Ali. It can view the status information of application load, memory, gc, and threads in real time from a global perspective, and can diagnose business problems without modifying the application code, including viewing methods Called input and output parameters, exceptions, monitoring method execution time-consuming, class loading information, etc., greatly improve the efficiency of online troubleshooting. This tool can be downloaded directly on Github: https://github.com/alibaba/arthas, and the Chinese documentation is also carefully prepared: https://github.com/alibaba/arthas/blob/master/README_CN. md

This tool should be the best Java online diagnostic tool I have used so far. It does not need to mount a graphical interface remotely, which greatly saves precious Linux resources.

Download from Git to the local, you can see these jar packages after decompression

Just start arthas-boot.jar directly: java -jar arthas-boot.jar

Before starting arthas, make sure that there are already Java programs running. Once arthas starts, it will find all running Java programs by itself and list them all in the form of process numbers;

Now which process you want to monitor, directly enter the corresponding number. For example, the number of the process 4531 in arthas is [1], just input 1 and press Enter;

After determining the process to be monitored, arthas will mount itself to this process: Try to attach process 4531, after the mount is successful: Attach process 4531 success. You can use the command of arthas to observe this process. The following introduces some of the more commonly used commands in arthas:

help  lists some commands commonly used in arthas

jvm  displays the detailed configuration information of the JVM where the current process is located

thread  lists all thread information in the process

thread ${ID}   prints detailed stack information in the thread

The dashboard  simulates the usage of system resources through the command line, which is similar to the Linux top command

heapdump  exports the heap information file, you can specify the location and file name of the exported file later

After the file is exported successfully, you can analyze the file through the jvisualvm tool that comes with the JDK

jad  decompile

Give it a class file, and through this command, you can directly decompile the source file online.

Some people may not understand it very well, the source file is in my own hands, why bother to decompile the class file. In fact, the special use of online decompilation: 1. For the problem location of the dynamic proxy generation class; 2. For the third-party class to observe the code; 3. To determine whether the latest submitted version has been used;

redefine   hot replacement

For large-scale projects running online, the importance of hot replacement is self-evident. However, there are currently some restrictions on this function. You can only change the implementation of the method that has already been run, and you cannot change the method name or attribute.

 

The result of running T is very simple

Start arthas, decompile and view the current code of TT

Directly modify the content in TT.java, after modification, use the redefine command: redefine /usr/local/app/TT.class

It can be seen that the hot replacement has been completed, which should essentially be reloaded using ClassLoader.

8. The future should be dominated by G1

Garbage 1st, in fact, is not a fresh garbage collector, it has been introduced in Java 7 update 4. The official also recommends using G1 instead of CMS when ZGC has not yet appeared. The biggest feature of G1 is the introduction of the idea of ​​partitioning (removing physical generation and retaining logical generation), weakening the concept of generation, making reasonable use of resources in each cycle of garbage collection, and solving many defects of other collectors. The goal is to be used on multi-core, large-memory machines, and in most cases the specified GC pause times can be achieved while maintaining high throughput.

G1 memory model

Before G1, the memory model was generally two contiguous spaces, the young generation and the old generation. G1 adopts the idea of ​​​​region, which divides the entire heap space into several memory areas of equal size, and each allocation of object space will use memory segment by segment. Therefore, in terms of the use of the heap, G1 does not require that the storage of objects must be physically continuous, as long as it is logically continuous; each partition will not definitely serve a certain generation, and can be allocated between the young generation and the Switch between old generations. You -XX:G1HeapRegionSize=ncan . By default, the whole heap is divided into 2048 partitions.

When the size of an object reaches or even exceeds half of the partition size, this object is called a huge object (Humongous Object). Because huge objects are expensive to move, and it is possible that a partition cannot accommodate huge objects. Therefore, huge objects will be directly allocated in the old generation, and the continuous space occupied is called a huge partition (Humongous Region). G1 has made an internal optimization. Once it is found that there is no reference to the giant object, it can be recycled directly in the young generation collection cycle.

Humongous objects will exclusively occupy one or more continuous partitions. The first partition is marked as Starts Humongous, and the adjacent continuous partitions are marked as Continues Humongous. Since a contiguous memory space is required and the entire heap needs to be scanned, the cost of determining where a huge object begins is very high, and applications should avoid generating huge objects if possible.

In short, G1's use of memory is in units of Regions, while the allocation of objects is in units of Cards. When G1 performs garbage collection, it will give priority to cleaning the area with the least surviving objects, that is, the area with the most garbage objects, so it is called Garbage First.

G1 basic concept

Card Table

As mentioned above, the memory model of G1 is divided into areas of different sizes, and each partition is divided into several 512 Byte cards (Cards), which identify the minimum available granularity of heap memory. The cards of all partitions will be recorded in the Global Card Table. The allocated objects will occupy several physically continuous cards. When looking for references to objects in a partition, you can use the cards to find the referenced object . Every time the memory is reclaimed, the cards in the specified partition are processed.

When garbage collection is doing YGC, we all know to get rid of all the garbage in the young generation, and then move the surviving objects from the eden area to the survivor area. However, it is really not an easy task to actually track a living object. Assuming that there are many objects in the young generation and the old generation, how do you determine which objects are alive and which objects are garbage. For the root reachable algorithm, the root object is used to find those living objects, but it is very likely that the root object points to the object in the old generation, and then the object in the old generation specifies the object in the young generation, which will produce a very Serious problem: in order to partition the garbage objects of the young generation, it is actually necessary to traverse the entire old generation of objects, which is very inefficient.

After introducing the concept of card, if an object in the card points back to an object in the young generation, the card will be marked as dirty. Which cards in the entire memory are marked as dirty, there will be a bitmap (bitmap) To mark, 0 means not dirty, 1 means dirty. In this way, only those cards marked as dirty need to be scanned, which greatly saves resources and improves efficiency.

CSet(Collection Set)

It is a collection of partitions that can be recycled. The data surviving in the CSet will be moved to another available partition during the GC process. The partitions in the CSet can come from Eden space, survivor space, or the old age. CSet will occupy less than 1% of the entire heap space.

RSet(Remembered Set)

Records the references of objects in other Regions to this Region. The value of RSet is that the garbage collector does not need to scan the entire heap to find who references the objects in the current partition, but only needs to scan RSet. The existence of this thing is the key to G1's high-efficiency recycling, and it is also the core of the three-color marking algorithm.

Dynamic new and old generation ratio

Generally, the proportion of the young generation is 5% - 60%. This proportion does not need to be specified manually, and it is recommended not to specify it manually, because this is the benchmark for G1 to predict the pause time. G1 will track the duration of each STW, assuming that the set STW target is 100ms, but a certain STW duration reaches 200ms, G1 will dynamically adjust this ratio to achieve the expected STW time target.

G1 log

[GC pause (G1 Evacuation Pause) (young) (initial-mark), 0.0022907 secs]
   [Parallel Time: 0.6 ms, GC Workers: 4]
      [GC Worker Start (ms): Min: 112208.2, Avg: 112208.4, Max: 112208.5, Diff: 0.3]
      [Ext Root Scanning (ms): Min: 0.0, Avg: 0.2, Max: 0.4, Diff: 0.4, Sum: 0.9]
      [Update RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
         [Processed Buffers: Min: 0, Avg: 0.2, Max: 1, Diff: 1, Sum: 1]
      [Scan RS (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
      [Object Copy (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
      [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.1, Diff: 0.1, Sum: 0.2]
         [Termination Attempts: Min: 1, Avg: 1.0, Max: 1, Diff: 0, Sum: 4]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
      [GC Worker Total (ms): Min: 0.2, Avg: 0.3, Max: 0.4, Diff: 0.3, Sum: 1.2]
      [GC Worker End (ms): Min: 112208.6, Avg: 112208.7, Max: 112208.7, Diff: 0.0]
   [Code Root Fixup: 0.0 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 0.3 ms]
   [Other: 1.4 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 1.1 ms]
      [Ref Enq: 0.0 ms]
      [Redirty Cards: 0.2 ms]
      [Humongous Register: 0.0 ms]
      [Humongous Reclaim: 0.0 ms]
      [Free CSet: 0.0 ms]
   [Eden: 0.0B(1024.0K)->0.0B(1024.0K) Survivors: 0.0B->0.0B Heap: 19.5M(20.0M)->19.5M(20.0M)]
 [Times: user=0.01 sys=0.00, real=0.00 secs] 
[GC concurrent-root-region-scan-start]
[GC concurrent-root-region-scan-end, 0.0000311 secs]
[GC concurrent-mark-start]
[Full GC (Allocation Failure)  19M->19M(20M), 0.0517014 secs]
   [Eden: 0.0B(1024.0K)->0.0B(1024.0K) Survivors: 0.0B->0.0B Heap: 19.5M(20.0M)->19.5M(20.0M)], [Metaspace: 3900K->3900K(1056768K)]
 [Times: user=0.06 sys=0.00, real=0.06 secs] 
[Full GC (Allocation Failure)  19M->19M(20M), 0.0414986 secs]
   [Eden: 0.0B(1024.0K)->0.0B(1024.0K) Survivors: 0.0B->0.0B Heap: 19.5M(20.0M)->19.5M(20.0M)], [Metaspace: 3900K->3900K(1056768K)]
 [Times: user=0.06 sys=0.00, real=0.04 secs] 
[GC concurrent-mark-abort]

In this log, all three stages of G1 garbage collection are printed:

[GC pause (G1 Evacuation Pause) (young) (initial-mark), 0.0022907 secs] In the 
YGC process of the young generation, Evacuation Pause refers to the suspension of copying surviving objects (I don’t know who gave it a strange name). initial-mark refers to the stage of mixed recycling, that is to say, seeing this shows that YGC mixes the old generation together for recycling, and not every young generation recycling will have this log information.

[Parallel Time: 0.6 ms, GC Workers: 4]
GC Workers: 4 Indicates that 4 GC recycling threads are enabled to work together.

[Ext Root Scanning (ms): Min: 0.0, Avg: 0.2, Max: 0.4, Diff: 0.4, Sum: 0.9]
Root Scanning Description Start searching from the root object

[Eden: 0.0B(1024.0K)->0.0B(1024.0K) Survivors: 0.0B->0.0B Heap: 19.5M(20.0M)->19.5M(20.0M)] After the garbage
collection is completed, print each District Garbage Collection Status

[GC concurrent-root-region-scan-start]
[GC concurrent-root-region-scan-end, 0.0000311 secs]
[GC concurrent-mark-start]

Other stages of mixed collection

[Full GC (Allocation Failure) 19M->19M(20M), 0.0517014 secs]
   [Eden: 0.0B(1024.0K)->0.0B(1024.0K) Survivors: 0.0B->0.0B Heap: 19.5M(20.0M )->19.5M(20.0M)], [Metaspace: 3900K->3900K(1056768K)]
 [Times: user=0.06 sys=0.00, real=0.06 secs] 

When Evacuation cannot be performed, FGC can only be performed in the end

A few questions about G1

Will G1 generate FGC?

Obviously, FGC will definitely be generated. Although G1 has been dynamically reclaiming each Region area, when the object allocation speed is too fast, when the garbage collection fails, and the objects cannot be allocated, FGC will be generated. Before JDK10, G1's FGC was also serial, and then it was gradually optimized to parallel FGC. For G1, the ultimate goal of tuning is not to generate FGC.

What to do if G1 produces FGC

1. Expand memory;
2. Improve CPU performance (fast recycling, fixed speed of business logic object generation, faster garbage collection, larger memory space); 3.
Lower the trigger threshold of MixedGC, so that MixedGC occurs earlier (the default is 45 %);

The default trigger threshold of MixedGC is that the object occupies more than 45% of the overall heap memory space, and MixedGC will be started. Of course, this ratio can be manually specified by -XX:InitiatingHeapOccupacyPercent.

MixedGC can be simply and roughly considered as a CMS, which is also divided into: initial marking STW, concurrent marking, final marking STW (re-marking), screening and recycling STW (parallel). The last step of screening is to select the Regions that most need to be recycled, and directly copy the surviving objects of the Region to another empty Region. Compression is also performed during the copying, so the fragmentation is not as serious as that of CMS.

Disadvantages of G1

Finally, compared with CMS, G1 does not have all-round and overwhelming advantages. For example, during the running of the user program, G1 is higher than CMS in terms of the memory usage (Footprint) generated by garbage collection and the additional execution load (overload) when the program is running. From experience, the performance of CMS in small-memory applications is likely to be better than G1, and G1 will play its advantages in large-memory applications. The balance point is between 6-8GB.

Moreover, G1 needs a memory set (specifically, a card table) to record the reference relationship between the new generation and the old generation. This data structure requires a large amount of memory in G1, which may reach 20% or even more of the entire heap memory capacity. many. Moreover, the cost of maintaining memory sets in G1 is high, which brings higher execution load and affects efficiency.

Based on the design characteristics of G1, there are the following problems:
1. The pause time is too long. Usually, the pause time of G1 reaches tens to hundreds of milliseconds; this number is actually very small, but we know that the occurrence of garbage collection causes the application Services cannot be provided in these tens or hundreds of milliseconds, and in some scenarios, especially when there are high requirements for user experience, the actual needs cannot be met;

2. The memory utilization rate is not high. Usually, the processing of reference relationships requires additional memory consumption, which generally accounts for about 1% to 20% of the entire memory;

3. The supported memory space is limited, and it is not suitable for systems with large memory, especially in systems with memory capacity higher than 100GB, the pause time will increase due to excessive memory;

A little mention of ZGC

As a new generation of garbage collector, ZGC defined three goals at the beginning of the design:
1. Support TB-level memory;
2. Control the pause time within 10ms;
3. The impact on program throughput is less than 15%;

In fact, ZGC has met the goal defined at the beginning of the design, supporting a maximum of 4TB heap space. It is said that it can support up to 16T level now. According to the actual test situation, the pause time is usually below 10ms, and the pause time caused by garbage collection will not be extended with the increase of memory.

Simply put, ZGC executes all tasks that can be processed concurrently. ZGC is developed on the basis of G1. We know that concurrent marking is implemented in G1, so marking will no longer affect the pause time. The pause time in G1 mainly comes from the copy algorithm in the garbage collection (YGC and mixed collection) phase. In the copy algorithm, it is necessary to transfer the object to a new space and update the references of other objects to this object. In practice, the transfer of objects involves memory allocation and copying of object member variables, and the copying of object member variables is very time-consuming.

In G1, the transfer of objects is executed in parallel in STW, and ZGC executes the transfer of objects concurrently, so as to satisfy the pause time below 10ms. We see that G1 is concurrent only when it is MARKING. However, ZGC has been changed to concurrency (simultaneously with application threads) in many aspects such as object replication and compression, replication set selection, and many other aspects. That's the secret to its short STW time.

9. Concurrent marking algorithm

The difficulty of the concurrent marking algorithm is that the reference relationship of the object changes during the process of marking the object.

Three Color Labeling Algorithm

Both CMS and G1 use the three-color marking method for concurrent marking: objects are logically divided into three colors, white represents unmarked objects; gray represents self-marked, member variables are not marked; black represents Both self and member variables are marked.

Omissions can occur in certain situations:

Assume that the relationship between the objects is as shown in the figure above, A is marked as a black object, B is marked as a gray object, and D is marked as a white object. At this point:
1. Object A points to object D;
2. Object B no longer points to object D;

When the above 1 and 2 conditions occur at the same time, because A has been marked as a black object, the member variables in A will not be scanned again, and at this time, D cannot be found through B, so D will be missed. , thus being mistaken for garbage and being recycled. If you want to solve the problem of missing labels, you only need to break one of the above two conditions:

1. Incremental update (incremental update): When A points to D, remark A as gray. In this way, when you scan again next time, if you see that A is a gray object, you will scan the children of A again, so that you can find the object D, thereby preventing missing marks;

2. SATB (snapshot at the beginning): When the reference from B to D disappears, push this reference to the stack of GC to ensure that D can still be scanned by GC, so that the reference of the object can be taken out from the stack every time , so as to find the missing object;

In order to solve the problem of missing labels, G1 uses the SATB algorithm. In terms of efficiency, SATB is much faster than incremental updates. Moreover, there is an RSet (Remembered Set) designed in G1. Every time the reference changes, through the maintenance of the RSet, it is only necessary to traverse the RSet. There is no need to scan the entire heap to find references pointing to white, which greatly improves the Efficiency, the combination of SATB algorithm and RSet is seamless.

However, RSet will definitely have some efficiency problems for assignment. Due to the existence of RSet, every time you assign a reference to an object, you have to do some extra operations: make some extra records in RSet, in It is called a write barrier in GC. But this write barrier is not a memory write barrier at the CPU and JVM level.

10. Common parameters of GC

-Xmn : young generation space size 
-Xms : minimum heap space size 
-Xmx : maximum heap space size 
-Xss : stack space space size  

-XX:+UseTLAB   use TLAB, open by default

-XX:+PrintTLAB  print TLAB usage

-XX:TLABSize  set TLAB size

-XX:+DisableExplictGC  sets System.gc() This code does not work

-XX:+PrintGC  print GC log

-XX:+PrintGCDetails  print GC detailed log

-XX:+PrintHeapAtGC   Print the stack when GC 

-XX:+PrintGCTimeStamps  Print system timestamps during GC

-XX:+PrintGCApplicationConcurrentTime  Print application time when GC

-XX:+PrintGCApplicationStoppedTime  prints the duration of application suspension during GC

-XX:+PrintReferenceGC  records how many references of different reference types are recycled

-verbose: class  class loading detailed process

-XX:+PrintVMOptions  print JVM runtime parameters

-XX:+PrintFlagsInitial / -XX:+PrintFlagsFinal  prints the specified parameters of the garbage collector, usage: java -XX:+PrintFlagsFinal -version | grep G1

-Xloggc:opt/log/gc.log  specifies the GC log location and file name

-XX:MaxTenuringThreshold  upgrade age, maximum 15

-XX: PreBlockSpin  lock spin times 

-XX:CompileThreshold  hot code detection parameters

Parallel common parameters

-XX:SurvivorRatio

-XX:PreTenureSizeThreshold  specifies how large a large object is directly allocated to the old generation

-XX:+ParallelGCThreads   The number of threads of the parallel collector is also applicable to CMS, and is generally set to be the same as the number of CPU cores

-XX:+UseAdaptiveSizePolicy  automatically selects the size ratio of each area

Common parameters of CMS

-XX:+UseConcMarkSweepGC  specifies to use the CMS garbage collector

-XX:ParallelCMSThreads  Number of CMS threads

-XX:CMSInitiatingOccupancyFraction  uses what percentage of the old age to start CMS collection, the default is 68% (approximate value), if SerialOld freezes frequently, it should be reduced (frequent CMS recycling)

-XX:+UseCMSCompactAtFullCollection  is compressed during FGC

-XX:CMSFullGCsBeforeCompaction  how many times to compress after FGC

-XX:+CMSClassUnloadingEnabled

-XX:  Perm recycling when CMSInitiatingPermOccupancyFraction reaches what ratio (used before JDK1.8)

-XX:GCTimeRatio   sets the percentage of program running time occupied by GC time

-XX:MaxGCPauseMillis  pause time, is a trial time, GC will try to use various means to achieve this time, such as reducing the young generation

G1 common parameters

-XX:+UseG1GC  specifies to use the G1 garbage collector

-XX:MaxGCPauseMillis  pause time, is a trial time, G1 will try to adjust the number of blocks in the Young area to reach this value

-XX:GCPauseIntervalMillis  GC interval time

-XX:+G1HeapRegionSize  partition size, it is recommended to gradually increase the value, 1 2 4 8 16 32M, as the size increases, the survival time of garbage is longer, the GC interval is longer, but the time of each GC will be longer, ZGC has improved on this

-XX:G1NewSizePercent  The minimum proportion of the new generation, the default is 5%

-XX:G1MaxNewSizePercent   The maximum proportion of the new generation, the default is 60%

-XX:GCTimeRatio  GC time suggestion ratio, G1 will adjust the heap space according to this value

-XX: Number of ConcGCThreads  threads

-XX:InitiatingHeapOccupancyPercent  starts the heap space occupation ratio of G1

Guess you like

Origin blog.csdn.net/FeenixOne/article/details/128510771