Detailed explanation of class file structure

Written at the forefront: Learning the class file structure is not like learning the JVM memory structure and garbage collector. It can help us a lot when writing code. After learning the JVM memory structure, we will have a more comprehensive consideration when configuring the virtual machine parameters. , When writing the code, you can notice the optimization space of the code, learn the garbage collector, so that we can better choose the collector suitable for the maximum throughput of the program according to the server configuration, and better according to the server hardware configuration. Appropriate parameters, learning class, it is more for letting us know why and why, let us know how the code we write runs in the JVM, this part of the content will be relatively boring, vernacular More, here is mainly divided into two parts to describe in detail the organization of the class file [class file structure, bytecode instruction].

1. Class file structure

Although this part of the content will be boring, but it is more friendly that this part of the content has not changed much since JDK1.0, so as long as you master this part, it can be said to be once and for all, unlike the JVM memory distribution. As the collector changes, it will not be constantly updated like the collector, because each generation of JDK version will be compatible with the previous version. It is necessary to ensure that the previous program is feasible to run on the new version of the JDK. This also ensures the relative stability of the class file structure. Let's look at the various structures and functions of the class file.

What is a class file?
JVM does not recognize java format files. All it can execute is class files. The java program uses javac compiler to convert java files into class files, and then they can be executed by virtual machines. All virtual machines execute class files. , This also makes the java file executable on other machines after a single compilation, that is, it runs everywhere in the java language. A class file is a set of binary streams based on 8 bytes, each data item (magic number, version...) is arranged in strict order, there is no spacer in the middle, and more than 8 bytes are encountered for storage. At the time, it will be divided into several 8 bytes for storage according to the high order first. This is the class file. In a word, the class file is the file that stores the information that can be recognized by the JVM after java compilation. The data in it is 8 bytes. As a unit, each data item is arranged in an orderly manner without interval.
What is an unsigned number? table? set?
Before talking about the specific structure of the class file, we must first introduce these three structures, because the information in the class is stored in these three structures.
Unsigned number: As the name implies, an unsigned number is a group of unsigned data, which is a basic data type. An unsigned number has 1 byte (u1), 2 bytes (u2), and 4 bytes (u4). ), 8 bytes (u8) These structures are mainly used to store numbers, index references, quantity values or string values (after UTF-8 encoding). It is the smallest unit of data stored in class files.
Table: An unsigned number is the smallest unit of data stored in a class file. A table is a data type composed of multiple unsigned numbers (data items composed of multiple tables are also called tables). The naming convention of tables ends with "_info". Mainly used to describe composite structure data with hierarchical relationships. The entire class file can actually be regarded as a table (all kinds of information are stored hierarchically).
Set: A set is actually an unsigned number or table. It is called a set because it stores multiple unsigned numbers or tables. Since there are multiple storages, the virtual machine needs to know the specific unsigned number or the number of tables stored, so there will be a u2 type data at the beginning of the collection to store the number of data items in the collection called the capacity counter , Which constitutes a set. As the name suggests, a collection is used to store multiple data of the same type.
Summarizing these three structures, we can see that, in fact, the smallest structure stored is an unsigned number. Multiple unsigned numbers form a table, and multiple unsigned numbers plus a preceding capacity counter form a set. Their essence is the change of unsigned numbers.
Magic number and version number of Class file
magic number (Magic Number): The first number is the unique role of magic, the magic number u4 structure is stored in the class file for the virtual machine to distinguish whether the file is an executable class, some people will say Isn’t there a suffix to identify the file type? Isn’t it redundant to use magic numbers, in fact, it’s not redundant. Using magic numbers for file identification can increase security, because the suffix name can be changed at will. The magic number of class files is 0xCAFEBABE , And not only class files have magic numbers. For example, common suffixes such as jpg, jpeg, png, gif, zip, jar, etc. have magic numbers. As shown in the figure below, we use WinHex hexadecimal The system editor opens a class file to see if it is 0xCAFEBABE, it's obvious.

Version number : The version number is stored next to the magic number. The version number occupies a u4. The first two u2 store the minor version, and the latter u2 stores the major version. The minor version is actually useless between JDK1.2 and JDK12. . Only after JDK12, if the preview function is used in the java file, the minor version number will be stored as 65535 in the generated class file. This is also the only usefulness of the current version. The main version is the more important item. It stores the version number of the JDK. This value is used in JDK1.0 and JDK1.1. 45. Afterwards, the JDK releases a large version. This value will increase by 1 accordingly. When it comes to our commonly used JDK8, the value is 52. Why do we need to indicate the version number of the JDK? Because the virtual machine must be backward compatible, the files compiled by the previous virtual machine must be executable in the current virtual machine. In addition, the virtual machine refuses to execute class files that are higher than the current virtual machine version. That is to say, the JDK8 virtual machine cannot execute the class files compiled by JDK9, but the class files compiled by the virtual machines before JDK7, JDK6, etc. can be executed.
The next version and major version are shown below. This is a file opened in hexadecimal. If 34 is converted to decimal, it is 52.

Summarizing the magic number and version number, we can find that this piece of content will be fixed as the virtual machine is fixed, and the same virtual machine will not be different due to different files (before JDK12). In addition, it was said that the class file is a group For binary files based on 8 bytes, the magic number and version number are data items that occupy the first eight bytes.
Constant pool
What is the constant pool?
The constant pool mentioned here is the class constant pool. Our common constant pools include class constant pool, runtime constant pool, and string constant pool. These three constant pools are three things. To put it briefly, the class constant pool is the static data stored in the class file, and the runtime constant pool in the method area stores the literal and symbol of the class file after being loaded. Reference, string constant pool in the heap, dedicated to storing string constants. Closer to home then what is the constant pool (class constant pool)? The constant pool, as the name implies, is a pool for storing constants. All the constants in the class file will be stored in this pool. It is the resource warehouse of the class file and the structure that has the most intersection with other items. The main constants stored are literal, Symbol reference. Literals are the constants, strings, etc. defined in the class (constants such as local variables int, the class is stored in the local variable table after being loaded, and the string enters the string constant pool), and the symbolic reference is the class or The fully qualified name of the interface, the names and descriptors of the methods and fields, the handle type of the methods and other information, this is the constant pool.
The characteristics of the
constant pool : The constant pool is the first table structure data in the class file. As mentioned earlier, the table is composed of multiple unsigned numbers or multiple table structures. The constant pool is composed of many tables. Because it is composed of multiple tables, it starts with a capacity counter of type u2, which stores the number of tables in the constant pool. The capacity counter of the constant pool is different from others. The real count from 1 represents the index of each constant. 0 does not point to any table, but represents the meaning of "not referencing any item in the constant pool". In the figure below, the number of constant pools is 0x2B, which represents decimal The 43 indicates that there are 42 table structure data in the constant pool.

We can use javap -verbose followed by the class file name to view the bytecode content of the class file. Let's see if the bytecode information constant pool of the file is 42 constants. The information is as follows. It can be clearly seen that there are a total of 42 constants (tables) stored in the constant pool.

First explain the structure of the constant pool table above. The first column #1, #2, etc. is the index number and the storage number. The third column Methodref, String, Fieldref stores the table type, and the fourth column stores the index number. It is the information stored in the current table. After the double slash in the fifth column, the specific value stored in the current structure is equivalent to a comment. From the picture above, we can not only see that there are 42 constant tables in the constant pool, as shown in the figure. There are Utf8, String, Fieldref, Methodref, etc. tables in the table. What table structures are there in the constant pool?
What are the table structures in the constant pool?
There are a total of 17 table structures in the constant pool (as of JDK13), which are used to store the literal and symbolic references in the class. This information will be loaded into the virtual machine according to the index number when the virtual machine is parsed. These 17 types The table structure covers all the java information, and all the tables are as follows:

Because the storage structure of each table structure is different, it is not realistic to introduce each table structure. In my opinion, there is no need to To master a table structure proficiently, you only need to know these table structures and know what to store. The detailed explanations of these table structures are attached below for reference when needed:
The flag bit
next to a u2 type data of the constant pool is the flag bit. The flag bit is used to store the access flag of the current class or interface. There are: whether it is public, whether it is abstract, whether it is final, etc. There are 9 types in total. The modification information is as shown in the figure below.

The flag ACC_SUPER is special and must be true after JDK1.0.2, so the minimum value of the flag bit is 0x0020, so if multiple flags are true, how does the flag bit represent it? When multiple flag bits are true, the corresponding flag values will be added, and the value obtained is the display value of the flag bit. We can easily infer which flags are true through the value displayed by the flag bit. .
The
stored information of class index, parent class index, and interface index collection is mainly to determine the fully qualified name of the current class, the fully qualified name of the parent class, and the fully qualified name of the implemented interface. If you see here, someone will definitely have questions. Isn’t this part of the information declared in the constant pool? These are all symbolic references. As mentioned earlier, the constant pool is equivalent to a resource warehouse, where the fully qualified name of the class, the fully qualified name of the parent class, and the fully qualified name of the interface are referenced All are the information in the constant pool . The class index and the parent class index each use a u2 type data storage. Java supports multiple implementations, so the interface index collection uses multiple u2 type data for storage.
Field table collection
This is a collection structure of data. We have already introduced the definition of collection. A collection is composed of unsigned numbers of the same data structure or multiple tables together, plus a pre-capacity counter. The field table collection structure is naturally like this, which is composed of a capacity counter plus multiple field tables.
What is a field table?
The field table is used to describe the variables declared in the class or interface. In the Java language, variables refer to class variables and instance variables by default, and do not contain local variables. Therefore, the field table definitely does not store local variables. A field table is divided into three parts, and each part occupies a u2 structure: access flag, field simple name (field name), field descriptor (description field type). As shown in the figure below: The

figure above is a structure diagram of a field table, but whether it is a field table or a method table, sometimes there will be a corresponding attribute table set behind it. This set will be introduced in detail below, and now you only need to know the attribute table The collection is a supplement to the field table or method table, and does not exist alone. In addition, the three parts of the field table need to be mentioned separately.
Access flag: This access flag is very similar to the access flag in the class file, and both are used to indicate access modifiers. The access modifiers of the field table are as follows. We can easily infer the access modifiers of the fields described by the field table through the corresponding u2 data.

Field simple name (name_index): name_index stores the index number in the constant pool, and the constant corresponding to this index number is the simple name of the field.
Field descriptor (descriptor_index): Descriptor_index stores the index number in the constant pool, and the constant corresponding to this index number stores the identifier of the descriptor. The following shows the relationship between the descriptor and the identifier:

As shown above, if the defined type is byte type, then the value of the field descriptor actually stored in the constant pool is B.
Summary of the field table collection. From the above, it can be seen that the structure of the field table is relatively complicated, so here is an example to illustrate the structure of the field table more vividly. As shown below:

The above figure is the information of a field table, fields_count is 1 means the capacity counter is 1, indicating that there is only one field table, access_flags is 0x0002, we can see that the object is private, and name_index is stored in the constant pool. As you can see in the second picture above, the simple name of the field is m, and descriptor_index stores the index number in the constant pool. It can be seen from the second picture above that the constant is I, which is compared with the identifier of the field descriptor From the table, we can see that I represents int, so this field table stores an instance variable private int m.
Fang published a collection
if the field master set of tables, in fact, a set of tables also mastered the method, because the information field around the table stored in the comparison, or it takes a little time to master, master of the field in the table set the table method also ok, There is basically no difference between these two items. The method table set consists of a capacity counter, multiple method tables and a set of attribute tables. A method table also has three parts (the same as the field table) access flag, method name index, and method descriptor index. It is basically the same as the field table. The inconsistency is that each party will have its own set of attribute tables, because the code of the method will be stored in the attribute table of code (there are few methods without code).
The following are all the access flags in the method table: I

will not repeat the three parts of the method table in detail here, and there is no difference from the field table.
Attribute table collection The
attribute table collection is the last item to be introduced in the class file structure. The attribute table collection does not exist alone. It will be used in conjunction with field tables and party publications. As a supplement to these tables, there is a lot of information stored here. For example, the compiled code of a common method is stored in the code attribute table, and the exceptions defined in the method are stored in the Exceptions table. The same content is also a lot, and the compiler implemented by itself supports new attributes. In addition to the attribute table collection, the information that is not stored in the several structures introduced above is basically in this attribute table. All the attribute tables are listed below for reference. In my opinion, it is not necessary for each attribute table to be To know what to do, we only need to know what information is stored here.

2. Bytecode instruction introduction

If someone has read this part of the introduction very carefully, then I still admire it, because this part is really boring, if you read the above part, don’t worry, there is also the following part, which is also relatively boring.

What are bytecode instructions?
Bytecode instruction is a concept in the operating system, and generally consists of two parts ① opcode ② operand. The opcode is a binary number that occupies one byte and represents a certain special meaning. The value is predefined to perform a certain operation. The operand usually follows the opcode and can be 0 or more, just like the input parameter in the method. Together they form a bytecode instruction. What is used in the JVM virtual machine is an operand stack-oriented architecture. Therefore, only the operation code exists, the operand is generally stored in the operand stack, and the operation code is used to complete an instruction of the virtual machine.
Types of opcodes
All programs we write need to be completed by opcodes. We can recall the programs we usually write: such as object creation, type conversion, data addition, subtraction, multiplication, and division processing, method calls, exception handling, Synchronization of methods, synchronization of code blocks, etc. These conceivable operations are completed by the cooperation of opcodes and operands. In addition, the naming of opcodes is generally related to the data type as much as possible, and the name is known as much as possible. For example, the opcodes for addition, subtraction, multiplication, and division corresponding to the int type are: iadd, isub, imul, idiv. So we know at a glance that these opcodes operate on the int type. Here are several commonly used opcodes.
Object creation related instructions
Ordinary objects: new
array objects: newarray, anewarray, multianewarray
set variable value and access variable value instructions: getfield, putfield, getstatic (class variable), putstatic (class variable)
store the value (in the operand stack) Array instructions: bastore, catore, sastore, iastore, fastore. . .
Get array length instructions:
As long as the functions that are not implemented by syntactic sugar in arraylength java, there are actually corresponding instructions. There are many of these instructions, and they don’t need to be fully posted to me. In my opinion, we only need to master some common ones. . Several common commands will continue to be listed below.
Instructions related to data operations
Add: iadd, fadd, ladd, dadd
Subtract: isub, fsub, lsub, dsub
Multiply: imul, fmul, lmul, dmul
Divide: idiv, fdiv, ldiv, ddiv
remainder: irem, frem, lrem, drem
reverse: ineg, fneg, lneg, dneg,
etc., byte, short, boolean, char are all operating instructions that use int, and the four of them do not have separate operating instructions, but they are stored (local variable table, operand stack In), what is still stored is their own type.
Method call and method return instructions
are relatively important. The parsing and dispatching in java is closely related to method call instructions, so this part still needs to be remembered. The following describes these instructions in detail (parsing and dispatching need to be mentioned Many are not explained here).
①Invokevirtual instruction : used to call the instance method of the object, and dispatch according to the actual type of the object (virtual method dispatch).
We tested this command and wrote the following code:
```
public class TestSuper {
      
      

	public TestSuper(){
      
      

	}

	public void test(){
      
      

	}

	public static void mian(String[] args){
      
      
		TestSuper testSuper = new TestSuper();
		testSuper.test();
	}

}
```
According to the description of the invokevirtual instruction, we should use this instruction when executing testSuper.test(); this code, and then we use the javap instruction to look at the bytecode information of the class, as shown in the following figure:

From the picture above, we can It is easy to see that the invokevirtual instruction is used when calling the test method.
②Invokeinterface instruction : used to call the interface method, it will search for an implementation object of the interface at runtime and find a suitable method to call. The above method can be used for verification, and no repetitive work will be done here.
③Invokespecial instruction : used to call some instance methods that need special processing, such as constructors, private methods, superclass constructors, superclass methods, etc., all need to rely on this command to achieve, this is also very well verified, but it needs to be said Next, let's make a small extension. We all know that this and super are keywords in Java. The implementation of this keyword is implicit parameter passing, so what about super, here is not to parse the this keyword is the verification of implicit parameter passing, nor to verify that super is not the verification of implicit parameter passing. If you want to figure this out, you can check the relevant information or chat with me in private. Let's just say that the realization mechanism of super is not implicit parameter passing. Let's look at the code of the invokevirtual instruction, which has a constructor. Let’s also look at the bytecode information corresponding to this constructor: the

one marked in red in the figure is the underlying implementation of the super keyword. From this line of instruction, we can see that the current instruction is calling the parameterless constructor of Object, and super is in After the compiler is compiled, it will be interpreted as an invokespecial instruction and carry parameters to call the parent class constructor. This is the implementation mechanism of super. It is not that this and super are implicitly passed as some people say.
④ Invokestatic instruction : From the name, everyone should understand that this instruction is specifically used to call class methods. Do not repeat the verification here.
⑤Invokedynamic instruction : used to dynamically analyze the method referenced by the call point qualifier at runtime and execute the method. The lamdba in JDK8 that we use is implemented only by relying on this instruction.
Synchronous instructions
We often use synchronous operations when writing code, such as the commonly used synchronized keyword. The virtual machine will also provide keywords corresponding to the synchronization instruction, but the synchronization of the method does not depend on the instruction to complete. Instead, there will be a modifier ACC_SYNCRONIZED in the method table, which indicates whether the method is synchronized or not. If the synchronization method executes the The thread of the method must hold a lock, which will be released after execution, and other threads have a chance to obtain the lock. However, the synchronization of code blocks still needs bytecode instructions to complete. The virtual machine provides two instructions, monitorenter and monitorexit, to support the keyword synchronized.

3. Full text summary

This article introduces the structure of the class file. Through this section, we can know the way and location of each information storage after the code we wrote is compiled by the compiler, and then introduces a part of the commonly used bytecode instructions, bytecode instructions More corresponding to the information in the method body, because the method is where we really implement the operation. So most of the bytecode instructions are also reflected in the code attribute table in the method table in the bytecode file. Through this section, we can understand how the underlying implementation of the method looks like, such as how the various method calls are completed, and how addition, subtraction, multiplication, and division are implemented. Here is a simple example to supplement how to complete the operation in the method when the bytecode instruction is given. The code is as follows:

public class TestByteCode {
    
    
    public static void main(String[] args){
    
    
        int a = 1;
        int b = 2;
        int c = a + b;
        System.out.println(c);
    }
}

Let's look at the information published by the main method, as follows:
Insert picture description here
iconst_1 means pushing the value 1 into the operand stack, istore_1 means assigning the constant method local variable table 1 to the local variable a,
iconst_2 means pushing the value 2 into the operand stack, istore_2 means assigning the constant method local variable table of 1 to the local variable b.
iload_1 and iload_2 load the two parts a and b into the operand stack to prepare for calculation. iadd is to calculate these two int type variables, istore_3 Put the value of 3 into c in the local variable table, and then the operation is completed and returned (LineNumberTable is attribute information, which has nothing to do with this).