Bytecode level analysis class class file structure

In this class, we analyze the class file structure from the bytecode level. First look at an interview question:

Is there a limit to the length of a String string in java?

In normal project development, we often use String to declare strings, such as String str = "abc", but you may never think about whether the string constant after the equal sign has a length limit. To answer this question thoroughly, you need to learn what I talked about today-class files.

The ins and outs of class

Java can achieve "compile once, run everywhere", and the class file should take most of the credit. In order to make the Java language have good cross-platform capabilities, Java uniquely provides an intermediate code that can be used on all platforms- bytecode class files (.class files) . With bytecode, no matter what kind of platform (such as: Mac, Windows, Linux, etc.), as long as the virtual machine is installed, the bytecode can be run directly.

And, with the bytecode, the coupling between the Java virtual machine and the Java language is also decoupled . You may not understand this sentence very well, what does this decoupling refer to?

In fact, the purpose of the Java virtual machine was not only to run the Java language. At present, the Java virtual machine can already support many languages ​​other than the Java language, such as Groovy, JRuby, Jython, Scala and so on. The reason why other languages ​​can be supported is that these languages ​​can also generate bytecode files that can be parsed and executed by the JVM after compilation. The virtual machine does not care which language the bytecode is compiled from. As shown below:

Look at the class file from God's perspective

If you look at the class file from an overall perspective, there are only two data structures in the class file: unsigned numbers and tables .

  • Unsigned number : It belongs to the basic data type. U1, u2, u4, and u8 represent unsigned numbers of 1 byte, 2 bytes, 4 bytes and 8 bytes respectively. Unsigned numbers can be used to describe numbers, index references, quantity values, or strings (UTF-8 encoded).

  • Table : A table is a composite data type composed of multiple unsigned numbers or other tables as data items. All tables in the class file end with "_info" . In fact, the entire Class file is essentially a table.

The relationship between the two can be represented by the following picture:

It can be seen that other unsigned numbers and other tables can be included in a table. Pseudocode could look like this:

// 无符号数

u1 = byte[1];

u2 = byte[2];

u4 = byte[4];

u8 = byte[8];



// 表

class_table {
    
    
    // 表中可以引用各种无符号数,
    u1 tag;
    u2 index2;
    ...
    // 表中也可以引用其它表
    method_table mt;
    ...
}

class file structure

Just now we said that there are only two data structures, unsigned numbers and tables, in the class file. And these unsigned numbers and tables constitute the various structures in the class. These structures are arranged tightly front to back in a predetermined order without any gaps between adjacent items. As shown below:

When the JVM loads a class file, the JVM parses the class file according to the structure in the above figure, loads the class file into the memory, and allocates the corresponding space in the memory. A specific structure needs to occupy a lot of space, you can refer to the following figure:

Seeing this, you may be a little conceptually confused, and you can't tell the relationship between unsigned numbers, tables, and the above structure. In fact, a simple example can be given: the human body is composed of H, O, C, N and other elements. But these elements form the various organs of the human body according to certain rules. The unsigned numbers and tables in the class file are equivalent to elements such as H, O, C, and N in the human body, and the structures in the class structure diagram are equivalent to various organs of the human body. And the organization order of these organs has a strict order requirement, after all, eyes cannot grow on the buttocks.

Case Analysis

After clarifying these concepts, let's take a look at the details of the above structures through a Java code example. First write a simple Java source code Test.java as follows:

import java.io.Serializable;

 

public class Test implements Serializable, Cloneable{
    
    

      private int num = 1;

 

      public int add(int i) {
    
    

          int j = 10;

          num = num + i;

          return num;

     }

}

Compile it through javac to generate a Test.class bytecode file. Then use a hexadecimal editor to open the class file, and the displayed content is as follows:

The above picture is some hexadecimal numbers, every two characters represent a byte. At first glance, there are no rules among the characters, but from the perspective of the JVM, these hexadecimal characters are arranged according to strict rules. Next, let's look at how the JVM parses them step by step.

Magic number magic number

As shown in the figure above, the four bytes at the beginning of the class file are the magic number of the class file, which is a fixed value - 0XCAFEBABE. The magic number is the mark of a class file, that is to say, it is the standard for judging whether a file is a class format file. If the first four bytes are not 0XCAFEBABE, it means that it is not a class file and cannot be recognized or loaded by the JVM.

version number

The four bytes following the magic number represent the version number of the current class file. The first two bytes 0000 represent the minor version number (minor_version), the last two bytes 0034 are the major version number (major_version), and the corresponding decimal value is 52, that is to say, the major version number of the current class file is 52, and the minor version number number is 0. So the integrated version number is 52.0, which is   jdk1.8.0

Constant pool (emphasis)

Immediately following the version number is a table called the constant pool (cp_info). Various related information of the class are saved in the constant pool, such as the name of the class, the name of the parent class, the method name in the class, the parameter name, the parameter type, etc., all of which are stored in the constant pool in the form of various tables middle.

Each item in the constant pool is a table with 14 item types, as shown in the following table:

It can be seen that each item in the constant pool will have a tag value of u1 size. The tag value is the identifier of the table. When the JVM parses the class file, it uses this value to determine which table the current data structure is. The above 14 types of tables have their own structures, so I will not introduce them one by one here, but I will use the two tables CONSTANT_Class_info and CONSTANT_Utf8_info as examples, because other tables are basically similar.

First, the specific structure of the CONSTANT_Class_info table is as follows:

table CONSTANT_Class_info {
    
    

    u1  tag = 7;

    u2  name_index;

}

explain.

  • tag: Occupies a byte size. For example, if the value is 7, it means that it is a CONSTANT_Class_info type table.

  • name_index: It is an index value, which can be understood as a pointer to the constant table whose index is name_index in the constant pool. For example, name_index = 2, then it points to the second constant in the constant pool.

Next, look at the specific structure of the CONSTANT_Utf8_info table as follows:

table CONSTANT_utf8_info {
    
    

    u1  tag;

    u2  length;

    u1[] bytes;

}

explain:

  • tag: The value is 1, which means it is a CONSTANT_Utf8_info type table.

  • length: length indicates the length of u1[]. For example, length=5 means that the next data is 5 consecutive u1 type data.

  • bytes: u1 type array, the length is the value of the second parameter length above.

The final storage format of the String string we declared in the java code in the class file is CONSTANT_utf8_info. Therefore, the maximum length of a string is the maximum value that u2 can represent is 65536, but it needs to use 2 bytes to save the null value, so the maximum length of a string is 65536 - 2 = 65534. Refer to  Java String maximum length analysis .

It is not difficult to see that there are also references to each other in the tables inside the constant pool. Use a picture to understand the relationship between the CONSTANT_Class_info and CONSTANT_utf8_info tables, as shown in the following figure:

After understanding the data structure inside the constant pool, let's take a look at the parsing process of the example code. Because developers usually define various Java classes, the methods and parameters in the classes are also different. Therefore, the number of elements in the constant pool cannot be fixed, so the class file uses a 2-byte capacity counter in front of the constant pool to represent the size of the constant pool in the current class. As shown below:

The conversion of 001d in the red box to decimal is 29, that is to say, the value of the constant counter is 29. Among them, the constant with subscript 0 is reserved by JVM for other special purposes, so the actual constant pool size in Test.class is the value of this counter minus 1, which is 28.

The first constant, as follows:

After converting 0a into decimal, it is 10. By looking at the 14 tables of the constant pool, it can be found that the table type with tag=10 is CONSTANT_Methodref_info, so the first constant type in the constant pool is the method reference table. Its structure is as follows:

CONSTANT_Methodref_info {
    
    

    u1 tag = 10;

    u2 class_index;        指向此方法的所属类

    u2 name_type_index;    指向此方法的名称和类型



}

That is to say, the 2 bytes after "0a" point to which class this method belongs to, and the next 2 bytes point to the name and type of this method. Their values ​​are:

  • 0006: Decimal 6, means pointing to the sixth constant in the constant pool.

  • 0015: Decimal 21, indicating that it points to the 21st constant in the constant pool.

At this point, the interpretation of the first constant is completed. Immediately following is the second constant, as follows:

tag 09 means the field reference table CONSTANT_FIeldref_info, and its structure is as follows:

CONSTANT_Fieldref_info{
    
    

    u1 tag;

    u2 class_index;        指向此字段的所属类

    u2 name_type_index;    指向此字段的名称和类型



}

It is also 4 bytes, and there are two indexes before and after.

  • 0005: Point to the fifth constant in the constant pool.

  • 0016: Point to the 22nd constant in the constant pool.

So far we have resolved two constants in the constant pool. The parsing process of the remaining 21 constants is similar, so we won't analyze them one by one here. In fact, we can use the javap command to help us view the contents of the class constant pool:

javap -v Test.class

After the above command is executed, the displayed results are as follows:

As we just analyzed, the first constant in the constant pool is of type Methodref, pointing to constants with subscript 6 and subscript 21. The constant type of subscript 21 is NameAndType, and its corresponding data structure is as follows:

CONSTANT_NameAndType_info{
    
    

    u1 tag;

    u2 name_index;    指向某字段或方法的名称字符串

    u2 type_index;    指向某字段或方法的类型字符串
}

The name_index and type_index of the NameAndType subscripted at 21 point to 13 and 14 respectively, that is, "<init>" and "()V". Therefore, the parsing process and final value of the first constant in the constant pool are finally parsed as shown in the following figure:

Carefully analyze the layer-by-layer references, and finally we can see that the first constant in the constant pool in the Test.class file saves the default constructor method in Object.

Access flags (access_flags)

The constant immediately after the constant pool is the access flag, which occupies two bytes, as shown in the following figure:

The access flag represents the access information of the class or interface, such as: whether the class file is a class or an interface, whether it is defined as public, whether it is abstract, if it is a class, whether it is declared as final, etc. The various access flags are as follows:

The Test.java we defined is a normal Java class, not an interface, enumeration or annotation. And it is modified by public but not declared as final and abstract, so its corresponding access_flags is 0021 (the combination of 0X0001 and 0X0020).

Class Index, Parent Class Index, and Interface Index Counters

The 2 bytes after the access flag are the class index , the 2 bytes after the class index are the parent class index , and the 2 bytes after the parent class index are the interface index counter . As shown below:

It can be seen that the class index points to the fifth constant in the constant pool, the parent class index points to the sixth constant in the constant pool, and the number of implemented interfaces is 2. Review the data in the constant pool again:

It can be seen from the figure that the fifth and sixth constants are both CONSTANT_Class_info table types, and the classes they represent are "Test" and "Object" respectively. Look at the interface counter again, because the value of the interface counter is 2, which means that this class implements 2 interfaces. Check that the 4 bytes after the interface counter are:

  • 0007: Point to the 7th constant in the constant pool, it can be seen from the figure that the value of the 7th constant is "Serializable".

  • 0008: Point to the 8th constant in the constant pool, it can be seen from the figure that the value of the 8th constant is "Cloneable".

To sum up, the following conclusions can be drawn: the current class is Test, which inherits from the Object class and implements the two interfaces "Serializable" and "Cloneable".

field table

Immediately after the interface index set is the field table. The main function of the field table is to describe the variables declared in the class or interface. The fields here include class-level variables and instance variables, but do not include local variables declared inside methods.

Similarly, the number of variables in a class is not fixed, so a counter is used to represent the number of variables before the field table collection, as shown below:

0002 means that 2 variables (called fields in the class file) are declared in the class, and the field counter will be followed by the data structure of the 2 field tables.

The specific structure of the field table is as follows:

CONSTANT_Fieldref_info{
    
    

    u2  access_flags    字段的访问标志

    u2  name_index          字段的名称索引(也就是变量名)

    u2  descriptor_index    字段的描述索引(也就是变量的类型)

    u2  attributes_count    属性计数器

    attribute_info

}

Continue to parse the field table in Text.class, its structure is shown in the following figure:

field access flag

For variables in Java classes, identifiers such as public, private, final, and static can also be used for identification. Therefore, when parsing a field, you need to judge its access flag first, and the access flag of the field is as follows:

The value of the access flag in the field table structure diagram is 0002, which means it is a private type. The variable name index points to the ninth constant in the constant pool, and the variable name type index points to the tenth constant in the constant pool. The 9th and 10th constants are "num" and "I" respectively, as follows:

Therefore, we can know that there is a variable named num in the class whose type is int. The same is true for the analysis process of the second variable, so I won't introduce it too much.

Precautions:

  1. Fields inherited from parent classes or parent interfaces will not be listed in the field table collection.

  2. In order to maintain access to the outer class in the inner class, a field pointing to the outer class instance will be automatically added.

For the above two cases, it is recommended that you define a class to view and manually analyze it.

method table

The field table is followed by the method table constants. I believe you can also guess that the method table constant should also start with a counter, because the number of methods in a class is not fixed, as shown in the figure:

The figure above shows that there are two methods in Test.class, but we only declared one add method in Test.java. Why? This is because the default constructor method is also included in the method table constants.

The structure of the method table is as follows:

CONSTANT_Methodref_info{
    
    

    u2  access_flags;        方法的访问标志

    u2  name_index;          指向方法名的索引

    u2  descriptor_index;    指向方法类型的索引

    u2  attributes_count;    方法属性计数器

    attribute_info attributes;

}

As you can see, the method also has its own access flag, as follows:

Let's mainly look at the add method, as follows:

From the figure, we can see the specific values ​​of the following fields of the add method:

  1. access_flags  = 0001  means the access permission is public.

  2. name_index  = 0X 0011   points to the 17th constant in the constant pool, which is "add".

  3. type_index  = 0X 0012    points to the 18th constant in the constant pool, which is (I). This method receives an int type parameter and returns an int type parameter.

attribute table

When parsing fields and methods before, we can see a table called attributes_info in their specific structure, which is the attribute table.

The attribute table does not have a fixed structure, and various attributes only need to satisfy the following structure:

CONSTANT_Attribute_info{
    
    

    u2 name_index;

    u2 attribute_length length;

    u1[] info;

}

There are many property tables predefined in the JVM. Here we will focus on the Code property table.

  • Code attribute table

We can continue to analyze the method table just now:

As you can see, following the method type index is the attribute for the "add" method. 0X0001 is an attribute counter, which means there is only one attribute. 0X000f is the attribute table type index. By looking at the constant pool, it can be seen that it is a Code attribute table, as shown below:

In the Code attribute table, the most important thing is the bytecode of some columns. After passing javap -v Test.class, you can see the bytecode of the method. The following figure shows the bytecode instruction of the add method:

When the JVM executes the add method, it uses this series of instructions to perform corresponding operations.

Summarize:

In this class, we mainly understand what the data structure of a class file content looks like, and use Test.class to simulate and demonstrate the process of Java virtual machine parsing bytecode files. Among them, the class constant pool part is the key content, which is equivalent to the resource warehouse in the class file, and other structures will more or less eventually point to this resource warehouse. In fact, we usually don't use a hexadecimal editor to open a .class file directly. We can use commands such as javap or other tools to help us view the data structure inside the class. It's just that it is very helpful to understand the parsing process of the JVM and deepen the memory of the class file structure by doing it yourself.

Guess you like

Origin blog.csdn.net/gqg_guan/article/details/132366496