In the whole process of developing and running a Java program, the stages related to coding are divided into the following categories:
.java
Source File;.class
bytecode file;- Runtime;
- output.
.java
The encoding of the source file is specified by the user or automatically uses the system default encoding according to the language settings of the operating system. The encoding at this stage cannot be unified, and the encoding of each source file can be different.
Then we need to compile the .java
source file into a file, but the way to read the source file is to read according to the default encoding of the operating system. If my operating system is encoded by default , but the encoding of my source file is set to , then If you use the encoded format to read , there will be garbled characters. At this time, you need to add such a parameter (I use it ): .javac
.class
javac
GBK
UTF-8
javac
GBK
UTF-8
javac
jdk 1.8
-encoding utf-8
public class Test {
public static void main(String[] args) {
String a="你好,世界";
System.out.println(a);
}
}
Microsoft Windows [版本 10.0.16299.371]
(c) 2017 Microsoft Corporation。保留所有权利。
C:\Users\MasterVing>D:
D:\>javac Test.java
Test.java:4: 错误: 编码GBK的不可映射字符
String a="浣犲ソ锛屼笘鐣?";
^
1 个错误
Microsoft Windows [版本 10.0.16299.371]
(c) 2017 Microsoft Corporation。保留所有权利。
C:\Users\MasterVing>D:
D:\>javac -encoding utf-8 Test.java
D:\>java Test
你好,世界
Of course, if we use IDE
it, IDE
it will automatically add this parameter for us.
The encoding of javac
the files generated after compilation .class
is unified, and the Modified UTF-8
encoding ( official documentation ) is used.
4.3. Descriptors
A descriptor is a string representing the type of a field or method. Descriptors are represented in the class file format using modified UTF-8 strings (§4.4.7) and thus may be drawn, where not further constrained, from the entire Unicode codespace.
In the process of running the program, the UTF-16
code is used.
And in the final output stage, the encoding used can also be customized, there is no mandatory requirement.
UTF-8 and Modified UTF-8
UTF-8
It is a variable-length code, occupying at least 1 byte (for example: English letters), at most 6 bytes, and Chinese characters generally occupy 3 bytes.
Modified UTF-8
It is an improved version of the UTF-8
encoding, which differs from the standard UTF-8 encoding in the following three points:
null
The encoding of the null character was changed from one byte'\u0000'
to 2 bytes, so that the embedded null character does not appear in the encoding of the string;- Only use the format of 1~3 bytes;
- Secondary characters are represented as surrogate pairs.
The differences between this format and the standard UTF-8 format are the following:
1. The null byte ‘\u0000’ is encoded in 2-byte format rather than 1-byte, so that the encoded strings never have embedded nulls.
2. Only the 1-byte, 2-byte, and 3-byte formats are used.
3. Supplementary characters are represented in the form of surrogate pairs.
UTF-16
UTF-16
It is a variable-length encoding, which is represented by 1~2 16-bit long unit symbols (one symbol is 2 bytes).
Therefore, an UTF-16
encoded character occupies 2 bytes or 4 bytes.