[Java] Coding at each stage

In the whole process of developing and running a Java program, the stages related to coding are divided into the following categories:

  1. .javaSource File;
  2. .classbytecode file;
  3. Runtime;
  4. output.

.javaThe encoding of the source file is specified by the user or automatically uses the system default encoding according to the language settings of the operating system. The encoding at this stage cannot be unified, and the encoding of each source file can be different.

Then we need to compile the .javasource file into a file, but the way to read the source file is to read according to the default encoding of the operating system. If my operating system is encoded by default , but the encoding of my source file is set to , then If you use the encoded format to read , there will be garbled characters. At this time, you need to add such a parameter (I use it ): .javac.classjavacGBKUTF-8javacGBKUTF-8javacjdk 1.8-encoding utf-8

public class Test {

    public static void main(String[] args) {
        String a="你好,世界";
        System.out.println(a);
    }   
}
Microsoft Windows [版本 10.0.16299.371]
(c) 2017 Microsoft Corporation。保留所有权利。

C:\Users\MasterVing>D:

D:\>javac Test.java
Test.java:4: 错误: 编码GBK的不可映射字符
                String a="浣犲ソ锛屼笘鐣?";
                                 ^
1 个错误
Microsoft Windows [版本 10.0.16299.371]
(c) 2017 Microsoft Corporation。保留所有权利。

C:\Users\MasterVing>D:

D:\>javac -encoding utf-8 Test.java

D:\>java Test
你好,世界

Of course, if we use IDEit, IDEit will automatically add this parameter for us.

The encoding of javacthe files generated after compilation .classis unified, and the Modified UTF-8encoding ( official documentation ) is used.

4.3. Descriptors
A descriptor is a string representing the type of a field or method. Descriptors are represented in the class file format using modified UTF-8 strings (§4.4.7) and thus may be drawn, where not further constrained, from the entire Unicode codespace.

In the process of running the program, the UTF-16code is used.

And in the final output stage, the encoding used can also be customized, there is no mandatory requirement.

UTF-8 and Modified UTF-8

UTF-8It is a variable-length code, occupying at least 1 byte (for example: English letters), at most 6 bytes, and Chinese characters generally occupy 3 bytes.

Modified UTF-8It is an improved version of the UTF-8encoding, which differs from the standard UTF-8 encoding in the following three points:

  1. nullThe encoding of the null character was changed from one byte '\u0000'to 2 bytes, so that the embedded null character does not appear in the encoding of the string;
  2. Only use the format of 1~3 bytes;
  3. Secondary characters are represented as surrogate pairs.

The differences between this format and the standard UTF-8 format are the following:
1. The null byte ‘\u0000’ is encoded in 2-byte format rather than 1-byte, so that the encoded strings never have embedded nulls.
2. Only the 1-byte, 2-byte, and 3-byte formats are used.
3. Supplementary characters are represented in the form of surrogate pairs.

UTF-16

UTF-16It is a variable-length encoding, which is represented by 1~2 16-bit long unit symbols (one symbol is 2 bytes).

Therefore, an UTF-16encoded character occupies 2 bytes or 4 bytes.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326448320&siteId=291194637