Java character set encoding

Java character set encoding

Java default character set is the Unicode (two bytes byte , one byte = 8 bits ' bit )

Detailed:

Character set encoding

Unicode is a "character set"; UTF-8 is "encoding rules" (the most widely used is a Unicode implementations)

Character set: assign a unique character for each ID (code bits)

Encoding rules: rule to convert the code bit sequence of bytes (with what is stored)

 

 

English / byte

Chinese / byte

. 8-UTF (variable length)

1

3

Utf-16

2

3-4

GBK

1

2

ISO8859-1

1

1

Unicode

2

2 (also punctuation)

ASCII

1

2

 

Java approach:

There are two aspects of coding problems: JVM within and JVM outside.

  1. compiler to Java files are compiled to form after class

Here Java encoded file may have varied ( can be UTF-. 8 (common) ) , the Java compiler automatically encoded in these Java produce the correct encoding format file read class file, where the class file code is Unicode encoding (specifically UTF-16 encoding). That completed the UTF-8 file encoding turn into a platform-independent .class files, the UTF-8 encoding turned into Unicode . Once compiled into .class files, they do not care about what our program source UTF-8 encoding

Thus, in Java definition of a string code String s = " characters ";
no matter before compiling java what encoding files, compiled into the class later, they are all the same ---- Unicode encoded representation.

  2.JVM of coding

In the JVM internal, uniform use Unicode said characters from the front JVM moves inside to the outside (i.e., stored as the contents of a file when the file system), the transcoding performed using a specific encoding scheme. Thus it can be said that all the encoding conversion occurs only in the local boundary, i.e. various input / output streams into play.

JVM load the class file is read when using Unicode encoding correctly read the class file, the original definition of the String s = " characters "; manifestation in memory is the Unicode encoding.

 

problem

In java , the number of bytes a character equal?

Or a more detailed Q: In java , one byte is equal to the number of English character? A number of byte Chinese characters equal?

 

Java uses unicode to represent characters, the Java one of the char is 2 bytes, a Chinese or English characters unicode encodings are accounted for 2 bytes, but if other encoding, a number of bytes for each character occupies is not the same.

Code validation as follows:

public static void main(String[] args) {
    String str = "测";
    char x = '测';
    byte[] byteStr = str.getBytes();
    byte[] byteChar = charToByte(x);
    System.out.println("byteStr :" + byteStr.length); // byteStr :3
    System.out.println("byteChar:" + byteChar.length); // byteChar:2
}

// 通过移位获取char类型的byte数组
public static byte[] charToByte(char c) {
    byte[] b = new byte[2];
    b[0] = (byte) ((c & 0xFF00) >> 8);
    b[1] = (byte) (c & 0xFF);
    return b;
}

 

Acquisition system code

System.out.println ( "the system default encoding:" + System.getProperty ( "file.encoding")); // Query results-8 UTF 
System.out.println ( "system default character encoding:" + Charset.defaultCharset ( )); // query results-8 UTF 
System.out.println ( "system default language:" + System.getProperty ( "user.language")); // query results zh

 

getBytes () method Detailed

Also explain getBytes top used () method

In Java in , String of getBytes () method was an operating system default encoding format byte array. This means that under different operating systems , things return is not the same !

1.str.getBytes ();   if you do not write parentheses charset , then uses Sytem.getProperty ( "file.encoding"), that is encoding the current file, 

2.str.getBytes ( "charset"); // specified charset , coming underlying storage Unicode code is parsed charset encoding format byte array embodiment 

3.String str = new String (str.getBytes ( "utf-8"), "gbk")); // the data byte has been parsed into gbk string encoding format in memory is the gbk byte array format into Unicode to pass interact

 

Extended

 ask:

"A" .getBytes ( "Unicode"). length // result 4

Top've said a Unicode character occupies two bytes, why there is not a 4-byte 2 bytes?

Why Unicode  4 bytes

Use for loop through the resulting byte array (or use the character a):

-2 -1 0 97 

 

Found in front of a plurality  -2 -1, which actually is a byte BOM flag. 

UNICODE is a character set, the Java used directly in Unicode will follow when transcoding UTF-16LE split, since UTF-16 into UTF-16LE and UTF-16BE , i.e. the little-endian and big endian Therefore, in the network during transmission, it can not determine LE or bE  sequence, thus requiring an additional add endian BOM header. BOM character header is a special character which Unicode encoded as the U-+ the FEFF , called the character "ZERO WIDTH  the NON-BREAKING the SPACE" , according to RFC2781 3.2 section provides that the first two bytes FE FF referred to Big-Endian , beginning with FF FE is called Little-Endian .

 

The explanation utf-16: utf-16 embodiment comprises 2 Species byte sequence, Big Endian Endian and Little Endian byte order :
UTF-16 Big Endian : FEFF ( no meaning in UCS 2- in ) , wherein FEFF identification code is
UTF-16 Little Endian : FFFE ( no meaning in UCS 2- in ), java default selection Little Endian endian

 

So, you just use Unicode conversion byte words, that is, according to UTF-16LE way to decode, to add extra BOM two bytes FF  FE .

Solution:

You can use UnicodeBigUnmarked coding

"a".getBytes("UnicodeBigUnmarked").length // 结果为2

 

reference:

http://bbs.itheima.com/thread-101106-1-1.html

https://blog.csdn.net/lcfeng1982/article/details/6830584 

Guess you like

Origin www.cnblogs.com/scChen/p/12508571.html