Java character set encoding
Java default character set is the Unicode (two bytes byte , one byte = 8 bits ' bit )
Detailed:
Character set encoding
Unicode is a "character set"; UTF-8 is "encoding rules" (the most widely used is a Unicode implementations)
Character set: assign a unique character for each ID (code bits)
Encoding rules: rule to convert the code bit sequence of bytes (with what is stored)
|
English / byte |
Chinese / byte |
. 8-UTF (variable length) |
1 |
3 |
Utf-16 |
2 |
3-4 |
GBK |
1 |
2 |
ISO8859-1 |
1 |
1 |
Unicode |
2 |
2 (also punctuation) |
ASCII |
1 |
2 |
Java approach:
There are two aspects of coding problems: JVM within and JVM outside.
1. compiler to Java files are compiled to form after class
Here Java encoded file may have varied ( can be UTF-. 8 (common) ) , the Java compiler automatically encoded in these Java produce the correct encoding format file read class file, where the class file code is Unicode encoding (specifically UTF-16 encoding). That completed the UTF-8 file encoding turn into a platform-independent .class files, the UTF-8 encoding turned into Unicode . Once compiled into .class files, they do not care about what our program source UTF-8 encoding
Thus, in Java definition of a string code String s = " characters ";
no matter before compiling java what encoding files, compiled into the class later, they are all the same ---- Unicode encoded representation.
2.JVM of coding
In the JVM internal, uniform use Unicode said characters from the front JVM moves inside to the outside (i.e., stored as the contents of a file when the file system), the transcoding performed using a specific encoding scheme. Thus it can be said that all the encoding conversion occurs only in the local boundary, i.e. various input / output streams into play.
JVM load the class file is read when using Unicode encoding correctly read the class file, the original definition of the String s = " characters "; manifestation in memory is the Unicode encoding.
problem
In java , the number of bytes a character equal?
Or a more detailed Q: In java , one byte is equal to the number of English character? A number of byte Chinese characters equal?
Java uses unicode to represent characters, the Java one of the char is 2 bytes, a Chinese or English characters unicode encodings are accounted for 2 bytes, but if other encoding, a number of bytes for each character occupies is not the same.
Code validation as follows:
public static void main(String[] args) { String str = "测"; char x = '测'; byte[] byteStr = str.getBytes(); byte[] byteChar = charToByte(x); System.out.println("byteStr :" + byteStr.length); // byteStr :3 System.out.println("byteChar:" + byteChar.length); // byteChar:2 } // 通过移位获取char类型的byte数组 public static byte[] charToByte(char c) { byte[] b = new byte[2]; b[0] = (byte) ((c & 0xFF00) >> 8); b[1] = (byte) (c & 0xFF); return b; }
Acquisition system code
System.out.println ( "the system default encoding:" + System.getProperty ( "file.encoding")); // Query results-8 UTF System.out.println ( "system default character encoding:" + Charset.defaultCharset ( )); // query results-8 UTF System.out.println ( "system default language:" + System.getProperty ( "user.language")); // query results zh
getBytes () method Detailed
Also explain getBytes top used () method
In Java in , String of getBytes () method was an operating system default encoding format byte array. This means that under different operating systems , things return is not the same !
1.str.getBytes (); if you do not write parentheses charset , then uses Sytem.getProperty ( "file.encoding"), that is encoding the current file,
2.str.getBytes ( "charset"); // specified charset , coming underlying storage Unicode code is parsed charset encoding format byte array embodiment
3.String str = new String (str.getBytes ( "utf-8"), "gbk")); // the data byte has been parsed into gbk string encoding format in memory is the gbk byte array format into Unicode to pass interact
Extended
ask:
"A" .getBytes ( "Unicode"). length // result 4
Top've said a Unicode character occupies two bytes, why there is not a 4-byte 2 bytes?
Why Unicode 4 bytes
Use for loop through the resulting byte array (or use the character a):
-2 -1 0 97
Found in front of a plurality -2 -1, which actually is a byte BOM flag.
UNICODE is a character set, the Java used directly in Unicode will follow when transcoding UTF-16LE split, since UTF-16 into UTF-16LE and UTF-16BE , i.e. the little-endian and big endian Therefore, in the network during transmission, it can not determine LE or bE sequence, thus requiring an additional add endian BOM header. BOM character header is a special character which Unicode encoded as the U-+ the FEFF , called the character "ZERO WIDTH the NON-BREAKING the SPACE" , according to RFC2781 3.2 section provides that the first two bytes FE FF referred to Big-Endian , beginning with FF FE is called Little-Endian .
The explanation utf-16: utf-16 embodiment comprises 2 Species byte sequence, Big Endian Endian and Little Endian byte order :
UTF-16 Big Endian : FEFF ( no meaning in UCS 2- in ) , wherein FEFF identification code is
UTF-16 Little Endian : FFFE ( no meaning in UCS 2- in ), java default selection Little Endian endian
So, you just use Unicode conversion byte words, that is, according to UTF-16LE way to decode, to add extra BOM two bytes FF FE .
Solution:
You can use UnicodeBigUnmarked coding
"a".getBytes("UnicodeBigUnmarked").length // 结果为2
reference:
http://bbs.itheima.com/thread-101106-1-1.html
https://blog.csdn.net/lcfeng1982/article/details/6830584