Character set encoding in Java

About ASCII , Unicode and UTF-8 concept will not repeat them here, they can be viewed on the definition by Wikipedia link.

Quick Start, recommended reading Ruan Yifeng's blog, " 字符编码笔记:ASCII,Unicode 和 UTF-8"

This paper introduces the characters in Java is how to store and read

A Chinese character "Zhang" a series of problems caused by

  • First determine the "sheets" of Unicode code in hexadecimalU+5F20
  • Hexadecimal 5F20convert binary to0101 1111 0010 0000

  • View Unicode and UTF-8 conversion rules of
    • In the range of ASCII code, with a byte, it is beyond the ASCII range in bytes, which formed a representation of UTF-8 we saw above, this benefit is when Yang UNICODE only ASCII file when code, files are stored as a byte, it is no different from an ordinary ASCII file, when read is true, it is compatible with the previous ASCII file.
    • Greater than ASCII code, it will indicate the length of the unicode character by the former first few bytes of the above, such as the binary representation of the top three 110xxxxx told us it was a 2BYTE the UNICODE character; 1110xxxx was three of UNICODE characters, and so on; xxx bit positions represented by the character encoding of binary numbers filled in. The right of x has fewer special significance. Expressing a character with only a sufficient number of multi-byte sequence encoding the shortest. Note that in the multi-byte string, a number of beginning of the first byte "1" is a whole number of bytes in the string.

Because U+5F20fell U+0800~ U+FFFFinterval may be determined Chinese character "Zhang" into UTF-8the format consists of 3 bytes composition according to the above conversion rule, draw characters "Zhang" is UTF-8encoded as1110 0101 1011 1100 1010 0000

This is our direct view, "Zhang" coding is consistent (default encoding is UTF-8)

  • Read characters FileReader
File file = new File(System.getProperty("user.dir") + "/src/main/java/com/dsying/IO/a.txt");

Reader reader = new FileReader(file);
System.out.println(reader.read()); // 24352
复制代码

24352 is indeed a Chinese character "Zhang" in decimal code point

  • Reading bytes with FileInputStream
File file = new File(System.getProperty("user.dir") + "/src/main/java/com/dsying/IO/a.txt");

FileInputStream is = new FileInputStream(file);
System.out.println(is.read()); // 229
复制代码

229The hexadecimal indeed E5, that is the Chinese character "Zhang" the first byte

  • Read byte array with FileInputStream
File file = new File(System.getProperty("user.dir") + "/src/main/java/com/dsying/IO/a.txt");

FileInputStream is = new FileInputStream(file);
// 每次读取3个字节
byte[] bytes = new byte[3];
is.read(bytes);
System.out.println(Arrays.toString(bytes)); //[-27, -68, -96]
复制代码

Why bytes in memory is negative, obviously the first byte corresponding decimal E5 should be 229 ah, why keep -27?

This is because byteonly save 1bytes that is 8binary digits, that is -128 ~ 127a number between, byte apparently kept no less than 229, overflow occurred

-27 How then converted to 229?

Negative binary numbers by positive binary 取反加一get

  • Binary 27 0001 1011
  • After is the inverse 1110 0100(all high 1)
  • Coupled with a 1110 0101(high all 1)
  • So is binary-27 1110 0101(all high 1)

Then again 1110 0101 & 0xffend up 1110 0101(high all 0) is229

How many bytes in Java char type accounts? If it is 2 bytes, why sometimes getBytes (). Length> 2

I consulted someone put a screenshot

Guess you like

Origin juejin.im/post/5e008b10518825126131d001