Character encoding and decoding (A: Java byte stream, a character stream with the source code analysis)

Character encoding and decoding (A: Java byte stream, a character stream with the source code analysis)

1. From the binary code to character

  • Let's make two points in the computer:
    • Any byte is a binary code consisting of
    • Any one character is composed of a binary code, a character based on the decoded different ways, may be made generally bytes (of course, also the binary bytes)
  • Different encoding / decoding specification of the character is represented by a kind of binary code, for example:
    • In ASCII, binary code '01101000' represents the character 'h'
    • GBK, the binary code '11111111111111111111111111001010 11111111111111111111111111000000' represents the characters 'world'
    • UTF-8, the binary code "11111111111111111111111111100100 11111111111111111111111110111000 11111111111111111111111110010110 'represents the character' World '
  • By using the encoding / decoding specification (ASCII, GBK, UTF-8, etc.), we can achieve the conversion of binary character codes
    • Encoding: binary code into the character (computer capable of identifying, for storage, computer network transmission)
    • Decoding: the characters into a binary code (for easy human recognition)
  • Obviously, different coding, coding specifications are likely to cause data encoding or decoding error, for example:
    • A character 'World', to regulate encoding GBK, then UTF-8 decoding specification (control characters preceding 'World' different binary code found)

Example 2: in UTF-8 encoding / decoding

  • First, we need to know the specification is UTF-8 by way of a 8bit (1byte) convert the binary code, and it is multi-byte variable length code:
    • A letter is represented by 1byte
    • A Chinese character is represented by 3byte

2.1 UTF-8 encoding

  • So, we use Java to write a code to see if it
    String content = "hello 世界";
    byte[] bytes = content.getBytes("UTF-8"); // 使用UTF-8编码,转为字节数组(即二进制码)
    for (int i = 0; i < bytes.length; i++) {
        byte current = bytes[i];
        System.out.println(
                i + " -> " +
                "十进制: " + current +
                ", 十六进制: " + Integer.toHexString(current) +
                ", 二进制: " + Integer.toBinaryString(current)
        );
    }
    
  • The output results are as follows
    0 -> 十进制: 104, 十六进制: 68, 二进制: 1101000
    1 -> 十进制: 101, 十六进制: 65, 二进制: 1100101
    2 -> 十进制: 108, 十六进制: 6c, 二进制: 1101100
    3 -> 十进制: 108, 十六进制: 6c, 二进制: 1101100
    4 -> 十进制: 111, 十六进制: 6f, 二进制: 1101111
    5 -> 十进制: 32, 十六进制: 20, 二进制: 100000
    6 -> 十进制: -28, 十六进制: ffffffe4, 二进制: 11111111111111111111111111100100
    7 -> 十进制: -72, 十六进制: ffffffb8, 二进制: 11111111111111111111111110111000
    8 -> 十进制: -106, 十六进制: ffffff96, 二进制: 11111111111111111111111110010110
    9 -> 十进制: -25, 十六进制: ffffffe7, 二进制: 11111111111111111111111111100111
    10 -> 十进制: -107, 十六进制: ffffff95, 二进制: 11111111111111111111111110010101
    11 -> 十进制: -116, 十六进制: ffffff8c, 二进制: 11111111111111111111111110001100
    
  • explain
    • 6 is a front byte English characters and symbols in general, with the ASCII coded characters, each coded as 1byte, e.g.
      • ‘h’ -> ‘1101000’
      • ‘e’ -> ‘1100101’
    • 6 is a rear byte Chinese characters, each character is encoded as 3Byte, e.g.
      • 'World' -> '11111111111111111111111111100100' and '11111111111111111111111110111000' and '11111111111111111111111110010110'

2.2 UTF-8 decoding

  • We then construct two characters try binary code (UTF-8 mode decoding)
    byte shi1 = 0b11111111111111111111111111100100;
    byte shi2 = 0b11111111111111111111111110111000;
    byte shi3 = 0b11111111111111111111111110010110;
    byte[] shi = {shi1, shi2, shi3};
    System.out.println(new String(shi, "UTF-8"));
    
    byte jie1 = 0b11111111111111111111111111100111;
    byte jie2 = 0b11111111111111111111111110010101;
    byte jie3 = 0b11111111111111111111111110001100;
    byte[] jie = {jie1, jie2, jie3};
    System.out.println(new String(jie, "UTF-8"));
    
  • Output follows
    世
    界
    

2.3 Error encoding and decoding

  • If we first use the GBK specification coding, and then decode the UTF-8 specification, what will happen?
    String content = "hello 世界";
    
    byte[] gbkBytes = content.getBytes("GBK");
    String utf8Str = new String(gbkBytes, "UTF-8");
    System.out.println(utf8Str);
    System.out.println(content);
    
  • Output follows
    hello ����
    hello 世界
    
  • Obviously, since the same GBK and UTF-8 characters for English analytically, the 'hello' portion without error, but because different for encoding / decoding predetermined characters, resulting in garbled
  • Further, if a text file is GBK encoded using the ISO-8859-1 decoded, and then encoded using the ISO-8859-1, the last memory
    • Using ISO-8859-1 decoding, if the printed characters, apparently garbled
    • The results are stored is not a problem, because ISO-8859-1 is a single byte encoding, and the use of all the space within a single byte. Thus any byte stream according to ISO-8859-1 decoding, re-encoding, the problem will not be lost. (MySQL default encoding Latin1 is the case)
  • GBK sample code below with ISO-8859-1
    import java.io.FileInputStream;
    import java.io.FileOutputStream;
    
    public class Demo02 {
    
        public static void main(String[] args) throws Exception {
            String content = "hello 世界";
            String path = "./test.txt";
    
            // 对字符串进行GBK编码,并存储
            byte[] gbkBytes = content.getBytes("GBK");
            saveByteArray(path, gbkBytes, 0, gbkBytes.length);
    
            // 读取该文件字节码
            byte[] bytes = new byte[1024];
            int len = readByteArray(path, bytes);
            System.out.println("len = " + len);
    
            // 使用GBK解码,并打印
            String gbkStr = new String(bytes, 0, len, "GBK");
            System.out.println("gbkStr = " + gbkStr);
    
            // 使用iso-8859-1解码,并打印
            String isoStr = new String(bytes, 0, len, "iso-8859-1");
            System.out.println("isoStr = " + isoStr);
    
            // 使用iso-8859-1对该字符串进行编码,然后存储
            byte[] isoBytes = isoStr.getBytes("iso-8859-1");
            saveByteArray(path, isoBytes, 0, isoBytes.length);
        }
    
        public static void saveByteArray(String path, byte[] bytes, int off, int len) throws Exception {
            FileOutputStream fos = new FileOutputStream(path);
            fos.write(bytes);
            fos.close();
        }
    
        public static int readByteArray(String path, byte[] bytes) throws Exception {
            FileInputStream fis = new FileInputStream(path);
            int len = fis.read(bytes);
            fis.close();
            return len;
        }
        
    }
    

3. The character encoding record

  • ASCII

    • American Standard Code for Information Interchange
    • 7 bit represents a character
    • A total of 128 characters
  • ISO-8859-1

    • For the extended ASCII
    • 8 bit represents a character, it will use the entire byte
    • A total of 256 characters
  • GB2312

    • GB, Chinese character set encoding
    • 2 byte represents a character
    • A total of 6763 characters
  • GBK

    • For extended GB2312, and can express more characters
    • 2 byte represents a character
    • A total of 21,003 Chinese characters
  • GB18030

    • For the expansion of GBK, the most complete set of Chinese character encoding
    • Multi-byte variable length encoding, 1, 2 or 4 byte character represents a
    • A total of more than 70,000 Chinese characters
  • BIG5

    • Developed by Taiwan, mainly used in traditional Chinese characters coding
    • 2 byte represents a character
    • A total of 13,060 Chinese characters
  • Unicode

    • Developed by the International Organization for Standardization, the integration of the world's characters
    • 2 byte represents a character
    • It represents all the characters of the world
    • If you use only English characters, representing a waste of space
  • UTF(Unicode Translation Format)

    • Universal Transformation Format, is the implementation of Unicode, Unicode space to solve the waste problem
    • UTF-8, UTF-16, UTF-16LE(little endian), UTF-16BE(big endian), UTF-32
  • UTF-8

    • Multi-byte variable length coding, a 1-4 byte character
      • 1 byte character represents a US-ASCIl
      • 2 byte represents a Latin characters (Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic, Syriac, etc.)
      • 3 byte represents a character (CJK text, text Southeast Asia, the Middle East and text, etc.)
      • 4 byte represents a rarely used in other languages
  • UTF-8-BOM(Byte Order Mark)

    • Require the use of Unicode byte order to identify the BOM, UTF-8-BOM file will begin with EF BB BF
    • UTF-16 and UTF-32 is determined by the need 2Byte 4byte read or read by, the need to determine the order BOM
    • UTF-8 is read by the 1byte, no problem byte order, are not required to identify the BOM endian
    • Recommendation: Use UTF-8, it is best to use UTF-8 without the BOM

4. Relationship Java byte stream and character stream (source code analysis)

  • We inflow from characters hand, look at java.io.FileReader, it inherits java.io.InputStreamReader, its constructor as follows
    public FileReader(String fileName) throws FileNotFoundException {
            super(new FileInputStream(fileName));
    }
    
    public FileReader(File file) throws FileNotFoundException {
            super(new FileInputStream(file));
    }
    
    public FileReader(FileDescriptor fd) {
        super(new FileInputStream(fd));
    }
    
  • Its constructor will always new FileInputStream (), and FileInputStream inherited from InputStream
  • FileInputStream will open0 acquired by the input byte stream into native methods, code is as follows
    public FileInputStream(File file) throws FileNotFoundException {
        String name = (file != null ? file.getPath() : null);
        SecurityManager security = System.getSecurityManager();
        if (security != null) {
            security.checkRead(name);
        }
        if (name == null) {
            throw new NullPointerException();
        }
        if (file.isInvalid()) {
            throw new FileNotFoundException("Invalid file path");
        }
        fd = new FileDescriptor();
        fd.attach(this);
        path = name;
        open(name);
    }
    
    private void open(String name) throws FileNotFoundException {
        open0(name);
    }
    
    private native void open0(String name) throws FileNotFoundException;
    
  • Then FileReader constructor calls the parent class constructor InputStreamReader (super method), inside the pass FileInputStream
  • InputStream InputStreamReader is actually a wrapper class, its enhancements (provided character encoding capability). In the constructor, using StreamDecoder decoding of InputStream
  • InputStreamReader constructor code as follows
    public InputStreamReader(InputStream in) {
        super(in);
        try {
            sd = StreamDecoder.forInputStreamReader(in, this, (String)null); // ## check lock object
        } catch (UnsupportedEncodingException e) {
            // The default encoding should always be available
            throw new Error(e);
        }
    }
    
  • The decoding method will look StreamDecoder specifies whether the Charset (e.g. UTF-8), if not specified will use the default Charset, if the system supports the last Charset, then returns StreamDecoder, the following code
    public static StreamDecoder forInputStreamReader(InputStream var0, Object var1, String var2) throws UnsupportedEncodingException {
        String var3 = var2;
        if (var2 == null) {
            var3 = Charset.defaultCharset().name();
        }
    
        try {
            if (Charset.isSupported(var3)) {
                return new StreamDecoder(var0, var1, Charset.forName(var3));
            }
        } catch (IllegalCharsetNameException var5) {
        }
    
        throw new UnsupportedEncodingException(var3);
    }
    
  • In InputStreamReader, the global variables will be assigned to sd StreamDecoder, subsequent read related methods are all read sd of method calls, as follows
    public String getEncoding() {
        return sd.getEncoding();
    }
    
    public int read() throws IOException {
        return sd.read();
    }
    
    public int read(char cbuf[], int offset, int length) throws IOException {
        return sd.read(cbuf, offset, length);
    }
    
    public boolean ready() throws IOException {
        return sd.ready();
    }
    
    public void close() throws IOException {
        sd.close();
    }
    
  • Obviously, we can know the relationship between the Java byte stream and character stream, in fact, the character stream is an enhancement to the byte stream function (coding / decoding), in essence, a stream of characters used or byte stream
Published 128 original articles · won praise 45 · Views 150,000 +

Guess you like

Origin blog.csdn.net/alionsss/article/details/103789906