Java byte stream read Chinese garbage problem

1. Everything bytes

  All file data (text, images, video, etc.) during storage, are stored in the form of binary digits are a one byte, so that the same transmission. Therefore, any file can be transmitted byte stream data. When the operation flow, we must always clear, no matter what kind of stream objects, the underlying transport is always used as binary data. Man is no exception, when the storage is one byte. Since they are byte, why is there still garbled it?

2, and a character byte

Bytes: one byte (byte) equal to eight bits (bit), equivalent to eight switches.
Character: the Java code within a predetermined use UTF-16 character encoding, so that a character = 2 bytes. However, we usually use the encoding format UTF-8, UTF-8 encoding is a variable-length coding, usually three bytes representing characters, extended characters B, after total of four bytes. Since the characteristics of UTF-8 encoding, when reading a file based on the characters representing the number of bytes to ensure data integrity reading. Therefore, a character equal to the number of bytes is determined by the coding scheme.
  Focus here, when reading the file system is how to determine how many bytes are a character does? See below! Bytes from the beginning of the scanning system, from the beginning of the numbers 0-1 can determine the number of bytes to read the word.
  So we came to the conclusion, with a byte stream operations will not necessarily Chinese garbled! Just look at the way you deal with!
Here Insert Picture Description

3, when it will not garbled - finished read

  When you read a file byte stream mode, continuously written to a file, then it will not be garbled. Because you're just a bunch of 0-1 will be copied into another digital file , the case does not involve the middle of reading, so you put all the Chinese written into the byte stream, then the system will be converted to a reading of a Chinese character.

Ready to work:

import java.io.File;
import java.io.FileOutputStream;
/**
 * @author RuiMing Lin
 * @date 2020-03-13 22:09
 */
public class Demo3 {
    public static void main(String[] args) throws Exception{
        File fis = new File("fis.txt");
        fis.createNewFile();    // 创建一个空文件 fis.txt
        File fos = new File("fos.txt");
        fos.createNewFile();    // 创建一个空文件 fos.txt
        FileOutputStream fileOutputStream = new FileOutputStream(fis);
        fileOutputStream.write("我爱你,中国!".getBytes());   // 向fis.txt写入 “我爱你,中国”
        fileOutputStream.close();
    }
}

Write fos.txt file:

import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
/**
 * @author RuiMing Lin
 * @date 2020-03-13 22:09
 */
public class Demo3 {
    public static void main(String[] args) throws Exception{
        FileInputStream fileInputStream = new FileInputStream("fis.txt");
        FileOutputStream fileOutputStream = new FileOutputStream("fos.txt");
        int len;
        while ((len = fileInputStream.read())!= -1){    
            fileOutputStream.write(len);    // 一次写入一个字节,或者一次写入多个字节也可以
        }
        fileInputStream.close();
        fileOutputStream.close();
    }
}

Output results:
Here Insert Picture Description

4, when it will be messy - read-while-write

When you read the side edge of a stream of bytes used, it is very likely to occur garbled!

import java.io.FileInputStream;
/**
 * @author RuiMing Lin
 * @date 2020-03-13 22:21
 */
public class Demo4 {
    public static void main(String[] args) throws Exception{
        FileInputStream fileInputStream = new FileInputStream("fis.txt");
        int len;
        byte[] bytes = new byte[3];     //假设一个汉字占3个字节
        while ((len = fileInputStream.read(bytes)) != -1){
            System.out.print(new String(bytes,"utf-8"));    //使用print,不进行换行
        }
        fileInputStream.close();
    }
}

Output results:

I love you, China!

  At first glance it seems will not be garbled! This is because "I love you, China!" In every word and punctuation are three bytes, and therefore will not be garbled!
  When we put the "I love you, China!" To "China, I love you!" And see the results at this time:
Here Insert Picture Description

5. Why not be garbled character stream

  From the above utf-8 encoding can be known, when reading a string of character stream numbers 0-1, will be read a few bytes of several of the foregoing determination, when finding 1110 **** 10 **** ****** ** 10 will read three bytes are not read immediately stopped, then read the contents of a Chinese character.

Please indicate the wrong place! Thought that it was in trouble if you can give a praise! We welcome comments section or private letter exchange!

Published 30 original articles · won praise 72 · views 10000 +

Guess you like

Origin blog.csdn.net/Orange_minger/article/details/104850611