In Java byte and character

Recently read in Java IO knowledge, discovery and understanding of byte characters is not enough. Write a summary record it.

A byte

The so-called byte (Byte), is a unit of measurement of computer data storage. A bit as bits (bit), 8 bits form a byte, that is to say can be used to distinguish a 256 byte integers (0 to 255). Thus we can know, this byte is the basic unit for transmission of computer data storage and subsequent characters are also stored in bytes, a different number of bytes occupied by different character encoding.

In Java, then, in addition to the significance of the stored, Java byte Byte also as a basic data type, the data type occupies one byte in memory for an integer in the range (-128 to 127)

byte a = -128;
byte b = 127;

In general, there are two meanings byte in Java:

  • Storage units
  • Java data types, an integer ranging from -128 to 127 for indicating

Second, character

Computer storage is the underlying byte, is designed for displaying a character symbols. A variety of text displayed on the screen, numbers, symbols, etc. is decoded characters. So we say character is used to display the symbol, it will store the byte into a symbol that people can understand, and therefore the relationship between the core and display symbols that define byte characters, this mapping relationship is often called the coding .

Origin 2.1, encoded

Why coding it? We all know that the previous data bytes stored in the computer, can distinguish 256-byte integer, most likely to think that these integers is defined as 256 and 256 states respectively correspond to 256 characters. But humans are too many symbols, 256 species are not enough. So people will think more bytes combined to represent a symbol of human language, coding problems on the conversion issue has become a combination of bytes.

2.2 common format, encoded

Today, there are many encoding formats, such as the common ASCII, ISO-8859-1, GB2312, GBK, UTF-8, UTF-16, and so on.

ASCII code is the most basic encoding format, the standard ASCII code, a total of 128, taking the low byte 7, the Department of English language symbols can be covered, but in general the characters that can be represented is still very limited.

ISO-8859-1 coding is an extended ASCII code, which took eight bits, 256 kinds of characters can be expressed, and downwardly compatible with ASCII, contains most of western Europe symbol.

GB2312 double-byte coding, meaning that it uses two bytes to represent symbols, comprising 6763 characters.

GBK is an extension GB2312, is double-byte coding, can represent 21,003 characters, and backward compatible GB2312.

...

Coding specifications, more and more countries in different languages ​​define their own sign language coding standards, a flourishing time coding standards, the exchange in the era of the Internet's very inconvenient, the exchange of information between different coding systems require different decoding scheme, otherwise there will be garbled phenomenon. So the International Organization for Standardization ISO has developed a character encoding scheme can accommodate all signs and symbols of the world's Unicode. Unicode is a character set, which defines the binary number corresponding to the character of all human beings, as to how this binary number is stored to be implemented by the developer. The more popular implementation is UTF-8 and UTF-16, there is a UTF-32.

UTF-32, 4 bytes, i.e. 32-bit Unicode character binary storage, high efficiency but wasted space.

UTF-8 encoding is a variable-length encoding, it is 1 to 6 bytes to store, for the character using the English system a byte, the ASCII backward compatible, two bytes are used for the characters, and so on so you can save some space.

UTF-16 encoding is an encoding embodiment interposed therebetween. For some characters are two bytes, the other portion of the character uses 4 bytes. Therefore, UTF-16 is not compatible ASCII.

In normal use, the use of UTF-8, or more, not only because it is backward compatible with ASCII, but also saves space to a certain extent.

2.3, Java IO stream encoding and decoding

Java is how to encode and decode it? We know that the encoding / decoding process mainly occurs during the transition between characters and bytes. When the character display, the byte will be decoded into memory symbol, when storing or transfer files, we character encoding bit byte data.

decoding

Decoding process is to convert byte character, that is, we in the process of reading the file or network data.

In java, we read the file data by FileReader, FileReader inherits from InputStreamReader. StreamDecoder used in the decoder in InputStreamReader.

// InputStreamReader.java 
Import java.nio.charset.Charset;
 Import java.nio.charset.CharsetDecoder;
 Import sun.nio.cs.StreamDecoder; 

public  class the InputStreamReader the extends Reader {
  // decoder, as specified byte encoding converted into characters 
    Private  Final StreamDecoder SD; 

  // through dec specified encoding scheme used by the decoder 
    public the InputStreamReader (in the InputStream, of a CharsetDecoder dec) {
         Super (in);
         IF (dec == null )
             the throw  new new a NullPointerException ( "charset decoder" ) ; 
        SDStreamDecoder.forInputStreamReader = (in, the this , On Dec); 
    } 
  
  // reading character, an int (4 bytes) Returns the character 
    public  int Read () throws IOException {
         return sd.read (); 
    } 

}

 Through the above InputStreamReader source we can know:

  • When reading the input stream, by StreamDecoder complete byte character conversion
  • Encoding scheme can be set via the constructor
  • The character read data returns an int type, i.e. 4 bytes

The above mentioned is only part of the source code, we set the encoding scheme in many forms, such as in the constructor of type String Name Species incoming coding mode, the incoming CharSet character set type and the above CharsetDecoder type of character decoding scheme. If you do not pass encoding scheme, the default encoding scheme for the current environment.

coding

And decoding is similar to the storage file or when writing data, we will convert characters to bytes written to the file or network.

In java species, we have to write to the file by FileWriter, FileWriter inherits from OutputStreamWriter. In OutputStreamWriter kinds encoder StreamEncoder.

// OutputStreamWriter.java 
Import java.nio.charset.Charset;
 Import java.nio.charset.CharsetEncoder;
 Import sun.nio.cs.StreamEncoder; 

public  class the OutputStreamWriter the extends Writer {
     // encoder, according to a specified character encoding conversion into byte 
    Private  Final StreamEncoder SE; 

    // specified by the coding scheme ENC 
    public the OutputStreamWriter (the OutputStream OUT, a CharsetEncoder, ENC) {
         Super (OUT);
         IF (ENC == null )
             the throw  new new a NullPointerException ( "charset encoder" ); 
        SE = StreamEncoder.forOutputStreamWriter (OUT, the this , ENC); 
    } 
    
    // write characters, characters written in an int incoming 
    public  void Write ( int C) throws IOException { 
        se.write (C); 
    }

 We can know by source:

  • Written to the output stream of characters to complete the conversion by StreamEncoder bytes
  • Coding scheme specified by the constructor
  • Written characters are of type int

Guess you like

Origin www.cnblogs.com/zhengshuangxi/p/11057972.html