Character encoding

ASCII code:

Inside the computer, all data using binary representation. Each binary digit (bit) has two states 0 and 1, and therefore 8 bits can be combined the 256 states, which is called a byte (byte). A byte total can be used to represent 256 different states, each state corresponding to a symbol, that is 256 symbols, from 0,000,000 to 11,111,111.

ASCII code: the 1960s, the United States developed a set of character encoding, the relationship between English characters and bits, made uniform regulations. This is called ASCII code. ASCII encoding code provides for a total of 128 characters, such as spaces "SPACE" is 32 (binary 00100000), uppercase letter A is 65 (binary 01000001). The 128 symbols (including 32 control symbols can not be printed out), only it takes a byte 7 behind the foremost one uniform predetermined zero.

Disadvantages:

You can not represent all characters.

Represent the same character encoding is not the same: for example, 130 represents the é in French coding, coding in Hebrew it represents the letter Gimel (ג)

Unicode encoding:

Unicode code: it can contain all of the characters in the world, but a fixed length, some storage space is wasted, Unicode character code table is two bytes;

Garbled: There are many encoding world, it can be interpreted as a binary number with a different symbol. Therefore, in order to open a text file, you must know the encoding, or reading the wrong encoding, it will be garbled.

Unicode: A coding, all the symbols of the world are included. Each symbol is given a unique code, the use of Unicode is not garbled.

The disadvantage of Unicode: Unicode provides only a binary code symbols, but does not specify how this should store the binary code: can not distinguish between Unicode and ASCII: The computer does not distinguish between a symbol or three bytes represent the three symbols. In addition, we know that the letters only one byte is enough, if unicode unified regulations, each symbol is represented by three or four bytes, then all must have two or three bytes of each letter It is 0, which is a great waste of storage space for it.

UTF-8:

UTF-8: Unicode for optimization, you can include all the characters in the world, accounting for one-byte letters, Chinese characters, 3 bytes;

UTF-8 is the most widely used implementation using a Unicode on the Internet.

UTF-8 is a variable-length encoding. It can use a symbol to 6 bytes, the byte length varies depending on the symbol.

UTF-8 encoding rules:

1) For the UTF-8 encoding single byte, the most significant bit of the byte is 0, the remaining 7 bits are used to encode characters (ASCII code equivalent).

2) For multi-byte UTF-8 encoding, and if the coding contains n bytes, then the first byte of the first n bits is 1, the first byte of the first n + 1 bits is 0, the byte the remaining it used to encode characters. All the bytes after the first byte, the highest two bits are "10", the remaining six bits are used to encode characters.

Guess you like

Origin www.cnblogs.com/zhangze-lifetime/p/11598406.html