On the development of various encoding

Disclaimer: This article is a blogger original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.

This link: https://blog.csdn.net/qq_41076931/article/details/102742163

1).Unicode

Unicode is a global character Unicode. It is a variety of text to each character on the world to specify a unique code, to achieve cross-lingual applications, cross-platform.

Unicode is just a set of symbols, it only provides a binary number per symbol, but did not specify how this should be stored in a binary number. For example, characters 'strict' of Unicode hexadecimal numbers 4E25, a full conversion into a binary number 15 (100111000100101), that is, this symbol represents at least 2 bytes. Other symbols represent greater, may require 3 bytes or 4 bytes, or even more.

2) UTF-8 encoding

Popularity of the Internet, urged a unified coding appears. UTF-8 is the most widely used implementation using a Unicode on the Internet. Other implementations further comprising UTF-16 (character two bytes or four bytes), and UTF-32 (four bytes represented by character), but substantially not on the Internet. Repeat, here is the relationship, UTF-8 Unicode is one of implementation.

UTF-8 biggest feature is that it is a variable length encoding. It can be 1 to 4 bytes of one symbol, byte length varies depending on the symbol. UTF-8 encoding rules is very simple, only two: a) for single byte symbols, a set of byte 0, the back 7 of the Unicode code symbol. Therefore, for the English alphabet, UTF-8 encoding and ASCII codes are the same. b) for symbol n bytes (n> 1), the first n bits of the first byte are set to 1, the n + 1 bit is set to 0, the first two bytes of the rear set 10 uniformly. The remaining bits not mentioned, all of the Unicode code symbol.
UTF-8 encoding has the following rule:
if a byte, the most significant bit (bit 8) is 0, indicating that it is an ASCII character (00 - 7F). Visible, all ASCII encoding is UTF-8 has a.
If a byte, beginning with 11, the number of consecutive 1 implies that the number of bytes of characters, for example: 110xxxxx is the two bytes representative of the first byte UTF-8 characters.
If a byte to start 10, indicating that it is not the first byte, need to look forward to get the first byte of the current character.

Note: UTF-8 encoding of the file is stored in a way, only takes place in the boundary, i.e. the function of various input / output streams place.
As we know, internal computer, all the information is ultimately a binary value. Each binary digit (bit) has two states 0 and 1, and therefore eight bits can be combined the 256 states, which is called a byte (byte). That is, a total of one byte may be used to represent 256 different states, each state corresponding to a symbol, that is 256 symbols, from 00,000,000 to 11,111,111.
60s of last century, the United States developed a set of character encoding, the relationship between English characters and bits, made uniform regulations. This is called ASCII code, still in use. ASCII code encoding provides for a total of 128 characters, such as spaces SPACE 32 (binary 00100000), the capital letter A is 65 (binary 01000001). The 128 symbols (including 32 control symbols can not be printed out), only it takes a byte 7 behind the foremost one uniform predetermined zero.

3), non-ASCII encoded
English with a 128 symbol encoding enough, but to represent other languages, 128 symbols is not enough. For example, in French, there is phonetic symbols above the letters, it can not be represented by ASCII codes. As a result, some European countries decided to use most significant byte of idle incorporated into the new symbol. For example, the French é coded as 130 (binary 10000010). As a result, the coding system used by European countries, may represent up to 256 symbols.

However, here again there is a new problem. Different countries have different letters, therefore, even if they are using the encoding 256 symbols, letters represent is not the same. For example, 130 represents the coding in French é, Hebrew letter it represents the coding Gimel (ג), in Russian encoding symbols will on behalf of another. But in any case, all these codes, the symbol represents 0-127 is the same, not the same in this paragraph is just 128--255.

As for text Asian countries, symbols used even more, as many as 10 million Chinese characters. A byte can only represent 256 kinds of symbols, it is definitely not enough, you must use multiple bytes express a symbol. For example, Chinese simplified encoding is common GB2312, using two bytes of a character, so theoretically represent up to 256 x 256 = 65536 symbols

Chinese coding problem discussed special needs, this note does not involve. Here only noted that although a symbol is represented by a plurality of bytes, but the character code and hereinafter GB classes are Unicode UTF-8 and unrelated.

4), Unicode
as the one mentioned, there are a variety of coding in the world, with a binary number can be interpreted as different symbols. Therefore, in order to open a text file, you must know the encoding, or reading the wrong encoding, it will be garbled. Why e-mail often garbled? Because encoding the sender and recipient use is not the same. Imagine, if there is a code, all the symbols of the world are included. Each symbol is given a unique code, then the garbage problem will disappear. This is Unicode, as its name have said, this is all encode a symbol. Unicode is certainly a big collection of its present size can accommodate more than one million symbols. Encoding each symbol is different, for example, U + 0639 represents the Arabic letter Ain, U + 0041 for English capital letters A, U + 4E25 expressed serious characters. Specific symbol correspondence table can query Unicode.org, special characters or correspondence table.

5), Unicode issues
should be noted that, Unicode is just a set of symbols, it only provides binary notation, but does not specify how this should be stored in binary code.

For example, the Unicode characters Yan hexadecimal numbers 4E25, a full conversion into a binary number 15 (100111000100101), that is, this symbol represents at least 2 bytes. Other symbols represent greater, may require 3 bytes or 4 bytes, or even more.

Here there are two serious problems, the first question is, how can the difference between Unicode and ASCII? The computer knows how three bytes represent a symbol, rather than the three symbols represent it? The second problem is that we already know, the letters only one byte is enough, if Unicode unified regulations, each symbol is represented by three or four bytes, are bound for two before each letter to three bytes is 0, which is a huge waste for storage, the size of the text file will be large and therefore a two to three times, which is not acceptable

The results they cause are: 1) the emergence of a variety of storage Unicode, which means there are many different binary format, can be used to represent Unicode. 2) Unicode can not promote a long period of time, until the emergence of the Internet.

6)、UTF-16

UTF-16 encoded using amplitude, i.e. 2 bytes per character. Advantages: simple; drawback: Western swell to 200% redundancy! And the boundaries between the word and the word easy to find, easy to divide wrong, not considered good prefix problem. This huffman coding do well.
UTF-8 encoding of unequal amplitude has a length ranging from 1 to 3 bytes. Pros: As a result of a good prefix, the situation is not easy to find the boundaries between words and Framing error does not occur. Disadvantages: CJK characters like expanded to 150% redundancy.

UTF-16 uses two bytes of a uniform character, although very simple in representation convenience, but also has its disadvantages, there are now a large part of two-byte characters can be represented with a byte representation, an enlarged storage space has doubled in the current network bandwidth is also very limited today, this will increase the flow of network traffic, but did not need. UTF-8 and uses a variable-length technique, each coding region has a different codeword length. Different types of characters may be composed of 1 to 6 bytes.

Speaking UTF must be mentioned Unicode (Universal Code Unicode), ISO tried to create a new super-language dictionary, all the world's languages can be translated to each other through the dictionary. One can imagine how complex this dictionary is a Unicode detailed specification can refer to the corresponding documentation. Unicode is the basis of Java and XML, Unicode is stored in the form described in detail below in the computer.
UTF-16 defines a particular Unicode character access method in a computer. UTF-16 is represented by two bytes Unicode format conversion, this is a method of fixed length, regardless of what characters are represented by two bytes, two bytes are 'bit 16, so called UTF-16. UTF-16 represents the character is very convenient, two bytes per character, this time in string manipulation is greatly simplified operation, which is to Java as a UTF-16 character memory storage format is very important.

7)、GBK

Full name "Chinese Internal Code Specification", the State Bureau of Technical Supervision for the new Chinese characters within the code developed by windows95 specification, which appears to extend GB2312, adding more characters, its encoding range is 8140 ~ FEFE ( removing XX7F) a total of 23,940 yards bits, it can represent 21,003 characters, and its encoding is compatible GB2312, GB2312 that is encoded by GBK characters can be decoded, and no distortion.

8), GB2312

Its full name is "information exchange with Chinese character set encoding basic set", which is double-byte encoded, the encoding range is always A1-F7, which is the symbol from the area A1-A9, comprising 682 symbols in total, from B0- F7 Chinese characters area, contains 6763 characters.

9)、GB18030

Full name is "Information exchange with the Chinese coded character set", is our country's mandatory standards, it may be a single-byte, double-byte or four-byte coding, coding it compatible with the GB2312 coding, although this is the national standard, but the actual application system is not widely used.

10), ISO-8859-1
128 characters is obviously not enough, then the ASCII ISO organization has developed on the basis of a series expansion for the standard ASCII encoding, which is ISO-8859-1 ~ ISO-8859-15 , ISO-8859-1 which covers most Western European language characters, the most widely used of all applications. ISO-8859-1 is still the single-byte code, it can represent a total of 256 characters.

11), ASCII code

Studied computer knows ASCII code, a total of 128, with the lower 7 bits of a byte represents 0 to 31 as a control character is a carriage return linefeed deletion; 32 to print 126 characters, can be input through the keyboard and It can be displayed.

On the development of various encoding

Guess you like