And the relationship between the Unicode UTF-8

A, ASCII code

As we know, internal computer, all the information is ultimately a binary value. Each binary digit (bit) has 0and 1two states, and therefore eight bits can be combined the 256 states, which is called a byte (byte). That is, a total of one byte may be used to represent 256 different states, each state corresponding to a symbol, that is 256 symbols, from 00000000to 11111111.

60s of last century, the United States developed a set of character encoding, the relationship between English characters and bits, made uniform regulations. This is called ASCII code, still in use.

ASCII encoding code provides for a total of 128 characters, such as spaces SPACEis 32 (binary 00100000), uppercase letters Ais 65 (binary 01000001). The 128 symbols (including 32 control symbols can not be printed out), only it takes a byte 7 behind the foremost one of a predetermined uniform 0.

Second, the non-ASCII coding

English with 128 symbol encoding enough, but to represent other languages, 128 symbols is not enough. For example, in French, there is phonetic symbols above the letters, it can not be represented by ASCII codes. As a result, some European countries decided to use most significant byte of idle incorporated into the new symbol. For example, French écoded as 130 (binary 10000010). As a result, the coding system used by European countries, may represent up to 256 symbols.

However, here again there is a new problem. Different countries have different letters, therefore, even if they are using the encoding 256 symbols, letters represent is not the same. For example, the 130 representatives in the French coding é, coding in Hebrew represents the letter Gimel( ג), in the Russian coding will sign on behalf of another. But in any case, all these codes, the symbol represents 0-127 is the same, not the same in this paragraph is just 128--255.

As for text Asian countries, symbols used even more, as many as 10 million Chinese characters. A byte can only represent 256 kinds of symbols, it is definitely not enough, you must use multiple bytes express a symbol. For example, Chinese simplified encoding is common GB2312, using two bytes of a character, so theoretically represent up to 256 x 256 = 65536 symbols.

Chinese coding problem discussed special needs, this note does not involve. Here only noted that although a symbol is represented by a plurality of bytes, but the character code and hereinafter GB classes are Unicode UTF-8 and unrelated.

Three. Unicode

As mentioned in the previous section, there are a variety of encoding the world, with a binary number can be interpreted as different symbols. Therefore, in order to open a text file, you must know the encoding, or reading the wrong encoding, it will be garbled. Why e-mail often garbled? Because encoding the sender and recipient use is not the same.

Imagine, if there is a code, all the symbols of the world are included. Each symbol is given a unique code, then the garbage problem will disappear. This is Unicode, as its name have said, this is all encode a symbol.

Unicode is certainly a big collection of its present size can accommodate more than one million symbols. Encoding each symbol is different, for example, U+0639represents the Arabic alphabet Ain, U+0041capital letter in English A, U+4E25represents the Chinese character . Specific symbol correspondence table can query Unicode.org , or special characters correspondence table .

Four, Unicode problems

It should be noted, Unicode is just a set of symbols, it only provides binary notation, but does not specify how this should be stored in binary code.

For example, Chinese characters of Unicode is a hexadecimal number 4E25, a binary number is converted into a full 15 ( 100111000100101), that is, this symbol represents at least 2 bytes. Other symbols represent greater, may require 3 bytes or 4 bytes, or even more.

Here there are two serious problems, the first question is, how can the difference between Unicode and ASCII? The computer knows how three bytes represent a symbol, rather than the three symbols represent it? The second problem is that we already know, the letters only one byte is enough, if Unicode unified regulations, each symbol is represented by three or four bytes, are bound for two before each letter three bytes 0, which is a great waste for storage, the size of the text file will be large and therefore a two to three times, this is unacceptable.

The results they cause are: 1) the emergence of a variety of storage Unicode, which means there are many different binary format, can be used to represent Unicode. 2) Unicode can not promote a long period of time, until the emergence of the Internet.

Five, UTF-8

Popularity of the Internet, urged a unified coding appears. UTF-8 is the most widely used implementation using a Unicode on the Internet. Other implementations further comprising UTF-16 (character two bytes or four bytes), and UTF-32 (four bytes represented by character), but substantially not on the Internet. Repeat, here is the relationship, UTF-8 Unicode is one of implementation.

UTF-8 biggest feature is that it is a variable length encoding. It can be 1 to 4 bytes of one symbol, byte length varies depending on the symbol.

UTF-8 encoding rules are very simple, only two:

1) For single byte symbols, the first byte is set 0, the back 7 of the Unicode code symbol. Therefore, for the English alphabet, UTF-8 encoding and ASCII codes are the same.

2) For the nbyte symbol ( n > 1), before the first byte nbits are set 1, the n + 1bit is set 0, the first two bytes always set back 10. The remaining bits not mentioned, all of the Unicode code symbol.

The following table summarizes the encoding rules, the letter xexpressed by the encoded bits.

Unicode symbol range | UTF-8 encoding 
(hex) | (binary) 
-------- + -------- ------------------------------------- 
0000 0000-0000 007F | 0xxxxxxx 
0000 0080-0000 07FF | 10xxxxxx 110xxxxx 
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx 
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

It is now on the table, reading UTF-8 encoding is very simple. If the first one is a byte 0, then this is a single-byte character; if the first one is 1, then the number of consecutive 1, it indicates how many bytes occupied by the current character.

Below, or in Chinese characters , for example, demonstrates how to implement UTF-8 encoding.

Unicode is a 4E25( 100111000100101), according to the table, can be found 4E25in the range of the third line ( 0000 0800 - 0000 FFFF), and therefore the UTF-8 encoding requires three bytes, i.e., format 1110xxxx 10xxxxxx 10xxxxxx. Then, from the last bit of Start, fill in the form from back to front x, the extra bits make up 0. Thus obtained, the code is UTF-8 11100100 10111000 10100101, is converted to hexadecimal E4B8A5.

Conversion between six, Unicode UTF-8 with

By the example of the previous section, we can see the codes are Unicode 4E25, UTF-8 is encoded E4B8A5, the two are not the same. Transitions between them may be implemented by a program.

Windows platform, there is a simple transformation method is to use the built-in Notepad applet notepad.exe. After opening the file, click on the 文件menu 另存为command, will pop up a dialog box at the bottom there is a 编码drop-down bar.

bg2007102801.jpg

There are four ANSIoptions: Unicode, , Unicode big endianand UTF-8.

1) ANSIis the default encoding. For English files are ASCIIencoded, for Simplified Chinese file is GB2312encoded (only for Windows Simplified Chinese version, if it is Traditional Chinese will use Big5 code).

2) Unicodecoding here refers to the notepad.exeUCS-2 encoding used, i.e., directly into Unicode characters with two bytes, this option with little endian format.

3) Unicode big endianencoding the corresponding option. I'll explain in the next section big endian and little endian meaning.

4) UTF-8coding, coding is a method discussed.

After selecting the "coding", click on the "Save" button, encoding files immediately convert better.

Seven, Little endian and Big endian

Already mentioned in the previous section, UCS-2 format can store Unicode code (code point does not exceed 0xFFFF). Kanji an example, Unicode code is a 4E25need to use two bytes, one byte 4E, the other byte 25. When stored, 4Ethe former, 25in the post, this is Big endian mode; 25the former, 4Ein the post, this is Little endian mode.

The two odd name comes from the British writer Jonathan Swift's "Gulliver's Travels." In the book, Lilliput outbreak of civil war, war is the cause of debate, whether it is knocked from the bulk (Big-endian) or small head (Little-endian) knocked eat eggs. To this matter, before and after the war broke out six times, an emperor lost his life, another emperor lost his throne.

The first byte first, is the "bulk mode" (Big endian), first second byte is the "head mode" (Little endian).

So naturally, there will be a problem: how do you know a computer in the end what kind of a file encoded using?

Unicode specification defines, the front of each file are sequentially added to a character code representation of the name of this character is called a "zero-width non-breaking space" (zero width no-break space ), with FEFFFIG. This is exactly two bytes, and FFmore than FEbig 1.

If the first two bytes is a text file FE FF, it indicates that the file is the bulk mode; if the first two bytes are FF FE, it means that the file is a small head embodiment.

Eight examples

Here, for an example.

Open the "Notepad" program notepad.exe, a new text file, content is a word, followed by the use of ANSI, Unicode, Unicode big endianand UTF-8encoding saved.

Then, use a text editor UltraEdit in the "Hex function," observe the internal encoding of the file.

1) ANSI: encoded file is two bytes D1 CF, which is the coding GB2312, GB2312 This also implies that the use of bulk stored.

2) Unicode: encoding four bytes FF FE 25 4E, which FF FEindicates that a small head stored, the actual encoding is 4E25.

3) Unicode big endian: encoding four bytes FE FF 4E 25, which FE FFindicates that the bulk is stored.

4) UTF-8: encoding is six bytes EF BB BF E4 B8 A5, the first three bytes EF BB BFindicate that this is UTF-8 encoding, three E4B8A5is the particular coding sequence and its coding sequence are stored in the same.

IX Further reading

(Finish)

 

Reference: http://www.ruanyifeng.com/blog/2007/10/ascii_unicode_and_utf-8.html

Guess you like

Origin www.cnblogs.com/still-smile/p/11595731.html