[Coding] completely understand the relationship between ASCII, Unicode, UTF-8

All characters in the computer, after all permutations and combinations are used to represent binary 0, 1, and therefore need to have a specification to specify which enumerate the permutations and combinations of each character corresponding to 0,1, such a specification is character sets .

 

ASCII

Stands for "American Standard Code for Information Interchange" (American Standard Code for Information Interchange ), enacted in 1960, this specification defines 128 characters corresponding binary code. 128 = 2 . 7 , i.e., only need to be able to fully represent seven bit, ASCII code so that each occupies only 1 byte (1Byte = 8bit).

For example, corresponding to ASCII uppercase A code is 01000001. Click to view the full ASCII code table

If a text file to store the 100 ASCII character encoding, the size of the file content is 100B.

ASCII specification covers only letters, numbers, and symbols part (including line breaks, tabs, etc. controllability symbol), but the world there are many characters in the language of the computer system needs to be able to be processed (for example, tens of thousands of Chinese characters), so one on the need to develop a much larger than the ASCII character set, enough to include all the characters into the world. This character set is Unicode.

 

Unicode

This is the world's largest character set, relative to the ASCII code, Unicode encoding greatly expanding the number of bits to 16 32 - bit, meaning it could theoretically hold up to 2 32 ≈42 billion characters. Unicode contains a variety of letters, almost all areas of symbolic language and CJK, emoji, etc., such as Chinese characters of "I" corresponding Unicode is 0,110,001,000,010,001, written in hexadecimal is 6211. Now transfer on the Internet, showing the encoding used basically Unicode. Its lowest 7 and ASCII code is fully compatible with 16-bit Unicode if that is to represent the capital letters A, will be written in 0,000,000,001,000,001.

 

UTF-8

Unicode coverage is very broad, but if we use 16 or even 32 to store and transmit each symbol, use the ASCII code for the main user of the West, is bound to be a lot of 0 for only fill the seats, resulting in a waste of hardware resources. For this reason, people invented UFT-8 encoding, using a variable number of bits to represent the characters in Unicode.

UTF-8 is how to do it?

1. For single-byte character can be represented by the first bit is 0, ASCII code for that character back 7

2. The need for ≥2 character bytes indicated before, beginning with the first byte 1, and the character representing the total number of bytes, several successive 1 is inserted; after the end of a continuous, insertion a 0. All subsequent bytes beginning at 01.

As shown below (Table from the Table blog Ruan Yifeng teacher )

Unicode symbol range | UTF-8 encoding 
  (hex) | (binary) 
-------------------- + ---------- ----------------------------------- 
0000 0000-0000 007F | 0xxxxxxx 
0000 0080-0000 07FF | 110xxxxx 10xxxxxx 
0800-0000 FFFF 0000 | 1110xxxx 10xxxxxx 10xxxxxx 
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx


The "I" coding method UTF-8 to operate as follows:
"I" Unicode code is 6211, corresponding to table 3 byte code section. Inserted in the corresponding position control bit Unicode binary code, to give
1110 0110 10 001000 10 010001
written in hexadecimal it is E68891, thus obtaining a UTF-8 encoding.

Of course, for the average developer, coding details are of secondary importance, as long as know how it is on the line ~


So, when an HTML page is added a <meta charset = "utf-8 "> tag, the computer will know that whenever this byte stream page of text found in a E68891, you should take it as the "I" word . But if not followed when the page encoding UTF-8, but GB2132 (UTF-8 and it has nothing to do), then there will be garbled - computer misinterpreted the meaning of these bytes.




Guess you like

Origin www.cnblogs.com/leegent/p/11097484.html