Depth understanding of ASCII, Unicode and UTF-8 encoding

1. Why do you need coding?

  Because the computer can only handle 0 and 1 (ie two states: high and low), all we need to letters, numbers, special characters translated into computer knowledge 0 and 1, then go and how to translate the translation with what rules , so smart people invented a series of encoding rules that correspond to characters and numbers. The first invention is the ASCII code, then the code and derived from the Unicode UTF-8 encoding.

2. The evolution of the encoding format

  The world's first computer was born in the University of Pennsylvania, the earliest use of computers is American, the oldest Code for Information Interchange also born in the United States, namely ASCII (America Standard Code for Infomation Interchange , American Standard Code for Information Interchange). ASCII code is the essence of the correspondence between numbers and characters, such as capital letter "A" corresponding to the decimal number 65 (For easier understanding of our example in decimal, octal, hexadecimal empathy), and the decimal number 65 in the computer represented as 01000001, the computer can store characters, but can store 0 and 1, the letter "A" is actually stored in a computer-01000001, 8 bits, i.e. one byte. The same is true of other characters, each corresponding to a decimal number, the standard reference is the ASCII code table. It is not very simple, but also why people develop Unicode encoding it? Because the United States is the standard ASCII encoding, character covered only contains only AZ, az, 0-9, and other control characters and some special characters, contains a total of 127 characters, and later as computer literacy, which is 127 symbols can not meet people's needs, so the use of IBM 128 to 255 pairs of ASCII code complements include diacritics symbols, Greek letters and other symbols drawing, this part of the coding is called extended ASCII codes.
  There are hundreds of languages in the world, it is clear that the standard ASCII code and the extended ASCII code still can not meet the needs of different countries for encoding, such as Chinese characters "Chinese" can not be used ASCII representation, if the editor is set ASCII encoding, in the face of Chinese characters will appear garbled situation, so China has developed GB2312 coding, Japan enacted the Shift_JIS encoding, but with an application, people in different areas of use, you need to include a different set of encoding, obviously it would be unwise to do so , so Unicode came into being. Unicode encoding is usually represented by two bytes, some remote character will use 2-4 bytes, in order to achieve a set of coded save all the characters, so different countries, different regions on the formation of a unified set of encoding format.
  Or take the character "A" For chestnut, "A" corresponds to the ASCII code is 01000001, if the "A" represented by the Unicode encoding in a leading zero on it, 0000000001000001, now we can also use Unicode encoding to mean "Han ", and 0110110001001001. Not difficult to find, if used to encode all Unicode characters, garbled problem will be solved, but the problem came, if a piece of text in both English letters, but also Chinese characters, English letters will also use two bytes (16) to indicate that this will obviously result in a waste of storage space. Is there a more versatile and more storage space to save coding it? Of course, some smart people invented the UTF-8 encoding, UTF-8 is a variable-length coding, why it is called UTF-8, this 8 What does it mean? 8 represents one byte, i.e. 8 bits, but does not represent a UTF-8 byte with a character, but in the UTF-8 encoding format, a change in byte size occupied by the smallest unit of characters, bit around, said mutt is because the UTF-8 encoding, different character sizes is variable in space, each character may be 1 byte, it may be two or three bytes.   

----- pending upgrade

Guess you like

Origin blog.csdn.net/weixin_34314962/article/details/91384919