Character encoding unicode, utf-8 and ascii

Ascii encoding

Since the computer was invented by Americans, only 127 characters were encoded into the computer at the earliest, that is, uppercase and lowercase English letters, numbers and some symbols. This encoding table is called ASCIIencoding. For example A, the encoding of uppercase letters is 65, lowercase letters. zThe encoding is 122.

But obviously one byte is not enough to process Chinese, at least two bytes are needed, and it cannot conflict with the ASCII encoding, so China has developed an GB2312encoding to compile Chinese into it.

As you can imagine, there are hundreds of languages ​​in the world. Japan has Japanese compiled into Shift_JISit, and South Korea has compiled Korean into Euc-krit. Each country has its own standards, and conflicts will inevitably arise. As a result, in multilingual mixed In the text, there will be garbled characters displayed. Hence, Unicode came into being. Unicode unifies all languages ​​into one encoding, so there will be no more garbled problems.

Unicode encoding

The Unicode standard is constantly evolving, but the most common is to use two bytes to represent a character (4 bytes if you want to use very remote characters). Modern operating systems and most programming languages ​​directly support Unicode.

Now, take a look at the difference between ASCII encoding and Unicode encoding: ASCII encoding is 1 byte, while Unicode encoding is usually 2 bytes.

Chinese characters have gone beyond the scope of ASCII encoding, and Unicode encoding is decimal 20013and binary 01001110 00101101.

Letters Aare encoded in ASCII in decimal 65and binary 01000001;

You can guess that if you Ause Unicode encoding for ASCII encoding, you only need to add 0 in front of it. Therefore, Athe Unicode encoding is 00000000 01000001.

However, if the text you write is basically all in English, using Unicode encoding requires twice as much storage space as ASCII encoding, which is very uneconomical in terms of storage and transmission.

Therefore, in the spirit of saving, there is an encoding that converts Unicode encoding into "variable-length encoding" UTF-8.

UTF-8coding

UTF-8 encoding encodes a Unicode character into 1-6 bytes according to different number sizes, commonly used English letters are encoded into 1 byte, Chinese characters are usually 3 bytes, and only very rare characters will be encoded. Encoded into 4-6 bytes. If the text you are transferring contains a lot of English characters, encoding in UTF-8 can save space.

UTF-8 encoding has the added benefit that ASCII encoding can actually be seen as part of UTF-8 encoding, so a lot of legacy software that only supports ASCII encoding can continue to work under UTF-8 encoding.

 

In the computer memory, Unicode encoding is used uniformly, and when it needs to be saved to the hard disk or needs to be transmitted, it is converted to UTF-8 encoding.

When editing with Notepad, the UTF-8 characters read from the file are converted to Unicode characters and stored in the memory. After editing, the Unicode is converted to UTF-8 and saved to the file when saving:

 

 When browsing the web, the server converts the dynamically generated Unicode content to UTF-8 and transmits it to the browser:

 

So you see similar <meta charset="UTF-8" />information on the source code of many web pages, indicating that the web page is encoded in UTF-8.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325055805&siteId=291194637