Beginners' understanding of ASCII encoding, Unicode encoding, and UTF-8 encoding

       The earliest computers were designed with 8 bits as a byte, so the largest integer that a byte can represent is 255 (binary 11111111=decimal 255). If you want to represent a larger integer, more bytes must be used. For example, the largest integer that can be represented by two bytes is 65535, and the largest integer that can be represented by 4 bytes is 4294967295.       

       At the earliest, only 127 letters were encoded into the computer, that is, uppercase and lowercase English letters, numbers and some symbols. This encoding table is called ASCII encoding. For example, the encoding of capital letter A is 65, and the encoding of lowercase letter z is 122. There are no other language encodings.

       Unicode encoding is to unify all languages ​​into a set of encodings, so that there will be no more garbled problems when encountering the same language.

       The Unicode standard is constantly evolving, but the most common one is to represent a character with two bytes (4 bytes if you want to use very remote characters). Modern operating systems and most programming languages ​​directly support Unicode.

       The difference between ASCII encoding and Unicode encoding: ASCII encoding is 1 byte, while Unicode encoding is usually 2 bytes.

       UTF-8 encoding is a "variable-length encoding", which encodes a Unicode character into 1-6 bytes according to different number sizes, commonly used English letters are encoded into 1 byte, and Chinese characters are usually 3 Bytes, only very rare characters will be encoded into 4-6 bytes. If the text you are transferring contains a lot of English characters, encoding in UTF-8 can save space.

       The following summarizes the working methods of character encoding commonly used in computer systems:

       In the computer memory, the Unicode encoding is uniformly used. When it needs to be saved to the hard disk or needs to be transmitted, it is converted to UTF8 encoding.

       When editing with Notepad, the UTF8 characters read from the file are converted into Unicode characters into the memory. After editing, the Unicode characters are converted into UTF8 and saved to the file when saving.

 

When browsing the web, the server converts the dynamically generated Unicode content to UTF8 and transmits it to the browser. There will be information similar to <meta charset="UTF-8" /> on the source code of the webpage, which means that the webpage is using UTF8 encoding.

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324606969&siteId=291194637