String and encoding

String and encoding

 

1. How do computers that can only process numbers save characters?  

 

      Because computers can only process numbers, if you want to process text, you must first convert the text to numbers before you can process it.

      The earliest computers were designed with 8 bits as a byte, so the largest integer that a byte can represent is 255 (binary 1111111=decimal 255). If you want to represent a larger integer, more bytes must be used.

     For example, the largest integer that can be represented by two bytes is 65535, and the largest integer that can be represented by 4 bytes is 4294967295.

 

2. Americans invented the computer, so the earliest character set is ASCII, only 127.

 

Since the computer was invented by Americans, only 127 letters were encoded into the computer at the earliest, that is, uppercase and lowercase English letters, numbers and some symbols. This encoding table is called ASCII encoding. For example, the encoding of capital letter A is 65 , the code for the lowercase z is 122.

But obviously one byte is not enough to process Chinese, at least two bytes are needed, and it cannot conflict with ASCII encoding, so China has formulated

The GB2312 code is used to compile Chinese into it.

 

3. Countries all over the world want to encode their own characters

 

As you can imagine, there are hundreds of languages ​​in the world. Japan compiles Japanese into Shift_JIS, and South Korea compiles Korean into Euc-kr. Countries have their own standards, and conflicts will inevitably arise. As a result, in In the multi-language mixed text, there will be garbled characters displayed.

 

4, the character set of Esperanto Unicode

 

Hence, Unicode came into being. Unicode unifies all languages ​​into one encoding, so there will be no more garbled problems.

The Unicode standard is constantly evolving, but the most common is to use two bytes to represent a character (4 bytes if you want to use very remote characters).

Modern operating systems and most programming languages ​​directly support Unicode.

Now, take a look at the difference between ASCII encoding and Unicode encoding: ASCII encoding is 1 byte, while Unicode encoding is usually 2 bytes.

The letter A in ASCII is 65 in decimal and 01000001 in binary;

Character 0 in ASCII is 48 in decimal and 00110000 in binary. Note that the character '0' is different from the integer 0;

The Chinese character "" has exceeded the range of ASCII encoding. The Unicode encoding is 20013 in decimal and 01001110 00101101 in binary.

You can guess that if the ASCII-encoded A is encoded in Unicode, you only need to add 0 in front. Therefore, the Unicode encoding of A is 00000000 01000001.

 

5. UTF-8 (variable-length encoding) that saves more storage space than Unicode

 

A new problem has appeared again: if it is unified into Unicode encoding, the problem of garbled characters has disappeared since then. However, if the text you write is basically all in English, using Unicode encoding requires twice as much storage space as ASCII encoding, which is very uneconomical in terms of storage and transmission.

Therefore, in the spirit of saving, UTF-8 encoding, which converts Unicode encoding into "variable-length encoding", appeared again. UTF-8 encoding encodes a Unicode character into 1-6 bytes according to different number sizes, commonly used English letters are encoded into 1 byte, Chinese characters are usually 3 bytes, and only very rare characters will be encoded. Encoded into 4-6 bytes. If the text you want to transfer contains a lot of English characters, encoding in UTF-8 can save space:

Character ASCII Unicode UTF-8

  A       01000001   00000000 01000001      01000001

 Medium x 01001110 00101101 11100100 10111000 10101101

It can also be found from the above table that UTF-8 encoding has an additional benefit, that is, ASCII encoding can actually be regarded as a part of UTF-8 encoding, so a large number of historical legacy software that only supports ASCII encoding can be used in UTF- 8 code to continue working.

 

6. How computers work with character encodings

 

After figuring out the relationship between ASCII, Unicode and UTF-8, we can summarize the working methods of character encoding commonly used in computer systems:

In the computer memory, the Unicode encoding is uniformly used, and when it needs to be saved to the hard disk or needs to be transmitted, it is converted to UTF-8 encoding.

 

When editing with Notepad, the UTF-8 characters read from the file are converted into Unicode characters into the memory. After editing, the Unicode is converted to UTF-8 and saved to the file when saving.

 

When browsing the web, the server converts the dynamically generated Unicode content to UTF-8 and transmits it to the browser.

So you see that the source code of many web pages will have information similar to <meta charset="UTF-8" />, indicating that the web page is encoded in UTF-8.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326326322&siteId=291194637