Character encoding ansi unicode utf-8 difference

Introduction

In order to make the computer support more languages, usually 2 bytes in the range of 0x80~0xFFFF are used  to represent 1 character. For example: Chinese character "中" In the ANSI coded Chinese operating system, the two bytes [0xD6,0xD0] are used for storage.

Different countries and regions have formulated different standards, resulting in their own coding standards such as GB2312, GBK, GB18030, Big5, and Shift_JIS. These extended encoding methods of various Chinese characters that use multiple bytes to represent a character are called ANSI encoding. In Simplified Chinese Windows operating system, ANSI code stands for GBK code; in Traditional Chinese Windows operating system, ANSI code stands for Big5; in Japanese Windows operating system, ANSI code stands for Shift_JIS code.

Different ANSI codes are incompatible with each other. When information is exchanged internationally, it is impossible to store texts in two languages ​​in the same ANSI coded text.

ANSI encoding uses one byte for English characters, and two or four bytes for Chinese.

the difference

Do a little experiment first:

In a folder, save a txt text (the text contains the sentence "Today's weather is very good") as ansi, unicode, and utf-8 txt files. Then, right-click on the folder and select "Search (E)...". 

Searching for the word "weather", you can search for txt files encoded in ansi and unicode, but not utf-8 encoded files.

the reason

1. The Chinese operating system defaults to ansi encoding, and the generated txt file defaults to ansi encoding, so it can be searched out.

2. Unicode is an international universal code, so it can be searched out.

3. utf-8 encoding is a kind of "workaround" and "bridge" encoding when unicode encoding is transmitted between networks (mainly web pages). utf-8 can save data when transferring between networks. Therefore, the txt text cannot be searched using the operating system.

According to the wishes of the founder of utf-8:

End (unicode)-transmission (utf-8)-end (unicode)

However, later, many web developers directly use utf-8 encoding when developing web pages.

End (utf-8)-transmission (utf-8)-end (utf-8)

Therefore, the encoding seen on the browser is: unicode (utf-8). Just because unicode (utf-8) is listed side by side on the browser, many netizens (and even many programmers) mistakenly believe that unicode=utf-8. In fact, according to the original intention of the founder of utf-8, it is wrong to use utf-8 encoding when developing web pages, and early browsers do not support parsing utf-8 encoding. However, the power of everyone is immense, and Microsoft has to "follow the flames" and support parsing utf-8 encoding on the browser.

The problem is: utf-8 encoding affects website developers, in other words, website developers "expanded" the use of utf-8 encoding. However, website developers still cannot influence the developers of various documents. Therefore, word documents and some internationally-used documents still use unicode encoding instead of utf-8 encoding.

For example, the Unicode code of "Yan" is 4E25, and the UTF-8 code is E4B8A5. The two are different.

Although the codes of the files (txt and xml) generated in Chinese and Japanese operating systems are all ansi, under the simplified Chinese system, the ansi code represents the GB2312 code, and under the Japanese operating system, the ansi code represents the JIS code. Different ANSI codes are incompatible with each other. When information is exchanged internationally, it is not possible to store texts belonging to two languages ​​in the same paragraph of ANSI coded text.

in conclusion

It is an authentic practice to use unicode encoding for international documents (txt and xml); both operating systems and browsers can "understand" unicode encoding. The browser "understands pressure" to "understand" utf-8 encoding. However, the operating system sometimes only recognizes unicode encoding.

The difference between Unicode and Unicode big endian: Do you eat the small head first or the big head first when you eat eggs? The difference between Unicode and Unicode big endian is the difference between small end first and big end first in encoding. "Follow the trend" using Unicode is OK

 

 

Guess you like

Origin blog.csdn.net/whatday/article/details/113765527