ASCII, Unicode, UTF-8, UTF-16, GBK, GB2312, ANSI encoding other Analysis

ASCII, Unicode, UTF-8, UTF-16, GBK, GB2312, ANSI encoding other Analysis

Preface

Byte code from a variety of methods, can see the shadow of the prehistoric era of computer development.

ASCII

There are standard ASCII code ASCII code ASCII code and expand the division, the separate explanation here.

  1. Standard ASCII code

    Standard ASCII code one byte , but after only a 7, the first bit is 0. a byte could represent 256 different cases, and the resulting ASCII code only 128 kinds of symbols. This 128 kinds of symbols including English 26 uppercase and lowercase letters, numbers 0-9,32 a non-printable control characters, symbols (symbols that we can see on the keyboard)

    Attached ASCII code table inquiry address: http://ascii.911cha.com/

  2. Expand the ASCII code (EASCII, namely Extended ASCII) table

    Expand the ASCII code is still a byte , but the idle down first gained access. EASCII code than the expansion out of ASCII code symbols including the symbol table, computing symbols, Greek letters, and special symbols of Latin


Unicode UTF-8 and

Plan Key: Unicode is not a specific implementation, but only for all the characters in a corresponding numerals, does not limit how they implement. And utf-8 Unicode implementation happens to be the most widely applied

UTF-8 encoding rules as follows:

1) For single byte symbols, the first byte is set 0, the back 7 of the Unicode code symbol. Therefore, for the English alphabet, UTF-8 encoding and ASCII codes are the same.

2) For the nbyte symbol ( n > 1), before the first byte nbits are set 1, the n + 1bit is set 0, the first two bytes always set back 10. The remaining bits not mentioned, all of the Unicode code symbol.

According to this rule, we can easily parse UTF-8 encoding: if the first one is a byte 0, then this is a single-byte character; if the first one is 1, then the number of consecutive 1, it means how many bytes occupied by the current character.

UTF-8 to achieve the backward compatibility of the ASCII code, and for different symbol represents the byte length of the various support are given, so called variable length coding . This has a significant advantage is for plain English document, we can still use a byte to represent all the characters (you know, if you can only use two bytes to represent words, the size of all the English documentation will be in ASCII twice, the price code indicated in FIG Bukejieshou)

So, if we just say that character is utf-8 encoding, we do not know its byte occupancy -

GB2312 and GBK etc.

GB2312 is a represented by two bytes Character encoding method. Its full name is the "People's Republic of China national standards Simplified Chinese character set", was released by the China National Standards Bureau, the implementation May 1, 1981.

GBK is an extension of GB2312-80 Since the GB 2312-80 only 6763 characters, there are many characters, in part only after the introduction of GB 2312-80 as simplified Chinese characters (e.g., "Hello"), part names of the characters ( such as "rong" character of former Chinese Premier Zhu Rongji), traditional Chinese characters used in Taiwan and Hong Kong, Japanese and Korean characters, etc., have not been included into account. So Microsoft vendors use code space GB 2312-80 unused, included all the characters GB 13000.1-93 developed GBK coding.

National Standard GB18030-2005 "Information technology - Chinese coded character set" is the second most important GB2312-1980 and GB13000-1993 Chinese character encoding standard of the main features is the addition of .GB18030-2005 CJK Unified Chinese characters based on the expansion in GB18030-2000 Character B's.

Currently using more or GBK and GB2312

ANSI

Different countries and regions to develop different standards, thereby creating GB2312, GBK, Big5, Shift_JIS and other respective coding standards. The use of 1-4 bytes to represent the various characters of a character extending encoding, called ANSI code.

Simplified Chinese Windows operating systems, ANSI coded representation of GBK coding; in Japanese Windows operating system, ANSI coded representation of Shift_JIS encoding. Incompatible between different ANSI encoding, when the exchange of information internationally, you can not belong to the text of the two languages, the text is stored in the same period of ANSI encoded.

Thus, ANSI actually refers to what, depends on the operating system language. This can actually be understood as Microsoft (because according to said reference 3, only Windows systems use ANSI encoding) method in order to adapt to different coding of different countries and thought: default using different encoding methods in different countries, just to show the functional the unity was named ANSI. Personally I think that this is a little pit settings ????

Then ANSI this function is switched in accordance with national coding scheme is how to control it?

Microsoft is using one called " Windows Pages and the code " (chcp execute commands at the command line you can see the value of the current code page) to determine the value of the system default encoding, such as: simplified Chinese code page is 936 (which represents GBK coding, win95 before represents GB2312, see: the Microsoft the Windows' code Page 936 ), Traditional Chinese code page is 950 (represented Big-5 coding).

This code page is manageable:

At the command prompt, we can modify the active code page of the current terminal by chcp command, for example:

(1) execute: chcp 437, code page to 437, the default encoding for the current terminal on the ASCII code (Chinese characters as distortion);

(2) execute: chcp 936, code page 936 to the current terminal on the default encoding for the encoding of GBK (Chinese characters can display the normal).
The above operation is only the current terminal functions, and will not affect the system default "ANSI encoding."

If you want to modify global code page, we must set the current system locale (locate)

For Linux, the default encoding is utf-8, if we write with an ANSI encoding (for China, in fact, GBK) file written to open a Linux system, you will see garbage under Windows. Then we can choose to change the locale, that is to replace the Linux coding method to solve the problem, see figure (Figure 3 reference source)

UTF-16LE and UTF-16BE

This is the encoding Notepad, leaving UTF-16 LE and UTF-16 BE we did not mention.

UTF-16

UTF-16 is the Unicode third layer of the hierarchical model character encoding five: a character code table (Character Encoding Form, also called "storage format") in one implementation. I.e., the abstract code bit Unicode character set is mapped to 16-bit long integer (i.e. symbol, a length of 2 Byte) sequence for data storage or transfer. Unicode character code bits, requires a 16-bit or two symbols represented, so this is a variable length represented.

Epoch UTF-16 earlier than UTF-8, UTF-8 but in comparison, it has several significant disadvantages:

  1. ASCII code is not supported: defined by UTF-16 can be seen, it is the shortest length is 16 bits, which is two bytes, which makes it incompatible ASCII
  2. UTF-8 is more costly compared to space

UTF-16 still exist today in the computer system for backwards compatibility, which itself has not used much (but there are applications, such as in JavaScript, all the type string (otherwise known as DOMString) are using UTF -16 coded), so the detailed coding method can not say here.

What is LE and BE

LE refers to low byte order (Little Endian), BE refers to the high byte order (Big Endian), here it involves a very low-level problem: If a character requires two bytes to represent, then the two bytes should how storage is the previous high or low in the former? This problem had caused doubts when I had to learn assembly language, but then I do not know the story behind this.

Endian read as Endian or Indian . The origin of this term can be traced to Gulliver's Travels. (Novel, Lilliput be poached from the large end of the ( Big-End peel) or the small end ( Little-End ) peeled and debate, both sides of the debate are called " big endian faction " and " the small end of school .")

Endian scheme is just a matter of preference microprocessor architects, for example, Intel use low endian , Motorola high byte order.

Kanji an example, Unicode code is a 4E25need to use two bytes, one byte 4E, the other byte 25. When stored, 4Ethe former, 25in the post, this is Big endian mode; 25the former, 4Ein the post, this is Little endian mode.

We can see, the way and manner BE write the value of human beings are the same, then why not just use BE?

Computer processing circuit low byte first, high efficiency, because the calculations are from the lowest bit. Therefore, the internal processing is little-endian computers

We can notice a problem, the above figure why only UTF-16 has a byte order problem, why not ANSI and UTF-8 byte order problem? UTF-16 in particular not because, since the standard when the Unicode, and is not specified byte order, which gives such a UTF-16 encoding and multi-byte UTF-32 processing unit a buried pit so that future generations many people in the tangled issue endian conversion. And is single-byte UTF-8 as the basic processing unit

If a character represented by using utf-8, it is necessary to code the Unicode character encoding into a byte array, to be noted here that the two concepts "byte", "array", so the array of bytes written to memory, only the need to follow the order of the array, a one byte write, upper and lower bits of the problem does not exist.

For GBK, witty standard setters in the standard has been said to understand:

Key stroke: two bytes for the byte preceding the first byte, a second byte is the next byte will

How to deal with endian issues

There is a very clever endian convention to solve the problem of the UTF-16, it is called BOM (Byte Order Mark). BOM placed at the beginning of the document, what used to tell the reader of this document endian Yes. If the high order byte, is written FE FF; if it is a low order byte, is written FF FE.

"UTF-8 with BOM," what is it?

BOM UTF-8 does not require, although the Unicode standard allows the BOM in UTF-8.
So UTF-8 is the standard form does not contain the BOM, BOM placed in UTF-8 document is primarily Microsoft's habit (By the way: the little endian UTF-16 with BOM's called "Unicode" without For details, which is Microsoft's habit).

Microsoft is using the BOM in UTF-8 because it is to separate the UTF-8 and ASCII coding clear area, but such a file in an operating system other than Windows's lead to problems. For example page code should not use UTF-8 BOM, otherwise often be wrong.


Reference

  1. http://www.ruanyifeng.com/blog/2007/10/ascii_unicode_and_utf-8.html
  2. https://www.freebuf.com/articles/others-articles/25623.html
  3. https://www.cnblogs.com/malecrab/p/5300486.html
  4. https://www.zhihu.com/question/19677619
  5. https://zh.wikipedia.org/wiki/%E6%B1%89%E5%AD%97%E5%86%85%E7%A0%81%E6%89%A9%E5%B1%95%E8%A7%84%E8%8C%83
  6. https://juejin.im/post/5ace27c96fb9a028dc416195
  7. http://www.ruanyifeng.com/blog/2016/11/byte-order.html
  8. https://www.zhihu.com/question/62587928
  9. https://blog.csdn.net/a_little_a_day/article/details/78923071

Guess you like

Origin www.cnblogs.com/jiading/p/11516868.html