Ascii, unicode, gbk, utf-8, utf-16 and other character encoding problems

This is a fun read from programmers for programmers. The so-called fun means that you can easily understand some concepts that were not clear before and increase your knowledge, which is similar to the upgrade of RPG games. The motivation for finishing this article is two questions:

Question 1:
Using Windows Notepad's "Save As", you can convert between GBK, Unicode, Unicode big endian and UTF-8 encodings. The same is a txt file, how does Windows recognize the encoding method?

I found out a long time ago that Unicode, Unicode big endian and UTF-8 encoded txt files will have a few more bytes at the beginning, namely FF, FE (Unicode), FE, FF (Unicode big endian), EF, BB , BF (UTF-8). But what criteria are these markings based on?

Question 2:
Recently I saw a ConvertUTF.c on the Internet, which realizes the mutual conversion of the three encoding methods of UTF-32 , UTF-16 and UTF-8. For Unicode (UCS2), GBK, UTF-8, these encoding methods, I knew it before. But this program has me a little confused and can't remember how UTF-16 has anything to do with UCS2.
After checking the relevant information, I finally figured out these problems, and by the way, I also learned some Unicode details. Write an article and send it to friends who have had similar questions. This article has been written as straightforward as possible, but requires the reader to know what bytes are and what hexadecimals are.

0, big endian and little endian
Big endian and little endian are different ways the CPU handles multibyte numbers. For example, the Unicode encoding of the Chinese character is 6C49. Then when writing to the file, do you write 6C in front or 49 in front? If you write 6C in front, it is big endian. Or write 49 in front, which is little endian.

The word "endian" comes from Gulliver's Travels . The civil war in Lilliput stems from whether the eggs are cracked from the Big-Endian or the Little-Endian. There have been six rebellions, in which one emperor died and the other was killed. One lost the throne.

We generally translate endian into "endian", and call big endian and little endian "big endian" and "little endian".

1. Character encoding, internal code, and introduction to Chinese character encoding Incidentally,
characters must be encoded before they can be processed by a computer. The default encoding used by the computer is the computer's internal code. Early computers used 7-bit ASCII encoding. In order to process Chinese characters, programmers designed GB2312 for simplified Chinese and big5 for traditional Chinese .

GB2312 (1980) contains a total of 7445 characters, including 6763 Chinese characters and 682 other symbols. The inner code range of the Chinese character area is from B0-F7 in the high byte and A1-FE in the low byte. The occupied code points are 72*94=6768. Among them, 5 vacancies are D7FA-D7FE.

GB2312 supports too few Chinese characters. The Chinese character extension specification GBK1.0 in 1995 included 21886 symbols, which are divided into Chinese character area and graphic symbol area. The Chinese character area includes 21003 characters. GB18030 in 2000 It is the official national standard to replace GBK1.0. The standard includes 27,484 Chinese characters, as well as Tibetan, Mongolian, Uyghur and other major minority languages . The current PC platform must support GB18030 , and there is no requirement for embedded products. Therefore, mobile phones and MP3s generally only support GB2312.

From ASCII, GB2312, GBK to GB18030 , these encoding methods are backward compatible, that is, the same character always has the same encoding in these schemes, and later standards support more characters. Among these encodings, English and Chinese can be handled uniformly. The method to distinguish Chinese encoding is that the highest bit of the high byte is not 0. According to the programmer's name, GB2312, GBK to GB18030 belong to the double-byte character set (DBCS).

The default internal code of some Chinese Windows is still GBK, which can be upgraded to GB18030 through the GB18030 upgrade package. However, the characters added by GB18030 relative to GBK are difficult for ordinary people to use. Usually, we still use GBK to refer to the Chinese Windows internal code.

Here are some details:

The original text of GB2312 is still the location code. From the location code to the inner code, A0 needs to be added to the high byte and the low byte respectively.

In DBCS, the storage format of the GB internal code is always big endian, that is, the high order is first.

The highest bits of the two bytes of GB2312 are both 1. But there are only 128*128=16384 code bits that meet this condition. So the highest bit of the low byte of GBK and GB18030 may not be 1. However, this does not affect the parsing of the DBCS character stream: when reading the DBCS character stream, as long as the high-order byte is 1, the next two bytes can be encoded as a double-byte, regardless of the low-order byte. What is high.

2. Unicode, UCS and UTF
mentioned earlier that the encoding methods from ASCII, GB2312, GBK to GB18030 are backward compatible. And Unicode is only compatible with ASCII (more precisely, compatible with ISO-8859-1), not compatible with GB code. For example, the Unicode code for "Han" is 6C49, while the GB code is BABA.

Unicode is also a character encoding method, but it was designed by an international organization to accommodate encoding schemes for all languages ​​and scripts in the world. The scientific name of Unicode is "Universal Multiple-Octet Coded Character Set", abbreviated as UCS. UCS can be seen as an abbreviation for "Unicode Character Set".

According to Wikipedia ( http://zh.wikipedia.org/wiki/ ): Historically, there were two organizations that tried to independently design Unicode, namely the International Organization for Standardization (ISO) and an association of software manufacturers (unicode. org). ISO developed the ISO 10646 project, and the Unicode Consortium developed the Unicode project.

Around 1991, both parties realized that the world didn't need two incompatible character sets. So they started merging the work of both sides and working together to create a single code table. Starting from Unicode 2.0, the Unicode project has adopted the same font library and font code as ISO 10646-1.

Both projects still exist and publish their own standards independently. The latest version of the Unicode Consortium is now Unicode 4.1.0 in 2005. The latest ISO standard is 10646-3:2003.

UCS specifies how to use multiple bytes to represent various characters. How to transmit these encodings is specified by the UTF (UCS Transformation Format) specification. Common UTF specifications include UTF-8, UTF-7, and UTF-16.

IETF's RFC2781 and RFC3629 describe the encoding methods of UTF-16 and UTF-8 clearly, crisply and rigorously in the consistent style of RFC. I can't always remember that IETF is short for Internet Engineering Task Force. But the RFCs maintained by the IETF are the basis for all specifications on the Internet.

3. UCS-2, UCS-4, BMP

UCS has two formats: UCS-2 and UCS-4. As the name implies, UCS-2 is encoded with two bytes, and UCS-4 is encoded with 4 bytes (actually only 31 bits are used, and the highest bit must be 0). Let's do some simple math games:

UCS-2 has 2^16=65536 code points and UCS-4 has 2^31=2147483648 code points.

UCS-4 is divided into 2^7=128 groups according to the highest byte whose highest bit is 0. Each group is divided into 256 planes according to the second highest byte. Each plane is divided into 256 rows (rows) according to the 3rd byte, and each row contains 256 cells. Of course, the cells in the same row are different only in the last byte, and the rest are the same.

The plane 0 of group 0 is called Basic Multilingual Plane, or BMP. In other words, in UCS-4, the code bits whose upper two bytes are 0 are called BMP.

UCS-2 is obtained by removing the preceding two zero bytes from the BMP of UCS-4. The BMP of UCS-4 is obtained by adding two zero bytes before the two bytes of UCS-2. In the current UCS-4 specification, no characters are assigned outside the BMP.

4. UTF encoding

UTF-8 encodes UCS in 8-bit units. The encoding from UCS-2 to UTF-8 is as follows:

UCS-2 encoding (hexadecimal) UTF-8 byte stream (binary)
0000 - 007F 0xxxxxxx
0080 - 07FF 110xxxxx 10xxxxxx
0800 - FFFF 1110xxxx 10xxxxxx 10xxxxxx

such as "Han" The Unicode encoding of the word is 6C49. 6C49 is between 0800-FFFF, so it must be a 3-byte template: 1110xxxx 10xxxxxx 10xxxxxx. Writing 6C49 in binary is: 0110 110001 001001, and using this bit stream to replace x in the template in turn, we get: 11100110 10110001 10001001, which is E6 B1 89.

Readers can use Notepad to test whether our coding is correct.

UTF-16 encodes UCS in 16-bit units. For UCS codes less than 0x10000, the UTF-16 encoding is equivalent to the 16-bit unsigned integer corresponding to the UCS code. For UCS codes not less than 0x10000, an algorithm is defined. However, since the BMP of UCS2 or UCS4 actually used must be less than 0x10000, so far, it can be considered that UTF-16 and UCS-2 are basically the same. But UCS-2 is just an encoding scheme, and UTF-16 is used for actual transmission, so the issue of byte order has to be considered.

5, UTF byte order and BOM
UTF-8 uses bytes as the coding unit, and there is no endianness problem. UTF-16 uses two bytes as a code unit. Before interpreting a UTF-16 text, you must first figure out the byte order of each code unit. For example, the Unicode code for receiving a "Kui" is 594E, and the Unicode code for "B" is 4E59. If we receive a UTF-16 byte stream "594E", is this "Kui" or "B"?

The recommended method of marking byte order in the Unicode specification is the BOM. The BOM is not a "Bill Of Material" BOM, but a Byte Order Mark. The BOM is a bit of a clever idea:

there is a character called "ZERO WIDTH NO-BREAK SPACE" in the UCS encoding, and its encoding is FEFF. And FFFE is a non-existent character in UCS, so it should not appear in actual transmission. The UCS specification recommends that we transmit the characters "ZERO WIDTH NO-BREAK SPACE" before transmitting the byte stream.

In this way, if the receiver receives FEFF, it indicates that the byte stream is Big-Endian; if it receives FFFE, it indicates that the byte stream is Little-Endian. Therefore the character "ZERO WIDTH NO-BREAK SPACE" is also called BOM.

UTF-8 does not need a BOM to indicate byte order, but can use BOM to indicate encoding. The UTF-8 encoding of the character "ZERO WIDTH NO-BREAK SPACE" is EF BB BF (the reader can verify it with the encoding method we introduced earlier). So if the receiver receives a byte stream starting with EF BB BF, it knows that this is UTF-8 encoding.

Windows uses the BOM to mark the encoding of text files.

6. Further references
The main reference for this article is "Short overview of ISO-IEC 10646 and Unicode" ( http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html ).

I also found two sources that looked good, but didn't read it because I found the answers to my initial questions:

"Understanding Unicode A general introduction to the Unicode Standard" ( http://scripts.sil.org/ cms/scripts/page.php?site_id=nrsi&item_id=IWS-Chapter04a )
"Character set encoding basics Understanding character set encodings and legacy encodings" ( http://scripts.sil.org/cms/scripts/page.php?site_id= nrsi&item_id=IWS-Chapter03 )
I have written packages that convert UTF-8, UCS-2, GBK to and from each other, both with and without the Windows API. If I have time in the future, I will organize it and put it on my personal homepage ( http://fmddlmyy.home4u.china.com ).

I started writing this article after I figured out all the issues, and I thought it would be done in a while. Unexpectedly, it took a long time to consider the wording and verify the details, and it was written from 1:30 pm to 9:00 pm. Hope some readers can benefit from it.

Author Blog: http://blog.csdn.net/fmddlmyy/
put away

Reference:  http://blog.csdn.net/fmddlmyy/

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325367963&siteId=291194637