Character Encoding Notes: ASCII, Unicode and UTF-8

It's so awesome:
http://www.ruanyifeng.com/blog/2007/10/ascii_unicode_and_utf-8.html
http://pcedu.pconline.com.cn/empolder/gj/other/0505/616631_all. html

one byte: 8 binary bits (256 states)



=================================================================================== =============================================================================================================================================================
_ The relationship between UTF-8, began to check the information.
This problem is more complicated than I thought. It was only after lunch until 9 o'clock in the evening that I figured it out initially.
The following are my notes, mainly used to organize my thoughts. I try to write as simple and easy to understand as possible, hoping to be useful to other friends. After all, character encoding is the cornerstone of computer technology, and if you want to use computers proficiently, you must know a little bit about character encoding.
1. ASCII code
We know that inside the computer, all information is ultimately a binary value. Each binary bit (bit) has two states of 0 and 1, so eight binary bits can be combined into 256 states, which is called a byte (byte). That is to say, a byte can be used to represent a total of 256 different states, each state corresponds to a symbol, that is, 256 symbols, from 00000000 to 11111111.
In the 1960s, the United States formulated a set of character codes, which uniformly stipulated the relationship between English characters and binary bits. This is called ASCII code and is still used today.
The ASCII code specifies a total of 128 characters of encoding. For example, the space SPACE is 32 (binary 00100000), and the uppercase letter A is 65 (binary 01000001). These 128 symbols (including 32 control symbols that cannot be printed out) only occupy the last 7 bits of a byte, and the first bit is uniformly specified as 0.
2. Non-ASCII encoding Encoding with 128 symbols is enough for
English , but for other languages, 128 symbols are not enough. For example, in French, there is a phonetic symbol above the letter, which cannot be represented in ASCII. As a result, some European countries have decided to program new symbols with the highest bits that are idle in bytes. For example, é in French is encoded as 130 (binary 10000010). In this way, the coding system used by these European countries can represent up to 256 symbols.
However, here comes a new problem. Different countries have different letters, so even though they all use a 256-symbol encoding method, they represent different letters. For example, 130 represents é in French encoding, but the letter Gimel (ג) in Hebrew encoding, and another symbol in Russian encoding. But in any case, in all of these encoding methods, the symbols represented by 0--127 are the same, and the only difference is this section of 128--255.
As for the characters of Asian countries, there are more symbols used, and there are as many as 100,000 Chinese characters. One byte can only represent 256 kinds of symbols, which is definitely not enough, and multiple bytes must be used to express one symbol. For example, the common encoding method for Simplified Chinese is GB2312, which uses two bytes to represent a Chinese character, so theoretically it can represent up to 256 x 256 = 65536 symbols.
The issue of Chinese encoding needs to be discussed in a special article, which is not covered in this note. It is only pointed out here that although multiple bytes are used to represent a symbol, the Chinese character encoding of GB class has nothing to do with Unicode and UTF-8 in the following.
3. Unicode
As mentioned in the previous section, there are multiple encodings in the world, and the same binary number can be interpreted as different symbols. Therefore, if you want to open a text file, you must know its encoding method, otherwise, if you use the wrong encoding method to interpret it, garbled characters will appear. Why do emails often appear garbled? It is because the encoding method used by the sender and the recipient is different.
It is conceivable that if there is an encoding that incorporates all the symbols in the world. Each symbol is given a unique code, then the garbled problem will disappear. This is Unicode, as its name implies, an encoding of all symbols.
Unicode is, of course, a large collection that now scales to hold over a million symbols. The encoding of each symbol is different. For example, U+0639 represents the Arabic letter Ain, U+0041 represents the English capital letter A, and U+4E25 represents the Chinese character Yan. For the specific symbol correspondence table, you can query unicode.org, or the special Chinese character correspondence table.
Fourth, the problem of Unicode It
should be noted that Unicode is only a symbol set, it only specifies the binary code of the symbol, but does not specify how the binary code should be stored.
For example, the Unicode of Chinese character Yan is hexadecimal number 4E25, and the conversion into binary number has 15 bits (100111000100101), that is to say, the representation of this symbol requires at least 2 bytes. For other larger symbols, it may take 3 bytes or 4 bytes, or even more.
There are two serious problems here. The first question is, how can we distinguish between Unicode and ASCII? How does the computer know that three bytes represent a symbol, rather than three symbols each? The second problem is that we already know that English letters are only represented by one byte. If Unicode uniformly stipulates that each symbol is represented by three or four bytes, then each English letter must be preceded by two To three bytes is 0, which is a huge waste of storage, and the size of the text file will be two or three times larger, which is unacceptable.
The result of them is: 1) There are multiple storage methods for Unicode, which means that there are many different binary formats that can be used to represent Unicode. 2) Unicode could not be generalized for a long time until the advent of the Internet.
5. The popularization of UTF-8
Internet strongly demands the emergence of a unified encoding method. UTF-8 is the most widely used implementation of Unicode on the Internet. Other implementations include UTF-16 (characters are represented by two or four bytes) and UTF-32 (characters are represented by four bytes), but they are rarely used on the Internet. Again, the relationship here is that UTF-8 is one of the implementations of Unicode.
One of the biggest features of UTF-8 is that it is a variable-length encoding method. It can use 1 to 4 bytes to represent a symbol, and the byte length varies according to different symbols.
The encoding rules of UTF-8 are very simple, there are only two:
1) For a single-byte symbol, the first bit of the byte is set to 0, and the last 7 bits are the Unicode code of the symbol. So for English letters, UTF-8 encoding and ASCII code are the same.
2) For the symbol of n bytes (n > 1), the first n bits of the first byte are set to 1, the n+1th bit is set to 0, and the first two bits of the following bytes are all set to 10. The remaining unmentioned binary bits are all the Unicode codes of this symbol.
The following table summarizes the encoding rules, the letter x indicates the bits of the available encoding.
Unicode symbol range | UTF-8 encoding
(hexadecimal) | (binary)
----------------------+-------- -------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx 0001
0000-0010 FFFF |
If the first bit of a byte is 0, the byte is a character by itself; if the first bit is 1, how many 1s there are in a row indicates how many bytes the current character occupies.
Next, take the Chinese character Yan as an example to demonstrate how to implement UTF-8 encoding.
Strict Unicode is 4E25 (100111000100101). According to the above table, it can be found that 4E25 is in the range of the third line (0000 0800 - 0000 FFFF), so strict UTF-8 encoding requires three bytes, that is, the format is 1110xxxx 10xxxxxx 10xxxxxx. Then, starting from the last binary bit of strict, fill in the x in the format from back to front in turn, and add 0 to the extra bits. In this way, the strict UTF-8 encoding is 11100100 10111000 10100101, and the conversion to hexadecimal is E4B8A5.
6. Conversion between Unicode and UTF-8
Through the example in the previous section, you can see that the strict Unicode code is 4E25, and the UTF-8 code is E4B8A5, and the two are different. The conversion between them can be realized by program.
On the Windows platform, there is one of the easiest conversion methods, which is to use the built-in notepad applet notepad.exe. After opening the file, click the Save As command in the File menu, a dialog box will pop up, and there is a drop-down bar for encoding at the bottom.

There are four options in bg2007102801.jpg : ANSI, Unicode, Unicode big endian and UTF-8.
1) ANSI is the default encoding. For English files, it is ASCII encoding, and for Simplified Chinese files, it is GB2312 encoding (only for Windows Simplified Chinese version, if it is Traditional Chinese version, Big5 code will be used).
2) Unicode encoding here refers to the UCS-2 encoding method used by notepad.exe, that is, the Unicode code of the character is directly stored in two bytes. This option uses the little endian format.
3) Unicode big endian encoding corresponds to the previous option. I'll explain what little endian and big endian mean in the next section.
4) UTF-8 encoding, which is the encoding method mentioned in the previous section.
After selecting the "encoding method", click the "Save" button, and the encoding method of the file will be converted immediately.
7. Little endian and Big endian
As mentioned in the previous section, the UCS-2 format can store Unicode codes (code points not exceeding 0xFFFF). Taking Chinese character Yan as an example, the Unicode code is 4E25, which needs to be stored in two bytes, one byte is 4E and the other byte is 25. When storing, 4E is in the front and 25 is in the back, which is the Big endian method; 25 is in the front and 4E is in the back, this is the Little endian method.
These two odd names come from the British author Swift's "Gulliver's Travels". In the book, there is a civil war in Lilliput, which is caused by people arguing over whether to eat eggs with the big-endian or the little-endian. For this matter, six wars broke out before and after, one emperor lost his life, and another emperor lost his throne.
The first byte first is the "big endian", and the second byte is the "little endian".
So naturally, a question arises: how does the computer know which way a certain file is encoded?
The Unicode specification defines that a character representing the encoding sequence is added to the top of each file. The name of this character is "zero width no-break space" (zero width no-break space), which is represented by FEFF. That's exactly two bytes, and FF is 1 greater than FE.
If the first two bytes of a text file are FE FF, it means that the file adopts the large-end mode; if the first two bytes are FF FE, it means that the file adopts the small-end mode.
8. Examples
Below , give an example.
Open the "Notepad" program notepad.exe, create a new text file, the content is a strict word, and save it in ANSI, Unicode, Unicode big endian and UTF-8 encoding.
Then, use the "hexadecimal function" in the text editing software UltraEdit to observe the internal encoding of the file.
1) ANSI: The encoding of the file is two bytes D1 CF, which is the strict GB2312 encoding, which also implies that GB2312 is stored in a large-end manner.
2) Unicode: The encoding is four bytes FF FE 25 4E, of which FF FE indicates that it is stored in a small-end manner, and the real encoding is 4E25.
3) Unicode big endian: The encoding is four bytes FE FF 4E 25, where FE FF indicates that it is stored in a big endian manner.
4) UTF-8: The encoding is six bytes EF BB BF E4 B8 A5, the first three bytes EF BB BF indicate that this is UTF-8 encoding, the last three E4B8A5 are strict specific encodings, and its storage order is the same as The coding order is the same.
9. Extended reading The Absolute Minimum Every Software Developer Absolutely, Positively Must
Know About Unicode and Character
Sets
8)
(End)

===============================Part II======== ==================================
This is a fun read from programmers for programmers. The so-called fun means that you can easily understand some concepts that were not clear before and increase your knowledge, which is similar to the upgrade of RPG games. The motivation for organizing this article is two questions:

  Question 1:

  Using the "Save As" of Windows Notepad, you can convert between GBK, Unicode, Unicode big endian and UTF-8 encodings. The same is a txt file, how does Windows recognize the encoding method?

  I found out a long time ago that Unicode, Unicode big endian and UTF-8 encoded txt files will have a few more bytes at the beginning, namely FF, FE (Unicode), FE, FF (Unicode big endian), EF, BB , BF (UTF-8). But what criteria are these markings based on?

  Question 2:

  Recently I saw a ConvertUTF.c on the Internet, which realizes the mutual conversion of the three encoding methods of UTF-32, UTF-16 and UTF-8. For Unicode (UCS2), GBK, UTF-8, these encoding methods, I knew it before. But this program has me a little confused and can't remember how UTF-16 has anything to do with UCS2.

  After checking the relevant information, I finally figured out these problems, and by the way, I also learned some Unicode details. Write an article and send it to friends who have had similar questions. This article has been written as straightforward as possible, but requires the reader to know what bytes are and what hexadecimals are.

0, big endian and little endian

  big endian and little endian are different ways the CPU handles multibyte numbers. For example, the Unicode encoding of the Chinese character is 6C49. Then when writing to the file, do you write 6C in front or 49 in front? If you write 6C in front, it is big endian. Or write 49 in front, which is little endian.

  The word "endian" comes from Gulliver's Travels. The civil war of Lilliput stems from whether the eggs are knocked from the Big-Endian or the Little-Endian. There have been six rebellions, in which one emperor died and the other was killed. One lost the throne.

  We generally translate endian into "endian", and call big endian and little endian "big endian" and "little endian".

1. Character encoding, internal code, and introduction to Chinese character encoding Incidentally,

  characters must be encoded before they can be processed by computers. The default encoding used by the computer is the computer's internal code. Early computers used 7-bit ASCII encoding. In order to process Chinese characters, programmers designed GB2312 for simplified Chinese and big5 for traditional Chinese.

  GB2312 (1980) contains a total of 7445 characters, including 6763 Chinese characters and 682 other symbols. The inner code range of the Chinese character area is from B0-F7 in the high byte and A1-FE in the low byte. The occupied code points are 72*94=6768. Among them, 5 vacancies are D7FA-D7FE.

  GB2312 supports too few Chinese characters. The Chinese character extension specification GBK1.0 in 1995 included 21886 symbols, which are divided into Chinese character area and graphic symbol area. The Chinese character area includes 21003 characters. GB18030 in 2000 is the official national standard to replace GBK1.0. The standard includes 27,484 Chinese characters, as well as Tibetan, Mongolian, Uyghur and other major minority languages. The current PC platform must support GB18030, and there is no requirement for embedded products. Therefore, mobile phones and MP3s generally only support GB2312.

  From ASCII, GB2312, GBK to GB18030, these encoding methods are backward compatible, that is, the same character always has the same encoding in these schemes, and later standards support more characters. Among these encodings, English and Chinese can be handled uniformly. The method to distinguish Chinese encoding is that the highest bit of the high byte is not 0. According to the programmer's name, GB2312, GBK to GB18030 belong to the double-byte character set (DBCS).

  The default internal code of some Chinese Windows is still GBK, which can be upgraded to GB18030 through the GB18030 upgrade package. However, the characters added by GB18030 relative to GBK are difficult for ordinary people to use. Usually, we still use GBK to refer to the Chinese Windows internal code.

  Here are some details:

  The original text of GB2312 is still the location code. From the location code to the inner code, A0 needs to be added to the high byte and the low byte respectively.

  In DBCS, the storage format of the GB internal code is always big endian, that is, the high order is first.

  The highest bits of the two bytes of GB2312 are both 1. But there are only 128*128=16384 code bits that meet this condition. So the highest bit of the low byte of GBK and GB18030 may not be 1. However, this does not affect the parsing of the DBCS character stream: when reading the DBCS character stream, as long as the high-order byte is 1, the next two bytes can be encoded as a double-byte, regardless of the low-order byte. What is high.

2. Unicode, UCS and UTF

  mentioned earlier that the encoding methods from ASCII, GB2312, GBK to GB18030 are backward compatible. And Unicode is only compatible with ASCII (more precisely, compatible with ISO-8859-1), not compatible with GB code. For example, the Unicode code for "Han" is 6C49, while the GB code is BABA.

  Unicode is also a character encoding method, but it is designed by an international organization and can accommodate encoding schemes for all languages ​​and scripts in the world. The scientific name of Unicode is "Universal Multiple-Octet Coded Character Set", abbreviated as UCS. UCS can be seen as an abbreviation for "Unicode Character Set".

  According to Wikipedia (http://zh.wikipedia.org/wiki/): Historically, there were two organizations that tried to design Unicode independently, namely the International Organization for Standardization (ISO) and an association of software manufacturers (unicode. org). ISO developed the ISO 10646 project, and the Unicode Consortium developed the Unicode project.

  Around 1991, both parties realized that the world didn't need two incompatible character sets. So they started merging the work of both sides and working together to create a single code table. Starting from Unicode 2.0, the Unicode project has adopted the same font library and font code as ISO 10646-1.

  Both projects still exist and publish their own standards independently. The latest version of the Unicode Consortium is now Unicode 4.1.0 in 2005. The latest ISO standard is 10646-3:2003.

  UCS specifies how to use multiple bytes to represent various characters. How to transmit these encodings is specified by the UTF (UCS Transformation Format) specification. Common UTF specifications include UTF-8, UTF-7, and UTF-16.

  IETF's RFC2781 and RFC3629 describe the encoding methods of UTF-16 and UTF-8 clearly, crisply and rigorously in the consistent style of RFC. I can't always remember that IETF is short for Internet Engineering Task Force. But the RFCs maintained by the IETF are the basis for all specifications on the Internet.

2
3. UCS-2, UCS-4, BMP

  UCS has two formats: UCS-2 and UCS-4. As the name implies, UCS-2 is encoded with two bytes, and UCS-4 is encoded with 4 bytes (actually only 31 bits are used, and the highest bit must be 0). Let's do some simple math games:

  UCS-2 has 2^16=65536 code points and UCS-4 has 2^31=2147483648 code points.

  UCS-4 is divided into 2^7=128 groups according to the highest byte whose highest bit is 0. Each group is divided into 256 planes according to the second highest byte. Each plane is divided into 256 rows (rows) according to the 3rd byte, and each row contains 256 cells. Of course, the cells in the same row are different only in the last byte, and the rest are the same.

  The plane 0 of group 0 is called Basic Multilingual Plane, or BMP. In other words, in UCS-4, the code bits whose upper two bytes are 0 are called BMP.

  UCS-2 is obtained by removing the preceding two zero bytes from the BMP of UCS-4. The BMP of UCS-4 is obtained by adding two zero bytes before the two bytes of UCS-2. In the current UCS-4 specification, no characters are assigned outside the BMP.

4. UTF encoding

  UTF-8 encodes UCS in 8-bit units. The encoding from UCS-2 to UTF-8 is as follows:

UCS-2 encoding (hexadecimal) UTF-8 byte stream (binary)
0000 - 007F 0xxxxxxx
0080 - 07FF 110xxxxx 10xxxxxx
0800 - FFFF 1110xxxx 10xxxxxx 10xxxxxx
  such as "Han" The Unicode encoding of the word is 6C49. 6C49 is between 0800-FFFF, so it must be a 3-byte template: 1110xxxx 10xxxxxx 10xxxxxx. Writing 6C49 in binary is: 0110 110001 001001, and using this bit stream to replace x in the template in turn, we get: 11100110 10110001 10001001, which is E6 B1 89.

  Readers can use Notepad to test whether our coding is correct.

  UTF-16 encodes UCS in 16-bit units. For UCS codes less than 0x10000, the UTF-16 encoding is equivalent to the 16-bit unsigned integer corresponding to the UCS code. For UCS codes not less than 0x10000, an algorithm is defined. However, since the BMP of UCS2 or UCS4 actually used must be less than 0x10000, so far, it can be considered that UTF-16 and UCS-2 are basically the same. But UCS-2 is just an encoding scheme, and UTF-16 is used for actual transmission, so the issue of byte order has to be considered.

5, UTF byte order and BOM

  UTF-8 uses bytes as the coding unit, and there is no endianness problem. UTF-16 uses two bytes as a code unit. Before interpreting a UTF-16 text, you must first figure out the byte order of each code unit. For example, the Unicode code for receiving a "Kui" is 594E, and the Unicode code for "B" is 4E59. If we receive a UTF-16 byte stream "594E", is this "Kui" or "B"?

  The recommended method of marking byte order in the Unicode specification is the BOM. The BOM is not a "Bill Of Material" BOM, but a Byte Order Mark. The BOM is a bit of a clever idea:

  there is a character called "ZERO WIDTH NO-BREAK SPACE" in the UCS encoding, and its encoding is FEFF. And FFFE is a non-existent character in UCS, so it should not appear in actual transmission. The UCS specification recommends that we transmit the characters "ZERO WIDTH NO-BREAK SPACE" before transmitting the byte stream.

  In this way, if the receiver receives FEFF, it indicates that the byte stream is Big-Endian; if it receives FFFE, it indicates that the byte stream is Little-Endian. Therefore the character "ZERO WIDTH NO-BREAK SPACE" is also called BOM.

  UTF-8 does not need a BOM to indicate byte order, but can use BOM to indicate encoding. The UTF-8 encoding of the character "ZERO WIDTH NO-BREAK SPACE" is EF BB BF (the reader can verify it with the encoding method we introduced earlier). So if the receiver receives a byte stream starting with EF BB BF, it knows that this is UTF-8 encoding.

  Windows uses the BOM to mark the encoding of text files.

6. Further references

  The main reference for this article is "Short overview of ISO-IEC 10646 and Unicode" (http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html).

  I also found two good sources, but I didn't read them because I found the answers to my initial questions:

"Understanding Unicode A general introduction to the Unicode Standard" (http://scripts.sil.org/ cms/scripts/page.php?site_id=nrsi&item_id=IWS-Chapter04a)
"Character set encoding basics Understanding character set encodings and legacy encodings" (http://scripts.sil.org/cms/scripts/page.php?site_id= nrsi&item_id=IWS-Chapter03)

  I have written packages that convert UTF-8, UCS-2, GBK to and from each other, both with and without the Windows API. If I have time in the future, I will organize it and put it on my personal homepage (http://fmddlmyy.home4u.china.com).

  I started writing this article after I figured out all the issues, and I thought it would be done in a while. Unexpectedly, it took a long time to consider the wording and verify the details, and it was written from 1:30 pm to 9:00 pm. Hope some readers can benefit from it.

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326679030&siteId=291194637