【Transfer】Character encoding notes: ASCII, Unicode and UTF-8

At noon today, I suddenly wanted to figure out the relationship between Unicode and UTF-8, so I started to check the information.

This problem is more complicated than I thought. It was only after lunch until 9 o'clock in the evening that I figured it out initially.

The following are my notes, mainly used to organize my thoughts. I try to write as simple and easy to understand as possible, hoping to be useful to other friends. After all, character encoding is the cornerstone of computer technology, and if you want to use computers proficiently, you must know a little bit about character encoding.

1. ASCII code

We know that inside a computer, all information is ultimately a binary value. Each binary bit (bit) has 0and 1two states, so eight binary bits can be combined into 256 states, which is called a byte (byte). That is to say, a byte can be used to represent a total of 256 different states, each state corresponds to a symbol, that is, 256 symbols, from 00000000to 11111111.

In the 1960s, the United States formulated a set of character codes, which uniformly stipulated the relationship between English characters and binary bits. This is called ASCII code and is still used today.

ASCII code specifies a total of 128 characters of encoding, such as spaces SPACEare 32 (binary 00100000), uppercase letters Aare 65 (binary 01000001). These 128 symbols (including 32 control symbols that cannot be printed out) only occupy the last 7 bits of a byte, and the first bit is uniformly specified as 0.

2. Non-ASCII encoding

English is enough to encode with 128 symbols, but to represent other languages, 128 symbols is not enough. For example, in French, there is a phonetic symbol above the letter, which cannot be represented in ASCII. As a result, some European countries have decided to program new symbols with the highest bits that are idle in bytes. For example, éthe encoding in French is 130 (binary 10000010). In this way, the coding system used by these European countries can represent up to 256 symbols.

However, here comes a new problem. Different countries have different letters, so even though they all use a 256-symbol encoding method, they represent different letters. For example, 130 is represented in French encoding é, but a letter Gimel ( ג) in Hebrew encoding, and another symbol in Russian encoding. But in any case, in all of these encoding methods, the symbols represented by 0--127 are the same, and the only difference is this section of 128--255.

As for the characters of Asian countries, there are more symbols used, and there are as many as 100,000 Chinese characters. One byte can only represent 256 kinds of symbols, which is definitely not enough, and multiple bytes must be used to express one symbol. For example, the common encoding method for Simplified Chinese is GB2312, which uses two bytes to represent a Chinese character, so theoretically it can represent up to 256 x 256 = 65536 symbols.

The issue of Chinese encoding needs to be discussed in a special article, which is not covered in this note. It is only pointed out here that although multiple bytes are used to represent a symbol, the Chinese character encoding of GB class has nothing to do with Unicode and UTF-8 in the following.

3. Unicode

As mentioned in the previous section, there are multiple encodings in the world, and the same binary number can be interpreted as different symbols. Therefore, if you want to open a text file, you must know its encoding method, otherwise it will be garbled if you use the wrong encoding method to interpret it. Why do emails often appear garbled? It is because the encoding method used by the sender and the recipient is different.

It is conceivable that if there is an encoding that incorporates all the symbols in the world. Each symbol is given a unique code, then the garbled problem will disappear. This is Unicode, as its name implies, an encoding of all symbols.

Unicode is, of course, a large collection that now scales to hold over a million symbols. The encoding of each symbol is different, for example, U+0639it represents Arabic letters Ain, U+0041it represents English capital letters A, and U+4E25it represents Chinese characters . For the specific symbol correspondence table, you can query unicode.org , or the special Chinese character correspondence table .

Fourth, the problem of Unicode

It should be noted that Unicode is only a symbol set, it only specifies the binary code of the symbol, but does not specify how the binary code should be stored.

For example, the Unicode of Chinese characters is a hexadecimal number 4E25, and there are 15 bits ( 100111000100101) when converted into a binary number, that is to say, the representation of this symbol requires at least 2 bytes. For other larger symbols, it may take 3 bytes or 4 bytes, or even more.

There are two serious problems here. The first question is, how can we distinguish between Unicode and ASCII? How does the computer know that three bytes represent a symbol, rather than three symbols each? The second problem is that we already know that English letters are only represented by one byte. If Unicode uniformly stipulates that each symbol is represented by three or four bytes, then each English letter must be preceded by two At three bytes 0, this is a huge waste of storage, and the size of the text file will be two or three times larger, which is unacceptable.

The result of them is: 1) There are multiple storage methods for Unicode, which means that there are many different binary formats that can be used to represent Unicode. 2) Unicode could not be generalized for a long time until the advent of the Internet.

5. UTF-8

The popularity of the Internet strongly demands the emergence of a unified coding method. UTF-8 is the most widely used implementation of Unicode on the Internet. Other implementations include UTF-16 (characters are represented by two or four bytes) and UTF-32 (characters are represented by four bytes), but they are rarely used on the Internet. Again, the relationship here is that UTF-8 is one of the implementations of Unicode.

One of the biggest features of UTF-8 is that it is a variable-length encoding method. It can use 1 to 4 bytes to represent a symbol, and the byte length varies according to different symbols.

The encoding rules of UTF-8 are very simple, there are only two:

1) For a single-byte symbol, the first bit of the byte is set 0, and the last 7 bits are the Unicode code of the symbol. So for English letters, UTF-8 encoding and ASCII code are the same.

2) For the nbyte symbol ( n > 1), the first bit of the first byte nis set 1, the first n + 1bit is set 0, and the first two bits of the following bytes are all set 10. The remaining unmentioned binary bits are all the Unicode codes of this symbol.

The following table summarizes the encoding rules, with letters xindicating the bits of the available encoding.

Unicode symbol range | UTF-8 encoding
(hexadecimal) | (binary)
----------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

According to the above table, interpreting UTF-8 encoding is very simple. If the first bit of a byte is 0, then this byte is a character alone; if the first bit is 1, how many consecutively there are 1, it means how many bytes the current character occupies.

Below, take Chinese characters as an example to demonstrate how to implement UTF-8 encoding.

The Unicode is 4E25( 100111000100101), according to the above table, it can be found 4E25in the range of the third line ( 0000 0800 - 0000 FFFF), so the UTF-8 encoding requires three bytes, that is, the format is 1110xxxx 10xxxxxx 10xxxxxx. Then, starting from the last binary bit, fill in the xextra bits in the format sequentially from back to front 0. So you get, the UTF-8 encoding is 11100100 10111000 10100101, and the conversion to hexadecimal is E4B8A5.

6. Conversion between Unicode and UTF-8

Through the example in the previous section, you can see that the Unicode code is 4E25, the UTF-8 encoding is E4B8A5, and the two are different. The conversion between them can be realized by program.

On the Windows platform, one of the easiest ways to convert is to use the built-in Notepad applet notepad.exe. After opening the file, click 文件the command in the menu, 另存为a dialog box will pop up, and there is a drop-down bar at the bottom 编码.

bg2007102801.jpg

There are four options: ANSI, Unicode, Unicode big endianand UTF-8.

1) ANSIis the default encoding method. It is ASCIIencoding for English files, and encoding for Simplified Chinese files GB2312(only for Windows Simplified Chinese version, if it is Traditional Chinese version, Big5 code will be used).

2) UnicodeEncoding here refers to notepad.exethe UCS-2 encoding method used, that is, the Unicode code of the character is directly stored in two bytes. This option uses the little endian format.

3) Unicode big endianThe encoding corresponds to the previous option. I'll explain what little endian and big endian mean in the next section.

4) UTF-8Encoding, which is the encoding method mentioned in the previous section.

After selecting the "encoding method", click the "Save" button, and the encoding method of the file will be converted immediately.

7. Little endian and Big endian

As mentioned in the previous section, the UCS-2 format can store Unicode codes (up to a code point 0xFFFF). Taking Chinese characters as an example, the Unicode code is 4E25and needs to be stored in two bytes, one byte is 4Eand the other byte is 25. When storing, 4Ebefore and 25after, this is the Big endian way; 25before and 4Eafter, this is the Little endian way.

These two odd names come from the British author Swift's "Gulliver's Travels". In the book, there is a civil war in Lilliput, which is caused by people arguing over whether to eat eggs with the big-endian or the little-endian. For this matter, six wars broke out before and after, one emperor lost his life, and another emperor lost his throne.

The first byte first is the "big endian", and the second byte is the "little endian".

So naturally, a question arises: how does the computer know which way a certain file is encoded?

The Unicode specification defines that a character representing the encoding sequence is added to the top of each file. The name of this character is called "zero width no-break space" (zero width no-break space), which is FEFFrepresented by . That's exactly two bytes, and it 's bigger FFthan that .FE1

If the first two bytes of a text file are FE FF, it means that the file adopts the large-end mode; if the first two bytes are FF FE, it means that the file adopts the small-end mode.

8. Examples

Below, give an example.

Open the "Notepad" program notepad.exe, create a new text file, the content is a word, and save it in the way of ANSI, Unicode, Unicode big endianand UTF-8encoding in turn.

Then, use the "hexadecimal function" in the text editing software UltraEdit to observe the internal encoding of the file.

1) ANSI: The encoding of the file is two bytes D1 CF, which is exactly the GB2312 encoding, which also implies that GB2312 is stored in a large-head manner.

2) Unicode: The encoding is four bytes FF FE 25 4E, which FF FEindicates that it is stored in a small header, and the real encoding is 4E25.

3) Unicode big endian: The encoding is four bytes FE FF 4E 25, which FE FFindicates that it is stored in a big endian manner.

4) UTF-8: The encoding is six bytes EF BB BF E4 B8 A5. The first three bytes EF BB BFindicate that this is UTF-8 encoding, and the last three E4B8A5are the specific encoding. Its storage order is consistent with the encoding order.

 

http://www.ruanyifeng.com/blog/2007/10/ascii_unicode_and_utf-8.html

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326446390&siteId=291194637