The character encoding

1. ASCII code

We know that inside the computer, all the information is ultimately represented as a binary string. Each binary digit (bit) has two states 0 and 1, and therefore eight bits can be combined the 256 states, which is called a byte (byte). That is, a total of one byte may be used to represent 256 different states, each state corresponding to a symbol, that is 256 symbols, from 0,000,000 to 11,111,111.

60s of last century, the United States developed a set of character encoding, the relationship between English characters and bits, made uniform regulations. This is called ASCII code, still in use.

ASCII encoding code provides for a total of 128 characters, such as spaces "SPACE" is 32 (binary 00100000), uppercase letter A is 65 (binary 01000001). The 128 symbols (including 32 control symbols can not be printed out), only it takes a byte 7 behind the foremost one uniform predetermined zero.

2, non-ASCII coding

English with 128 symbol encoding enough, but to represent other languages, 128 symbols is not enough. For example, in French, there is phonetic symbols above the letters, it can not be represented by ASCII codes. As a result, some European countries decided to use most significant byte of idle incorporated into the new symbol. For example, the French é coded as 130 (binary 10000010). As a result, the coding system used by European countries, may represent up to 256 symbols.

However, here again there is a new problem. Different countries have different letters, therefore, even if they are using the encoding 256 symbols, letters represent is not the same. For example, 130 represents the coding in French é, Hebrew letter it represents the coding Gimel (ג), in Russian encoding symbols will on behalf of another. But in any case, all these codes, the symbol represents 0-127 is the same, not the same as this period is only 128-255.

As for text Asian countries, symbols used even more, as many as 10 million Chinese characters. A byte can only represent 256 kinds of symbols, it is definitely not enough, you must use multiple bytes express a symbol. For example, Chinese simplified encoding is common GB2312, using two bytes of a character, so theoretically represent up to 256x256 = 65536 symbols.

Chinese coding problem discussed special needs, this note does not involve. Here only noted that although a symbol is represented by a plurality of bytes, but the character code and hereinafter GB classes are Unicode UTF-8 and unrelated.

3.Unicode

As mentioned in the previous section, there are a variety of encoding the world, with a binary number can be interpreted as different symbols. Therefore, in order to open a text file, you must know the encoding, or reading the wrong encoding, it will be garbled. Why e-mail often garbled? Because encoding the sender and recipient use is not the same.

Imagine, if there is a code, all the symbols of the world are included. Each symbol is given a unique code, then the garbage problem will disappear. This is Unicode, as its name have said, this is all encode a symbol.

Unicode is certainly a big collection of its present size can accommodate more than one million symbols. Encoding each symbol is different, for example, U + 0639 represents the Arabic letter Ain, U + 0041 for English capital letters A, U + 4E25 represent the Chinese character "strict." Specific symbol correspondence table can query Unicode.org , or special characters correspondence table .

4. Unicode issues

It should be noted, Unicode is just a set of symbols, it only provides binary notation, but does not specify how this should be stored in binary code.

For example, the Chinese character "strict" unicode hexadecimal numbers 4E25, a full conversion into a binary number 15 (100111000100101), the symbol that is represented at least 2 bytes. Other symbols represent greater, may require 3 bytes or 4 bytes, or even more.

Here there are two serious questions, the first question is, how can we distinguish unicode and ascii? The computer knows how three bytes represent a symbol, rather than the three symbols represent it? The second problem is that we already know, the letters only one byte is enough, if unicode unified regulations, each symbol is represented by three or four bytes, are bound for two before each letter to three bytes is 0, which is a huge waste for storage, the size of the text file will be large and therefore a two to three times, this is unacceptable.

The results they cause are: 1) the emergence of a variety of storage unicode, that there are many different binary format, can be used to represent unicode. 2) unicode can not promote a long period of time, until the emergence of the Internet.

5.UTF-8

Popularity of the Internet, urged a unified coding appears. UTF-8 is the most widely used implementation using a unicode on the Internet. Other implementations also include UTF-16 and UTF-32, but the basic need on the Internet. Repeat, here is the relationship, UTF-8 Unicode is one of implementation.

UTF-8 biggest feature is that it is a variable length encoding. It can be 1 to 4 bytes of one symbol, byte length varies depending on the symbol.

UTF-8 encoding rules are very simple, only two:

1) For single byte symbols, a set of byte 0, the back 7 of the symbol codes to unicode. Therefore, for the English alphabet, UTF-8 encoding and ASCII codes are the same.

2) The symbol n bytes (n> 1), the first n bits of the first byte are set to 1, the n + 1 bit is set to 0, the first two bytes of the rear set 10 uniformly. The remaining bits not mentioned, all this unicode code symbol.

The following table summarizes the encoding rules, the letter x represents available encoding bits.

Unicode symbol range | UTF-8 encoding
(hex) | (binary)
-------------------- + ---------- -----------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0800-0000 FFFF 0000 | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Below, or to the Chinese character "strict" for example, demonstrates how to implement UTF-8 encoding.

Known "strict" unicode is 4E25 (100111000100101), according to the table, can be found in the range of 4E25 third row (0000 0800-0000 FFFF), so "strict" UTF-8 encoding requires three bytes that the format is "1110xxxx 10xxxxxx 10xxxxxx". Then, from the "strict" last bit Start, fill in the format of x from back to front, the extra bit 0s. This resulted in a "strict" UTF-8 encoding is "11100100 1,011,100,010,100,101", converted to hexadecimal is E4B8A5.

6. Unicode conversion between UTF-8 and

Through the upper one example, see "strict" Unicode code is 4E25, UTF-8 encoding is E4B8A5, the two are not the same. Transitions between them may be implemented by a program.

In the Windows platform, there is a simple transformation method is to use the built-in Notepad applet Notepad.exe. After opening the file, click on the "File" menu in the "Save As" command, a dialog box will pop up with a "code" pull-down bar at the bottom.

There are four options: ANSI, Unicode, Unicode big endian and UTF-8.

1) ANSI is the default encoding. For English files are ASCII encoding for Simplified Chinese GB2312 file is encoded (only for Windows Simplified Chinese version, if it is Traditional Chinese will use Big5 code).

2) encoded Unicode UCS-2 refers to a coding mode, i.e., directly into Unicode characters with two bytes. This option is little endian format used.

3) Unicode big endian encoding on a corresponding option. I'll explain in the next section big endian and little endian meaning.

4) UTF-8 coding, coding is a method discussed.

After selecting the "coding", click on the "Save" button, encoding files immediately convert better.

7. Little endian和Big endian

Already mentioned in the previous section, Unicode UCS-2 code format may be used directly stored. Kanji "strict" for example, Unicode code is 4E25, need to store two bytes, one byte is 4E, another byte is 25. When stored, 4E front and 25 in the post, is Big endian mode; 25 front, 4E in the post, is Little endian mode.

The two odd name comes from the British writer Jonathan Swift's "Gulliver's Travels." In the book, Lilliput outbreak of civil war, war is the cause of debate, whether it is knocked from the bulk (Big- Endian) or small head (Little-Endian) knocked eat eggs. To this matter, before and after the war broke out six times, an emperor lost his life, another emperor lost his throne.

Thus, the first byte first, is the "bulk mode" (Big endian), first second byte is the "head mode" (Little endian).

So naturally, there will be a problem: how do you know a computer in the end what kind of a file encoded using?

Unicode defined in the specification, the front of each file are added to a character representation of the coding sequence, the name of this character is called a "zero-width non-breaking space" (ZERO WIDTH NO-BREAK SPACE), represented by FEFF. This is exactly two bytes, and FF FE large than 1.

If the first two bytes of the text file is a FE FF, it means that the file is the bulk mode; if the first two bytes FF FE, it means that the file is a small head embodiment.

8. Examples

Here, for an example.

Open the "Notepad" application Notepad.exe, create a text file, the content is a "strict" character, sequentially using ANSI, save Unicode, Unicode big endian and UTF-8 encoding.

Then, use a text editor UltraEdit in the "Hex function," observe the internal encoding of the file.

1) ANSI: two bytes is encoded files "D1 CF", which is "strict" coding GB2312, GB2312 This also implies that use is stored in bulk.

2) Unicode: encoding four bytes "FF FE 25 4E", where "FF FE" indicates that a small head stored, the actual encoding is 4E25.

3) Unicode big endian: encoding four bytes "FE FF 4E 25", wherein "FE FF" indicates that the bulk stored.

4) UTF-8: encoding is six bytes "EF BB BF E4 B8 A5", the first three bytes of "EF BB BF" indicates that this is a UTF-8 encoding, the three "E4B8A5" is "strict" in particular coding sequence and its coding sequence are stored in the same.

 

 

Text mode
source files written in different encoding methods, will lead to the implementation of the results are not the same.
How to deal with it? When the compiler to specify the character set
man gcc, / charset
-finput-charset = charset represents the encoding of the source file, the default to UTF-8 to resolve
-fexec-charset = charset represents an executable program in the word when what coding way to represent, the default is UTF-8

gcc -o a a.c //

gcc -finput-charset=GBK -fexec-charset=UTF-8 -o utf-8_2 ansi.c

 

How to display: displaying a character you need to use dot matrix character, that font file, font file certainly include character encoding table.

Example: When we write an LCD driver, you can use a character lcd display, in fact, use the kernel font files.

Guess you like

Origin www.cnblogs.com/x2i0e19linux/p/11764549.html