A, ASCII code
As we know, internal computer, all the information is ultimately a binary value. Each binary digit (bit) has 0
and 1
two states, and therefore eight bits can be combined the 256 states, which is called a byte (byte). That is, a total of one byte may be used to represent 256 different states, each state corresponding to a symbol, that is 256 symbols, from 00000000
to 11111111
.
60s of last century, the United States developed a set of character encoding, the relationship between English characters and bits, made uniform regulations. This is called ASCII code, still in use.
ASCII encoding code provides for a total of 128 characters, such as spaces SPACE
is 32 (binary 00100000
), uppercase letters A
is 65 (binary 01000001
). The 128 symbols (including 32 control symbols can not be printed out), only it takes a byte 7 behind the foremost one of a predetermined uniform 0
.
Second, the non-ASCII coding
English with 128 symbol encoding enough, but to represent other languages, 128 symbols is not enough. For example, in French, there is phonetic symbols above the letters, it can not be represented by ASCII codes. As a result, some European countries decided to use most significant byte of idle incorporated into the new symbol. For example, French é
coded as 130 (binary 10000010
). As a result, the coding system used by European countries, may represent up to 256 symbols.
However, here again there is a new problem. Different countries have different letters, therefore, even if they are using the encoding 256 symbols, letters represent is not the same. For example, the 130 representatives in the French coding é
, coding in Hebrew represents the letter Gimel
( ג
), in the Russian coding will sign on behalf of another. But in any case, all these codes, the symbol represents 0-127 is the same, not the same in this paragraph is just 128--255.
As for text Asian countries, symbols used even more, as many as 10 million Chinese characters. A byte can only represent 256 kinds of symbols, it is definitely not enough, you must use multiple bytes express a symbol. For example, Chinese simplified encoding is common GB2312, using two bytes of a character, so theoretically represent up to 256 x 256 = 65536 symbols.
Chinese coding problem discussed special needs, this note does not involve. Here only noted that although a symbol is represented by a plurality of bytes, but the character code and hereinafter GB classes are Unicode UTF-8 and unrelated.
Three. Unicode
As mentioned in the previous section, there are a variety of encoding the world, with a binary number can be interpreted as different symbols. Therefore, in order to open a text file, you must know the encoding, or reading the wrong encoding, it will be garbled. Why e-mail often garbled? Because encoding the sender and recipient use is not the same.
Imagine, if there is a code, all the symbols of the world are included. Each symbol is given a unique code, then the garbage problem will disappear. This is Unicode, as its name have said, this is all encode a symbol.
Unicode is certainly a big collection of its present size can accommodate more than one million symbols. Encoding each symbol is different, for example, U+0639
represents the Arabic alphabet Ain
, U+0041
capital letter in English A
, U+4E25
represents the Chinese character 严
. Specific symbol correspondence table can query Unicode.org , or special characters correspondence table .
Four, Unicode problems
It should be noted, Unicode is just a set of symbols, it only provides binary notation, but does not specify how this should be stored in binary code.
For example, Chinese characters 严
of Unicode is a hexadecimal number 4E25
, a binary number is converted into a full 15 ( 100111000100101
), that is, this symbol represents at least 2 bytes. Other symbols represent greater, may require 3 bytes or 4 bytes, or even more.
Here there are two serious problems, the first question is, how can the difference between Unicode and ASCII? The computer knows how three bytes represent a symbol, rather than the three symbols represent it? The second problem is that we already know, the letters only one byte is enough, if Unicode unified regulations, each symbol is represented by three or four bytes, are bound for two before each letter three bytes 0
, which is a great waste for storage, the size of the text file will be large and therefore a two to three times, this is unacceptable.
The results they cause are: 1) the emergence of a variety of storage Unicode, which means there are many different binary format, can be used to represent Unicode. 2) Unicode can not promote a long period of time, until the emergence of the Internet.
Five, UTF-8
Popularity of the Internet, urged a unified coding appears. UTF-8 is the most widely used implementation using a Unicode on the Internet. Other implementations further comprising UTF-16 (character two bytes or four bytes), and UTF-32 (four bytes represented by character), but substantially not on the Internet. Repeat, here is the relationship, UTF-8 Unicode is one of implementation.
UTF-8 biggest feature is that it is a variable length encoding. It can be 1 to 4 bytes of one symbol, byte length varies depending on the symbol.
UTF-8 encoding rules are very simple, only two:
1) For single byte symbols, the first byte is set 0
, the back 7 of the Unicode code symbol. Therefore, for the English alphabet, UTF-8 encoding and ASCII codes are the same.
2) For the n
byte symbol ( n > 1
), before the first byte n
bits are set 1
, the n + 1
bit is set 0
, the first two bytes always set back 10
. The remaining bits not mentioned, all of the Unicode code symbol.
The following table summarizes the encoding rules, the letter x
expressed by the encoded bits.
Unicode symbol range | UTF-8 encoding (hex) | (binary) -------- + -------- ------------------------------------- 0000 0000-0000 007F | 0xxxxxxx 0000 0080-0000 07FF | 10xxxxxx 110xxxxx 0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx 0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
It is now on the table, reading UTF-8 encoding is very simple. If the first one is a byte 0
, then this is a single-byte character; if the first one is 1
, then the number of consecutive 1
, it indicates how many bytes occupied by the current character.
Below, or in Chinese characters 严
, for example, demonstrates how to implement UTF-8 encoding.
严
Unicode is a 4E25
( 100111000100101
), according to the table, can be found 4E25
in the range of the third line ( 0000 0800 - 0000 FFFF
), and therefore 严
the UTF-8 encoding requires three bytes, i.e., format 1110xxxx 10xxxxxx 10xxxxxx
. Then, from 严
the last bit of Start, fill in the form from back to front x
, the extra bits make up 0
. Thus obtained, 严
the code is UTF-8 11100100 10111000 10100101
, is converted to hexadecimal E4B8A5
.
Conversion between six, Unicode UTF-8 with
By the example of the previous section, we can see 严
the codes are Unicode 4E25
, UTF-8 is encoded E4B8A5
, the two are not the same. Transitions between them may be implemented by a program.
Windows platform, there is a simple transformation method is to use the built-in Notepad applet notepad.exe
. After opening the file, click on the 文件
menu 另存为
command, will pop up a dialog box at the bottom there is a 编码
drop-down bar.
There are four ANSI
options: Unicode
, , Unicode big endian
and UTF-8
.
1) ANSI
is the default encoding. For English files are ASCII
encoded, for Simplified Chinese file is GB2312
encoded (only for Windows Simplified Chinese version, if it is Traditional Chinese will use Big5 code).
2) Unicode
coding here refers to the notepad.exe
UCS-2 encoding used, i.e., directly into Unicode characters with two bytes, this option with little endian format.
3) Unicode big endian
encoding the corresponding option. I'll explain in the next section big endian and little endian meaning.
4) UTF-8
coding, coding is a method discussed.
After selecting the "coding", click on the "Save" button, encoding files immediately convert better.
Seven, Little endian and Big endian
Already mentioned in the previous section, UCS-2 format can store Unicode code (code point does not exceed 0xFFFF
). Kanji 严
an example, Unicode code is a 4E25
need to use two bytes, one byte 4E
, the other byte 25
. When stored, 4E
the former, 25
in the post, this is Big endian mode; 25
the former, 4E
in the post, this is Little endian mode.
The two odd name comes from the British writer Jonathan Swift's "Gulliver's Travels." In the book, Lilliput outbreak of civil war, war is the cause of debate, whether it is knocked from the bulk (Big-endian) or small head (Little-endian) knocked eat eggs. To this matter, before and after the war broke out six times, an emperor lost his life, another emperor lost his throne.
The first byte first, is the "bulk mode" (Big endian), first second byte is the "head mode" (Little endian).
So naturally, there will be a problem: how do you know a computer in the end what kind of a file encoded using?
Unicode specification defines, the front of each file are sequentially added to a character code representation of the name of this character is called a "zero-width non-breaking space" (zero width no-break space ), with FEFF
FIG. This is exactly two bytes, and FF
more than FE
big 1
.
If the first two bytes is a text file FE FF
, it indicates that the file is the bulk mode; if the first two bytes are FF FE
, it means that the file is a small head embodiment.
Eight examples
Here, for an example.
Open the "Notepad" program notepad.exe
, a new text file, content is a 严
word, followed by the use of ANSI
, Unicode
, Unicode big endian
and UTF-8
encoding saved.
Then, use a text editor UltraEdit in the "Hex function," observe the internal encoding of the file.
1) ANSI: encoded file is two bytes D1 CF
, which is 严
the coding GB2312, GB2312 This also implies that the use of bulk stored.
2) Unicode: encoding four bytes FF FE 25 4E
, which FF FE
indicates that a small head stored, the actual encoding is 4E25
.
3) Unicode big endian: encoding four bytes FE FF 4E 25
, which FE FF
indicates that the bulk is stored.
4) UTF-8: encoding is six bytes EF BB BF E4 B8 A5
, the first three bytes EF BB BF
indicate that this is UTF-8 encoding, three E4B8A5
is 严
the particular coding sequence and its coding sequence are stored in the same.
IX Further reading
- Absolute Minimum Every Software Developer at The. Absolutely, Positively Must Know the About Unicode and Character Sets (basic knowledge of the character set)
- To talk about the Unicode encoding
- RFC3629: UTF-8, the format of the ISO 10646 A Transformation (predetermined achieved if the UTF-8)
(Finish)
Reference: http://www.ruanyifeng.com/blog/2007/10/ascii_unicode_and_utf-8.html