And notes the character encoding: Character encoding and Overview

With the development of society and the era of modern life is almost inseparable from the computer, mobile phones and other electronic tools, get them to read news, chat, play games have become part of our lives. As long as the phone did not bring out every day, the people will become unexamined, uncomfortable.

Computer: World 0 and 1

However, we can identify, all kinds of information displayed on the electronic device, which is inside the computer, and from 0 to 1 is represented.

As the disc used to store information, and you know it inside contains data, but also look at the look can not read any information.

If the disc is placed under the microscope, you may be able to see the disc surface bumpy, uneven, which raised local says 1, concave down where it is 0.

And character encoding

That these are 0 and 1 how is our perception of it? This is to say to the characters and encoding.

character

In computer and telecommunications technology, a character glyph is a unit of basic information like glyph or symbol of the unit. For example, the letters "A" is a character, Chinese character "I" is a character, punctuation mark "/" is also a character.

coding

The computer stores 0 and 1, transformed into a particular character mapping relationship, is a convention. Such as represented in ASCII 01000001 A, 01000010 represents B.

And the development of character encoding

Depending on the desired character conversion, or conversion to different conventions same characters, derived from a number of different character sets and coding.

From the point of view of computer support multi-language point of view, it can be divided into the following four stages:

Phase One: English-speaking countries (ASCII code)

Since the computer was invented by English-speaking countries, so in the beginning, only in English, other languages ​​can not be displayed on the computer.

ASCII code, by the Americans in the 1960s to develop a set of character encoding binary data into a computer in English characters.

ASCII code represented by the first 7 bits of a byte, the most significant bit of a predetermined byte 0. So it can represent a total of 7 power of 2, 128 characters. For example, capital letter A is 65 (binary 01000001).

Phase II: European countries (Extended ASCII code)

English with 128 characters is enough, but for many non-English speaking European countries, 128 characters is not enough. For example, in French, there is phonetic symbols above the letters, ASCII code can not be expressed.

So some European countries decided, with the highest-bit ASCII code to expand the idle coding, coding the expanded a total of 256 characters can be represented, which is consistent first 128 characters and ASCII code byte highest average after 128 characters 1.

Due to the different character of many European countries, so they are not the same expansion ASCII code. Such as 130 (binary 10000010) is coded é in French, and in Hebrew encoding said ג.

Phase three: other countries (ANSI code)

For the average European country, 256 characters have been enough to carry the information. But for other countries, especially Asian countries, such as China, after thousands of years of civilization, has accumulated tens of thousands of characters, a mere 256 characters is totally inadequate.

Thus ANSI code appears, it refers specifically to enable the computer to support more languages, uses 2 bytes to represent a character encoding is usually used in the range of 0x80 ~ 0xFF.

Different countries and regions to develop different standards, therefore, in different system environments, meaning ANSI code representation is different. such as:

  • Simplified Chinese system, ANSI codes representing their GB2312;
  • In Traditional Chinese system, ANSI coded representation of BIG5;
  • In the Japanese system, ANSI coded representation of the JIS.

Incompatible between different ANSI encoding, when the exchange of information internationally, you can not belong to the text of the two languages, the text is stored in the same period of ANSI encoded.

Phase IV: All countries (Unicode)

As previously described in three stages, there are a variety of encoding the world. Even in ANSI encoding, in different system environments, also represents a different character.

So when you want to open a file, you need to know the encoding, otherwise there is garbled, common mail systems may appear, it is an example.

If there is a code, all the symbols in the world into them, whether it is English, Japanese, or Chinese, etc., we have to use this code, it will not be garbled problem.

This is Unicode.

Unicode is certainly a big collection of its present size can accommodate more than one million symbols. Encoding each symbol is different, for example, U + 0639 represents the Unicode encoding Arabic Ain, U + 0041 for English capital letters A, "Han" is the word that U + 6C49.

Unicode unified encoding of course, but it is not efficient, such as UCS-4 (one of the Unicode standard) provides 4 bytes to store a symbol, then the first letters of each are bound to have three bytes is 0, this is very resource-intensive for storage and transport.

So in order to improve the coding efficiency of Unicode, appears UTF-8, UTF-16 encoding the like.

to sum up

This paper outlines the concept and character encoding, computers 0 and 1 are translated into how our everyday readable information, as well as a variety of encoding characters appear along the developmental sequence of cause and existence.

There is reasonable, does not produce things out of thin air, it will not disappear for no reason. Although some coding now seems very unreasonable, but at that particular time node, it is the best choice.

Summarizing under GBK and UTF-related, we hope to be able to accomplish real good memory with bad written.

References :

character

Character Encoding

Character set and character encoding

Character Encoding notes: ASCII, Unicode and UTF8

Programmers need to know something about Unicode and character sets

Original: Large column  character encoding and notes: The character encoding and Overview


Guess you like

Origin www.cnblogs.com/petewell/p/11607139.html