Introduction to Character Encoding

First, the foundation of the computer - interactive

Operating system, computer hardware, computer applications make up the system. When we use a computer system to process the task is actually to make between the various components of the computer system of a complex set of interactive processing, get the results we want. The interaction between the various components of the computer hardware is carried out by the binary instruction code, but when we use computers, almost all visible text files are based on common human language characters presented, the computer system is how human language characters converting it into binary code.

When we use a text editor, a computer system is how to achieve the contents of the file to save and read it.

Second, text editor, file access principle

1, When we open a text editor to open a process in memory, after which the content in which we write is stored in memory. At this point the data off that is lost.

2, when we press the save, the data is saved to the hard disk, for permanent preservation.

3. Similarly, we write code file is the same, are present in the form of text before execution. For computers, it is just a bunch of characters.

Three, Python interpreter to execute py file principle

Step one: Open the python interpreter, this time is equivalent to open a text editor

Step two: python interpreter equivalent of a text editor to open the py file from the hard disk to read the contents of the file into memory (this step is taken only content, do not judge the content is correct)

The third step: Click Run, python interpreter began to interpret and execute the code (explain his party and his party to perform, to determine the interpretation of grammar, executed when the variable will open up memory space)

Four, python interpreter and text editor similarities and differences

The same point: the text can be read into memory, you can edit and save.

Different points: a text editor, only the file contents of the operation, while the python interpreter in addition to the operation of the contents of the documents, but also to understand the content, grammar judge, code execution

Fifth, the character encoding

1. What is the character encoding

Is to a certain character encoding text and characters according to certain rules are compiled into one to one code.

2, why should encode characters

Because computers only know 0s and 1s, humans make computers understand human intentions we must take the human characters commonly used in high-level language is compiled into code that the computer can recognize.

Character -> compilation rules (coding table) -> binary code

3, classification and history of character encoding

1、ASCII

In the computer, all data must be used when storing binary numbers and calculation expressed (as represented by the computer 1 and 0 respectively high and low), for example, as a, b, c, d such letters 52 (including capital) and 0,1 other digital there are some common symbols (such as *, #, @, etc.) should be stored in a computer when using binary to represent, and what specific binary digits with symbols indicating which of course everyone conventions can own set of rules (this is called code table), but if you want to communicate with each other without causing confusion, then we must use the same encoding rules, then the United States relating to standardization organizations on the harmonization of the provisions of the English alphabet and common symbols used to represent a binary number which, ASCII and published the first edition in 1967, the last time it was updated in 1986, so far defines a total of 128 characters. SPACE such as spaces is 32 (binary 00100000), the capital letter A is 65 (binary 01000001). The 128 symbols (including 32 control symbols can not be printed out), only it takes a byte 7 behind the foremost one uniform predetermined zero.

2, other countries coding table

English coded symbols with 128 is enough, but for other languages, 128 symbols is not necessarily enough. With the development of the computer industry nations, countries have launched their own country code table. For example, China's GBK, Japan shift_JIS, South Korea Euc-kr, etc., are the corresponding binary code. Thus, the exchange of data between different countries there is a problem, the same binary code represents a different character in the two types of coding, or the same characters used in both encoding different binary code. With coding rule of country A file on the computer to get the country B can not be properly opened (garbled) is not conducive to communication.

With the development of globalization, the world desperately needs a unified coding to achieve a smooth flow of information, therefore, known as Unicode Unicode came into being.

3、Unicode

Unicode is known as the world's major countries contain Unicode character encoding, so far, people from different countries can finally together in the Internet world happy to play it. But soon it was suggested that dissatisfaction. Because Unicode is a big collection of scale can now accommodate more than one million symbols. Encoding each symbol is different, for example, U+0639represents the Arabic alphabet Ain, U+0041capital letter in English A, U+4E25represents the Chinese character . Some had just one byte characters, some require three bytes, four bytes, so many of the characters are not classified it not messed up, Okay, the character encoding format to make a unified, with all four byte bar, the excess fill bits with a 0 bit.

This creates two problems:

(1), and some original documents only 1mb space is enough, according to the Unicode 4mb of space they need, and their waste. So then, they quit, all kinds of negative response, various resist (especially walking sideways like an old country). Unicode can not lead to the promotion of a long period of time, until the advent of the Internet.

(2), there have been a variety of storage Unicode, which means there are many different binary format, can be used to represent Unicode.

With the popularity of the Internet, playing online friends like to play on the strong demand for a unified coding appears. In netizens lively and passionate discussion UTF-8 it was born (as well as utf-16, utf-32, but for us nothing niao use, could not remember). At this point, and everyone can play along with the pleasant.

4、UTF-8

UTF-8 is the most widely used implementation using a Unicode on the Internet. Oh emphasize that the key here is, UTF-8 Unicode is one of implementation.

UTF-8 biggest feature is that it is a variable length encoding. It can be 1 to 4 bytes of one symbol, byte length varies depending on the symbol.

UTF-8 encoding rules are very simple, only the following two:

1) For single byte symbols, the first byte is set 0, the back 7 of the Unicode code symbol. Therefore, for the English alphabet, UTF-8 encoding and ASCII codes are the same.

2) For the nbyte symbol ( n > 1), before the first byte nbits are set 1, the n + 1bit is set 0, the first two bytes always set back 10. The remaining bits not mentioned, all of the Unicode code symbol.

Unicode符号范围     |        UTF-8编码方式
(十六进制)        |              (二进制)
----------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

It is now on the table, reading UTF-8 encoding is very simple. If the first one is a byte 0, then this is a single-byte character; if the first one is 1, then the number of consecutive 1, it indicates how many bytes occupied by the current character.

5, and the transition between the Unicode UTF-8

Both the conversion can be achieved through the program, of course, some application software provides file storage method multiple encoding modes can be a conversion.

6, memory Why not just use UTF-8 it?

A friend asked, UTF-8 so well, why should we use it in memory Unicode, UTF-8 does not work directly with you?

This is because the Unicode UTF-8 is one of implementation, although storage encoding, decoding regardless of reading, so when you read on the large island from your hard time for it, UTF-8 is not loose friends, have to Unicode to decode the job. Of course, if you have to build a factory to your computer memory also UTF-8 it is also OK, but in that case you can only play yourself with himself.

Of course, you can always wait, because, according to Allen predicted under it, with the development of it UTF-8, Unicode is likely to be less and less, perhaps, in the near future, we all use UTF-8 La.

7, garbled analysis

First, a clear concept

  • Files written to the hard disk operations referred to save files from memory
  • Read files from the hard disk file referred to as memory read operation

Garbled two cases:

  • Garbled one: When the file has been saved garbled

When save the file, because the contents of the file with the text of each country, we have a single encoding rules to a country to exist,
the text of other countries at this time due to this coding rule does not lead to a corresponding encoded find the store failed. But when we insist on deposit when editing and does not complain, but there is no doubt, not to save memory and hard, certainly chaotic saved, save the file that is garbled stage has already occurred, and when we open this encoding rules when the file, the text can be displayed in this country, while other text is garbled.

  • Garbled two: do not read files saved garbled garbled file

Save files when using utf-8 encoding, ensuring compatibility nations, will not be garbled, but chose the wrong decoding mode when reading a file, such as gbk, distortion occurs in the read step for reading stage garbled can be solved, the election of the right decoding scheme ok.

Sixth, summarize and supplement

The difference between ASCII and Unicode: ASCII code is a byte Unicode encoding and usually more than one byte.

A letter with ASCII encoding is 65 decimal, binary 01000001;

Characters are encoded in ASCII 0 48 decimal, binary 00110000, attention character '0' is different from 0 and integers;

A If the ASCII-encoded in Unicode encoding, only the leading zero can, therefore, Unicode code A is 0,000,000,001,000,001.

Question: If unified into Unicode encoding, to solve the garbage problem. However, if you write essentially all text is in English, then use Unicode encoding than ASCII encoding requires double the storage space, storage and transmission are wasted.

Solution: The Unicode encoded into a "variable length code" UTF-8 encoding. UTF-8 encoded according to Unicode character to a different size of the figures 1-6 encoded into bytes, commonly used letters are encoded into a byte characters typically 3 bytes, only a rare character will be encoded into bytes 4-6. If you want to transfer the text contains a lot of English characters, use UTF-8 encoding will be able to save space:

UTF-8 encoding has the advantage, in fact, is the ASCII coding can be seen as part of UTF-8 encoding, so only supports ASCII encoding a large number of legacy software may continue to operate in UTF-8 encoding

The computer system of universal character encoding works:

In computer memory, uniform use Unicode encoding, when the need to save time or the hard disk needs to be transmitted, is converted to UTF-8 encoding.

When you start Notepad, UTF-8 character read from the file is converted to Unicode characters into memory, after the editing is complete, save time and then converted to Unicode UTF-8 to save the file:

Browse the Web, the server will dynamically generated content into Unicode UTF-8 and then transmitted to the browser:

So you see will be similar to the source of many pages of information indicating UTF-8 encoding is the use of the web page.

reference:

Nick teacher's blog: https: //www.cnblogs.com/nickchen121/p/10718112.html

Guess you like

Origin www.cnblogs.com/allenchen168/p/11537105.html