Computer character encoding foundation

First, basic computer

Second, text editor, file access principle

1, open the editor will open to start a process, is in memory, so content with the editor to write and are also stored in memory, data lost after power failure.

2, in order to permanently save, you need to click the Save button: the brush editor memory data to the hard disk.

3, write out a .py file (not implemented), there is no difference with the preparation of other documents, are just a bunch of characters to write it.

Three, Python interpreter to execute py file principle

  • The first stage: Python interpreter starts, then you start the equivalent of a text editor.

  • The second stage: Python interpreter equivalent of a text editor to open test.py file, read from the content on the hard disk test.py file into memory (Small Review: Python interpretative decision concerned only the interpreter file contents, do not care about the file extension).

  • The third stage: Python interpreter interpreted code execution (ps just loaded into the content of test.py: At this stage, that is when the real implementation of the code, Python will recognize the syntax, the code within the executable file, when name = "egon "I will open up memory space for the string" egon ").

Four, Python interpreter with files Editing of the similarities and differences

  • The same point: Python interpreter is to explain the contents of the file, so Python interpreter have read py file function, which is the same as with a text editor.

  • Different points: a text editor, file contents into memory, in order to display or edit than simply ignore python grammar, and Python interpreter will file contents into memory, not in order to give you a peek in Python code wrote what, but to execute Python code that identifies the Python syntax.

Fifth, the character encoding introduction

5.1 What is a character encoding

The computer must be powered in order to work, that is, with 'electric' drive computer work, that is characteristic of 'power' determines the characteristics of the computer. I.e., high and low electrical characteristics (for humans from a logically high level binary 1, binary 0 corresponding to a low level), the magnetic properties are about the same reason. Conclusion: The computer only recognize numbers.
Obviously, we usually use the computer, are used in human can read character (a high-level programming language in the document is the result of nothing more than a bunch of characters), how to make computers understand human character?
We must go through a process:

  • Character ------> translation --------> Digital
    All in all, the human character encoding is character encoding process computer can recognize numbers, this transition must follow a prescribed standard, the standard It is nothing more than a correspondence relationship with the digital human character, called the character code table.

5.2 relate to the character encoding of two scenes

1, the contents of a file in Python is a bunch of characters, involve access to character encoding issues (Python file is not running, the first two stages belong to this category).
2, the data type is string Python (Python when the file is executed, i.e. third stage) of a string of characters.

5.3 History of character encoding and classification

The computer was invented by the Americans, the highest character encoding ASCII, the provisions of the correspondence between English alphanumeric and some special characters and numbers. With a maximum of 8 bits (one byte), i.e., 2 . 8 = 256, the ASCII code can represent up to 256 symbols.

Of course, we are programming languages ​​English no problem, ASCII enough, but when dealing with data, different countries have different languages, Chinese people will join the Chinese, the Japanese will join the Japanese in their program, but also Korean.

But to represent Chinese, take a single byte represents a character, it is impossible to express finished (even a child will know more than two thousand Chinese characters), only one solution is to use a byte> 8 binary instead of, the more bits, the more representative of the change, so you can express as many different characters.

所以中国人规定了自己的标准GB2312编码,规定了包含中文在呢哦的字符与数字的对应关系。

日本日规定了自己的Shift——JIS编码,韩国人规定了自己的Euc-kr编码(另外,韩国人说,计算机是他们发明的,要求世界统一用韩国编码,但是世界人民没有搭理他们)。

这时候问题出现了,精通18国语言的小周同学谦虚的用8国语言些了一篇文档,那么这篇文档,按照哪国的标准,都会出现乱码(应为此刻的各种标准只是规定了自己国家的文字在哪的字符跟数字的对应关系,日过单纯采用一种国家的编码格式,那么其余国家的语言的文字在解析时就会出现乱码)。所以迫切需要一个世界的标准(能包含全世界的语言)于是Unicode应运而生(韩国人表示不服,然后没有什么卵用)。

acsii用1个字节(8位二进制)代表一个字符,Unicode常用2个字节(16位二进制)代表一个字符,生僻字需要4个字节。

  • 例子:字母X,用ascii表示是十进制的120,二进制0111 1000.

汉字中已经超出了ASCII编码的范围了,用Unicode编码是十进制的20013,二进制的0100111000101101.字母x,用Unicode表示二进制0000 0000 0111 1000,所以Unicode兼容ascii,也兼容万国,是世界的标准。这时候乱码问题消失了,所有的文档我们都是用但是新问题出现了,如果我们的文档通篇都是英文,你用Unicode会比ascii耗费多一倍的空间,在存储和传输上十分的低效。
本着节约的精神,有出现了把Unicode编码转化成“可变长编码”的UTF-8(Unicode Transformation Format-8)编码。UTF-8编码是把一个Unicode字符根据不同的数字大小编码程1-6个字节,通常英文字母编码成1个字节,汉字通常都是3个字节,只有恨生僻的字符才会被编码成4-6个字节。如果你要传输的文本包含大量英文字符,用UTF-8编码就能节省空间

字符 ASCII Unicode UTF-8
A 01000001 00000000 01000001 01000001
x 01001110 00101101 11100100 10111000 10101101

从上面的表格还可以发现,UTF-8编码有一个额外的好处,就是ASCII编码世纪上可以堪称是UTF-8编码的一部分,所以,大量只支持ASCII编码的历史遗留软件可以在UTF-8编码下继续工作。

5.4 内存为什么不用UTF-8呢?

说了那么一大堆,那为什么内存用Unicode,而不直接使用UTF-8呢?这样不就可以直接把代码从内存直接丢入硬盘了吗?出现这个问题的原因是硬盘中还躺了其他国家的代码,各个国家的代码的二进制还需要运行在计算机上使用,因此内存中必须使用Unicode的编码,因为Unicode能和硬盘中的其他国家二进制中的代码进行转换,但是UTF-8只是简化了代码的存储,它并不能与其他国家硬盘中的代码进行关系转换。总而言之只有Unicode编码才能运行其他国家硬盘中的代码,而UTF-8代码无法进行该操作。

内存中还使用Unicode编码,是因为历史遗留问题造成的,但是因为现在写代码使用的都是UTF-8代码,所以,以后内存中的代码都将变成UTF-8代码,并且以前遗留的各个国家的代码都将会被淘汰,所以未来内存中使用的编码也将使用UTF-8编码替代Unicode编码。

5.5字符编码之文本编辑器操作

5.6乱码分析

首先明确概念

  • 文件从内存刷入到硬盘的操作简称存文件

  • 文件从硬盘读到内存的操作简称读文件

乱码的两种情况

  • 乱码一:存文件时就已经乱码了
    存文件时,由于文件内有各个国家的文字,我们单以shiftjis去存,本质上其他国家的文字由于shiftjis中没有找到对应的关系而导致存储失败,但当我们硬要存的时候,编辑并不会报错(难道你的编码错误,编辑器这个软件就跟着崩溃了吗???),但毫无疑问,不能存而硬存,肯定是乱存了,即存文件阶段就已经发生了乱码,而当我们用shiftjis打开文件时,日文可以正常显示,而中文则乱码了。
  • 乱码二:存文件时不乱码而读文件时乱码
    存文件时用UTF-8编码,保证兼容万国,不会乱码,而读文件时选择了错误的解码方式,比如gbk,则在读阶段发生了乱码,读阶段发生了乱码时可以解决的,选对正确的解码方式就ok了。

六、总结

1、保证不乱码的核心法则就是,字符按照什么标准编码的,就要按照什么标准解码,此处的标准指的是字符编码。
2、在内存中写的所有字符,一视同仁,都是Unicode编码,比如我们打开编辑器,输入一个“你”,我们并不能说“你”就是一个汉字,此时它仅仅只是一个符号,该符号可能很多国家都在使用,根据我们使用的输入法不同这个字的样式可能也不大一样。只有我们往硬盘保存或者基于网络传输时,才能确定“你”到底是一个汉字,还是一个日本字,这就是Unicode转换成其他编码格式的过程了,简而言之,就是内存中固定使用的就是Unicode编码,我们唯一改变的就是存储到硬盘时使用的编码。

  • Unicode-----> encode(编码)------>gbk
  • Unicode<-------decode(解码)<-------gbk

原文地址:nikechen121字符编码博客

Guess you like

Origin www.cnblogs.com/FirstReed/p/11682111.html