2019.08.07 learning finishing

2019.08.07 learning finishing

Character Encoding

1. What is the character encoding

The character encoding is human character encoding into the digital computer can recognize, this conversion must follow a fixed set of standards, which is nothing more than the correspondence between the digital human character, called the character code table.

2. History of character encoding and classification

Computers were invented by the Americans, the first character encoding is ASCII, only the provisions of the correspondence between English alphanumeric and some special characters and numbers. 256 = 2 ** 8, therefore, the ASCII code symbols can represent up to 256: can only be represented (one byte), i.e., 8 bits.

046- character encoding -ASCII table .jpg? X-oss-process = style / watermark

Of course, we are programming languages ​​English no problem, ASCII enough, but when dealing with data, different countries have different languages, Chinese people will join the Chinese, the Japanese will join the Japanese in their program, but also Korean.

But to represent Chinese, take a single byte table shows a man, it is impossible to express finished (even a child will know more than two thousand Chinese characters), only one solution is to use a byte> 8 represents the binary , the more bits, the more representative of the change, so that you can get as much out of nowhere expression of Chinese characters.

So Chinese people defined their own standards gb2312 coding, the correspondence contains provisions, including Chinese characters and numbers.

Japanese defined their Shift_JIS encoding; Korean defined its Euc-kr encoding (In addition, Koreans say they invented the computer is required to unify the world with South Korea coding, but the people of the world who ignores them).

046- character encoding - summed up the history of the development of character encoding .png x-oss-process = style / watermark?

This time the question arises, fluent in 18 languages ​​modest little Chow wrote a document in 8 languages, so this document, according to the standard which country, will be garbled (due to various standards at the moment are only provisions correspondence between the character of their country, including the characters with numbers, if only using the encoding format for the country, then the rest of the language of the text will be garbled when parsing). There is an urgent need for a world standard (can include language around the world) then came into Unicode (Korean pleaded, then there is no use eggs).

ascii with a byte (8 bits binary) represents one character; the Unicode conventional two bytes (16 bits binary) represents one character, uncommon word requires 4 bytes.

Example: letter x, is represented by ascii 120 decimal, binary 01111000.

汉字中已经超出了ASCII编码的范围,用Unicode编码是十进制的20013,二进制的01001110 00101101。

字母x,用Unicode表示二进制0000 0000 0111 1000,所以Unicode兼容ascii,也兼容万国,是世界的标准。

这时候乱码问题消失了,所有的文档我们都使用但是新问题出现了,如果我们的文档通篇都是英文,你用Unicode会比ascii耗费多一倍的空间,在存储和传输上十分的低效。

本着节约的精神,又出现了把Unicode编码转化为“可变长编码”的UTF-8(Unicode Transformation Format-8)编码。UTF-8编码把一个Unicode字符根据不同的数字大小编码成1-6个字节,常用的英文字母被编码成1个字节,汉字通常是3个字节,只有很生僻的字符才会被编码成4-6个字节。如果你要传输的文本包含大量英文字符,用UTF-8编码就能节省空间:

字符 ASCII Unicode UTF-8
A 01000001 00000000 01000001 01000001
x 01001110 00101101 11100100 10111000 10101101

从上面的表格还可以发现,UTF-8编码有一个额外的好处,就是ASCII编码实际上可以被看成是UTF-8编码的一部分,所以,大量只支持ASCII编码的历史遗留软件可以在UTF-8编码下继续工作。

3.内存为什么不用UTF-8呢?

内存中还使用Unicode编码,是因为历史遗留问题造成的,但是因为现在写代码使用的都是UTF-8代码,所以以后内存中的代码都将变成UTF-8代码,并且以前遗留的各个国家的代码都将被淘汰,所以未来内存中使用的编码也将使用UTF-8编码替代Unicode编码。

4.乱码分析

首先明确概念

  • 文件从内存刷到硬盘的操作简称存文件
  • 文件从硬盘读到内存的操作简称读文件

乱码的两种情况:

  • 乱码一:存文件时就已经乱码

存文件时,由于文件内有各个国家的文字,我们单以shiftjis去存,
本质上其他国家的文字由于在shiftjis中没有找到对应关系而导致存储失败。但当我们硬要存的时候,编辑并不会报错(难道你的编码错误,编辑器这个软件就跟着崩溃了吗???),但毫无疑问,不能存而硬存,肯定是乱存了,即存文件阶段就已经发生乱码,而当我们用shiftjis打开文件时,日文可以正常显示,而中文则乱码了。

  • 乱码二:存文件时不乱码而读文件时乱码

存文件时用utf-8编码,保证兼容万国,不会乱码,而读文件时选择了错误的解码方式,比如gbk,则在读阶段发生乱码,读阶段发生乱码是可以解决的,选对正确的解码方式就ok了。

六、总结

  • Unicode----->encode(编码)-------->gbk
  • Unicode<--------decode(解码)<----------gbk

046- and Unicode character encoding conversion -utf8 .png? X-oss-process = style / watermark

python2和3字符编码

1.字符编码应用之python

1.1执行python程序的三个阶段

  • 阶段一:启动Python解释器
  • 阶段二:Python解释器此时就是一个文本编辑器,负责打开文件test.py,即从硬盘中读取test.py的内容到内存中

  • 阶段三:读取已经加载到内存的代码(Unicode编码格式),然后执行,执行过程中可能会开辟新的内存空间,比如name="nick

文件的三种打开方式

件操作的基础模式有三种(默认的操作模式为r模式):

  • r模式为read
  • w模式为write
  • a模式为append

文件读写内容的格式有两种(默认的读写内容的模式为b模式):

  • t模式为text
  • b模式为bytes

需要注意的是:t、b这两种模式均不能单独使用,都需要与r/w/a之一连用

文件打开模式之r模式

r: read,只读模式,只能读不能写,文件不存在时报错。

f.read()读取文件指针会跑到文件的末端,如果再一次读取,读取的将是空格。

文件打开模式之w模式

w: only write, can not read, there is time to write the contents of the file back to the file and then emptied; file does not exist when the content is written after the file is created.

Open the file mode of a mode

a: you can append. File exists, write to the end of the file; the file does not exist when the content is written after the file is created.

Open reading a binary file

b mode is a common pattern, because all files are stored in the hard disk binary form, should be noted that: b mode read and write files, must not add encoding parameters, because the binary coding can no longer

document management with the operational context

with open () method not only provides a method for the automatic release of the operating system takes up, with open and can be separated by commas, disposable open multiple files, copy files fast.

with open('32.txt', 'rb') as fr, \
        open('35r.txt', 'wb') as fw:
    f.write(f.read())

Guess you like

Origin www.cnblogs.com/zhangmingyong/p/11316974.html