Python Computer coding and coding

Python Computer coding and coding

1 coding knowledge

1.1 concept

  • Information : human activity is required to communicate to each other in the narrow sense of some sort of signal.

  • Data : is the media to carry information. This has a variety of media formats, it may be: digital, audio, video, text and so on. Simple terms, is a digital tool 1 is a digital content, but if the number is a tool to express the number of concepts. So 'one apple', 'one person' is what we needed to convey the signal is understandable.

  • Coding : encoding the information from one form (data such as kanji) format or converted to another form of the process. So, speaking of macro, coding also carry data, one medium of information, is a computer-generated medium. The computer can not read Chinese characters, but by coding, recognition operation.

  • Decoding : decoding a symbol or code that those who pass the received information to the reduction process, the encoding process, respectively.

 In simple terms, from plain text to encode text, known as the "coding." From the coded text explains the plaintext, it referred to as "decoding."

1.2 six kinds of encoding text

  1. The ASCII : Computer storage is the data of a resistor unit is opened: 1, is off: 0 1bit called, is also a positive. The earliest Americans with eight character symbols to express resistance, but also can express 256 different characters, English words have 26, plus some special characters are also more than sufficient. The character A -> 00010001.

  2. ANSI : Extended ASCII code, early character who just turned 128. So the first eight tend to be zero. In order to represent more characters, it has expanded the eighth. A total of 256 characters can be expressed, which contains a number of Latin.

  3. GB2312 : the rapid development in China among the computer codes also passed over. However, before the Chinese character encoding can not express. So in 1980, China has independently developed the GB2312 coding. The first eight corresponding Latin deleted. The significance of the binary character values less than 127 unchanged, still showing English characters and special characters; it will be greater than 127 characters, the other consisting of a combination of characters is greater than 127, it expressed as a specific Chinese character. In simple terms, it is greater than two bytes 127 denotes a Chinese character. It can represent a total of more than 7,000 Chinese characters. It can be considered an extension of ASCII code.

  4. GBK and GB18030 : 7000 characters, however, is not enough. Therefore, after the extended, when the first byte is greater than 127, followed by random byte would be represented as a Chinese character. On the basis of the GB2312, it added more than 20,000 Chinese characters (contains some complex characters and symbols).

  5. The UNICODE : The coded character represented by a two-byte, 256 * 256 = 65,536. A total of 65,000 may represent a variety of different characters, enough can represent all the world's a meaningful characters (including Oracle). You can ask me, call it "UNIQUE CODE", the world is the only encoding.

  6. 8-UTF : With UNICODE, why develop a new code of it. Because, originally a byte can store all 26 English characters, why do we have to two bytes, resulting in redundant storage yet. Therefore, the experts of the United States for this issue to create a high degree of freedom of UTF-8. It may be 1-4 bytes used to represent a character. When the binary length of one character is smaller than 127, it is expressed as ASCII code. Depending on the type of symbols to transform length. Commonly used Chinese characters, 3 bytes (2w +), which are difficult to write the word is four bytes.

Well, the specific code is look like it? In this online coding for a converter.

1.3 encoding and transcoding

  1. UNICODE encoding :

    str = 'U are a puppy' # This is what we can read in a clear text that contains English, Chinese

The following is UNIDODE coding, \ u as an identifier, plus four hexadecimal numbers:

with a letter and a character, for example:

U: 0055 
狗: 5c0f

Because it is above 16 hex, 16 hex Why choose? This is because a hexadecimal number capable of expressing all four-bit binary numbers, two hexadecimal numbers can be represented by a byte size, the recording length is shortened from 8-2, reduce redundancy record. Then converted into binary: binary converter

U: 00000000 01010101 
狗:01011100 00001111

Not difficult to find, UNICODE for English and Chinese characters are two bytes to store.
Sometimes, we find that U of UNICODE code can also be written as & # x85. In fact, \ u & # x at the beginning and the beginning is the same, \ u ary representatives UNICODE16 wording, & # x represents a decimal wording.

Representative leading 0x hexadecimal \ u represents unicode encoding \ X corresponds to the UTF-8 encoded data

  1. UTF-8 encoding :

U: 01010101 狗: 00000001 00011001 00100001
English ASCII characters continue to use the rule is represented by a byte. However, Chinese characters 2-4 bytes to represent. Because the program uses mostly English characters. So, when a large amount of time to program text, UTF-8 UNICODE than can save a lot of storage space.

2 Python code

  As a python program for computer users to write high-level language. When we use it to write some of the plaintext time, it will help us turn the interior into a default character encoding, to provide computer processing. So, we have to understand the default encoding python2 \ python3 as well as modifications we can make.

2.1 Python2 coding

1> py2 default encoding type believe ASCII:

The default is ASCII coding, str type. But when no IDE specify UTF-8, ANSI and other text encoding time, we will complain, because ASCII encoding can not represent the Chinese character.
Plus payable to:

# *-* coding: utf-8

IDE is the default when using the string writing, coding the text is UTF-8:

2> py2 plus u, display unicode encoding:

可以发现,ASCII方式的字符串被py2标记为str类型,unicode方式的字符串被py2标记为unicode类型,其他外部使用的编码,py2也标记为str类型。

那么如何转换成其他编码呢?

  • UNICODE转GBK:

编码是给机器看的,明文是给人看的,print就是直接打印当前编码的明文。
发现,编码成GBK会乱码,这是因为两码不通的缘故。

  • UNICODE转UTF-8:

 **3> 总结:**py2的字符类型就设置为两种: str及unicode类型。其他如utf-8、gbk显示也是默认的str字符串类型,被转换成几个ASCII码。但是其实真正的ASCII编码中是不包含这个编码的明文的。

4> 解码decode: 就是将编码转换成明文。由于,每种编码方式都有不同的编码规则,所以就要选用对应的解码规则。不然就会报错:

正确解码:

2.2 Python3的编码

&emsp: 1> py3也有两种类型,我觉得这两种类型更符合人的认知意识,就是明文字符串和字节表示符:

str就是明文字符串,bytes就是字节表示符(多进制来表示)。

 总之py3是更符合人的思维逻辑的。encode就是编码,是机器能读懂的编码文字。decode就是解码,解释出人能看懂的明文。

2.3 个人对python编码的理解

  我觉得,在python2交互输入框中输入字符串的时候,不用IDE、不加u则会报错,因为默认编码为ASCII码,没有中文字符。加u编码是UNICODE,直接解码显示为明文。输入明文,就默认解码显示明文了,内部编码类型是UNICODE。

 在python3中,也有两种字符串格式类型了,但是该str是指明文字符串,bytes是指编码的字节表示符。

3 编码的延伸

3.1 软件的默认编码

 其实,当我查阅资料的时候,有些人讲:其实许多软件内部都把默认存储编码选择为UNICODE,这是因为UNICODE又被称为万国码,因为它能存储6W5千多个符号,汉字、英文、拉丁文、日文等尽收囊中。包罗万象、兼容性最佳。当我们把软件的显示编码调试为ANSI、UTF-8的时候,在存储时,都会先转换为UNICODE,再进行存储。

 究竟内部存储编码类型,为UNICODE或者是其他GBK、ANSI、UTF-8等都是无所谓的。我们只需要知道:

  •   1> 当我们在用计算机交互软件的时候,我们输入自己能看懂的明文,但是其实计算机是看不懂的。所以,软件内部就必须要有能够转换明文到编码文字的编码,如cmd窗口、记事本、word在打开软件的时候,都有指定对应的默认编码。那么软件的默认编码都是什么呢?如word就默认为UNICODE,这是因为UNICODE是包罗万象的万国码,兼容性最佳。

  •   2> 当我们保存文件到计算机存储介质的时候,明文其实就已经被软件通过某种编码规则编辑成一种记录方式了,但不管怎么,其记录单元都是字节->电阻单元。

  •   3> 当我们从存储介质找到相应文件并打开时,软件又自动帮我们解码成能看懂的明文。

 py2默认编码解码方式为ASCII,py3为UTF-8。通过以下方式可以查询:

import sys
print(sys.getdefaultencoding())

3.2 声明;执行.py文件

指这个.py文件是以utf-8进行存储的。

# *-* coding:utf8

是的,这就是因为如果py2解释器去执行一个utf8编码的文件,就会以默认地ASCII去解码utf8,一旦程序中有中文,自然就解码错误了,所以我们在文件开头位置声明 #coding:utf8,其实就是告诉解释器,你不要以默认的编码方式去解码这个文件,而是以utf8来解码。而py3的解释器因为默认utf8编码,所以就方便很多了。

3.3 常见打开文件乱码或报错

 在我们保存一种文本文件,并用python自带的f = open(file)去打开的时候,会发现f的内容为乱码或者报错。这是因为,文件file可能是ANSI的编码,但是python3默认的编码解码类型是UTF-8,所以就会出现乱码。或者,windows操作系统默认为gbk解码方式,打开utf-8文件报错。

 这时候,就可以指定编码方式,f = open(file, encoding = ‘UTF-8’)

Guess you like

Origin blog.csdn.net/qq_40260867/article/details/86430731