Introduction to Python Basics Lesson 4--Character Encoding

    This section was originally written together with the variables and functions in the previous section. Since it is a basic introduction, it will be difficult to understand if you write too much, so this part will be explained separately. Here we mainly talk about the related problems of character encoding.

1. Character encoding

   In the C language, the most basic and important concept we have learned is string, which is also a type we often encounter. We often use it in programming. Before mastering the use and method of a type, understand it development process is still necessary.

  •  ASCII code

    ASCII was first published as a canonical standard in 1967, last updated in 1986, and has so far defined a total of 128 characters; 33 of them cannot be displayed (some terminals provide extensions that make these characters readable. Displayed as 8-bit symbols such as smiley faces, playing card tricks, etc.), and most of these 33 characters are obsolete control characters . The purpose of control characters is mainly to manipulate text that has been processed. Beyond the 33 characters are 95 displayable characters. The blank character produced by pressing the blank key on the keyboard is also counted as a displayable character (displayed as blank).

    USASCII code chart:


    But there are some problems: at that time, the computer was only a patent for Latin characters, and the development momentum of the current computer was not thought of at all. If you thought of it, you might have used unicode at the beginning. At that time, most experts believed that to use a computer, one must be proficient in English. This encoding occupies 7 bits, which occupies one byte in the computer, 8 bits, the highest bit is useless, and is sometimes used as a parity bit during communication. Therefore, the value range of ASCII encoding is actually: 0x00-0x7f, which can only represent 128 characters. Later, it was found that 128 were not enough, and it was extended, called ASCII extended encoding, with eight bits, and the value range became: 0x00-0xff, which can represent 256 characters. In fact, this kind of expansion is not meaningful, because 256 characters is not enough to represent some non-Latin characters, but it is not enough to represent Latin characters. So the meaning of the extension is for the following ANSI encoding services.

  • ANSI

    American National Standards Institute, that is to say, each country (non-Latin-speaking country) formulates its own encoding rules for its own characters, which have been recognized by ANSI and conform to ANSI standards. This encoding is called ANSI encoding. In other words, the meaning of ANSI encoding in China is different from that in Japan, because both represent the text encoding standard of their own country. For example, China's ANSI corresponds to the GB2312 standard, Japan corresponds to the JIT standard, Hong Kong and Taiwan correspond to the BIG5 standard, and so on.

    So how many bits is ANSI? This is not necessarily! For example, in GB2312 and GBK, BIG5, there are two! But if other standards or other languages ​​are not enough, there may be more than two! 
For example: GB18030: 
GB18030-2000 (GBK2K) further expands Chinese characters on the basis of GBK, adding the glyphs of ethnic minorities such as Tibetan and Mongolian. GBK2K fundamentally solves the problem of insufficient word bits and insufficient glyphs. It has several characteristics: it does not determine all the glyphs, but only specifies the coding range, which is reserved for future expansion. 
The encoding is variable length, and its two-byte part is compatible with GBK; the four-byte part is an extended font and word bit, and its encoding range is the first byte 0x81-0xfe, the two bytes 0x30-0x39, and the three bytes 0x81- 0xfe, four bytes 0x30-0x39. Its promotion is staged, and the first requirement is to implement all the glyphs that can be fully mapped to the Unicode3.0 standard. It is a national standard and is mandatory. 

After understanding the meaning of ANSI, we found that ANSI has a fatal flaw, that is, each standard is independent and cannot be guaranteed to be compatible. In other words, if you want to display Chinese and Japanese or Arabic at the same time, it is entirely possible that there is a corresponding encoding in both character sets, and you don't know which one to display, that is, the problem of overlapping encodings. Obviously such a scheme is not good, so Unicode will appear!

  • MBCS

    The speed of development of computers is very fast, and the single-byte encoding of ASCII can no longer meet the demand. There are also many other languages ​​in the computer, and each language has its own set of encodings, so there is a character set encoding method called Multibyte Character Set Support (MBCS).

    多字节字符系统或者字符集,基于ANSI编码的原理上,对一个字符的表示实际上无法确定他需要占用几个字节的,只能从编码本身来区分和解释。因此计算机在存储的时候,就是采用多字节存储的形式。也就是你需要几个字节我给你放几个字节,比如A我给你放一个字节,比如”中“,我就给你放两个字节,这样的字符表示形式就是MBCS。 

在基于GBK的windows中,不会超过2个字节,所以windows这种表示形式有叫做DBCS(Double-Byte Chactacter System),其实算是MBCS的一个特例。 C语言默认存放字符串就是用的MBCS格式。从原理上来说,这样是非常经济的一种方式。

  • Unicode

     这是一个编码方案,说白了就是一张包含全世界所有文字的一个编码表,不管你用的上,用不上,不管是现在用的,还是以前用过的,只要这个世界上存在的文字符号,统统给你一个唯一的编码,这样就不可能有任何冲突了。不管你要同时显示任何文字,都没有问题。 
因此在这样的方案下,Unicode出现了。Unicode编码范围是:0-0x10FFFF,可以容纳1114112个字符,100多万啊。全世界的字符根本用不完了,Unicode 5.0版本中,才用了238605个码位。所以足够了。 
因此从码位范围看,严格的unicode需要3个字节来存储。但是考虑到理解性和计算机处理的方便性,理论上还是用4个字节来描述。 
Unicode采用的汉字相关编码用的是《CJK统一汉字编码字符集》— 国家标准 GB13000.1 是完全等同于国际标准《通用多八位编码字符集 (UCS)》 ISO 10646.1。《GB13000.1》中最重要的也经常被采用的是其双字节形式的基本多文种平面。在这65536个码位的空间中,定义了几乎所有国家或地区的语言文字和符号。其中从0x4E00到 0x9FA5 的连续 
区域包含了 20902 个来自中国(包括台湾)、日本、韩国的汉字,称为 CJK (Chinese Japanese Korean) 汉字。CJK是《GB2312-80》、《BIG5》等字符集的超集。 

     CJK包含了中国,日本,韩国,越南,香港,也就是CJKVH。这个在UNICODE的Charset chart中可以明显看到。 unicode的相关标准可以从unicode.org上面获得,目前已经进行到了6.0版本。

    Unicode的实现方案:

Unicode其实只是一张巨大的编码表。要在计算机里面实现,也出现了几种不同的方案。也就是说如何表示unicode编码的问题。 
(1)UTF-8(UCS Transformation Format 8bit) 
这个方案的意思以8位为单位来标识文字,注意并不是说一个文字用8位标识。他其实是一种MBCS方案,可变字节的。到底需要几个字节表示一个符号,这个要根据这个符号的unicode编码来决定,最多4个字节。 
编码规则如下: 
Unicode编码(16进制) ║ UTF-8 字节流(二进制)   
 000000 - 00007F ║ 0xxxxxxx    
000080 - 0007FF ║ 110xxxxx 10xxxxxx    
000800 - 00FFFF ║ 1110xxxx 10xxxxxx 10xxxxxx    
010000 - 10FFFF ║ 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx    
UTF-8的特点是对不同范围的字符使用不同长度的编码。对于0x00-0x7F之间的字符,UTF-8编码与ASCII编码完全相同。 
UTF-8编码的最大长度是4个字节。从上表可以看出,4字节模板有21个x,即可以容纳21位二进制数字。Unicode的最大码位0x10FFFF也只有21位。    
例1:“汉”字的Unicode编码是0x6C49。0x6C49在0x0800-0xFFFF之间,使用用3字节模板了:1110xxxx 10xxxxxx 10xxxxxx。将0x6C49写成二进制是:0110 1100 0100 1001, 用这个比特流依次代替模板中的x,得到:11100110 10110001 10001001,即E6 B1 89。    
例2:Unicode编码0x20C30在0x010000-0x10FFFF之间,使用用4字节模板了:11110xxx 10xxxxxx 10xxxxxx 10xxxxxx。 
将0x20C30写成21位二进制数字(不足21位就在前面补0):0 0010 0000 1100 0011 0000,用这个比特流依次代替模板中的x,得到:11110000 10100000 10110000 10110000,即F0 A0 B0 B0。

(2)UTF-16 
UTF-16编码以16位无符号整数为单位。注意是16位为一个单位,不表示一个字符就只有16位。现在机器上的unicode编码一般指的就是UTF-16。绝大部分2个字节就够了,但是不能绝对的说所有字符都是2个字节。这个要看字符的unicode编码处于什么范围而定,有可能是2个字节,也可能是4个字节。这点请注意! 
下面算法解释来自百度百科。

我们把Unicode unicode编码记作U。编码规则如下: 
  如果U<0x10000,U的UTF-16编码就是U对应的16位无符号整数(为书写简便,下文将16位无符号整数记作WORD)。 
  如果U≥0x10000,我们先计算U’=U-0x10000,然后将U’写成二进制形式:yyyy yyyy yyxx xxxx xxxx,U的UTF-16编码(二进制)就是:110110yyyyyyyyyy 110111xxxxxxxxxx。为什么U’可以被写成20个二进制位?Unicode的最大码位是0x10ffff,减去0x10000后,U’的最大值是0xfffff,所以肯定可以用20个二进制位表示。 
例如:Unicode编码0x20C30,减去0x10000后,得到0x10C30,写成二进制是:0001 0000 1100 0011 0000。用前10位依次替代模板中的y,用后10位依次替代模板中的x,就得到:1101100001000011 1101110000110000,即0xD843 0xDC30。    
按照上述规则,Unicode编码0x10000-0x10FFFF的UTF-16编码有两个WORD,第一个WORD的高6位是110110,第二个WORD的高6位是110111。可见,第一个WORD的取值范围(二进制)是11011000 00000000到11011011 11111111,即0xD800-0xDBFF。第二个WORD的取值范围(二进制)是11011100 00000000到11011111 11111111,即0xDC00-0xDFFF。 
  为了将一个WORD的UTF-16编码与两个WORD的UTF-16编码区分开来,Unicode编码的设计者将0xD800-0xDFFF保留下来,并称为代理区(Surrogate):    
D800-DB7F ║ High Surrogates ║ 高位替代    
DB80-DBFF ║ High Private Use Surrogates ║ 高位专用替代  
DC00-DFFF ║ Low Surrogates ║ 低位替代    
高位替代就是指这个范围的码位是两个WORD的UTF-16编码的第一个WORD。低位替代就是指这个范围的码位是两个WORD的UTF-16编码的第二个WORD。那么,高位专用替代是什么意思?我们来解答这个问题,顺便看看怎么由UTF-16编码推导Unicode编码。    
如果一个字符的UTF-16编码的第一个WORD在0xDB80到0xDBFF之间,那么它的Unicode编码在什么范围内?我们知道第二个WORD的取值范围是0xDC00-0xDFFF,所以这个字符的UTF-16编码范围应该是0xDB80 0xDC00到0xDBFF 0xDFFF。我们将这个范围写成二进制:   1101101110000000 11011100 00000000 - 1101101111111111 1101111111111111   按照编码的相反步骤,取出高低WORD的后10位,并拼在一起,得到   1110 0000 0000 0000 0000 - 1111 1111 1111 1111 1111 
即0xe0000-0xfffff,按照编码的相反步骤再加上0x10000,得到0xf0000-0x10ffff。这就是UTF-16编码的第一个WORD在0xdb80到0xdbff之间的Unicode编码范围,即平面15和平面16。因为Unicode标准将平面15和平面16都作为专用区,所以0xDB80到0xDBFF之间的保留码位被称作高位专用替代。

(3)UTF-32 
这个就简单了,和Unicode码表基本一一对应,固定四个字节。 

为什么不采用UTF-32呢,因为unicode定义的范围太大了,其实99%的人使用的字符编码不会超过2个字节,所以如同统一用4个字节,简单倒是简单了,但是数据冗余确实太大了,不好,所以16位是最好的。就算遇到超过16位能表示的字符,我们也可以通过上面讲到的代理技术,采用32位标识,这样的方案是最好的。所以现在绝大部分机器实现unicode还是采用的utf-16的方案。当然也有UTF-8的方案。比如windows用的就是UTF16方案,不少linux用的就是utf8方案。

(来源:CSDN  ASCII、ANSI、MBCS、UNICODE字符集详解

2.Unicode和UTF-8有何差异

    unicode在很长一段时间内无法推广,直到互联网的出现,为解决unicode如何在网络上传输的问题,于是面向传输的众多 UTF(UCS Transfer Format)标准出现了,顾名思义, UTF-8就是每次8个位传输数据,而 UTF-16就是每次16个位。UTF-8就是在互联网上使用最广的一种unicode的实现方式,这是为传输而设计的编码,并使编码无国界,这样就可以显示全世界上所有文化的字符了。UTF-8最大的一个特点,就是它是一种变长的编码方式。它可以使用1~4个字节表示一个符号,根据不同的符号而变化字节长度,当字符在ASCII码的范围时,就用一个字节表示,保留了ASCII字符一个字节的编码做为它的一部分,注意的是unicode一个中文字符占2个字节,而UTF-8一个中文字符占3个字节)。从unicode到utf-8并不是直接的对应,而是要过一些算法和规则来转换。

Unicode符号范围 | UTF-8编码方式

(十六进制) | (二进制)
—————————————————————–
0000 0000-0000 007F | 0xxxxxxx

0000 0080-0000 07FF | 110xxxxx 10xxxxxx

0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx

0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Summary
The Chinese people have developed GB2312 encoding through the Chinese expansion and transformation of ASCII encoding, which can represent more than 6,000 commonly used Chinese characters. There are too many Chinese characters, including traditional Chinese characters and various characters, so GBK encoding is produced, which includes the encoding in GB2312 and expands a lot at the same time. China is a multi-ethnic country, and almost every ethnic group has its own independent language system. In order to represent those characters, the GBK code continues to be expanded to the GB18030 code. Like China, every country encodes its own language, so there are various encodings. If you do not install the corresponding encoding, you will not be able to explain what the corresponding encoding wants to express. Finally, there is an organization called ISO that can't stand it anymore. Together, they created an encoding, UNICODE, that was so large that it could hold every word and sign in the world. Therefore, as long as the computer has the UNICODE encoding system, no matter what kind of language it is in the world, when the file is saved, it can be normally interpreted by other computers as UNICODE encoding. UNICODE In network transmission, there are two standard UTF-8 and UTF-16, which transmit 8 bits and 16 bits each time respectively. So some people will question, since UTF-8 can save so many characters and symbols, why are there so many people who use GBK and other encodings in China? Because UTF-8 and other encodings are relatively large and occupy a lot of computer space, if most of the users are Chinese, GBK and other encodings can also be used.
 (Source: Know the  difference between Unicode and UTF-8 )

3. Coding problems in Python 2.x

    There are still some differences in the coding in Python2 and Python3. Here, please refer to the following two articles for detailed answers.

Detailed explanation of python character encoding     in blog garden 

    Differences in string encoding processing between CSDN  Python2 and Python3

    This part will say so much first, the next chapter preview: module

    

    

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325776335&siteId=291194637