Python basic finishing - ASCII code, Unicode, utf-8, gbk

       The problem of encoding is actually caused by the diversity of languages ​​in various countries. The original computer was invented in the United States, and the computer can only process data, not text documents. Because computers can only process numbers, if you want to process text, you must first convert the text to numbers before you can process it. So it needs to be encoded for conversion.

ASCII code:

      The earliest computers were designed with 8 bits as a byte, so the largest integer that a byte can represent is 255 (binary 11111111=decimal 255). If you want to represent a larger integer, more bytes must be used. For example, the largest integer that can be represented by two bytes is, and the largest integer that 65535can be represented by 4 bytes is 4294967295.


       Since the computer was invented by Americans, only 127 letters were coded into the computer at the earliest, that is, uppercase and lowercase English letters, numbers and some symbols. This coding table is called ASCIIcoding. For example A, the coding of uppercase letters is 65, lowercase letters zThe encoding is 122. But obviously one byte is not enough to process Chinese, at least two bytes are needed, and it cannot conflict with the ASCII encoding, so China has developed an GB2312encoding to compile Chinese into it. As you can imagine, there are hundreds of languages ​​in the world. Japan has Japanese compiled into Shift_JISit, and South Korea has compiled Korean into Euc-krit. Each country has its own standards, and conflicts will inevitably arise. As a result, in multilingual mixed In the text, there will be garbled characters displayed.

Unicode 码:

Hence, Unicode came into being. Unicode unifies all languages ​​into one encoding, so there will be no more garbled problems. The Unicode standard is constantly evolving, but the most common is to use two bytes to represent a character (4 bytes if you want to use very remote characters). Modern operating systems and most programming languages ​​directly support Unicode.

UTF-8:

      A new problem has appeared again: if it is unified into Unicode encoding, the problem of garbled characters has disappeared since then. However, if the text you write is basically all in English, using Unicode encoding requires twice as much storage space as ASCII encoding, which is very uneconomical in terms of storage and transmission. Therefore, in the spirit of saving, there is an encoding that converts Unicode encoding into "variable-length encoding" UTF-8. UTF-8 encoding encodes a Unicode character into 1-6 bytes according to different number sizes, commonly used English letters are encoded into 1 byte, Chinese characters are usually 3 bytes, and only very rare characters will be encoded. Encoded into 4-6 bytes. If the text you want to transfer contains a lot of English characters, encoding in UTF-8 can save space:

character ASCII Unicode UTF-8
A 01000001 00000000 01000001 01000001
middle x 01001110 00101101 11100100 10111000 10101101
byte one byte (8 bits) two bytes One byte for English, three bytes for Chinese characters, and 6 bytes for uncommon words

Note: When browsing a web page, the server will convert the dynamically generated Unicode content to UTF-8 and then transmit it to the browser, so you see similar <meta charset="UTF-8" />information on the source code of many web pages, indicating that the web page is using UTF- 8 encoding.

gbk is a Chinese character encoding, using 2 bytes to represent a character.

 

The internal representation of strings in Python is unicode encoding . Therefore, when doing encoding conversion, it is usually necessary to use unicode as the intermediate encoding.

  • That is, first decode (decode) other encoded strings into unicode , and then encode (encode) from unicode
    to another encoding.
  • The function of decode is to convert other encoded strings into unicode encoding, such as str1.decode( 'gb2312'), which means to convert the gb2312 encoded string str1 into unicode encoding.
  • The function of encode is to convert unicode encoding into other encoded strings, such as str2.encode( 'gb2312'), which means to convert the unicode encoded string str2 into gb2312 encoding.

Total meaning: if you want to convert other encodings to utf - 8 , you must first decode it into unicode and then re-encode it into utf- 8 , which uses unicode as the conversion medium

s='中文'
s.decode('utf-8').encode('utf-8')
print s#以utf-8的形式输出

isinstance(s,utf-8):判断s是否是unicode编码,如果是就返回true,否则返回false*


s='中文'
s.decode('utf-8')
print isinstance(s,unicode)   #此时输出的就是True
s=s.encode('utf-8')           #又将unicode码编码成utf-8
print isinstance(s,unicode)   #此时输出的就是False
'''
print sys.getdefaultencoding()

s='中文'
if isinstance(s,unicode):   #如果是unicode就直接编码不需要解码
    print s.encode('utf-8')
else:
    print s.decode('utf-8').encode('gb2312')

If python2.7 IDE has Chinese, you need to use the following

#!/user/bin/python
# -*- coding: cp936 -*-

Reprinted from: http://blog.csdn.net/qq_34162294/article/details/53727357

 

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324982619&siteId=291194637