Introduction to Python character encoding

Introduction to Python character encoding

1. Notice:

The python 2default encoding in is ASCII, and the python 3default encoding in isunicode

unicodeDivided into utf-32(occupies 4 bytes), utf-16(occupies two bytes), utf-8(occupies 1-4 bytes), so it utf-16is the most commonly used unicodeversion, but it is stored in the file utf-8because it saves utf8space

In python 3, encodewhile coding will stringbecome bytestype, while decoding will decode bytestype into stringType

In the unicodeencoding, 1 Chinese character = 2 bytes, 1 English character = 1 byte, remember: ASCII cannot store Chinese characters

utf-8It is a variable-length character encoding, it is unicodeoptimized, all English characters are still ASCIIstored in form, and all Chinese characters are uniformly 3 bytes

unicodeContains the character codes of all countries, and the conversion between different character codes requires unicodea process

pythonThe default encoding itself isutf-8

2. py2The encoding and transcoding process in

As shown:

Insert picture description here
Note: Because it unicodeis an intermediate encoding, any conversion before the character encoding must be decoded into unicode, and then encoded into the character encoding that needs to be converted

Two, character encoding conversion

1. Conversion of py2 character encoding

code show as below:

#! /usr/bin/env python
# -*- coding:utf-8 -*-
# __auther__ == luoahong
  
s = "我是学员"
#utf-8解码成unicode编码
s_to_unicode = s.decode("utf-8")
print("--------s_to_unicode-----")
print(s_to_unicode)
#然后unicode再编码成gbk
s_to_gbk = s_to_unicode.encode("gbk")
print("-----s_to_gbk------")
print(s_to_gbk)
#gbk解码成unicode再编码成utf-8
gbk_to_utf8 = s_to_gbk.decode("gbk").encode("utf-8")
print("------gbk_to_utf8-----")
print(gbk_to_utf8)

#Output

--------s_to_unicode-----
我是学员
-----s_to_gbk------
�����˧
------gbk_to_utf8-----
我是学员

Note: In the above case, the suitable character is non-unicode encoding, please subscribe, but what if the character encoding is already Unicode?

2. When the character encoding is already unicode

code show as below:

#! /usr/bin/env python
# -*- coding:utf-8 -*-
# __auther__ == luoahong
  
#u代码字符编码是unicode
s = u'你好'
#已经是unicode,所以这边直接是编码成gbk
s_to_gbk = s.encode("gbk")
print("----s_to_gbk----")
print(s_to_gbk)
#这边再解码成unicode然后再编码成utf-8
gbk_to_utf8 = s_to_gbk.decode("gbk").encode("utf-8")
print("-----gbk_to_utf8---")
print(gbk_to_utf8)

#Output

----s_to_gbk----
���
-----gbk_to_utf8---
你好

Note: When python2``中,在文件的开头指定字符编码,是要告诉解释器我现在的字符编码使用的是utf-8I am printing utf-8Chinese characters, if Chinese characters are contained in it, it can be printed. So if you do not specify the character encoding, the system encoding is used by default. If your system encoding is ASCII, then an error will be reported because ASCIIChinese characters cannot be stored.

3. Character encoding conversion of py3

In the instructions, the encoding of python 3 has been mentioned, the default is unicode, so the conversion between character encodings does not require the decoding process, just encode directly, the code is as follows:

#! /usr/bin/env python
# __auther__ == luoahong
#无需声明字符编码,当然你声明也不会报错    
s = '你好'
# 字符串s已经是unicode编码,无需decode,直接encode s_to_gbk = s.encode("gbk") 
print("----s_to_gbk----") 
print(s_to_gbk)
#这边还是一样,gbk需要先解码成unicode,再编码成utf-8 gbk_to_utf8 = s_to_gbk.decode("gbk").encode("utf-8") print("-----gbk_to_utf8---") 
print(gbk_to_utf8)
#解码成unicode字符编码 
utf8_decode = gbk_to_utf8.decode("utf-8") 
print("-------utf8_decode----") 
print(utf8_decode)

#Output

----s_to_gbk----
b'\xc4\xe3\xba\xc3'
-----gbk_to_utf8---
b'\xe4\xbd\xa0\xe5\xa5\xbd'
-------utf8_decode----
你好

Note: In python 3, encodewhile coding will stringbecome a bytestype, decodewhile decoding will bytestype becomes a stringtype, so you can easily see encodeafter it became a bytestype of data. Also need to pay special attention to: regardless of whether the python 3character encoding is declared at the beginning of the file, it can only indicate that this python file is this character encoding, and the string in the file is still unicode, as shown in the following figure:

3. Summary:

1、uniocode

uniocodeCan recognize all character encoding strings

2、python 2

The conversion between character encodings needs to pass unicodebefore conversion, so when printing, you can use it unicode, or use the corresponding character encoding (specify the encoding at the beginning of the file) to print characters or strings, because there is no obvious character and byte in py2 Distinguish, so it will lead to such a result.

3、python 3

Only by Unicoderecognizing the characters, if it is converted into the corresponding encoding format, it will directly become bytesthe bytecode of the corresponding encoding type, that is, binary. It needs to be recognized and must be decoded to Unicodebe recognized.

Guess you like

Origin blog.csdn.net/qq_25562325/article/details/111408324