Introduction to Python character encoding
1. Notice:
The python 2
default encoding in is ASCII
, and the python 3
default encoding in isunicode
unicode
Divided into utf-32
(occupies 4 bytes), utf-16
(occupies two bytes), utf-8
(occupies 1-4 bytes), so it utf-16
is the most commonly used unicode
version, but it is stored in the file utf-8
because it saves utf8
space
In python 3
, encode
while coding will string
become bytes
type, while decoding will decode bytes
type into string
Type
In the unicode
encoding, 1 Chinese character = 2 bytes, 1 English character = 1 byte, remember: ASCII cannot store Chinese characters
utf-8
It is a variable-length character encoding, it is unicode
optimized, all English characters are still ASCII
stored in form, and all Chinese characters are uniformly 3 bytes
unicode
Contains the character codes of all countries, and the conversion between different character codes requires unicode
a process
python
The default encoding itself isutf-8
2. py2
The encoding and transcoding process in
As shown:
Note: Because it unicode
is an intermediate encoding, any conversion before the character encoding must be decoded into unicode, and then encoded into the character encoding that needs to be converted
Two, character encoding conversion
1. Conversion of py2 character encoding
code show as below:
#! /usr/bin/env python
# -*- coding:utf-8 -*-
# __auther__ == luoahong
s = "我是学员"
#utf-8解码成unicode编码
s_to_unicode = s.decode("utf-8")
print("--------s_to_unicode-----")
print(s_to_unicode)
#然后unicode再编码成gbk
s_to_gbk = s_to_unicode.encode("gbk")
print("-----s_to_gbk------")
print(s_to_gbk)
#gbk解码成unicode再编码成utf-8
gbk_to_utf8 = s_to_gbk.decode("gbk").encode("utf-8")
print("------gbk_to_utf8-----")
print(gbk_to_utf8)
#Output
--------s_to_unicode-----
我是学员
-----s_to_gbk------
�����˧
------gbk_to_utf8-----
我是学员
Note: In the above case, the suitable character is non-unicode encoding, please subscribe, but what if the character encoding is already Unicode?
2. When the character encoding is already unicode
code show as below:
#! /usr/bin/env python
# -*- coding:utf-8 -*-
# __auther__ == luoahong
#u代码字符编码是unicode
s = u'你好'
#已经是unicode,所以这边直接是编码成gbk
s_to_gbk = s.encode("gbk")
print("----s_to_gbk----")
print(s_to_gbk)
#这边再解码成unicode然后再编码成utf-8
gbk_to_utf8 = s_to_gbk.decode("gbk").encode("utf-8")
print("-----gbk_to_utf8---")
print(gbk_to_utf8)
#Output
----s_to_gbk----
���
-----gbk_to_utf8---
你好
Note: When python2``中,在文件的开头指定字符编码,是要告诉解释器我现在的字符编码使用的是utf-8
I am printing utf-8
Chinese characters, if Chinese characters are contained in it, it can be printed. So if you do not specify the character encoding, the system encoding is used by default. If your system encoding is ASCII
, then an error will be reported because ASCII
Chinese characters cannot be stored.
3. Character encoding conversion of py3
In the instructions, the encoding of python 3 has been mentioned, the default is unicode, so the conversion between character encodings does not require the decoding process, just encode directly, the code is as follows:
#! /usr/bin/env python
# __auther__ == luoahong
#无需声明字符编码,当然你声明也不会报错
s = '你好'
# 字符串s已经是unicode编码,无需decode,直接encode s_to_gbk = s.encode("gbk")
print("----s_to_gbk----")
print(s_to_gbk)
#这边还是一样,gbk需要先解码成unicode,再编码成utf-8 gbk_to_utf8 = s_to_gbk.decode("gbk").encode("utf-8") print("-----gbk_to_utf8---")
print(gbk_to_utf8)
#解码成unicode字符编码
utf8_decode = gbk_to_utf8.decode("utf-8")
print("-------utf8_decode----")
print(utf8_decode)
#Output
----s_to_gbk----
b'\xc4\xe3\xba\xc3'
-----gbk_to_utf8---
b'\xe4\xbd\xa0\xe5\xa5\xbd'
-------utf8_decode----
你好
Note: In python 3
, encode
while coding will string
become a bytes
type, decode
while decoding will bytes
type becomes a string
type, so you can easily see encode
after it became a bytes
type of data. Also need to pay special attention to: regardless of whether the python 3
character encoding is declared at the beginning of the file, it can only indicate that this python file is this character encoding, and the string in the file is still unicode, as shown in the following figure:
3. Summary:
1、uniocode
uniocode
Can recognize all character encoding strings
2、python 2
The conversion between character encodings needs to pass unicode
before conversion, so when printing, you can use it unicode
, or use the corresponding character encoding (specify the encoding at the beginning of the file) to print characters or strings, because there is no obvious character and byte in py2 Distinguish, so it will lead to such a result.
3、python 3
Only by Unicode
recognizing the characters, if it is converted into the corresponding encoding format, it will directly become bytes
the bytecode of the corresponding encoding type, that is, binary. It needs to be recognized and must be decoded to Unicode
be recognized.