https://www.cnblogs.com/liaohuiqiang/p/7247393.html
Article Directory
1 ascii、unicode、utf8
1.1 ascii
The earliest code, only 127
one character, contains English letters, numbers, punctuation marks and some other symbols.One byte represents one character。
1.2 unicode (unicode)
One byte is not enough, there are characters in various languages in the world that need to be encoded, so unicode sets a unique encoding for all characters. Usually usedTwo wordsSection represents a character (some uncommon words to useFour bytes). So, one thing to understand: the unicode
encoding mentioned below isDouble byteEncoding (two bytes per character).
1.3 uft8
For the ascii
encoded characters, only 1
one byte is needed, and a byte unicode
is also set for these characters 2
. If an article is all in English ( ascii
characters), a lot of space is wasted (the original 1
bytes can be stored, using 2
a Bytes), so produced utf8
. utf8
It is a variable-length encoding method, which changes the byte length according to different symbols and ascii
encodes it into 1
bytes, Chinese characters are usually encoded into 3
bytes, and some uncommon characters are encoded into 4~6
bytes.
In the computer memory, the unicode
coding is used uniformly .
In python
, it is recommended to use unicode
coding uniformly in the process of the program , when saving files and reading files utf8
(the utf8
corresponding decode
sum is used when reading and writing disk files encode
)
2 encoding statement
python
By default, ascii
encoding is used to interpret source files.
If a non- ASCII
code character appears in the source file, an encoding
error will be reported if it is not declared at the beginning .
It can be declared as utf8
telling the interpreter utf8
to read the file code. At this time, the source file will not report an error if it has Chinese.
# encoding=utf8 如果不加这一行会报错
print '解释器用相应的encoding去解释python代码'
3 str and unicode in python2.7
python2.7
There are generally two types of strings in, unicode
and str
.
str
For byte code, the character string will be converted into bytes according to a certain encoding. At this time, there is no fixed one-to-one correspondence between characters and bytes.unicode
It is anunicode
encoded string. At this time, a character corresponds to two bytes, one-to-one correspondence.- Direct assignment string type
str
,str
byte strings, will follow the beginning of theencoding
encoded into one byte. - When assigning, add one in front of the string
u
, the type isunicode
, and directlyunicode
encode according to.
s1 = '字节串'
print type(s1) #输出 <type 'str'>,按照开头的encoding来编码成相应的字节。
print len(s1) #输出9,因为按utf8编码,一个汉字占3个字节,3个字就占9个字节。
s2 = u'统一码'
print type(s2) #输出 <type 'unicode'>,用unicode编码,2个字节1个字符。
print len(s2) #输出3,unicode用字符个数来算长度,从这个角度上看,unicode才是真正意义上的字符串类型
Let's look at a realistic example. For example, we need to find all the words in the last two words that are "learning" from a file. When judging:
s = '机器学习'
s[-2:] == '学习‘
# 返回false,平时写程序可能会以为相等。
# 这里的”学习是用开头的encoding声明解释的,我开头用的是utf8,汉字占3个字节,
# 所以“学习”占了6个字节),而s[-2:]取的是最后两个”双字节“,所以不相同。
s = u'机器学习'
s[-2:] == u'学习’
# 返回true,这也是为什么说unicode是真正意义上的字符串类型。因为使用的是unicode,\
# ”学习“占的是两个”双字节“,一个"双字节“一个字。
For people who often deal with Chinese character strings, unicode
this pit can be avoided by unified use .
Although some string processing functions str
can also be used, it should be the function that handles the encoding problem for you.
4encode and decode in python2.7
The normal use of encode: encode the unicode type to get the byte string str type. That is unicode-> encode (according to the specified encoding)-> str.
Normal use of decode: decode str type to get unicode type. That is str-> decode (according to the specified encoding)-> unicode.
Note: Both encode and decode need to specify the encoding.
Because you need to know what the original encoding is and what new encoding method to encode when encoding, two types of encoding are used. There is a unicode by default, so you need to specify another encoding method. The same is true when decoding.
These two methods are to convert between unicode and str with the specified encoding.
s3 = u'统一码'.encode('utf8')
print type(s3) # 输出 <type 'str'>
s4 = '字节串'.decode('utf8')
print type(s4) #输出 <type 'unicode'>
encode
Inappropriate use of: The str
type is used encode
because the type is encode
needed unicode
. At this time, the python
default system will be used to encode decode
the unicode
type, and then use the encoding you give encode
. (Note that the system code here is not the beginning encoding
, for specific examples see point 5 below)
decode
Inappropriate use of the unicode
type : for the type decode
, the python
default system will be used to encode encode
the str
type, and then use the code you give decode
.
Therefore, if you change the default encoding of the corresponding system, even if it is not used normally, you will not report an error. But I turned a little bit more, I don't like this.
5 Modify the system default encoding
The system uses the ascii
encoding by default and needs to be modified accordingly.
The encoding
difference between this encoding and the beginning is that the beginning encoding
is the encoding of the file content.
Here are some of the encoded python方法
encoding used by default, such as to str
conduct encode
when the first default decode
code, such as a file write write
of the encode
code (see section on document reader below 7:00)
import sys
reload(sys)
sys.setdefaultencoding('utf8')
s = '字节串str'
s.encode('utf8')
#等价于
s.decode(系统编码).encode('utf8')
Let ’s look at another example of where the system ’s default encoding comes into play.
import sys
print sys.getdefaultencoding() # 输出ascii
s = 'u华南理工大学'
print s[-2:] == '大学' # 返回False,并有warning提醒
reload(sys)
sys.setdefaultencoding('utf8')
print s[-2:] == '大学' # 返回True
According to the results: python
when using ==
comparison, if the first operator is unicode
and the second is not, it will automatically use the system default encoding to help the second operator decode
.
PS: Why do you need reload(sys)
it? First, it reload
is used to reload the previous import
module.
sys
The reason for reloading here is that the method in (which may be for security reasons) python
was deleted when the module was loaded , so this module is needed .sys
setdefaultencoding
reload
sys
6 View file encoding
import chardet
with open(filename,'r') as f:
data = f.read()
return chardet.detect(data)
7 File reading and writing
7.1 open
The first thing to remember is that the read and write , the gates of these two files are of str
type, which is one byte.
python
The built-in default open
reads out bytes one by one in the form of byte string str when reading files. After reading, you must use the correct encoding to decode
form the correct unicode, so you must know the original encoding in the file.
It is also a reason when writing files. It is str
written in the form of bytes by type. This str
is encoded in a certain encoding method. Pay attention to encoding in the correct encoding method. Generally utf8
, files are written after encoding.
If you unicode
write by type, python
the unicode
encoding will be str according to the system default encoding and then written to the file. Because all you need to write to the file str
is str
to write it, not me to convert you to str
write again.
The simple principle is to use str
writing as much as possible and avoid using the default encoding, so you don't need to modify the default encoding at the beginning.
7.2 codecs open
codecs
The open
method in the module in Python can specify an encoding. It guaranteesBytes read and writtenAll are coded according to this specified code.
Thus, when a file is read: I will read out str
the specified encoded decode
into unicode
.
When writing a file: If it is unicode
, based on the specified encoding encode
to str
then write; if it is str
, according to the coding system will default str
be decode
obtained unicode
, then according to the specified encoding encode
into str
writing.
The simple principle is to use unicode
writing as much as possible and avoid using the default encoding, so you don't need to modify the default encoding at the beginning.
Note that for other ways to read and write files, you need to debugger to see the encoding problem. For example, when I read excel in python, it reads directly unicode instead of str.
8 General handling points
- First change the
encoding
default encoding of the source file and the system default encoding toutf8
- Uniform use
unicode
type during program execution - For reading and writing files (with the
python
built-in defaultopen
), what you get isstr
that you canencode
sum str accordinglydecode
.
To sum up:
- Set the corresponding default encoding to
utf8
; - Read the file to get the
str
type:str -> decode('utf8') -> unicode
- Program processing: use
unicode
- Write file:, write file
unicode -> encode('utf8') -> str
withstr
type - Of course, the premise is that the files are in a
utf8
format, including source files and read and write data files.
Also want to say:
unicode
This is only a suggestion for the uniform use of types in the process of writing programs , because uniformity unicode
can reduce the trouble when processing strings.
If you feel that it is all in unicode
trouble, you can consider the usual unified utf8
coding str
. Some problems need to be used unicode
again unicode
. When you encounter coding problems, you can think about whether there is no unified unicode
problem unicode
. Happening)