[Reprinted] The character encoding problem in python2.7

https://www.cnblogs.com/liaohuiqiang/p/7247393.html

1 ascii、unicode、utf8


1.1 ascii

The earliest code, only 127one character, contains English letters, numbers, punctuation marks and some other symbols.One byte represents one character

1.2 unicode (unicode)

One byte is not enough, there are characters in various languages ​​in the world that need to be encoded, so unicode sets a unique encoding for all characters. Usually usedTwo wordsSection represents a character (some uncommon words to useFour bytes). So, one thing to understand: the unicodeencoding mentioned below isDouble byteEncoding (two bytes per character).

1.3 uft8

For the asciiencoded characters, only 1one byte is needed, and a byte unicodeis also set for these characters 2. If an article is all in English ( asciicharacters), a lot of space is wasted (the original 1bytes can be stored, using 2a Bytes), so produced utf8. utf8It is a variable-length encoding method, which changes the byte length according to different symbols and asciiencodes it into 1bytes, Chinese characters are usually encoded into 3bytes, and some uncommon characters are encoded into 4~6bytes.

In the computer memory, the unicodecoding is used uniformly .

In python, it is recommended to use unicodecoding uniformly in the process of the program , when saving files and reading files utf8(the utf8corresponding decodesum is used when reading and writing disk files encode)

2 encoding statement


pythonBy default, asciiencoding is used to interpret source files.

If a non- ASCIIcode character appears in the source file, an encodingerror will be reported if it is not declared at the beginning .

It can be declared as utf8telling the interpreter utf8to read the file code. At this time, the source file will not report an error if it has Chinese.

# encoding=utf8 如果不加这一行会报错
print '解释器用相应的encoding去解释python代码'

3 str and unicode in python2.7


python2.7There are generally two types of strings in, unicodeand str.

  • strFor byte code, the character string will be converted into bytes according to a certain encoding. At this time, there is no fixed one-to-one correspondence between characters and bytes.
  • unicodeIt is an unicodeencoded string. At this time, a character corresponds to two bytes, one-to-one correspondence.
  • Direct assignment string type str, strbyte strings, will follow the beginning of the encodingencoded into one byte.
  • When assigning, add one in front of the string u, the type is unicode, and directly unicodeencode according to.
s1 = '字节串'
print type(s1) #输出 <type 'str'>,按照开头的encoding来编码成相应的字节。
print len(s1) #输出9,因为按utf8编码,一个汉字占3个字节,3个字就占9个字节。

s2 = u'统一码'
print type(s2) #输出 <type 'unicode'>,用unicode编码,2个字节1个字符。
print len(s2) #输出3,unicode用字符个数来算长度,从这个角度上看,unicode才是真正意义上的字符串类型

Let's look at a realistic example. For example, we need to find all the words in the last two words that are "learning" from a file. When judging:

s = '机器学习'
s[-2:] == '学习‘ 
# 返回false,平时写程序可能会以为相等。
# 这里的”学习是用开头的encoding声明解释的,我开头用的是utf8,汉字占3个字节,
# 所以“学习”占了6个字节),而s[-2:]取的是最后两个”双字节“,所以不相同。

s = u'机器学习'
s[-2:] == u'学习’ 
# 返回true,这也是为什么说unicode是真正意义上的字符串类型。因为使用的是unicode,\
# ”学习“占的是两个”双字节“,一个"双字节“一个字。

For people who often deal with Chinese character strings, unicodethis pit can be avoided by unified use .

Although some string processing functions strcan also be used, it should be the function that handles the encoding problem for you.

4encode and decode in python2.7


The normal use of encode: encode the unicode type to get the byte string str type. That is unicode-> encode (according to the specified encoding)-> str.

Normal use of decode: decode str type to get unicode type. That is str-> decode (according to the specified encoding)-> unicode.

Note: Both encode and decode need to specify the encoding.

Because you need to know what the original encoding is and what new encoding method to encode when encoding, two types of encoding are used. There is a unicode by default, so you need to specify another encoding method. The same is true when decoding.

These two methods are to convert between unicode and str with the specified encoding.

s3 = u'统一码'.encode('utf8')
print type(s3) # 输出 <type 'str'>

s4 = '字节串'.decode('utf8')
print type(s4) #输出 <type 'unicode'>

encodeInappropriate use of: The strtype is used encodebecause the type is encodeneeded unicode. At this time, the pythondefault system will be used to encode decodethe unicodetype, and then use the encoding you give encode. (Note that the system code here is not the beginning encoding, for specific examples see point 5 below)

decodeInappropriate use of the unicodetype : for the type decode, the pythondefault system will be used to encode encodethe strtype, and then use the code you give decode.

Therefore, if you change the default encoding of the corresponding system, even if it is not used normally, you will not report an error. But I turned a little bit more, I don't like this.

5 Modify the system default encoding


The system uses the asciiencoding by default and needs to be modified accordingly.

The encodingdifference between this encoding and the beginning is that the beginning encodingis the encoding of the file content.

Here are some of the encoded python方法encoding used by default, such as to strconduct encodewhen the first default decodecode, such as a file write writeof the encodecode (see section on document reader below 7:00)

import sys
reload(sys)
sys.setdefaultencoding('utf8')

s = '字节串str'

s.encode('utf8')
#等价于
s.decode(系统编码).encode('utf8')

Let ’s look at another example of where the system ’s default encoding comes into play.

import sys
print sys.getdefaultencoding()  # 输出ascii

s = 'u华南理工大学'
print s[-2:] == '大学'   # 返回False,并有warning提醒

reload(sys)
sys.setdefaultencoding('utf8')

print s[-2:] == '大学'  # 返回True 

According to the results: pythonwhen using ==comparison, if the first operator is unicodeand the second is not, it will automatically use the system default encoding to help the second operator decode.

PS: Why do you need reload(sys)it? First, it reloadis used to reload the previous importmodule.

sysThe reason for reloading here is that the method in (which may be for security reasons) pythonwas deleted when the module was loaded , so this module is needed .syssetdefaultencodingreloadsys

6 View file encoding

import chardet
with open(filename,'r') as f:
    data = f.read()
    return chardet.detect(data)

7 File reading and writing


7.1 open

The first thing to remember is that the read and write , the gates of these two files are of strtype, which is one byte.

pythonThe built-in default openreads out bytes one by one in the form of byte string str when reading files. After reading, you must use the correct encoding to decodeform the correct unicode, so you must know the original encoding in the file.

It is also a reason when writing files. It is strwritten in the form of bytes by type. This stris encoded in a certain encoding method. Pay attention to encoding in the correct encoding method. Generally utf8, files are written after encoding.

If you unicodewrite by type, pythonthe unicodeencoding will be str according to the system default encoding and then written to the file. Because all you need to write to the file stris strto write it, not me to convert you to strwrite again.

The simple principle is to use strwriting as much as possible and avoid using the default encoding, so you don't need to modify the default encoding at the beginning.

7.2 codecs open

codecsThe openmethod in the module in Python can specify an encoding. It guaranteesBytes read and writtenAll are coded according to this specified code.

Thus, when a file is read: I will read out strthe specified encoded decodeinto unicode.

When writing a file: If it is unicode, based on the specified encoding encodeto strthen write; if it is str, according to the coding system will default strbe decodeobtained unicode, then according to the specified encoding encodeinto strwriting.

The simple principle is to use unicodewriting as much as possible and avoid using the default encoding, so you don't need to modify the default encoding at the beginning.

Note that for other ways to read and write files, you need to debugger to see the encoding problem. For example, when I read excel in python, it reads directly unicode instead of str.

8 General handling points


  1. First change the encodingdefault encoding of the source file and the system default encoding toutf8
  2. Uniform use unicodetype during program execution
  3. For reading and writing files (with the pythonbuilt-in default open), what you get is strthat you can encodesum str accordingly decode.

To sum up:

  • Set the corresponding default encoding to utf8;
  • Read the file to get the strtype:str -> decode('utf8') -> unicode
  • Program processing: useunicode
  • Write file:, write file unicode -> encode('utf8') -> strwith strtype
  • Of course, the premise is that the files are in a utf8format, including source files and read and write data files.

Also want to say:

unicodeThis is only a suggestion for the uniform use of types in the process of writing programs , because uniformity unicodecan reduce the trouble when processing strings.

If you feel that it is all in unicodetrouble, you can consider the usual unified utf8coding str. Some problems need to be used unicodeagain unicode. When you encounter coding problems, you can think about whether there is no unified unicodeproblem unicode. Happening)

71 original articles published · 56 praises · 90,000 + views

Guess you like

Origin blog.csdn.net/baidu_26646129/article/details/104725887