Python Chinese this file read and write operations of coding problems

Python coding problem in this document to read and write Chinese

Coding (encode):

After we enter any character you want in the form of files (such as .txt) is saved on the computer's hard disk, you must know by the compiled binary computer according to certain rules, can exist on the computer's hard drive. This rule has GBK, utf-8 and the like.

Decoding (decode):

Similarly, computer files on the hard disk, you want to correctly displayed on the computer screen, it must follow certain rules by extracting from the computer hard drive, the decoding binary data into a character, we can see on the computer screen. And if, decoding the wrong way, there will be garbled. For example, the file is coded in the form of GBK, that decoding must also use GBK decode, if you use UTF-8 decoding, distortion occurs.

Reading file

Python via the open () manner, text file read and write operations for

Now, I have two files:

  • test1_gbk.txt
  • test2_utf-8.txt

Point two is the same: the same stored content ( "China Hello")

Is a difference between the: test1_gbk.txt by gbk coding on the hard disk, test2_utf-8 press utf-8 encoded on the hard disk

Now do a test
test environment:
  • win10
  • Python3.7
  • Pycharm

1, read test1_gbk.txt

f = open('test1_gbk.txt', 'r')
s = f.read()
f.close()
print(s)

Results: Hello China

2, the read test2_utf-8.txt

f = open('test2_utf-8.txt', 'r')
s = f.read()
f.close()
print(s)

结果报错:UnicodeDecodeError: 'gbk' codec can't decode byte 0xbd in position 14: incomplete multibyte sequence

why?

We view the open () source code can be found in open () is a lot of default parameters, such as, encoding

open(file, mode='r', buffering=None, encoding=None, errors=None, newline=None, closefd=True)
......
#encoding解释如下:
encoding is the name of the encoding used to decode or encode the
    file. This should only be used in text mode. The default encoding is
    platform dependent, but any encoding supported by Python can be
    passed.  See the codecs module for the list of supported encodings.
  • That is, encoding actually has two functions for encoding and decoding . My understanding is: when the file open for reading, encoding is the role of decoding; when open to create a file write characters entered, encoding role is to coding .
  • In addition, when decoding or encoding, default encodng in the end is gbk or UTF-8, this depends on our operating system, on windows is the default gbk.

  • Accordingly, the above second test result, there is an error prompt, wants no error, just add encoding = 'utf-8' to

f = open('test2_utf-8.txt', 'r', encoding = 'utf-8')
s = f.read()
f.close()
print(s)
  • Also know after this point, it is for test1_gbk.txt, encoding = 'gbk' presence or absence is not affected

Write file

  • These are read-write file, the file empathy
  • The difference is that this time, encoding is the encoding function
  • Test3.txt file created by the following code is saved by GBK encoded files
f = open('test3.txt', 'w')
s = '中国你好'
f.write(s)
f.close()
  • If you want to send your friends a .txt file, but he is Apple's mac laptop, you send the file must be ufd-8 encoded text files, or the other party is open there will be garbled, because the default decoder mac It is based utf-8
  • If your laptop is also a mac, it does not matter, because the default encoding mac wrote the text file is utf-8
  • But if you are a windows user, you have to pay attention.
f = open('test4.txt', 'w', encoding = 'utf-8')
s = '中国你好'
f.write(s)
f.close()
  • On the windows, on the top in this way to create a utf-8 encoded file test4.txt

Guess you like

Origin www.cnblogs.com/liuxu2019/p/11184265.html