[Python] techniques use utf-8-sig write a csv file encoding format to solve the garbage problem

The first example, respectively 不指定编码, 指定编码为 utf-8, 指定编码为 utf-8-sigthree ways to do the comparison, and then writing a csv file txt file to make a comparison


First, do not specify the encoding, directly into csv file

import csv

with open('test.csv', 'w') as fp:
    writer = csv.writer(fp)
    writer.writerow(['汉语', '俄语', '韩语', '日语', '英语'])
    writer.writerow(['爱你', 'люблю тебя', '사랑해요', '愛しています', 'love you'])

Run the program at this time will be reported the following error:

UnicodeEncodeError: 'gbk' codec can't encode character '\uc0ac' in position 14: illegal multibyte sequence

Second, to specify the encoding utf-8, and then stores the csv file

The next attempt will be content utf-8encoding into test.csv file, you can see that in addition to English, the other is all garbled:

import csv

with open('test.csv', 'w', encoding='utf-8') as fp:
    writer = csv.writer(fp)
    writer.writerow(['汉语', '俄语', '韩语', '日语', '英语'])
    writer.writerow(['爱你', 'люблю тебя', '사랑해요', '愛しています', 'love you'])

Here Insert Picture Description


Third, specify the encoding to utf-8-sig, and then stored in a csv file

When the coding mode into utf-8-sigThereafter, the display is normally:

import csv

with open('test.csv', 'w', encoding='utf-8-sig') as fp:
    writer = csv.writer(fp)
    writer.writerow(['汉语', '俄语', '韩语', '日语', '英语'])
    writer.writerow(['爱你', 'люблю тебя', '사랑해요', '愛しています', 'love you'])

Here Insert Picture Description


Fourth, do not specify the encoding, directly into txt file

with open('test.txt','w') as fp:
    fp.write('爱你, люблю тебя, 사랑해요, 愛しています, love you')

And into csv files, also reported the following error:

UnicodeEncodeError: 'gbk' codec can't encode character '\uc0ac' in position 16: illegal multibyte sequence

Five, encoded as specified utf-8 / utf-8-sig, and then stored in a txt file

With utf-8or utf-8-sigcoding into test.txt file, the contents are completely normal:

with open('test.txt','w', encoding='utf-8') as fp:
    fp.write('爱你, люблю тебя, 사랑해요, 愛しています, love you')
with open('test.txt','w', encoding='utf-8-sig') as fp:
    fp.write('爱你, люблю тебя, 사랑해요, 愛しています, love you')

Here Insert Picture Description


utf-8 and utf-8-sig What is the difference?

  • utf-8 A coding unit in bytes, the byte order in which all systems are the same, no byte order problem, it does not actually require the BOM;

  • uft-8-sig Sig spelling of the signature, utf-8 i.e. with a signature (UTF-8 with BOM);

  • BOM Full ByteOrder Mark, a byte order mark, the file header appears in the text, Unicode encoding standard used for encoding which identifies the file format is employed.

Why write csv files use utf-8-sig coding?

  • Excel csv file when reading the identification code is to read the file by the head BOM, if no BOM header information, in accordance with the default read Unicode encoding.

  • When we use utf-8 encoding to generate the csv file does not generate BOM information, Excel will automatically follow Unicode code reading, there will be garbled problem.

Why write txt file to use utf-8 encoding?

  • When writing txt file, Windows will default transcoded into gbk, gbk met some characters are not supported'll get an error when opening the file to declare encoding is utf-8 will be able to avoid this error.
Published 149 original articles · won praise 518 · Views 460,000 +

Guess you like

Origin blog.csdn.net/qq_36759224/article/details/104417871