The first example, respectively 不指定编码
, 指定编码为 utf-8
, 指定编码为 utf-8-sig
three ways to do the comparison, and then writing a csv file txt file to make a comparison
First, do not specify the encoding, directly into csv file
import csv
with open('test.csv', 'w') as fp:
writer = csv.writer(fp)
writer.writerow(['汉语', '俄语', '韩语', '日语', '英语'])
writer.writerow(['爱你', 'люблю тебя', '사랑해요', '愛しています', 'love you'])
Run the program at this time will be reported the following error:
UnicodeEncodeError: 'gbk' codec can't encode character '\uc0ac' in position 14: illegal multibyte sequence
Second, to specify the encoding utf-8, and then stores the csv file
The next attempt will be content utf-8
encoding into test.csv file, you can see that in addition to English, the other is all garbled:
import csv
with open('test.csv', 'w', encoding='utf-8') as fp:
writer = csv.writer(fp)
writer.writerow(['汉语', '俄语', '韩语', '日语', '英语'])
writer.writerow(['爱你', 'люблю тебя', '사랑해요', '愛しています', 'love you'])
Third, specify the encoding to utf-8-sig, and then stored in a csv file
When the coding mode into utf-8-sig
Thereafter, the display is normally:
import csv
with open('test.csv', 'w', encoding='utf-8-sig') as fp:
writer = csv.writer(fp)
writer.writerow(['汉语', '俄语', '韩语', '日语', '英语'])
writer.writerow(['爱你', 'люблю тебя', '사랑해요', '愛しています', 'love you'])
Fourth, do not specify the encoding, directly into txt file
with open('test.txt','w') as fp:
fp.write('爱你, люблю тебя, 사랑해요, 愛しています, love you')
And into csv files, also reported the following error:
UnicodeEncodeError: 'gbk' codec can't encode character '\uc0ac' in position 16: illegal multibyte sequence
Five, encoded as specified utf-8 / utf-8-sig, and then stored in a txt file
With utf-8
or utf-8-sig
coding into test.txt file, the contents are completely normal:
with open('test.txt','w', encoding='utf-8') as fp:
fp.write('爱你, люблю тебя, 사랑해요, 愛しています, love you')
with open('test.txt','w', encoding='utf-8-sig') as fp:
fp.write('爱你, люблю тебя, 사랑해요, 愛しています, love you')
utf-8 and utf-8-sig What is the difference?
-
utf-8
A coding unit in bytes, the byte order in which all systems are the same, no byte order problem, it does not actually require the BOM; -
uft-8-sig
Sig spelling of the signature, utf-8 i.e. with a signature (UTF-8 with BOM); -
BOM
Full ByteOrder Mark, a byte order mark, the file header appears in the text, Unicode encoding standard used for encoding which identifies the file format is employed.
Why write csv files use utf-8-sig coding?
-
Excel csv file when reading the identification code is to read the file by the head BOM, if no BOM header information, in accordance with the default read Unicode encoding.
-
When we use utf-8 encoding to generate the csv file does not generate BOM information, Excel will automatically follow Unicode code reading, there will be garbled problem.
Why write txt file to use utf-8 encoding?
- When writing txt file, Windows will default transcoded into gbk, gbk met some characters are not supported'll get an error when opening the file to declare encoding is utf-8 will be able to avoid this error.