Python format error solutions and case
These days playing reptile, often when parsing and extracting content because the content format problems lead to errors, in order to prevent future errors, look at the whole, the following is a summary of the past few days:
1. Special symbols or emoticons, etc.
Background : crawling teaching a cooking website, with BeautifulSoup error when parsing web pages:
UnicodeEncodeError: 'UCS-2' codec can't encode character '\U0001f44d' in position 0: Non-BMP character not supported in Tk
Solution :
import sys non_bmp_map = dict.fromkeys(range(0x10000, sys.maxunicode + 1), 0xfffd) targetText=targetText.translate(non_bmp_map)
That is where the targetText text you need to convert.
2. csv written in Chinese garbled
BACKGROUND: csv Module1 csv classic method of operating controls, are generally csv file operation using 'utf-8' coding format, as follows:
import csv targetText=['abc','efg'] csv_target=open('mycsv.csv','a+',newlien='',encoding='utf-8') writer=csv.writer(csv_target) writer.writerow(targetText) csv_target.close()
Thereto Chinese writing (i.e. targetText Chinese comprising, as targetText = [ 'John Doe', 'John Doe']) be garbled.
Solution: modified coding mode is 'utf-8-sig'
import csv targetText=['張三','李四'] csv_target=open('mycsv.csv','a+',newlien='',encoding='utf-8') writer=csv.writer(csv_target) writer.writerow(targetText) csv_target.close()
python encoding format is a pit.
With this being the first update, and then continue to encounter later update.