关于Python BeautifulSoup 爬取网页信息中文乱码 解决方法

爬取中国基金网数据时中文部分出现乱码

原code如下:

url=r'http://data.chinafund.cn/' 
urlString= urlopen(url)
soup= BeautifulSoup(urlString, 'html.parser')
nameList= soup.findAll('div',{'id':'content'})  #print(nameList)
for name in nameList:
    nameString= name.getText(',')  #get raw data
nameString= nameString.replace('--','0')
#'--' means NA on this website. replaced as '0' to easy the next steps like float()
nameString= nameString.splitlines() # split lines by '\r'(the default method)
#print(nameString)
data=[]  #data empty list
for line in nameString:
    lines= line.split(',')  #split text by ','
    data+=[lines[1:5]]  # the 2nd, 3rd, 4th, 5th data in each line
colnames=['Date','Symbol','Fundname','NAV'] # assign col names
dataFrame=pd.DataFrame(data,columns=colnames)
dataFrame=dataFrame[4:len(dataFrame)-2]  #delete the first 4 lines and the last 2 lines which are invalid
##print(dataFrame)
dataFrame['Symbol']= dataFrame['Symbol'].astype(float)  #transfer NAVs to float 
dataFrame['NAV']= dataFrame['NAV'].astype(float)
# transfer Symbol to float(I need to do that to remove first few zeroes for easy work)
filePath= ('D:/MS/Allprice.csv')   #My  filepath 
dataFrame.to_csv(filePath,encoding='utf8',index=False) #encoding=utf-8



如果使用以上代码,最后column['Fundname']下是乱码,即使我在最后已经使用了 encoding=‘utf8’了。为了解决这个问题,我改变了encoding,最后一行代码变为dataFrame.to_csv(filePath,encoding='utf_8_sig',index=False) 。这样跑出来的数据显示是正常的,而且也可以用作后续的读取与计算。

最后结果:


猜你喜欢

转载自blog.csdn.net/clintlong/article/details/80814790