关于Python BeautifulSoup 爬取网页信息中文乱码解决方法 - 代码天地

关于Python BeautifulSoup 爬取网页信息中文乱码解决方法

其他 2018-07-07 05:09:26 阅读次数: 0

爬取中国基金网数据时中文部分出现乱码

原code如下:

url=r'http://data.chinafund.cn/'

urlString= urlopen(url)

soup= BeautifulSoup(urlString, 'html.parser')

nameList= soup.findAll('div',{'id':'content'}) #print(nameList)

for name in nameList:

nameString= name.getText(',') #get raw data

nameString= nameString.replace('--','0')

#'--' means NA on this website. replaced as '0' to easy the next steps like float()

nameString= nameString.splitlines() # split lines by '\r'(the default method)

#print(nameString)

data=[] #data empty list

for line in nameString:

lines= line.split(',') #split text by ','

data+=[lines[1:5]] # the 2nd, 3rd, 4th, 5th data in each line

colnames=['Date','Symbol','Fundname','NAV'] # assign col names

dataFrame=pd.DataFrame(data,columns=colnames)

dataFrame=dataFrame[4:len(dataFrame)-2] #delete the first 4 lines and the last 2 lines which are invalid

##print(dataFrame)

dataFrame['Symbol']= dataFrame['Symbol'].astype(float) #transfer NAVs to float

dataFrame['NAV']= dataFrame['NAV'].astype(float)

# transfer Symbol to float(I need to do that to remove first few zeroes for easy work)

filePath= ('D:/MS/Allprice.csv') #My filepath

dataFrame.to_csv(filePath,encoding='utf8',index=False) #encoding=utf-8

如果使用以上代码，最后column['Fundname']下是乱码，即使我在最后已经使用了 encoding=‘utf8’了。为了解决这个问题，我改变了encoding，最后一行代码变为dataFrame.to_csv(filePath,encoding='utf_8_sig',index=False) 。这样跑出来的数据显示是正常的，而且也可以用作后续的读取与计算。

最后结果：

猜你喜欢

转载自blog.csdn.net/clintlong/article/details/80814790

关于Python BeautifulSoup 爬取网页信息中文乱码解决方法

Python使用BeautifulSoup爬取网页信息

python使用requests和BeautifulSoup爬取网页乱码问题

Python BeautifulSoup爬取小说

python爬取网页中文乱码。解决方案。python3

python BeautifulSoup乱码问题

python:BeautifulSoup解析爬取网页文章demo

python爬虫爬取内容为乱码（解决方法）

Python基于BeautifulSoup爬取京东商品信息

qt关于中文乱码解决方法

python爬虫爬取招聘（ requests，BeautifulSoup）

关于使用CMD安装Python第三方模块库BeautifulSoup失败的解决方法

python + selenium + BeautifulSoup 爬股票信息

Python获取网页指定内容(BeautifulSoup工具的使用方法)

python 中文乱码解决方法

python爬虫中文乱码解决方法

python解析html网页BeautifulSoup

Python 安装beautifulsoup4遇到No module named setuptools问题解决方法

【Python】在Pycharm中安装爬虫库requests , BeautifulSoup , lxml 的解决方法

Python BeautifulSoup [解决方法] TypeError: list indices must be integers or slices, not str

Python3爬虫--两种方法（requests(urllib)和BeautifulSoup）爬取网站pdf

python爬虫——爬取酷狗音乐top500(BeautifulSoup使用方法)

BeautifulSoup通过lxml解析页面造成信息丢失的解决方法

python获取网页page数，同时按照href批量爬取网页（requests+BeautifulSoup）

关于python使用xpath爬取网页内容返回值为空列表的解决方法

Python爬虫学习三------requests+BeautifulSoup爬取简单网页

python 爬虫（一） requests+BeautifulSoup 爬取简单网页代码示例

python爬虫——利用requests库BeautifulSoup定向爬取网页内容写入txt文件

python爬虫——利用requests库BeautifulSoup简单爬取网页上照片—代码完善

python爬虫——利用requests库BeautifulSoup简单爬取网页上照片

今日推荐

TIOBE 5 月榜单：Fortran “复活”进入 Top 10

GCC 14.1 发布

面壁智能发布 Eurux-8x22B 开源大模型 —— 堪称「理科状元」

开源日报 | 谷歌扶持鸿蒙上位；开源Rabbit R1；Docker加持的安卓手机；微软的焦虑和野心；海尔电器把开放平台关了

中国码农的“35岁魔咒”

蘭雅 CorelDRAW 插件 2024.5.1 国际劳动节版，免费下载

Arc Browser for Windows 1.0 正式 GA

90后程序员开发视频搬运软件、不到一年获利超 700 万，结局很刑！

周排行

Java自定义时间格式

同步整形电路

在开发中最最最常用的字符串的属性大集合

Linux 查看端口占用并杀掉

Java基础四：ArrayList

多线程之死锁就是这么简单

mysql 基础命令集

awk 命令详解

Centos6.3编译安装nginx+php步骤

OCR （Optical Character Recognition，光学字符识别）

每日归档

更多

2024-05-08(42)

2024-05-07(14)

2024-05-06(40)

2024-05-05(0)

2024-05-04(7)

2024-05-03(19)

2024-05-02(0)

2024-05-01(4)

2024-04-30(1)

2024-04-29(40)