1.将文件夹中的“datare.html”文件转化成对应的excel输出
tip:soup= BeautifulSoup(open('datare.html'), 'html.parser')读取文件,如果出现编码识别错误,在open函数内添加中文编码方式
import pandas
with open('datare.html','r',encoding='gbk') as f:
data=pandas.read_html(f.read())
excel_writer=pandas.ExcelWriter('datare.xlsx')
data[0].to_excel(excel_writer)
excel_writer.close()
使用pandas库将html转xlsx
pandas.read_html():将html表格转换为DataFrame
有如下html表格:
使用read_html()将读取:
[ 城市 环比 同比 定基 增长 减少 Unnamed: 6 1
0 "北京" 101.5 120.7 121.4 121.4 NaN 121.4 NaN
1 "上海" 101.2 127.3 127.8 131.4 NaN NaN NaN
2 "广州" 101.3 119.4 120.0 146.4 NaN NaN NaN
3 "深圳" 102.0 140.9 145.5 121.9 NaN NaN NaN
4 "沈阳" 100.1 101.4 101.6 126.4 0.0 NaN NaN
5 "3" 5.0 7 7.0 8.0 NaN 8.0 NaN
6 NaN NaN "8" 1.0 NaN NaN NaN NaN
7 1 NaN 4 NaN NaN 7.0 NaN NaN
8 NaN NaN 4 NaN NaN NaN NaN NaN
9 NaN NaN 1 NaN NaN NaN NaN NaN]
pandas.ExcelWriter('file_name'):创建excel写入类,将DataFrame对象写入excel工作表
DataFrame.to_excel:将DataFrame导出至excel文件
参考博客:https://blog.csdn.net/sinat_30062549/article/details/51180518
2.参考书中例20,修改代码输出排名后50位的大学
tip:大学排名的网址为:http://gaokao.xdf.cn/201911/10991728.html
import requests as rq
from bs4 import BeautifulSoup as Bs
import pandas as pd
import numpy as np
def get(url):
rp=rq.get(url)
rp.encoding='utf-8'
return rp.text
def lastN(html,n):
tables=Bs(html,'html.parser').find_all('tr')
for tr in tables[-n:]:
tds=tr.find_all('td')
school_raw=[td.contents for td in tds][1][1].contents
if(len(school_raw)==1):
print(school_raw[0].replace('\n','').replace('\t',''))
else:
print(school_raw[1].string)
if __name__=='__main__':
url='https://gaokao.xdf.cn/201911/10991728.html'
lastN(get(url),50)
参考博客: https://blog.csdn.net/weixin_47434673/article/details/124161861