1. Convert the "datare.html" file in the folder into the corresponding excel output
tip: soup= BeautifulSoup(open('datare.html'), 'html.parser') reads the file, if there is an encoding recognition error, add the Chinese encoding method in the open function
import pandas
with open('datare.html','r',encoding='gbk') as f:
data=pandas.read_html(f.read())
excel_writer=pandas.ExcelWriter('datare.xlsx')
data[0].to_excel(excel_writer)
excel_writer.close()
Use the pandas library to convert html to xlsx
pandas.read_html(): convert html table to DataFrame
There is the following html form:
Using read_html() will read:
[ 城市 环比 同比 定基 增长 减少 Unnamed: 6 1
0 "北京" 101.5 120.7 121.4 121.4 NaN 121.4 NaN
1 "上海" 101.2 127.3 127.8 131.4 NaN NaN NaN
2 "广州" 101.3 119.4 120.0 146.4 NaN NaN NaN
3 "深圳" 102.0 140.9 145.5 121.9 NaN NaN NaN
4 "沈阳" 100.1 101.4 101.6 126.4 0.0 NaN NaN
5 "3" 5.0 7 7.0 8.0 NaN 8.0 NaN
6 NaN NaN "8" 1.0 NaN NaN NaN NaN
7 1 NaN 4 NaN NaN 7.0 NaN NaN
8 NaN NaN 4 NaN NaN NaN NaN NaN
9 NaN NaN 1 NaN NaN NaN NaN NaN]
pandas.ExcelWriter('file_name'): Create an excel writing class and write the DataFrame object to the excel worksheet
DataFrame.to_excel: Export DataFrame to excel file
Reference blog: https://blog.csdn.net/sinat_30062549/article/details/51180518
2. Example 20 in the reference book, modify the code to output the last 50 universities
tip: The URL of the university ranking is: http://gaokao.xdf.cn/201911/10991728.html
import requests as rq
from bs4 import BeautifulSoup as Bs
import pandas as pd
import numpy as np
def get(url):
rp=rq.get(url)
rp.encoding='utf-8'
return rp.text
def lastN(html,n):
tables=Bs(html,'html.parser').find_all('tr')
for tr in tables[-n:]:
tds=tr.find_all('td')
school_raw=[td.contents for td in tds][1][1].contents
if(len(school_raw)==1):
print(school_raw[0].replace('\n','').replace('\t',''))
else:
print(school_raw[1].string)
if __name__=='__main__':
url='https://gaokao.xdf.cn/201911/10991728.html'
lastN(get(url),50)
Reference blog: https://blog.csdn.net/weixin_47434673/article/details/124161861