Python练习（八）

1.将文件夹中的“datare.html”文件转化成对应的excel输出

tip：soup= BeautifulSoup(open('datare.html'), 'html.parser')读取文件，如果出现编码识别错误，在open函数内添加中文编码方式

import pandas

with open('datare.html','r',encoding='gbk') as f:
    data=pandas.read_html(f.read())
    excel_writer=pandas.ExcelWriter('datare.xlsx')
    data[0].to_excel(excel_writer)
    excel_writer.close()

使用pandas库将html转xlsx

pandas.read_html()：将html表格转换为DataFrame

有如下html表格：

使用read_html()将读取：

[     城市     环比     同比     定基     增长   减少  Unnamed: 6   1
0  "北京"  101.5  120.7  121.4  121.4  NaN       121.4 NaN
1  "上海"  101.2  127.3  127.8  131.4  NaN         NaN NaN
2  "广州"  101.3  119.4  120.0  146.4  NaN         NaN NaN
3  "深圳"  102.0  140.9  145.5  121.9  NaN         NaN NaN
4  "沈阳"  100.1  101.4  101.6  126.4  0.0         NaN NaN
5   "3"    5.0      7    7.0    8.0  NaN         8.0 NaN
6   NaN    NaN    "8"    1.0    NaN  NaN         NaN NaN
7     1    NaN      4    NaN    NaN  7.0         NaN NaN
8   NaN    NaN      4    NaN    NaN  NaN         NaN NaN
9   NaN    NaN      1    NaN    NaN  NaN         NaN NaN]

pandas.ExcelWriter('file_name')：创建excel写入类，将DataFrame对象写入excel工作表

DataFrame.to_excel：将DataFrame导出至excel文件

参考博客：https://blog.csdn.net/sinat_30062549/article/details/51180518

2.参考书中例20，修改代码输出排名后50位的大学

tip：大学排名的网址为：http://gaokao.xdf.cn/201911/10991728.html

import requests as rq
from bs4 import BeautifulSoup as Bs
import pandas as pd
import numpy as np

def get(url):
    rp=rq.get(url)
    rp.encoding='utf-8'
    return rp.text

def lastN(html,n):
    tables=Bs(html,'html.parser').find_all('tr')
    for tr in tables[-n:]:
        tds=tr.find_all('td')
        school_raw=[td.contents for td in tds][1][1].contents
        if(len(school_raw)==1):
            print(school_raw[0].replace('\n','').replace('\t',''))
        else:
            print(school_raw[1].string)

if __name__=='__main__':
    url='https://gaokao.xdf.cn/201911/10991728.html'
    lastN(get(url),50)

参考博客： https://blog.csdn.net/weixin_47434673/article/details/124161861

猜你喜欢