Python Exercises (8)

1. Convert the "datare.html" file in the folder into the corresponding excel output

tip: soup= BeautifulSoup(open('datare.html'), 'html.parser') reads the file, if there is an encoding recognition error, add the Chinese encoding method in the open function

import pandas

with open('datare.html','r',encoding='gbk') as f:
    data=pandas.read_html(f.read())
    excel_writer=pandas.ExcelWriter('datare.xlsx')
    data[0].to_excel(excel_writer)
    excel_writer.close()

Use the pandas library to convert html to xlsx

pandas.read_html(): convert html table to DataFrame

There is the following html form: 

 Using read_html() will read:

[     城市     环比     同比     定基     增长   减少  Unnamed: 6   1
0  "北京"  101.5  120.7  121.4  121.4  NaN       121.4 NaN
1  "上海"  101.2  127.3  127.8  131.4  NaN         NaN NaN
2  "广州"  101.3  119.4  120.0  146.4  NaN         NaN NaN
3  "深圳"  102.0  140.9  145.5  121.9  NaN         NaN NaN
4  "沈阳"  100.1  101.4  101.6  126.4  0.0         NaN NaN
5   "3"    5.0      7    7.0    8.0  NaN         8.0 NaN
6   NaN    NaN    "8"    1.0    NaN  NaN         NaN NaN
7     1    NaN      4    NaN    NaN  7.0         NaN NaN
8   NaN    NaN      4    NaN    NaN  NaN         NaN NaN
9   NaN    NaN      1    NaN    NaN  NaN         NaN NaN]

pandas.ExcelWriter('file_name'): Create an excel writing class and write the DataFrame object to the excel worksheet

DataFrame.to_excel: Export DataFrame to excel file

Reference blog: https://blog.csdn.net/sinat_30062549/article/details/51180518

2. Example 20 in the reference book, modify the code to output the last 50 universities

tip: The URL of the university ranking is: http://gaokao.xdf.cn/201911/10991728.html

import requests as rq
from bs4 import BeautifulSoup as Bs
import pandas as pd
import numpy as np

def get(url):
    rp=rq.get(url)
    rp.encoding='utf-8'
    return rp.text

def lastN(html,n):
    tables=Bs(html,'html.parser').find_all('tr')
    for tr in tables[-n:]:
        tds=tr.find_all('td')
        school_raw=[td.contents for td in tds][1][1].contents
        if(len(school_raw)==1):
            print(school_raw[0].replace('\n','').replace('\t',''))
        else:
            print(school_raw[1].string)

if __name__=='__main__':
    url='https://gaokao.xdf.cn/201911/10991728.html'
    lastN(get(url),50)

Reference blog:  https://blog.csdn.net/weixin_47434673/article/details/124161861

Guess you like

Origin blog.csdn.net/qq_53401568/article/details/128366308