Python Reptile - directional crawling "Chinese University Ranking Network"

Content of the discussion from China University of MOOC-- Beijing Institute of Technology - Artemisia days -Python web crawler to extract the relevant information and practical sections

 

We pre-crawling url below

http://www.zuihaodaxue.cn/zuihaodaxuepaiming2019.html

Page excerpt

Read page source code in a browser

 

Can be found in the table data is written directly to an HTML page information, so we can take direct operations directed reptiles.

 

Our overall design ideas are as follows:

1. Obtain ranking university network content from the network

2. Extract page content information to an appropriate data structure

3. Use of the data structure shown and outputs the result

 

Careful observation can be found, the structure of HTML, each <tr> tag contains all the information a university, and a set of <tr> inside, school name university, provincial, rankings and other information from the tag <td>

All <tr> are <tbody> child node

The page configuration information few design ideas, as follows

 1 import requests
 2 from bs4 import BeautifulSoup
 3 import bs4
 4 
 5 def getHTMLText(url):  #获取url信息
 6     try:
 7         r = requests.get(url,timeout = 30)
 8         r.raise_for_status()
 9         r.encoding = r.apparent_encoding
10         return r.text
11     except:
12         return ""
13 
14 def fillUnivList(ulist,html):  #将一个html页面放入一个列表
15     soup = BeautifulSoup(html,"html.parser")
16     #每个<tr>包含一所大学的所有信息
17     #所有<tr>信息包在<tbody>中
18     for tr in soup.find('tbody').children:
19         if isinstance(tr,bs4.element.Tag):  #过滤掉非标签信息,以取出包含在<tr>标签中的bs4类型的Tag标签
20             tds = tr('td')  #等价于tr.find_all('td'),在tr标签中找td标签内容
21             # print(tds)
22             ulist.append([tds[0].string,tds[1].string,tds[3].string,tds[2].string])
23             #td[0],[1],[3],[2],分别对应每组td信息中的排名,学校名称,得分,区域。将这些信息从摘取出来
24 
25 
26 def printUnivList(ulist,num):  #将列表信息输出
27     print("{:^10}\t{:^6}\t{:^10}\t{:^10}".format("排名","大学名称","得分","省市"))
28     for i in range(num):
29         u = ulist[i]
30         print("{:^10}\t{:^6}\t{:^10}\t{:^10}".format(u[0],u[1],u[2],u[3]))
31 
32 if __name__ == '__main__':
33     uinfo = []
34     url = "http://www.zuihaodaxue.cn/zuihaodaxuepaiming2019.html"
35     html = getHTMLText(url)
36     fillUnivList(uinfo,html)
37     printUnivList(uinfo,20)  #暂时只看前20所学校

输出信息如下

 

 

 其中第27行处,{:^10}为Python的格式化输出,30表示输出宽度约束为30个字符,^表示输出时右对齐,若宽度小于字符串的实际宽度,以实际宽度输出

 

为了使输出结果更整齐,我们可以对于输出部分进行优化

 

我们在输出时用到了format()方法,这样就会产生一个问题:当中文字符宽度不够时,方法会默认采用西文字符填充。而中西文字符宽度并不相同,这就是导致每行字符无法一一对齐的原因。

 

这里我们就可以设置format()填充方式为,采用中文字符的空格填充,即chr(12288)

于是我们对printUnivList()方法做如下修改

def printUnivList(ulist,num):  #将列表信息输出
    tplt = "{0:^10}\t{1:{4}^10}\t{2:^10}\t{3:^10}"  #{4}表示需要用到format方法中,从0伊始定义的第[4]个变量
    print(tplt.format("排名","大学名称","得分","省市",chr(12288)))
    for i in range(num):
        u = ulist[i]
        print(tplt.format(u[0],u[1],u[2],u[3],chr(12288)))

输出结果如下

 

 这样就对齐了

 

Guess you like

Origin www.cnblogs.com/fcbyoung/p/12297037.html