Python ranking data reptiles crawl University --2019

Python ranking data reptiles crawl University --2019

Ready to work

  1. Input: University Ranking URL connection
  2. University ranking information screen output: Output
  3. The need to use the library: requests, bs4

Thinking

  1. Get information page
  2. Extract the contents of a Web page and stored in the data structure
  3. Using the data structures shown and outputs the result

programming

  1. Defined functions getHTMLText () to obtain information page
  2. Defined functions UnivList () data structure into
  3. Defined functions printUnivList () output to the screen

Overall:

  • You need to write a custom function to produce an overall framework
  • Write the frame main function, functional
  • Finally, call the function

step

View source url

Analyze the source code, see the need for crawling content in what position
001.png

As can be seen from the picture, the ranking information in the <tbody>tag
specific information in the <tr>case of the label <td>string in the label

Defined Functions getHTMLText

def getHMLText(url):
    '''
    获取url信息,输出url的内容,来抓取网页的信息
    '''
    try:
        r = request.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return "抓取失败!"

Defined functions UnivList ()

def UnivList(ulist, html):
    '''
    提取html中的数据,放入到ulist列表,完成数据提取
    '''
    soup = BeautifulSoup(html, "html.parser")
    for tr in soup.find('tbody').children:
        if isinstance(tr, bs4.element.Tag):  # 判断tr的子节点是否为非属性字符串
            tds = tr('td')
            # print(tds)
            # print('#' * 30)
            # print(tds[0], tds[0].string)
            # print('#' * 30)
            # print(tds[1], tds[1].string)
            # print('#' * 30)
            # print(tds[2], tds[2].string)
            # print('#' * 30)
            # print(tds[3], tds[3].string)
            ulist.append([tds[0].string, tds[1].string, tds[3].string, tds[2].string])

Defined functions printUnivList ()

def printUnivList(ulist, num):
    '''
    将ulist列表信息打印,num表示打印前多少排名的学校
    '''
    print("{:^3}\t{:^10}\t{:^20}\t{:^30}".format("排名", "学校名称", "总分", '地址'))
    for i in range(num):
        u = ulist[i]
        print("{:^3}\t{:^10}\t{:^20}\t{:^30}".format(u[0], u[1], u[2], u[3]))

Main function main ()

def main():
    '''
    实现整个代码
    '''
    ulist = []
    url = 'http://www.zuihaodaxue.cn/zuihaodaxuepaiming2019.html'
    html = getHTMLText(url)
    UnivList(ulist, html)
    printUnivList(ulist, 100)

The main function calls

main()

The complete code

import requests
import bs4
from bs4 import BeautifulSoup
def getHMLText(url):
    '''
    获取url信息,输出url的内容,来抓取网页的信息
    '''
    try:
        r = request.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return "抓取失败!"

def UnivList(ulist, html):
    '''
    提取html中的数据,放入到ulist列表,完成数据提取
    '''
    soup = BeautifulSoup(html, "html.parser")
    for tr in soup.find('tbody').children:
        if isinstance(tr, bs4.element.Tag):  # 判断tr的子节点是否为非属性字符串
            tds = tr('td')
            # print(tds)
            # print('#' * 30)
            # print(tds[0], tds[0].string)
            # print('#' * 30)
            # print(tds[1], tds[1].string)
            # print('#' * 30)
            # print(tds[2], tds[2].string)
            # print('#' * 30)
            # print(tds[3], tds[3].string)
            ulist.append([tds[0].string, tds[1].string, tds[3].string, tds[2].string])

def printUnivList(ulist, num):
    '''
    将ulist列表信息打印,num表示打印前多少排名的学校
    '''
    print("{:^3}\t{:^10}\t{:^20}\t{:^30}".format("排名", "学校名称", "总分", '地址'))
    for i in range(num):
        u = ulist[i]
        print("{:^3}\t{:^10}\t{:^20}\t{:^30}".format(u[0], u[1], u[2], u[3]))


def main():
    '''
    实现整个代码
    '''
    uinfo = []
    url = 'http://www.zuihaodaxue.cn/zuihaodaxuepaiming2019.html'
    html = getHTMLText(url)
    fillUnivList(uinfo, html)
    printUnivList(uinfo, 100)

main()

Guess you like

Origin www.cnblogs.com/moniter/p/12334232.html