Python crawler entry example three: crawling the ranking of Chinese universities

Write in front

  This example was written by the author today while studying at the Chinese University MOOC (Songtian Beijing Institute of Technology). But soon after I finished writing it, I found something was wrong. First of all, the example given in the course was the ranking of China's good university websites, but now this website has been refactored, and the original link is the university ranking of Shanghai Science and Technology, so the previous code is needed After making some modifications, with my unremitting efforts (all kinds of frustrations and many detours), I have basically succeeded now. Write this blog to record.

One, crawl the original interface

1. Website connection

Link: https://www.shanghairanking.cn/rankings/bcur/2020 .

2. Crawl content

  This example crawls the ranking, university name, and total score in the following figure.
Insert picture description here

Two, programming ideas

  This part of the teacher Songtian gave an explanation in the class, here I will organize and share with you.

1. Function description

Input: URL link of university ranking

Output: screen output of university ranking information (ranking, university name, total score)

Technical route: requests–bs4

Targeted crawler: crawls only the input URL, not extended crawling

Note: The use of requestts and bs4 library can only obtain static page information, how to obtain dynamic page information, I will write an article to explain in detail later

2. The structure design of the program

Step 1: Get the content of the university ranking webpage from the Internet: define the function getHTMLText()

Step 2: Extract the information in the webpage content to the appropriate data structure: define the function fillUnivList()

Step 3: Use the data structure to display and output the result: define the function printUnivList()

Three, write the function

  Before writing the function, we must first look at the source code
Insert picture description here
  of the webpage. By observing the source code of the webpage, we can see that all university information is encapsulated in a table. The label of this table is called tbody. In tbody, each The information of a university is encapsulated in a label, this label is called tr, and each tr label has a td label. The specific information of each university is surrounded by this label, but the name of the university is wrapped in a label Yes, here to do processing.
  So we first traverse the tbody tag to get all university information, then find the tr tag in the tbody tag, get each university information, and finally find the td tag in the tr tag, and write the relevant data we need in our ulist list.

Note:
  1. Since the university name is contained in the a tag, we can define a list to store the content of the a tag (distinguished from the td tag).
  2. In order to be visually more beautiful, the space of Chinese characters can be used to fill chr(12288), which is actually for alignment.

1. Function getHTMLText()

  Get the content of university ranking web pages from the Internet.

def getHTMLText(url):#获取URL信息,输出内容
    try:
        r = requests.get(url,timeout = 30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return""

2. Function fillUnivList()

  Extract the information in the webpage content into the appropriate data structure.

def fillUnivList(ulist,html):#将html页面放到ulist列表中(核心)
    soup = BeautifulSoup(html,"html.parser")
    for tr in soup.find('tbody').children:
        if isinstance(tr,bs4.element.Tag):#如果tr标签的类型不是bs4库中定义的tag类型,则过滤掉
            a = tr('a')#将所有的a标签存为一个列表类型
            tds = tr('td')#将所有的td标签存为一个列表类型
            ulist.append([tds[0].string.strip(),a[0].string.strip(),tds[4].string.strip()])

Note: It should be noted here that you should pay attention to the use of the strip() function, which is used to remove the specified characters at the beginning and end of the string (the default is a space or a newline) or character sequence. But this method can only delete the characters at the beginning or end, not the characters in the middle. It can be used here to make the crawled content achieve the effect of alignment when formatting the output.

3. Function printUnivList()

  Use data structure to display and output results: define functions.

def printUnivList(ulist1,num):#打印出ulist列表的信息,num表示希望将列表中的多少个元素打印出来
    #格式化输出
    tplt = "{0:^10}\t{1:{3}^12}\t{2:^10}"
    # 0、1、2为槽,{3}表示若宽度不够,使用format的3号位置处的chr(12288)(中文空格)进行填充
    print(tplt.format("排名","学校名称","总分",chr(12288)))
    for i in range(num):
        u = ulist1[i]
        print(tplt.format(u[0], u[1], u[2],chr(12288)))
    print()
    print("共有记录"+str(num)+"条")

Fourth, the complete code

'''
功能描述:
输入:大学排名URL
输出:大学排名信息的屏幕输出(排名,大学名称,总分)
技术路线:requests—bs4(只能获取静态页面信息)
定向爬虫:仅对输入URL进行爬取,不扩展爬取

程序的结构设计:
1.从网络上获取大学排名网页内容:定义函数getHTMLText()
2.提取网页内容中信息到合适的数据结构:定义函数fillUnivList()
3.利用数据结构展示并输出结果:定义函数printUnivList()
'''
import requests
from bs4 import BeautifulSoup
import bs4

ulist1=[]

def getHTMLText(url):#获取URL信息,输出内容
    try:
        r = requests.get(url,timeout = 30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return""

def fillUnivList(ulist,html):#将html页面放到ulist列表中(核心)
    soup = BeautifulSoup(html,"html.parser")
    for tr in soup.find('tbody').children:
        if isinstance(tr,bs4.element.Tag):#如果tr标签的类型不是bs4库中定义的tag类型,则过滤掉
            a = tr('a')
            tds = tr('td')#将所有的td标签存为一个列表类型
            ulist.append([tds[0].string.strip(),a[0].string.strip(),tds[4].string.strip()])

def printUnivList(ulist1,num):#打印出ulist列表的信息,num表示希望将列表中的多少个元素打印出来
    #格式化输出
    tplt = "{0:^10}\t{1:{3}^12}\t{2:^10}"
    # 0、1、2为槽,{3}表示若宽度不够,使用format的3号位置处的chr(12288)(中文空格)进行填充
    print(tplt.format("排名","学校名称","总分",chr(12288)))
    for i in range(num):
        u = ulist1[i]
        print(tplt.format(u[0], u[1], u[2],chr(12288)))
    print()
    print("共有记录"+str(num)+"条")

def main():
    uinfo = [] #将大学信息放到列表中
    url = "https://www.shanghairanking.cn/rankings/bcur/2020"
    html = getHTMLText(url)
    fillUnivList(uinfo,html)
    printUnivList(uinfo,10)

main()

The output effect is as follows: after
Insert picture description here
  this article, please point out any errors~

Quote from

中国大学MOOC Python网络爬虫与信息提取
https://www.icourse163.org/course/BIT-1001870001

Guess you like

Origin blog.csdn.net/weixin_44578172/article/details/109340255