Chinese university ranking directional crawler refers to the problems and solutions encountered in the Python crawler course of teacher Songtian

First, attach the program in the course, it can't run normally

Change the 2016 URL to this year's URL: http://www.shanghairanking.cn/rankings/bcur/2020

code show as below:

import requests
from bs4 import BeautifulSoup
import bs4

def getHTMLText(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""

def fillUnivList(ulist, html):
    soup = BeautifulSoup(html, "html.parser")
    for tr in soup.find('tbody').children:
        if isinstance(tr, bs4.element.Tag):
            tds = tr('td')
            ulist.append([tds[0].string, tds[1].string, tds[3].string])

def printUnivList(ulist, num):
    print("{:^10}\t{:^6}\t{:^10}".format("排名","学校名称","总分"))
    for i in range(num):
        u=ulist[i]
        print("{:^10}\t{:^6}\t{:^10}".format(u[0],u[1],u[2]))
    
def main():
    uinfo = []
    url = 'http://www.shanghairanking.cn/rankings/bcur/2020'
    html = getHTMLText(url)
    fillUnivList(uinfo, html)
    printUnivList(uinfo, 20) # 20 univs
main()

Output result:

报错
AttributeError: ‘NoneType’ object has no attribute ‘children’

Find the cause of the problem

First output the website content, the code is as follows:

from bs4 import BeautifulSoup
import requests
r = requests.get('https://www.shanghairanking.cn/rankings/bcur/2020')
r.encoding = r.apparent_encoding
demo = r.text
soup = BeautifulSoup(demo,'html.parser')
print(soup.prettify())

Part of the output:

Intercept part of the output result
In the output, you can see that the tbody tag contains information about all universities, the tr tag contains all information about a university, and the td tag contains every information about a single university. But the difference from Songtian's courseware is that the label containing the university name is the a label under td.
So the problem should be the part of getting the university name.
Print out the contents of the print ulist.

code show as below:

The statement:

ulist.append([tds[0].string, tds[1].string, tds[4].string])

To:

ulist.append([tds[0].string, tds[1], tds[4].string])
def fillUnivList(ulist, html):
    soup = BeautifulSoup(html, "html.parser")
    for tr in soup.find('tbody').children:
        if isinstance(tr, bs4.element.Tag):
            tds = tr('td')
            ulist.append([tds[0].string, tds[1], tds[4].string])
    print(ulist)

Output result:

Insert picture description here
You can see that the content we want is printed out, but there is something we don't want.
You can see that the content we want is under the a tag, and we can use the .find() method to retrieve the content we want.

code show as below:

for a in tr.find('a'):
	print('a')

The output is:

Insert picture description here
This is exactly what we want and assign it to tds.

code show as below:

def fillUnivList(ulist, html):
    soup = BeautifulSoup(html, "html.parser")
    for tr in soup.find('tbody').children:
        if isinstance(tr, bs4.element.Tag):
            tds = tr('td')
#            ulist.append([tds[0].string, tds[1], tds[4].string])
            for a in tr.find('a'):
#                print(a)
                ulist.append([tds[0].string, a, tds[4].string])

The output is:

Insert picture description here
The content is exactly what we want, but the layout is not neat enough. The reason is that the content of ulist contains line breaks.
Replace the newline character in ulist with the .replace() method, and there will be no newline problem.

code show as below:

def printUnivList(ulist, num):
    print("{:^10}\t{:^6}\t{:^10}".format("排名","学校名称","总分"))
    for i in range(num):
        u=ulist[i]
        print("{:^10}\t{:^6}\t{:^10}".format(u[0].replace('\n',''), u[1].replace('\n',''), u[2].replace('\n','')))

Output result:

Insert picture description here
This is exactly the typesetting and content we want. Modify the format printed by print() to get a neater typesetting.

The overall code of the program is as follows:

import requests
from bs4 import BeautifulSoup
import bs4

def getHTMLText(url):
    try:
        r = requests.get(url, timeout=30)
        r.raise_for_status()
        r.encoding = r.apparent_encoding
        return r.text
    except:
        return ""

def fillUnivList(ulist, html):
    soup = BeautifulSoup(html, "html.parser")
    for tr in soup.find('tbody').children:
        if isinstance(tr, bs4.element.Tag):
            tds = tr('td')
#            ulist.append([tds[0].string, tds[1], tds[4].string])
            for a in tr.find('a'):
#                print(a)
                ulist.append([tds[0].string, a, tds[4].string])
    
def printUnivList(ulist, num):
    print("     {:^10}\t{:^6}\t      {:^10}".format("排名","学校名称","总分"))
    for i in range(num):
        u=ulist[i]
        print("{:^10}\t{:^6}\t{:^10}".format(u[0].replace('\n',''), u[1].replace('\n',''), u[2].replace('\n','')))
    
def main():
    uinfo = []
    url = 'http://www.shanghairanking.cn/rankings/bcur/2020'
    html = getHTMLText(url)
    fillUnivList(uinfo, html)
    printUnivList(uinfo, 20) # 20 univs
main()

The output is:

Insert picture description here

Guess you like

Origin blog.csdn.net/qq_51005828/article/details/109405304