Install a .Beautiful soup
II. Understand
III. Page traversal methods
Four encoding format .bs4
prettify function can be retracted to make the code by adding more intuitive.
V. Information mark
Format information in different languages:
Example:
re is the regular expression library
Seven examples: University Rankings crawling
1 import requests 2 from bs4 import BeautifulSoup 3 import bs4 4 5 def getHtmlText(url): 6 try: 7 r=requests.get(url,timeout=30) 8 r.raise_for_status() 9 r.encoding=r.apparent_encoding 10 return r.text 11 except: 12 return "" 13 14 def fillUnivList(ulist,html): 15 soup=BeautifulSoup(html,"html.parser") 16 for tr in soup.find('tbody').children:#通过对源代码分析,是从body标签里的tr标签提取内容, 17 if isinstance(tr,bs4.element.Tag): 18 tds=tr('td') 19 ulist.append([tds[0].string,tds[1].string,tds[4].string]) 20 21 def printUnivList(ulist,num): 22 print("{:^10}\t{:^16}\t{:^16}".format("排名","学校名称","总分")) 23 for i in range(num): 24 u=ulist[i] 25 print("{:^10}\t{:^16}\t{:^16}".format(u[0],u[1],u[2])) 26 27 def main(): 28 uinfo=[] 29 url='http://www.zuihaodaxue.cn/zuihaodaxuepaiming2018.html' 30 html=getHtmlText(url) 31 fillUnivList(uinfo,html) 32 printUnivList(uinfo,50) 33 34 main()
1 def printUnivList(ulist,num): 2 textout="{0:^10}\t{1:{3}^16}\t{2:^16}" 3 print(textout.format("排名","学校名称","总分",chr(12288))) 4 for i in range(num): 5 u=ulist[i] 6 print(textout.format(u[0],u[1],u[2],chr(12288)))
格式优化,如果不规定空格填充,默认使用英文空格,中西结合的格式会使得排列很乱
前后对比,这个中文排版问题会一直出现,之后记得就好。