This series of notes comes from
the Python series courses of Chinese University MOOC-Beijing Institute of Technology-Songtian teacher
Reprinted from: http://www.jianshu.com/p/7b950b8a5966
4. Getting Started with Beautiful Soup Library
Beautiful Soup library can parse HTML/XML format and extract relevant information
- Installation: Open CMD in administrator mode - enter
pip install beautifulsoup4
small test:
- The basic elements of the
Beautiful Soup library Beautiful Soup library is a functional library that parses/traverses/maintains "tags cooked". Reference method:from bs4 import BeautifulSoup
import bs4
Four parsers of the Beautiful Soup library:
Basic elements of the Beautiful Soup class:
- Tags
Any tag that exists in the HTML syntax can be accessed by **soup.<tag>, if there are multiple, take the first one - tagname
Each <tag> has its own name, obtained through <tag>.name, string type - Tag's attrs
- Tag的NavigableString
- Tag 的 Comment
- Tags
- HTML content traversal method based on bs4 library
- down traversal
- Up traversal
- parallel traversal
- down traversal
- HTML format output based on bs4 library
Use the prettify() method to add "\n" to HTML text <> and its content and can be used for tags/methods
5. Information organization and extraction methods
- Three Forms of Information Markup and Comparison
XML (eXtensible Markup Language) is the earliest general information markup language, which is extensible but cumbersome; tags are composed of names and attributes in the following forms:<name>...</name> <name /> <!-- -->
JSON (JavaScript Objection Notation) is suitable for program processing and is more concise than XML; there are types of key-value pairs in the form of:
"key":"value" "key":["value1","value2"] "key":{"subkey":"subvalue"}
YAML (YAML Ain't Markup Language) has the highest proportion of text information and good readability; untyped key-value pairs in the form of:
key:value key:#Comment -value1 -value2 key: subkey:subvalue
- General approach to information extraction
- Completely parse the marked form of the information and then extract the key information, a markup parser is required; the advantage is that the analysis is accurate, but the disadvantage is that the extraction is cumbersome and slow.
- Ignore the marked form and directly search for key information; the advantage is that the extraction speed is fast, and the disadvantage is that the accuracy is related to the information content
- Combining the two approaches requires a token parser and a text search function
- HTML content search method based on bs4 library
<>.find_all(name,attrs,recursive,string,**kwargs)` #返回一个列表类型,存储查找的结果 #name:对标签名称的检索字符串 #attrs:对标签属性值的检索字符串,可标注属性检索 #recursive:是否对子孙全部搜索,默认True #string:对字符串域进行检索
Seven methods extended by find_all():
6. Example 1: Chinese University Ranking Crawler
Step 1: Get the university ranking webpage content from the Internet getHTMLText()
Step 2: Extract the information in the webpage content to the appropriate data structure fillUnivList()
Step 3: Use the data structure to display and output the result printUnivLise()
import requests
from bs4 import BeautifulSoup
import bs4
def getHTMLText(url):
try:
r = requests.get(url,timeout=30)
r.raise_for_status()
r.encoding=r.apparent_encoding
return r.text
except:
return "error"
def fillUnivList(ulist,html):
soup=BeautifulSoup(html,"html.parser")
for tr in soup.find('tbody').children:
if isinstance(tr,bs4.element.Tag):
tds = tr('td')
ulist.append([tds[0].string,tds[1].string,tds[3].string])
def printUnivList(ulist,num):
tplt="{0:^10}\t{1:{3}^10}\t{2:^10}"
print(tplt.format("排名","学校名称","总分",chr(12288)))
for i in range(num):
u=ulist[i]
print(tplt.format(u[0],u[1],u[2],chr(12288)))
def main():
uinfo = []
url = 'http://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html'
html = getHTMLText(url)
fillUnivList(uinfo,html)
printUnivList(uinfo,20)
main()
This series of notes comes from
the Python series courses of Chinese University MOOC-Beijing Institute of Technology-Songtian teacher
Reprinted from: http://www.jianshu.com/p/7b950b8a5966
4. Getting Started with Beautiful Soup Library
Beautiful Soup library can parse HTML/XML format and extract relevant information
- Installation: Open CMD in administrator mode - enter
pip install beautifulsoup4
small test:
- The basic elements of the
Beautiful Soup library Beautiful Soup library is a functional library that parses/traverses/maintains "tags cooked". Reference method:from bs4 import BeautifulSoup
import bs4
Four parsers of the Beautiful Soup library:
Basic elements of the Beautiful Soup class:
- Tags
Any tag that exists in the HTML syntax can be accessed by **soup.<tag>, if there are multiple, take the first one - tagname
Each <tag> has its own name, obtained through <tag>.name, string type - Tag's attrs
- Tag的NavigableString
- Tag 的 Comment
- Tags
- HTML content traversal method based on bs4 library
- down traversal
- Up traversal
- parallel traversal
- down traversal
- HTML format output based on bs4 library
Use the prettify() method to add "\n" to HTML text <> and its content and can be used for tags/methods
5. Information organization and extraction methods
- Three Forms of Information Markup and Comparison
XML (eXtensible Markup Language) is the earliest general information markup language, which is extensible but cumbersome; tags are composed of names and attributes in the following forms:<name>...</name> <name /> <!-- -->
JSON (JavaScript Objection Notation) is suitable for program processing and is more concise than XML; there are types of key-value pairs in the form of:
"key":"value" "key":["value1","value2"] "key":{"subkey":"subvalue"}
YAML (YAML Ain't Markup Language) has the highest proportion of text information and good readability; untyped key-value pairs in the form of:
key:value key:#Comment -value1 -value2 key: subkey:subvalue
- General approach to information extraction
- Completely parse the marked form of the information and then extract the key information, a markup parser is required; the advantage is that the analysis is accurate, but the disadvantage is that the extraction is cumbersome and slow.
- Ignore the marked form and directly search for key information; the advantage is that the extraction speed is fast, and the disadvantage is that the accuracy is related to the information content
- Combining the two approaches requires a token parser and a text search function
- HTML content search method based on bs4 library
<>.find_all(name,attrs,recursive,string,**kwargs)` #返回一个列表类型,存储查找的结果 #name:对标签名称的检索字符串 #attrs:对标签属性值的检索字符串,可标注属性检索 #recursive:是否对子孙全部搜索,默认True #string:对字符串域进行检索
Seven methods extended by find_all():
6. Example 1: Chinese University Ranking Crawler
Step 1: Get the university ranking webpage content from the Internet getHTMLText()
Step 2: Extract the information in the webpage content to the appropriate data structure fillUnivList()
Step 3: Use the data structure to display and output the result printUnivLise()
import requests
from bs4 import BeautifulSoup
import bs4
def getHTMLText(url):
try:
r = requests.get(url,timeout=30)
r.raise_for_status()
r.encoding=r.apparent_encoding
return r.text
except:
return "error"
def fillUnivList(ulist,html):
soup=BeautifulSoup(html,"html.parser")
for tr in soup.find('tbody').children:
if isinstance(tr,bs4.element.Tag):
tds = tr('td')
ulist.append([tds[0].string,tds[1].string,tds[3].string])
def printUnivList(ulist,num):
tplt="{0:^10}\t{1:{3}^10}\t{2:^10}"
print(tplt.format("排名","学校名称","总分",chr(12288)))
for i in range(num):
u=ulist[i]
print(tplt.format(u[0],u[1],u[2],chr(12288)))
def main():
uinfo = []
url = 'http://www.zuihaodaxue.cn/zuihaodaxuepaiming2016.html'
html = getHTMLText(url)
fillUnivList(uinfo,html)
printUnivList(uinfo,20)
main()