Infi-chu:
http://www.cnblogs.com/Infi-chu/
Beautiful Soup
The web page is parsed with the help of the structure and properties of the web page, so that the writing of complex regular expressions can be omitted.
Beautiful Soup is an HTML or XML parsing library for Python.
1. Parser
parser | Instructions | Advantage | disadvantage |
Python Standard Library | BeautifulSoup(markup,"html.parser") | Moderate execution speed and strong document fault tolerance | Versions prior to 2.7.3 and 3.2.2 have poor fault tolerance |
lxml HTML parser | BeautifulSoup(markup,"lxml") | Fast speed, strong document fault tolerance | Need to install C language library |
lxml XML parser | BeautifulSoup(markup,"xml") | Fast, the only parser that supports XML | Need to install C language library |
html5lib | BeautifulSoup(markup,"html5lib") | Best fault tolerance, browser-like parsing of documents, and HTML5-formatted documents | Slow and does not depend on external extensions |
To sum up, lxml HTML parser is recommended
from bs4 import BeautifulSoup soup = BeautifulSoup('<p>Hello World</p>','lxml') print(soup.p.string)
2. Basic usage:
html = ''' <html> <head><title>Infi-chu example</title></head> <body> <p class="title" name="dr"><b>title example</b></p> <p class="story">link <a href="http://example.com/elsie" class="sister" id="link1">elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">lacie</a>, <a href="http://example.com/tillie" class="sister" id="link3">tillie</a>, last sentence</p> '''
from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.prettify()) # fix html print(soup.title.string) # Output the string content of the title node
3. Node selector:
select element
Use soup. element to get
OK
(1) Get the name
Use soup.element.name to get the element name
(2) Get attributes
Use soup.element.attrs
Use soup.element.attrs['name']
(3) Element content
Use soup.element.string to get the content
nested selection
Use soup.parent.element.string to get the content
Association selection
(1) Child nodes and descendant nodes
html = ''' <html> <head><title>Infi-chu example</title></head> <body> <p class="title" name="dr"><b>title example</b></p> <p class="story">link <a href="http://example.com/elsie" class="sister" id="link1"><span>elsie</span></a>, <a href="http://example.com/lacie" class="sister" id="link2"><span>lacie</span></a>, <a href="http://example.com/tillie" class="sister" id="link3"><span>tillie</span></a>, last sentence</p> '''
from bs4 import BeautifulSoup # Get direct child nodes, children property soup = BeautifulSoup(html,'lxml') print(soup.p.children) for i ,child in enumerate(soup.p.children): print(i,child) # Get all descendant nodes, descendants property soup = BeautifulSoup(html,'lxml') print(soup.p.descendants) for i,child in enmuerate(soup.p.descendants): print(i,child)
(2) Parent node and ancestor node
Call the parent node, using the parent property
Get all ancestor nodes, use the parents property
(3) Sibling node
next_sibling next sibling element
previous_sibling previous sibling element
next_siblings all previous siblings
previous_siblings all subsequent siblings
(4) Extract information
4. Method selector:
find_all()
find_all(name,attrs,recursize,text,**kwargs)
(1)name
soup.find_all(name='ul') for ul in soup.find_all(name='ul'): print(ul.find_all(name='ul'))
for ul in soup.find_all(name='ul'): print(ul.find_all(name='li')) for li in ul.find_all(name='li'): print(li.string)
(2)atts
# Query by node name print(soup.find_all(attrs={'id':'list1'})) print(soup.find_all(attrs={'name':'elements'})) # can also be written as print(soup.find_all(id='list1')) print(soup.find_all(class='elements'))
(3)text
The text parameter can be used to match the text of the node. The incoming form can be a string or a regular expression object
from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.find_all(text=re.compile('link')))
find()
return an element
【Note】
find_parents()和find_parent()
find_next_siblings()和find_next_sibling()
find_previous_siblings()和find_previous_sibling()
find_all_next()和find_next()
find_all_previous()和find_previous()
5. CSS selectors:
nested selection
for ul in soup.select('ul'): print(ul.select('li'))
get attribute
for ul in soup.select('ul'): print(ul['id']) # Equivalent to print(ul.attrs['id'])
get text
In addition to the string attribute, there is also a get_text() method to get text
for li in soup.select('li'): # same effect print(li.get_text()) print(li.string)
Beautiful Soup
The web page is parsed with the help of the structure and properties of the web page, so that the writing of complex regular expressions can be omitted.
Beautiful Soup is an HTML or XML parsing library for Python.
1. Parser
parser | Instructions | Advantage | disadvantage |
Python Standard Library | BeautifulSoup(markup,"html.parser") | Moderate execution speed and strong document fault tolerance | Versions prior to 2.7.3 and 3.2.2 have poor fault tolerance |
lxml HTML parser | BeautifulSoup(markup,"lxml") | Fast speed, strong document fault tolerance | Need to install C language library |
lxml XML parser | BeautifulSoup(markup,"xml") | Fast, the only parser that supports XML | Need to install C language library |
html5lib | BeautifulSoup(markup,"html5lib") | Best fault tolerance, browser-like parsing of documents, and HTML5-formatted documents | Slow and does not depend on external extensions |
To sum up, lxml HTML parser is recommended
from bs4 import BeautifulSoup soup = BeautifulSoup('<p>Hello World</p>','lxml') print(soup.p.string)
2. Basic usage:
html = ''' <html> <head><title>Infi-chu example</title></head> <body> <p class="title" name="dr"><b>title example</b></p> <p class="story">link <a href="http://example.com/elsie" class="sister" id="link1">elsie</a>, <a href="http://example.com/lacie" class="sister" id="link2">lacie</a>, <a href="http://example.com/tillie" class="sister" id="link3">tillie</a>, last sentence</p> '''
from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.prettify()) # fix html print(soup.title.string) # Output the string content of the title node
3. Node selector:
select element
Use soup. element to get
OK
(1) Get the name
Use soup.element.name to get the element name
(2) Get attributes
Use soup.element.attrs
Use soup.element.attrs['name']
(3) Element content
Use soup.element.string to get the content
nested selection
Use soup.parent.element.string to get the content
Association selection
(1) Child nodes and descendant nodes
html = ''' <html> <head><title>Infi-chu example</title></head> <body> <p class="title" name="dr"><b>title example</b></p> <p class="story">link <a href="http://example.com/elsie" class="sister" id="link1"><span>elsie</span></a>, <a href="http://example.com/lacie" class="sister" id="link2"><span>lacie</span></a>, <a href="http://example.com/tillie" class="sister" id="link3"><span>tillie</span></a>, last sentence</p> '''
from bs4 import BeautifulSoup # Get direct child nodes, children property soup = BeautifulSoup(html,'lxml') print(soup.p.children) for i ,child in enumerate(soup.p.children): print(i,child) # Get all descendant nodes, descendants property soup = BeautifulSoup(html,'lxml') print(soup.p.descendants) for i,child in enmuerate(soup.p.descendants): print(i,child)
(2) Parent node and ancestor node
Call the parent node, using the parent property
Get all ancestor nodes, use the parents property
(3) Sibling node
next_sibling next sibling element
previous_sibling previous sibling element
next_siblings all previous siblings
previous_siblings all subsequent siblings
(4) Extract information
4. Method selector:
find_all()
find_all(name,attrs,recursize,text,**kwargs)
(1)name
soup.find_all(name='ul') for ul in soup.find_all(name='ul'): print(ul.find_all(name='ul'))
for ul in soup.find_all(name='ul'): print(ul.find_all(name='li')) for li in ul.find_all(name='li'): print(li.string)
(2)atts
# Query by node name print(soup.find_all(attrs={'id':'list1'})) print(soup.find_all(attrs={'name':'elements'})) # can also be written as print(soup.find_all(id='list1')) print(soup.find_all(class='elements'))
(3)text
The text parameter can be used to match the text of the node. The incoming form can be a string or a regular expression object
from bs4 import BeautifulSoup soup = BeautifulSoup(html,'lxml') print(soup.find_all(text=re.compile('link')))
find()
return an element
【Note】
find_parents()和find_parent()
find_next_siblings()和find_next_sibling()
find_previous_siblings()和find_previous_sibling()
find_all_next()和find_next()
find_all_previous()和find_previous()
5. CSS selectors:
nested selection
for ul in soup.select('ul'): print(ul.select('li'))
get attribute
for ul in soup.select('ul'): print(ul['id']) # Equivalent to print(ul.attrs['id'])
get text
In addition to the string attribute, there is also a get_text() method to get text
for li in soup.select('li'): # same effect print(li.get_text()) print(li.string)