BeautifulSoup class uses
fundamental element
|
Explanation
|
Tag
|
Tag, the basic information organization unit, respectively, <> and </> indicate the beginning and end
|
Name
|
Name tag, <p> </ p> name is / 'p', the format: <tag> .name
|
Attributes
|
Tag attributes, organized in the dictionary, the format: <tag> .attrs
|
NavigableString
|
Non attribute string in the tag, <> </> string in the format: <tag> .string
|
Comment
|
Note the tag part of the string, a special type Comment
|
The main function of the library and use BeautifulSoup
1. Create BeautifulSoup library objects
import lxml
import requests
from bs4 import BeautifulSoup
2. Select the parser
python standard library
|
BeautifulSoup(markup, "html.parser")
|
built-in python standard library, execution speed is moderate, strong fault tolerance documents
|
python2.7 and fault tolerance documents before python3.2.2 poor
|
lxml HTML parser |
BeautifulSoup(markup, "lxml")
|
Fast, strong fault tolerance documents
|
Need to install the C language library need to install the C language library
|
lxml XML parser |
BeautifulSoup(markup, "xml")
|
Speed, the only support for XML parser
|
You need to install the C language library |
html5lib |
BeautifulSoup(markup, "html5lib")
|
The best fault tolerance to the way the browser parses the document, document generation HTML5 format
|
Slow, does not rely on external expansion
|
3. traverse the document tree
.contents return all child nodes of the current node is the return type List
.children Returns all child nodes of the current node list return type is a generator object
.descendants Returns all descendant nodes of the current node list return type generator object
.parent return to the current node's father node is the return type of node Tag
.parents return to the current node's father node returns all type list generator object
.next_sibling returns the next sibling of the current node
.previous_sibling Returns all siblings of the current node list return type generator object
.next_element return to the current node's next Tag
.previous_sibling return to a current node
All siblings after the current node returns .next_siblings
.previous_siblings return all nodes before the current node
.string return contents of the current node labels
If the current Tag contains multiple child nodes Tag string method can not determine the contents of which node should call
.strings return multiple content need to traverse acquisition
4. Search Document
1 find_all(name, attrs, recursive, text, **kwargs)
Parameter name
A string transfer
B Regular Expressions
C transfer list
D pass True
E transmission method
keyword parameters
text parameters
5.CSS selector
select method
1 Find by label name
2 by class name Find
3 Find by id name
Find a combination of 4
Find 5 properties
select the object list is returned
Gets the label of all strings use get_text ()