Reptile resolve some notes of the library BeautifulSoup

BeautifulSoup class uses

 
fundamental element
Explanation
Tag
Tag, the basic information organization unit, respectively, <> and </> indicate the beginning and end
Name
Name tag, <p> </ p> name is / 'p', the format: <tag> .name
Attributes
Tag attributes, organized in the dictionary, the format: <tag> .attrs
NavigableString
Non attribute string in the tag, <> </> string in the format: <tag> .string
Comment
Note the tag part of the string, a special type Comment
The main function of the library and use BeautifulSoup
 
1. Create BeautifulSoup library objects
 
import lxml
import requests
from bs4 import BeautifulSoup
 
 
2. Select the parser
 
python standard library
BeautifulSoup(markup, "html.parser")
built-in python standard library, execution speed is moderate, strong fault tolerance documents
python2.7 and fault tolerance documents before python3.2.2 poor
lxml HTML parser
BeautifulSoup(markup, "lxml")
Fast, strong fault tolerance documents
Need to install the C language library need to install the C language library
lxml XML parser
BeautifulSoup(markup, "xml")
Speed, the only support for XML parser
You need to install the C language library
html5lib
BeautifulSoup(markup, "html5lib")
The best fault tolerance to the way the browser parses the document, document generation HTML5 format
Slow, does not rely on external expansion
 
3. traverse the document tree
.contents return all child nodes of the current node is the return type List
.children Returns all child nodes of the current node list return type is a generator object
.descendants Returns all descendant nodes of the current node list return type generator object
.parent return to the current node's father node is the return type of node Tag
.parents return to the current node's father node returns all type list generator object
.next_sibling returns the next sibling of the current node 
.previous_sibling Returns all siblings of the current node list return type generator object
.next_element return to the current node's next Tag
.previous_sibling return to a current node
All siblings after the current node returns .next_siblings
.previous_siblings return all nodes before the current node 
.string return contents of the current node labels
If the current Tag contains multiple child nodes Tag string method can not determine the contents of which node should call
.strings return multiple content need to traverse acquisition
 
4. Search Document
    1 find_all(name, attrs, recursive, text, **kwargs)
 
Parameter name
            A string transfer
            B Regular Expressions
            C transfer list
            D pass True
            E transmission method
 
keyword parameters
            
 
text parameters
            
 
5.CSS selector
 
select method
    1 Find by label name
    2 by class name Find
    3 Find by id name
    Find a combination of 4
    Find 5 properties
select the object list is returned
Gets the label of all strings use get_text ()
 

Guess you like

Origin www.cnblogs.com/1328497946TS/p/11016489.html