BeautifulSoup4 basic parsing library use
A. Installation
pip install Beautifulsoup4
Beautiful Soup is actually dependent on when parsing parser, in addition to its support for python standard library of HTML parser also supports third party, such as a parser lxml recommended because lxml.
Installation parser: pip install lxml
Objects created beautifulsoup
= the BeautifulSoup Soup (HTML, 'lxml')
HTML: may str, may be a file handle FP
'lxml': a parser for an installation package lxml
1. Node Selector
The name of the node can be called directly select an element node, the node can be nested selection and return to the type of objects are bs4.element.Tag
soup.head # Get the head tag
soup.pb # p at node acquires the node b
soup.p.string # p tags acquired in the text
When there is a plurality of identical sibling node, the node selector selects only the first default
Method node attributes :
Gets the name of the node's name attribute:
soup.div.name
attrs attribute acquisition node number, the returned results could be a list or string type, depending on the node type
Get all the properties soup.p.attrs # p nodes
soup.p.attrs [ 'class'] # p-node class attributes acquired
soup.p [ 'class'] # p-node class attributes directly obtained in the form dictionary directly
soup. p.get ( 'class')
string property to get the text nodes contained elements:
soup.p.string # Gets the text contents of the first node p
contents directly attribute node's children, returns the contents as a list
soup.div.contents # direct child node, bs4 wrap will also serve as a node
Direct child node of the children nodes is acquired properties, return to the type of generator
soup.div.children
attribute acquisition descendants descendant node returns generator
soup.div.descendants
parent parent attribute acquisition, acquisition Parents ancestor node, returns generator
soup.b.parent
soup.b.parents
next_sibling property returns a next sibling node, previous_sibling return to a sibling node , a node is also noted that line breaks , so obtaining sibling node is usually a string or a blank
soup.a.next_sibling
soup.a.previous_sibling
Obtaining a next object to be parsed and previous_element next_element properties, or a
soup.a.next_element
soup.a.previous_element
next_elements and previous_elements iterator forward or rear access the document parsing content
soup.a.next_elements
soup.a.previous_elements
2. Use find_all
find_all (name, attrs, recursive, text, ** kwargs): Discover all eligible elements, where the parameters:
name represents can find all the names for the name of the label (tag), may also be a filter, regular expression, or list is True
attrs represents incoming attributes can be specified as id property used in the form of a dictionary by attrs parameter, attrs = { 'id': '123'}, since python class attribute is a keyword, all the needs in the class in the query that is followed by an underscore class _ = 'element', the result returned is a list of tag types
text parameters to match the text of the node, may be passed in the form of a string may be a regular expression object
recursive said, If you want to search the direct child can set the parameter to false: recursive = Flase
limit parameter can be used to limit the number of results returned, similar to the keywords in SQL limit
find_all (condition): Query all eligible elements
Find all label element called div of
soup.find_all (name = 'div') # name represents the tag name, not the name attribute
soup.find_all ( 'div') # tag lookup
Find a tag named li or all of the elements
soup.find_all(name = ['li', 'a'])
Find all the elements of the world with id
soup.find_all(id = 'world')
Find a class is active all the elements
Soup. find_all ( class_ = 'the Active') # class as a python keyword, underlined the need
Find a tag, title attributes for all elements of hello
Soup. find_all ( 'A', title = 'Hello') # tag added attribute filtering
Soup. find_all ( 'A', title = 'Hello', limit = 2) # output limits, limit take the first properties represent 2
Attribute contains a label lookup id = 'box', all the elements class = 'active' in
. Soup find_all ( 'A', attrs = { 'ID': 'Box', 'class': 'Active'}) # multiple attribute filter
Find the text to match the string hello all elements
Soup. find_all ( text = Re. the compile ( 'Tillie')) # use regular filtering lookup
Other methods:
find (name, attrs, recursive, text, ** kwargs): it returns a single element, i.e. the first matching element, the type of tag type is still
Parameters with find_all () as
3. css selectors
Use css selector syntax find elements
Find all a label
soup.select('a')
Find all the elements class = 'active' in
soup.select('.active')
Find id = 'box' element
soup.select('#box')
Find all descendants of li tags .active
soup.select(.active li)
Find li tag of all elements of a child
soup.select('li > a')
By the existence of a property to find elements
soup.select ( 'li [class]') # Find all li tag with the class attribute
By value of the property to find a match
soup.select ( 'li [class = " active"]') # Find all tags with li class = 'hello' of
soup.select ( 'li [class ^ = "act"]') at the beginning of match value #
soup end ( 'li [class $ = " ve"]') # match value .Select
soup.select ( 'Li [* class = "TIV"]') # fuzzy matches
Gets the text node
Soup. SELECT ( 'A') [ 0]. get_text ()
Soup. SELECT ( 'A') [ 0]. String failure when only text # seemingly effective, other labels nested