BeautifulSoup4 basic use

BeautifulSoup4 basic parsing library use

A. Installation

pip install Beautifulsoup4

Beautiful Soup is actually dependent on when parsing parser, in addition to its support for python standard library of HTML parser also supports third party, such as a parser lxml recommended because lxml.

Installation parser: pip install lxml

II. Basic use

Objects created beautifulsoup

= the BeautifulSoup Soup (HTML, 'lxml') 
HTML: may str, may be a file handle FP
'lxml': a parser for an installation package lxml

 

1. Node Selector

The name of the node can be called directly select an element node, the node can be nested selection and return to the type of objects are bs4.element.Tag

soup.head # Get the head tag 
soup.pb # p at node acquires the node b
soup.p.string # p tags acquired in the text

When there is a plurality of identical sibling node, the node selector selects only the first default

Method node attributes :

Gets the name of the node's name attribute:

soup.div.name

attrs attribute acquisition node number, the returned results could be a list or string type, depending on the node type

Get all the properties soup.p.attrs # p nodes 
soup.p.attrs [ 'class'] # p-node class attributes acquired
soup.p [ 'class'] # p-node class attributes directly obtained in the form dictionary directly
soup. p.get ( 'class')

string property to get the text nodes contained elements:

soup.p.string # Gets the text contents of the first node p

contents directly attribute node's children, returns the contents as a list

soup.div.contents # direct child node, bs4 wrap will also serve as a node

Direct child node of the children nodes is acquired properties, return to the type of generator

soup.div.children

attribute acquisition descendants descendant node returns generator

soup.div.descendants

parent parent attribute acquisition, acquisition Parents ancestor node, returns generator

soup.b.parent
soup.b.parents

next_sibling property returns a next sibling node, previous_sibling return to a sibling node , a node is also noted that line breaks , so obtaining sibling node is usually a string or a blank

soup.a.next_sibling
soup.a.previous_sibling

Obtaining a next object to be parsed and previous_element next_element properties, or a

soup.a.next_element
soup.a.previous_element

next_elements and previous_elements iterator forward or rear access the document parsing content

soup.a.next_elements
soup.a.previous_elements

2. Use find_all

find_all (name, attrs, recursive, text, ** kwargs): Discover all eligible elements, where the parameters:

name represents can find all the names for the name of the label (tag), may also be a filter, regular expression, or list is True

attrs represents incoming attributes can be specified as id property used in the form of a dictionary by attrs parameter, attrs = { 'id': '123'}, since python class attribute is a keyword, all the needs in the class in the query that is followed by an underscore class _ = 'element', the result returned is a list of tag types

text parameters to match the text of the node, may be passed in the form of a string may be a regular expression object

recursive said, If you want to search the direct child can set the parameter to false: recursive = Flase

limit parameter can be used to limit the number of results returned, similar to the keywords in SQL limit

find_all (condition): Query all eligible elements

Find all label element called div of

soup.find_all (name = 'div') # name represents the tag name, not the name attribute 
soup.find_all ( 'div') # tag lookup

Find a tag named li or all of the elements

soup.find_all(name = ['li', 'a'])

Find all the elements of the world with id

soup.find_all(id = 'world')

Find a class is active all the elements

Soup. find_all ( class_ = 'the Active')   # class as a python keyword, underlined the need

Find a tag, title attributes for all elements of hello

Soup. find_all ( 'A', title = 'Hello')   # tag added attribute filtering 
Soup. find_all ( 'A', title = 'Hello', limit = 2) # output limits, limit take the first properties represent 2

Attribute contains a label lookup id = 'box', all the elements class = 'active' in

. Soup find_all ( 'A', attrs = { 'ID': 'Box', 'class': 'Active'})   # multiple attribute filter

Find the text to match the string hello all elements

Soup. find_all ( text = Re. the compile ( 'Tillie')) # use regular filtering lookup

Other methods:

find (name, attrs, recursive, text, ** kwargs): it returns a single element, i.e. the first matching element, the type of tag type is still

Parameters with find_all () as

 

3. css selectors

Use css selector syntax find elements

Find all a label

soup.select('a')

Find all the elements class = 'active' in

soup.select('.active')

Find id = 'box' element

soup.select('#box')

Find all descendants of li tags .active

soup.select(.active li)

Find li tag of all elements of a child

soup.select('li > a')

By the existence of a property to find elements

soup.select ( 'li [class]') # Find all li tag with the class attribute

By value of the property to find a match

soup.select ( 'li [class = " active"]') # Find all tags with li class = 'hello' of 
soup.select ( 'li [class ^ = "act"]') at the beginning of match value #
soup end ( 'li [class $ = " ve"]') # match value .Select
soup.select ( 'Li [* class = "TIV"]') # fuzzy matches

 

Gets the text node

Soup. SELECT ( 'A') [ 0]. get_text () 
Soup. SELECT ( 'A') [ 0]. String   failure when only text # seemingly effective, other labels nested

 

Guess you like

Origin www.cnblogs.com/Deaseyy/p/11266742.html