Python Reptile: BeautifulSoup usage summary

original

BeautifulSoup is a parsing HTML or XML file third-party libraries. HTML or XML files can be explained by the DOM model. Generally it contains three nodes:

  • Element nodes - usually refers to HTML or XML tags
  • Text node - the text inside the tag
  • Node attributes - attribute of each tag

BeautifulSoup library for parsing HTML or XML file, find one or more label elements, and get each tag in the text and attributes.

BeautifulSoup nice feature is that it will accept the automatic detection of a str or byte encoded objects, and the current document coding and converted to Unicode encoding. So you can not worry about the garbage problem.

Installation: pip install beautifulsoup4

使用: from bs4 import BeautifulSoup

Recommended installation: pip install lxml, faster than the built-in python parsing library html.parser

 

 

 BeautifulSoup library use:

There are three ways to find a major element:

  • According to the label name directly to find: soup.title, soup.p applies only to find a single element.
  • Use find and find_all method - based on name tags and attributes to extract one or more elements of the document traversal Find
  • Using the select method - traversing the document elements based on the extracted one or more css style selector
# Direct access according to tag elements 
soup.p # get p label element objects, get only the first 
soup.p.name # get p tag name, that 'p " 
soup.p.string # text within the label element to obtain p 
soup.p [ ' class ' ] # Get class attribute tag element p 
soup.p.get ( ' class ' ) # is equivalent to the case 
soup.a [ ' href ' ] # Get the href attribute of a first element 


# similar find_all method .find method for extracting only the first matching element 
# find_all (name, attrs, recursive This, text, ** kwargs) 
#   name: to find the tag name (string, the regular method, True) 
#   attrs : property tag
#   Recursive This: Recursive 
#   text: Find text 
# ** kwargs: Other key parameters 
# because class is a keyword, it should be rewritten class _ = "value", is equivalent to = {attrs "class": "value"} 
soup.find_all ( ' p ' )   # returns a list of all tags p 
soup.find_all ( ' p ' , attrs = { " class " : " SISTER " })   # returns all class attributes in list form the p-tag == sister 
soup.find_all ( ' P ' , the class_ = " SISTER ")   # Returns the class attribute in a list of all the p-tag == sister 
soup.find_all (id =' Link2 ' )   # returns all tags id attributes == link2 
soup.find_all (the re.compile ( " ^ b " ))   # use regular label to find an element b beginning 
soup.find_all (the re.compile the href = ( " Elsie " )) # use regular, returns all of the href attribute contains the label elsie 
soup.find_all (ID = " link1 " , the re.compile href = ( ' elsie ' ))   # ID = link1 and comprising elsie href tag 


# SELECT method - css selector 
# Note select method of extracting elements are in the form of a list, plus index gets text note 
soup.select ( ' P ' ) #The tag name to find all p elements equal soup.find_all ( 'p') 
soup.select ( ' .sister ' ) # lookup tag class = sister by css attributes 
soup.select ( ' # link1 ' ) # Find all the id id = # link1 element 
soup.select ( ' P # link1 ' ) # combination element lookup id = p # link11 of 
soup.select ( " head> title " ) # Find subelement head tag title 
soup.select ( ' a [class = "sister"] ' ) # find all the attributes of a sister label 
soup.select ( ' a [the href = "http://example.com/elsie"]') # Find the a href = xxx label elements 
soup.select ( ' p ' ) [0] .get_text () # Get the text elements in the first p 
soup.select ( ' a [the href = *. "COM / EL"] ' ) [0] .attrs [ ' the href ' ] # obtain a href xxx.com

In addition to find () and find_all (), there are some common methods used to search for his son and siblings:

find_parent()

find_parents()

find_next_sibling()

find_next_siblings()

find_previous_sibling()

find_previous_siblings()

find_next()
find_previous()
find_all_next()
find_all_previous()

Please note that the select method and find method has the following differences, please note that when using:

  • find method returns a single element, find_all method returns a list of elements, and select the method always returns a list of elements. If you use the select method to find a single element, do not forget to add a list of index [0] before you can call get_text () method to get the text.

  • find further support method parameter queries, more powerful than the select method, as follows:

    def has_class_no_id(tag):
        return tag.has_attr("class") and not tag.has_attr("id")
    soup.find_all(has_class_no_id)  # 支持方法参数

     

Guess you like

Origin www.cnblogs.com/ahMay/p/11995209.html