BeautifulSoup is a parsing HTML or XML file third-party libraries. HTML or XML files can be explained by the DOM model. Generally it contains three nodes:
- Element nodes - usually refers to HTML or XML tags
- Text node - the text inside the tag
- Node attributes - attribute of each tag
BeautifulSoup library for parsing HTML or XML file, find one or more label elements, and get each tag in the text and attributes.
BeautifulSoup nice feature is that it will accept the automatic detection of a str or byte encoded objects, and the current document coding and converted to Unicode encoding. So you can not worry about the garbage problem.
Installation: pip install beautifulsoup4
使用: from bs4 import BeautifulSoup
Recommended installation: pip install lxml, faster than the built-in python parsing library html.parser
BeautifulSoup library use:
There are three ways to find a major element:
- According to the label name directly to find: soup.title, soup.p applies only to find a single element.
- Use find and find_all method - based on name tags and attributes to extract one or more elements of the document traversal Find
- Using the select method - traversing the document elements based on the extracted one or more css style selector
# Direct access according to tag elements soup.p # get p label element objects, get only the first soup.p.name # get p tag name, that 'p " soup.p.string # text within the label element to obtain p soup.p [ ' class ' ] # Get class attribute tag element p soup.p.get ( ' class ' ) # is equivalent to the case soup.a [ ' href ' ] # Get the href attribute of a first element # similar find_all method .find method for extracting only the first matching element # find_all (name, attrs, recursive This, text, ** kwargs) # name: to find the tag name (string, the regular method, True) # attrs : property tag # Recursive This: Recursive # text: Find text # ** kwargs: Other key parameters # because class is a keyword, it should be rewritten class _ = "value", is equivalent to = {attrs "class": "value"} soup.find_all ( ' p ' ) # returns a list of all tags p soup.find_all ( ' p ' , attrs = { " class " : " SISTER " }) # returns all class attributes in list form the p-tag == sister soup.find_all ( ' P ' , the class_ = " SISTER ") # Returns the class attribute in a list of all the p-tag == sister soup.find_all (id =' Link2 ' ) # returns all tags id attributes == link2 soup.find_all (the re.compile ( " ^ b " )) # use regular label to find an element b beginning soup.find_all (the re.compile the href = ( " Elsie " )) # use regular, returns all of the href attribute contains the label elsie soup.find_all (ID = " link1 " , the re.compile href = ( ' elsie ' )) # ID = link1 and comprising elsie href tag # SELECT method - css selector # Note select method of extracting elements are in the form of a list, plus index gets text note soup.select ( ' P ' ) #The tag name to find all p elements equal soup.find_all ( 'p') soup.select ( ' .sister ' ) # lookup tag class = sister by css attributes soup.select ( ' # link1 ' ) # Find all the id id = # link1 element soup.select ( ' P # link1 ' ) # combination element lookup id = p # link11 of soup.select ( " head> title " ) # Find subelement head tag title soup.select ( ' a [class = "sister"] ' ) # find all the attributes of a sister label soup.select ( ' a [the href = "http://example.com/elsie"]') # Find the a href = xxx label elements soup.select ( ' p ' ) [0] .get_text () # Get the text elements in the first p soup.select ( ' a [the href = *. "COM / EL"] ' ) [0] .attrs [ ' the href ' ] # obtain a href xxx.com
In addition to find () and find_all (), there are some common methods used to search for his son and siblings:
find_parent()
find_parents()
find_next_sibling()
find_next_siblings()
find_previous_sibling()
find_previous_siblings()
find_next()
find_previous()
find_all_next()
find_all_previous()
Please note that the select method and find method has the following differences, please note that when using:
-
find method returns a single element, find_all method returns a list of elements, and select the method always returns a list of elements. If you use the select method to find a single element, do not forget to add a list of index [0] before you can call get_text () method to get the text.
-
find further support method parameter queries, more powerful than the select method, as follows:
def has_class_no_id(tag): return tag.has_attr("class") and not tag.has_attr("id") soup.find_all(has_class_no_id) # 支持方法参数