Basic usage of BeautifulSoup4

  • html格式化:"<html>
     <head>
      <title>
       Page title
      </title>
     </head>
     <body>
      <p align="center" id="firstpara">
       This is paragraph
       <b>
        one
       </b>
      </p>
      <p align="blah" id="secondpara">
       This is paragraph
       <b>
        two
       </b>
      </p>
     </body>
    </html>"
       soup = BeautifulSoup(html)
        print soup.prettify()
  • Get the first corresponding label of the label: soup.label name 
     print soup.head
     输出: <head><title>Page title</title></head>
  • Get the content of the corresponding label to get the first one: soup.title.string
  • Get all p tags:
    soup = BeautifulSoup(''.join(doc),'lxml')
    print soup.find_all('p')
  • Find a tag based on attributes:
    soup.find(id = 'firstpara')
  • Get all the content of html, excluding tags:
    soup.get_text()
  • Modify the content of a label replace_with:
    soup = BeautifulSoup(''.join(doc),'lxml')
    tag = soup.title
    tag.string.replace_with('hello word hh')
  • Output the child nodes of a label in the form of a list:
     soup.head.contents
  • Get the parent node:
    soup.title.parent
  • The id attribute searched by the css selector method is #, and the class is.:
    soup.select('#firstpara')
  • Find according to attribute value:
    soup.select('p[id= "secondpara"] ')
  • 详情请点击:https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

Guess you like

Origin blog.csdn.net/xxy_yang/article/details/92766424