BeautifulSoup用法详解

注:本文代码以网站http://www.pythonscraping.com/pages/page3.html为例

1.获取网页HTML内容,传到BeautifulSoup对象。

import requests
from bs4 import BeautifulSoup
url = 'http://www.pythonscraping.com/pages/page3.html'
response = requests.get(url)
soup = BeautifulSoup(response.text)

<html>
 <head>
  <style>
   img{
	width:75px;
}
table{
	width:50%;
}
td{
	margin:10px;
	padding:10px;
}
.wrapper{
	width:800px;
}
.excitingNote{
	font-style:italic;
	font-weight:bold;
}
  </style>
 </head>
 <body>
  <div id="wrapper">
   <img src="../img/gifts/logo.jpg" style="float:left;"/>
   <h1>
    Totally Normal Gifts
   </h1>
   <div id="content">
    Here is a collection of totally normal, totally reasonable gifts that your friends are sure to love! Our collection is
hand-curated by well-paid, free-range Tibetan monks.
    <p>
     We haven't figured out how to make online shopping carts yet, but you can send us a check to:
     <br/>
     123 Main St.
     <br/>
     Abuja, Nigeria
We will then send your totally amazing gift, pronto! Please include an extra $5.00 for gift wrapping.
    </p>
   </div>
   <table id="giftList">
    <tr>
     <th>
      Item Title
     </th>
     <th>
      Description
     </th>
     <th>
      Cost
     </th>
     <th>
      Image
     </th>
    </tr>
    <tr class="gift" id="gift1">
     <td>
      Vegetable Basket
     </td>
     <td>
      This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
      <span class="excitingNote">
       Now with super-colorful bell peppers!
      </span>
     </td>
     <td>
      $15.00
     </td>
     <td>
      <img src="../img/gifts/img1.jpg"/>
     </td>
    </tr>
    <tr class="gift" id="gift2">
     <td>
      Russian Nesting Dolls
     </td>
     <td>
      Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"!
      <span class="excitingNote">
       8 entire dolls per set! Octuple the presents!
      </span>
     </td>
     <td>
      $10,000.52
     </td>
     <td>
      <img src="../img/gifts/img2.jpg"/>
     </td>
    </tr>
    <tr class="gift" id="gift3">
     <td>
      Fish Painting
     </td>
     <td>
      If something seems fishy about this painting, it's because it's a fish!
      <span class="excitingNote">
       Also hand-painted by trained monkeys!
      </span>
     </td>
     <td>
      $10,005.00
     </td>
     <td>
      <img src="../img/gifts/img3.jpg"/>
     </td>
    </tr>
    <tr class="gift" id="gift4">
     <td>
      Dead Parrot
     </td>
     <td>
      This is an ex-parrot!
      <span class="excitingNote">
       Or maybe he's only resting?
      </span>
     </td>
     <td>
      $0.50
     </td>
     <td>
      <img src="../img/gifts/img4.jpg"/>
     </td>
    </tr>
    <tr class="gift" id="gift5">
     <td>
      Mystery Box
     </td>
     <td>
      If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining.
      <span class="excitingNote">
       Keep your friends guessing!
      </span>
     </td>
     <td>
      $1.50
     </td>
     <td>
      <img src="../img/gifts/img6.jpg"/>
     </td>
    </tr>
   </table>
   <div id="footer">
    © Totally Normal Gifts, Inc.
    <br/>
    +234 (617) 863-0736
   </div>
  </div>
 </body>
</html>

2.提取指定的标签如 h1

print(soup.h1)
<h1>Totally Normal Gifts</h1>

.get_text()将所有标签清除,返回一个只包含文本的字符串:

print(soup.h1.get_text())
Totally Normal Gifts

3.find()和findAll()

BeautifulSoup中对两者的定义:

findAll(tag,attributes,recursive,text,limit,keywords)

find(tag,attributes,recursive,text,keywords)

----返回所有标题标签的列表

print(soup.findAll({'h1'}))

----attributes是一个用Python字典封装一个标签的若干属性和对应属性值,例如下面会返回tr标签中属性class为gift的内容:

print(soup.findAll('tr',{'class':'gift'}))

----参数recursive为布尔变量,为true则查找所有子标签,false则只查找文档的一级标签。

----参数text匹配标签的文本内容,如果我们想要知道网页中包含‘Totally Normal Gifts’内容的标签的数量,可以这样:

name = soup.findAll(text='Totally Normal Gifts')
print(name)
print(len(name))
['Totally Normal Gifts']
1

----参数limit=x表示你只对网页中获取的前x项感兴趣。limit=1时findAll相当于find

----参数keyword为冗余功能,此处不做介绍。

4.处理子标签以及其他后代标签(.children)

查找第一个tr标签的子标签:

name = soup.find('tr').children

5.处理兄弟标签(.next_siblings)(.previous_siblings)

6.父标签的处理(.parents) 

猜你喜欢

转载自blog.csdn.net/why_cant_i_change/article/details/83685618