我要爬爬虫(8)-beautiful soup解析库

bs不基于正则，而是基于网页的结构和属性。

html = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')#补全更正格式
print(soup.prettify())#自动缩进
print(soup.title)#特定节点，只取第一个节点！！！！！！！！！
print(soup.title.string)#特定节点的文本

结果如下

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title" name="dromouse">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    <!-- Elsie -->
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
<title>The Dormouse's story</title>
The Dormouse's story

类似于soup.titlle.string
string 文本
name 名称
attrs 属性

print(soup.p)
print(soup.p.string)
print(soup.p.attrs)
print((soup.p.attrs['name']))
print(soup.p['class'])

<p class="title li" name="dromouse"><b>The Dormouse's story</b></p>
The Dormouse's story
{'name': 'dromouse', 'class': ['title']}
dromouse
['title']

children 直接子节点
descendants 所有子孙节点
parent 直接父节点
parents 所有祖先节点
previous_sibling 前一个兄弟节点
next_sibling 后一个兄弟节点
previous_siblings 前面所有兄弟节点
next_siblings 后面所有兄弟节点
其中只有parent返回文本(因其是唯一的)，其他均返回生成器。

print(list(enumerate(soup.p.parents)))

这里把iterator类型转换成enumerate类型，然后用转成列表输出。
其中0元素为其父节点body，然后是body的父节点html，最后是全文html，所以html输出了两遍。

[(0, <body>
<p class="title li" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body>), (1, <html><head><title>The Dormouse's story</title></head>
<body>
<p class="title li" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>), (2, <html><head><title>The Dormouse's story</title></head>
<body>
<p class="title li" name="dromouse"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1"><!-- Elsie --></a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
</body></html>)]

find_all 匹配所有结果

匹配class属性，需要加下划线，即class_，因为class本身是python关键字
print(soup.find_all(class_='sister'))
用节点name匹配
print(soup.find_all(name='p'))
对于节点内部属性，叫name的，不可直接匹配，需要标注是属性，否则会匹配节点名。
print(soup.find_all(attrs={'name':'dromouse'}))

我要爬爬虫(8)-beautiful soup解析库

猜你喜欢