小白学爬虫笔记5---beautifulsoup库基本元素

Beautiful Soup库的基本元素

解析、遍历、维护标签树的功能库

<p>..</p>:标签Tag
p为Name
class="title"为属性,属性为键值对构成  

Beautiful Soup库的引用 from bs4 import BeatifulSoup import bs4
HTML文档、标签树、BeautifulSoup类等价

from bs4 import BeautifulSoup
soup = BeautifulSoup("<html>data</html>","html.parser")
soup2 = BeautifulSoup(open("D://demo.html"),"html.parser")  

解析器

  • bs4的HTML解析器 'html.parser' 需要bs4库
  • lxml的HTML解析器 'lxml' pip install lxml
  • lxml的XML解析器 'lxl' pip install lxml
  • html5lib的解析器 'html5lib' pip install html5lib

Beautiful Soup类基本元素

  • Tag 标签,最基本的信息组织单元 <></>
  • Name 标签的名字,如p,.name
  • Attributes 属性,如class,.attrs
  • NavigableString 标签内费属性字符串,.string,即内容
  • Comment 标签内字符串的注释,一种特殊的Comment类型

 

from bs4 import BeautifulSoup
soup = BeautifulSoup(demo,"HTML.parser")
soup.title
tag = soup.a
tag  

获取标签名字

from bs4 import BeautifulSoup
soup = BeautifulSoup(demo,"html.parser")
soup.a.name
soup.a.parent.name
soupt.a.parent.parent.name
tag = soup.a
tag.attrs #这是一个字典
tag.attrs['class']
tag.attrs['href']
type(tag.attrs) #dict
type(tag) #bs4.element.Tag
#NavigableString
soup.a.string
soup.p.string 
type(soup.p.string) #bs4.element.NavigableString
#Comment 可对类型做判断过滤注释信息
newsoup = BeautifulSoup("<b><!--This is a comment--></b><p>This is not a comment</p>, "html.parser")
newsoup.b.string
type(newsoup.b.string) #bs4.element.Comment
type(newsoup.p.string) #bs4.element.NavigableString

猜你喜欢

转载自blog.csdn.net/paleyellow/article/details/81079346