Python爬虫之BeautifulSoup库(六):输出

一、格式化输出

prettify()方法将BeautifulSoup文档以格式化的方法输出

from bs4 import BeautifulSoup
markup = '<a href="http://example.com/">I linked to <i>example.com</i></a>'
soup = BeautifulSoup(markup,'lxml')
print(soup.prettify())
<html>
 <body>
  <a href="http://example.com/">
   I linked to
   <i>
    example.com
   </i>
  </a>
 </body>
</html>

二、压缩输出

如果只想得到字符串,不重视格式的话,可以使用str()方法

str(soup)
'<html><body><a href="http://example.com/">I linked to <i>example.com</i></a></body></html>'

三、HTML特殊字符

soup = BeautifulSoup("&ldquo;Dammit!&rdquo; he said.","lxml")
str(soup)
'<html><body><p>“Dammit!” he said.</p></body></html>'

四、获取该tag中所有的文本内容:get_text()

markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
soup = BeautifulSoup(markup,'lxml')
print(soup.get_text())
print(soup.i.get_text())
I linked to example.com

example.com

指定分隔符

soup.get_text("|")
'\nI linked to |example.com|\n'

去掉空白符

soup.getText("|",strip=True)
'I linked to|example.com'

猜你喜欢

转载自blog.csdn.net/bqw18744018044/article/details/81036142