BeautifulSoup用于爬虫,跟xml那个模块一样的。效率上也差不多吧。
1 #!usr/bin/env python 2 #encding:utf-8 3 #by i3ekr 4 5 import requests 6 from bs4 import BeautifulSoup 7 8 html = """ 9 <!DOCTYPE html> 10 <html> 11 <head> 12 <title>title test demo</title> 13 </head> 14 <body> 15 <h1>this is h1</h1> 16 <h1>this is h1 two</h1> 17 <h1>this is h1 stree</h1> 18 <a href="http://baidu.com">this is a href.</a> 19 </body> 20 </html> 21 """ 22 bs = BeautifulSoup(html, "lxml") 23 print bs.find_all('h1')