使用BeautifulSoup模块解析HTML

问题：

UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("html.parser"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 10 of the file D:\python_work\test\test.py. To get rid of this warning, pass the additional argument 'features="html.parser"' to the BeautifulSoup constructor.

  noStarchSoup = bs4.BeautifulSoup(res.text)

解决方法：

    noStarchSoup = bs4.BeautifulSoup(res.text,features='html.parser')

《CSS选择器的例子》，select()方法将返回一个Tag对象的列表

传递给select()方法的选择器	将匹配...
soup.select('div')	所有名为<div>的元素
soup.select('#author')	带有id属性为author的元素
soup.select('.notice')	所有使用CSS class属性名为notice的元素
soup.select('div span')	所有在<div>元素之内的<span>元素
soup.select('div >span')	所有直接在<div>元素之内的<span>元素，中间没有其他元素
soup.select('input[name]')	所有名为<input>，并有一个name属性，其值无所谓的元素
soup.select('input[type="button"]')	所有名为<input>，并有一个type属性，其值为button的元素

文件：example.html

<!-- This is the example.html example file. -->
<html><head><title>The Website Title</title></head>
<body>
<p>Download my <strong>Python</strong> book from <a href='http://inventwithpython.com'>learn Python the easy way!</a>.</p>
<p>By <span id='author'>Al Sweigart</span></p>
</body>
</html>

# -*-coding:utf-8-*-
import requests ,bs4

#在firefox浏览器中，使用Ctrl+Shift+C调用开发者工具，来查看网页源代码

examFile = open('example.html')
exampleSoup = bs4.BeautifulSoup(examFile.read(),features="html.parser")
elems = exampleSoup.select('#author')
print(type(elems))
print(len(elems))
print(type(elems[0]))
print(elems[0].getText())
print(elems[0].attrs) 
#2.用select()方法巡检元素

#3.通过元素的属性获取数据

输出

<class 'list'>
1
<class 'bs4.element.Tag'>
Al Sweigart
{'id': 'author'}

# -*-coding:utf-8-*-
import requests ,bs4

#通过元素的属性获取数据
examFile = open('example.html')
exampleSoup = bs4.BeautifulSoup(examFile.read(),features="html.parser")
spanElem = exampleSoup.select('span')[0]
print(str(spanElem))
print(spanElem.get('id'))
print(spanElem.get('some_nonexistent_addr') == None)
print(spanElem.attrs)

输出

<span id="author">Al Sweigart</span>
author
True
{'id': 'author'}

使用BeautifulSoup模块解析HTML

猜你喜欢