[python爬虫之路day6]:BeautifulSoup4库的基本操作&&&常用的CSS选择器

BeautifulSoup4库：
这是一个html/xml的解析器，类似前面学过的lxml，但是与之前的相比，更容易使用，因为每次调用都会载入整个文档，所以速度较慢。
安装：
pip install bs4
BeautifulSoup4库的基本使用：

from bs4 import BeautifulSoup
html=“”“长代码”“”
bs=BeautifulSoup(html,'lxml')
print(bs.prettify())

其中bs=BeautifulSoup(html,‘lxml’)，lxml是本库的一种解析器。
注意事项：
1.find_all的使用:

soup=BeautifulSoup(html,'lxml')
#print(bs.prettify())
#print("1")

2.find和find_all的区别
find找到第一个满足的标签，find_all找到所有满足条件的标签
3.获取标签的属性(两种方法)：

aList=soup.find_all("a")
for a in aList:
    #1.
    #hers=a["href"]
    #print(hers)
    #2.
    her=a.attrs['href']
    print(her)

4.strings,stripped_strings,string,get_text()的区别使用
string：获取某个标签的非标签字符串，返回字符串,如果标签下有多个文本，那么就不能获取到了。（.contents）
strings：获取某个标签的子孙非标签字符串，返回生成器
stripped_strings:获取某个标签的子孙非标签字符串，除去空格，返回生成器可用list()强制转换
get_text():
获取某个标签的子孙非标签字符串，返回字符串
部分操作代码：

from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
#print(bs.prettify())
#print("1")
#1.获取所有li标签
lis=soup.find_all("li")
for li in lis:
    print(li)
    print("*"*30)
#2.获取第二个li标签
li=soup.find_all("li",limit=2)[1]
print(li)
#3.获取所有class=industry的div标签
lis=soup.find_all("div",class_="industry")  #也可写成    lis=soup.find_all("li", attrs={'class':"industry"}
for li in lis:
    print(li)
#4.将所有id=test,class=test的a标签
#alis=soup.find_all("a",id="test",class_="test")等价于
alis=soup.find_all("a",attrs={"id":"test","class_":"test"})
for a in alis:
    print(a)
#5.获取所有的a标签的href属性
aList=soup.find_all("a")
for a in aList:
    #1.
    #hers=a["href"]
    #print(hers)
    #2.
    her=a.attrs['href']
    print(her)
#6.获取所有的文本
lis=soup.find_all("li")
for li in lis:
    tds=soup.find_all("div")
    title=tds[0].string
    time=tds[1].string
    job=tds[2].string

infos=tr.strings
infos=list(tr.stripped_strings)
movie['title']=infos[0]
movie['time']=infos[1]
movie['job']=infos[2]

#get_text()
lis=soup.find_all("li")
text=lis.get_text()

爬虫中一些常用的CSS选择器：
1.根据标签名选择，示例如下：


p{
	background-lor:red
}

2.根据类名选择，要在类名前加上“ . ”,示例如下：

.linn{
	background-color:red
}

3.根据id名选择，要在前加#，示例如下：

#line3{
	background-color:red
}

4.组合查找，查找子孙元素，要在子孙元素前加一个空格，示例：

.box p{
	background-color:red
}

5.直接查找子元素，在父子之间加“ >”,示例：

.box>p{
	background-color:red
}

6.根据属性名查找，示例：

input[name='username']{
	background-color:red
}

7.在对类或者id查找时，如果还要根据标签进行过滤，应该在前面加入标签名字，示例：

（id）
div#line{
	background-color:red
}
或者(类)
div.line{
	background-color:red
}

在BeautifulSoup中使用css选择器，使用soup.select(‘字符串’)
1.获取所有tr标签：

from bs4 import BeautifulSoup
soup=BeautifulSoup(html,'lxml')
trs=soup.select("tr")

2.获取第二个tr标签：

soup=BeautifulSoup(html,'lxml')
trs=soup.select("tr")[1]

3.获取所有class是even的tr标签:

soup=BeautifulSoup(html,'lxml')
trs=soup.select("tr.even")
trs=soup.select(tr[class]='even')

4.获取所有a标签的href：

soup=BeautifulSoup(html,'lxml')
alist=soup.select("a")
for a in alist:
	href=a['hred']
	print(href)

常见的四种对象：
1.Tag:BeautifulSoup的所有标签都是Tag类型，且BeautifulSoup的对象都是Tag类型，一些方法比如：find,find_al，并不是BeautifulSoup类型，而是Tag类型
2.NavgableString：继承python的str，与python中的str使用一致。
3.BeautifulSoup：继承Tag，用来从生产BeautifulSoup树，一些方法其实也是Tag,比如find_all, select等。
4.Comment:继承NavgableString。
“.contents"和”.chirldren"
返回标签下的直接子元素，包括字符串，区别是.contents返回的是列表，.chirldren返回的是迭代器。

slow.ver

发布了6 篇原创文章 · 获赞 3 · 访问量 662

私信关注

[python爬虫之路day6]:BeautifulSoup4库的基本操作&&&常用的CSS选择器

猜你喜欢