爬虫：BeautifulSoup库的使用 - 代码天地

爬虫：BeautifulSoup库的使用

其他 2018-12-25 17:35:02 阅读次数: 0

BeautifulSoup库中文文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

常用函数：

soup.select():

按路径搜索需要的内容

soup.select("p nth-of-type(3)")
# [<p class="story">...</p>]

soup.find_all():

如果想要得到所有的<a>标签,或是通过名字得到比一个tag更多的内容的时候,就需要用到 Searching the tree 中描述的方法,比如: find_all()

from bs4 import BeautifulSoup

soup = BeautifulSoup(html.text, "html.parser")
soup.find_all('a')

soup.get_text():

如果只想得到tag中包含的文本内容,那么可以用 get_text() 方法。这个方法获取到tag中包含的所有文版内容包括子孙tag中的内容,并将结果作为Unicode字符串返回

markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
soup = BeautifulSoup(markup)

soup.get_text()
u'\nI linked to example.com\n'
soup.i.get_text()
u'example.com'

soup.get():

可用于提取图片

代码实例

from bs4 import BeautifulSoup

data = []
path = './web/new_index.html'

with open(path, 'r') as f:
    Soup = BeautifulSoup(f.read(), 'lxml')
    titles = Soup.select('ul > li > div.article-info > h3 > a')
    pics = Soup.select('ul > li > img')
    descs = Soup.select('ul > li > div.article-info > p.description')
    rates = Soup.select('ul > li > div.rate > span')
    cates = Soup.select('ul > li > div.article-info > p.meta-info')

for title, pic, desc, rate, cate in zip(titles, pics, descs, rates, cates):
    info = {
        'title': title.get_text(),
        'pic': pic.get('src'),
        'descs': desc.get_text(),
        'rate': rate.get_text(),
        'cate': list(cate.stripped_strings)
    }
    data.append(info)

for i in data:
    if len(i['rate']) >= 3:
        print(i['title'], i['cate'])

猜你喜欢

转载自blog.csdn.net/Benanan/article/details/83656550

爬虫：BeautifulSoup库的使用

爬虫之 BeautifulSoup库的使用

爬虫【二】 BeautifulSoup库的使用

网络爬虫BeautifulSoup库的使用

python爬虫_BeautifulSoup库使用

爬虫--解析库的使用 XPath、BeautifulSoup、pyquery

python 爬虫之BeautifulSoup 库的基本使用

python之爬虫（八）BeautifulSoup库的使用

Python 爬虫 BeautifulSoup4 库的使用

爬虫库requests和BeautifulSoup的基本使用

Python爬虫库-1-BeautifulSoup的使用

数据爬虫（五）：爬虫BeautifulSoup库的基本使用

Python爬虫-BeautifulSoup 库

python爬虫——BeautifulSoup库

爬虫之BeautifulSoup库

爬虫解析库beautifulsoup

【爬虫】一、BeautifulSoup库

爬虫入门-BeautifulSoup库

使用BeautifulSoup爬虫

python爬虫入门四：BeautifulSoup库(转) python爬虫从入门到放弃（六）之 BeautifulSoup库的使用 python爬虫从入门到放弃（六）之 BeautifulSoup库的使用

Python爬虫开发系列之四》BeautifulSoup解析库的使用

爬虫基本原理与实战---6、BeautifulSoup库的使用

Python从零开始写爬虫（二）BeautifulSoup库使用

Python3中beautifulsoup库的使用(爬虫利器)

【python 爬虫】BeautifulSoup4 库的介绍使用

python爬虫从入门到放弃（六）之 BeautifulSoup库的使用

Python网络爬虫——BeautifulSoup4库的使用

Python爬虫库BeautifulSoup的介绍与简单使用实例

Python网络爬虫 BeautifulSoup库的使用方法

python爬虫之BeautifulSoup库

今日推荐

美国拟限制 AI 大模型出口中国和俄罗斯

苹果将与 OpenAI 达成协议，将 ChatGPT 应用于 iPhone

openKylin 社区生态委员会第六次会议圆满召开

阿里云正式发布通义千问 2.5

Python 3.13 发布首个 Beta：实验性自由线程模式和 JIT、改进交互式解释器

Stack Overflow 拿我的代码去训练 AI 大模型，还封了我的账号

Pop!_OS 的 COSMIC 桌面完成 App Store 上架工作

报告：Django 仍然是 74% 开发者的首选

《2024 年一季度互联网投融资运行情况》研究报告

15 年前上了“FFmpeg 耻辱柱”，今天他还得谢谢咱——腾讯QQPlayer一雪前耻？

TIOBE 5 月榜单：Fortran “复活”进入 Top 10

GCC 14.1 发布

周排行

curl的POST请求，封装方法

8.1.1. Integer Types

Java基础 Day05(个人复习整理)

Python - Django - 中间件 process_exception

小L的试卷

【Shell编程】（函数）判断用户是否存在

python(css样式)

spring ant path 匹配原则 - 【笔记】

《JavaScript与JScript从入门到精通》(美)James.Jaworski.中译本.扫描版.pdf

Eclipse运行带参数的java程序

每日归档

更多

2024-05-12(0)

2024-05-11(38)

2024-05-10(38)

2024-05-09(35)

2024-05-08(42)

2024-05-07(14)

2024-05-06(40)

2024-05-05(0)

2024-05-04(7)

2024-05-03(19)