爬虫:BeautifulSoup库的使用

BeautifulSoup库中文文档:https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/

常用函数:

soup.select():

按路径搜索需要的内容

soup.select("p nth-of-type(3)")
# [<p class="story">...</p>]

soup.find_all():

如果想要得到所有的<a>标签,或是通过名字得到比一个tag更多的内容的时候,就需要用到 Searching the tree 中描述的方法,比如: find_all()

from bs4 import BeautifulSoup

soup = BeautifulSoup(html.text, "html.parser")
soup.find_all('a')

soup.get_text():

如果只想得到tag中包含的文本内容,那么可以用 get_text() 方法。这个方法获取到tag中包含的所有文版内容包括子孙tag中的内容,并将结果作为Unicode字符串返回

markup = '<a href="http://example.com/">\nI linked to <i>example.com</i>\n</a>'
soup = BeautifulSoup(markup)

soup.get_text()
u'\nI linked to example.com\n'
soup.i.get_text()
u'example.com'

soup.get():

可用于提取图片

代码实例

from bs4 import BeautifulSoup

data = []
path = './web/new_index.html'

with open(path, 'r') as f:
    Soup = BeautifulSoup(f.read(), 'lxml')
    titles = Soup.select('ul > li > div.article-info > h3 > a')
    pics = Soup.select('ul > li > img')
    descs = Soup.select('ul > li > div.article-info > p.description')
    rates = Soup.select('ul > li > div.rate > span')
    cates = Soup.select('ul > li > div.article-info > p.meta-info')

for title, pic, desc, rate, cate in zip(titles, pics, descs, rates, cates):
    info = {
        'title': title.get_text(),
        'pic': pic.get('src'),
        'descs': desc.get_text(),
        'rate': rate.get_text(),
        'cate': list(cate.stripped_strings)
    }
    data.append(info)

for i in data:
    if len(i['rate']) >= 3:
        print(i['title'], i['cate'])

猜你喜欢

转载自blog.csdn.net/Benanan/article/details/83656550