Data Analysis——BeautifulSoup

Table of contents

1. Basic introduction

2. Installation and creation

3. Use of BeautifulSoup

3.1, Json file content

3.2, BeautifulSoup syntax

3.2.1、find

3.2.2、find_all

3.2.3、select

3.3. Node information

3.3.1. Get node content

3.3.2. Node attributes (which attributes are included in the output node)

3.3.3. Obtain the attribute value of the node (output the attribute value of the node attribute)

4. Example (get Starbucks menu information)


1. Basic introduction

 1. Abbreviation of BeautifulSoup: bs4

 2. What is BeautifulSoup:

        BeautifulSoup, like lxml, is an html parser whose main function is to parse and extract data

 3. Advantages and disadvantages of BeautifulSoup:

        Disadvantages: efficiency is not as high as that of lxml

        Advantages: user-friendly interface design, easy to use

2. Installation and creation

3. Use of BeautifulSoup

3.1, Json file content

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>Title</title>
</head>
<body>

    <div>
        <ul>
            <li id="l1">张三</li>
            <li id="l2">李四</li>
            <li>王五</li>
            <a href="" id="" class="a1">尚硅谷</a>
        </ul>
    </div>

    <span href="" title="a2">尚硅谷</span>
    <a href="" title="a2">尚硅谷</a>

    <div id="d1">
        <span>哈哈哈</span>
    </div>

    <p id="p1" class="p1">呵呵呵</p>
</body>
</html>

3.2, BeautifulSoup syntax

3.2.1、find

from bs4 import BeautifulSoup

# 通过解析本地文件,来讲bs4的基础语法进行讲解
# 默认打开文件的编码格式是gbk,故需要指定编码格式
soup = BeautifulSoup(open('22_爬虫_解析_bs4的基本使用.html', encoding='utf8'), 'lxml')

# 1、find
# 1.1返回第一个符合条件的数据
print(soup.find('a'))
# 1.2根据title的值找到对应的标签对象
print(soup.find('a', title='a2'))
# 1.3根据class属性找到对应的标签对象,注意:class需要添加下划线"_"
print(soup.find('a', class_='a1'))

3.2.2、find_all

from bs4 import BeautifulSoup

# 通过解析本地文件,来讲bs4的基础语法进行讲解
# 默认打开文件的编码格式是gbk,故需要指定编码格式
soup = BeautifulSoup(open('22_爬虫_解析_bs4的基本使用.html', encoding='utf8'), 'lxml')

# 2、find_all
# 2.1、返回一个列表,并返回所有的a标签
print(soup.find_all('a'))
# 2.2、如果想获取的是多个标签的数据,那么需要在find_all的参数中添加的是列表的数据
print(soup.find_all(['a', 'span']))
# 2.3、limit作用:查找前几个数据
print(soup.find_all('li', limit=2))

3.2.3、select

from bs4 import BeautifulSoup

# 通过解析本地文件,来讲bs4的基础语法进行讲解
# 默认打开文件的编码格式是gbk,故需要指定编码格式
soup = BeautifulSoup(open('22_爬虫_解析_bs4的基本使用.html', encoding='utf8'), 'lxml')

# 3、select(推荐)
# 3.1、select方法返回的是一个列表,并且会返回多个数据
print(soup.select('a'))

# 3.2、可以通过.代表class,我们把这种操作叫做类选择器
print(soup.select('.a1'))

# 3.3、可以通过#代表id,我们把这种操作叫做id选择器
print(soup.select('#l1'))

# 3.4、属性选择器——通过属性寻找对应的标签
# 3.4.1、查找到li标签中有id的标签
print(soup.select('li[id]'))
# 3.4.2、查找到li标签中id为l2的标签
print(soup.select('li[id="l2"]'))

# 3.5、层级选择器
# 3.5.1、后代选择器:找到div下面的li
print(soup.select('div li'))
# 3.5.2、子代选择器:某标签的第一级子标签
# 注意:很多的计算机编程语言中,如果不加”空格“,不会输出内容;但是在bs4中不会报错,会显示内容
print(soup.select('div > ul > li'))
# 3.5.3、找到a标签和li标签的所有对象
print(soup.select('a, li'))

3.3. Node information

3.3.1. Get node content

obj = soup.select('#d1')[0]    # 节点id=d1返回得到的obj是个列表
# 如果标签对象中 只有内容 那么string和get_text()都可以使用
# 如果标签对象中 除了内容还有标签 那么string就获取不到数据;和get_text()可以获取数据
# 一般情况下,推荐使用get_text()
print(obj.string)
print(obj.get_text())

3.3.2. Node attributes (which attributes are included in the output node)

obj = soup.select('#p1')[0]    # 节点id=p1
# 2.2.1、name是标签的名字<div>、<span>等
print(obj.name)
# 2.2.2、attrs将属性值作为一个字典返回
print(obj.attrs)

3.3.3. Obtain the attribute value of the node (output the attribute value of the node attribute)

obj = soup.select('#p1')[0]
# 获取节点的class属性
print(obj.attrs.get('class'))
print(obj.get('class'))
print(obj['class'])

4. Example (get Starbucks menu information)

import urllib.request
from bs4 import BeautifulSoup

url = 'https://www.starbucks.com.cn/menu/'

response = urllib.request.urlopen(url)

content = response.read().decode('utf8')

soup = BeautifulSoup(content, 'lxml')

# //ul[@class="grid padded-3 product"]//strong/text()
name_list = soup.select('ul[class="grid padded-3 product"] strong')
for name in name_list:
    print(name.string)

Guess you like

Origin blog.csdn.net/weixin_44302046/article/details/126756079