Python crawler - data analysis BeautifulSoup

1. Basic introduction

BeautifulSoup is referred to as bs4 for short. BeautifulSoup is an html parser like lxml, and its main function is to parse and extract data.

BeautifulSoup is similar to lxml, which can parse local files and respond to server files.

Disadvantages: The efficiency is not as high as that of lxml.

Advantages: User-friendly interface design, easy to use.

2. Installation

pip install bs4

3. Basic grammar

1. Node positioning
1. Find the node soup.a according to the label name
        [Note] Only the first a
                soup.a.name
                soup.a.attrs can be found
2. Function
        (1).find(returns an object)
                find(' a'): Only find the first a tag
                find('a', title='name')
                find('a', class_='name')
        (2).find_all(return a list)
                find_all('a' ) find all a
                find_all(['a', 'span']) return all a and span
                find_all('a', limit=2) only find the first two a
        (3).select (obtained according to the selector Node object) 【Recommendation】
                1.element
                        eg:p
                2..class
                        eg:.firstname
                3.#id
                        eg:#firstname
                4. Attribute selector
                        [attribute]
                                eg:li = soup.select('li[class]')
                        [attribute=value]
                                eg:li = soup.select('li[class="hengheng1"]' )
                5. Hierarchical selector
                        element element
                                div p
                        element>element
                                div>p
                        element,element
                                div,p
                                eg:soup = soup.select('a,span')
2. Node information
(1). Get node content: applicable to Structure of nested tags within tags
        obj.string
        obj.get_text() [recommended]
(2). The attribute
        tag.name of the node gets the tag name
                eg:tag = find('li)
        print(tag.name)
                tag.attrs returns the attribute value as a dictionary
( 3). Get node attributes
        obj.attrs.get('title') [commonly used]
        obj.get('title')
        obj['title']

 python code:

from bs4 import BeautifulSoup
 
# 通过解析本地文件 对bs4基础语法进行熟悉
# 默认打开的编码格式是gbk 所以在打开文件的时候需要指定编码
soup = BeautifulSoup(open('6.html',encoding='utf-8'),'lxml')
 
# 根据标签名查找节点
# 注意找到的是第一个符合条件的信息
print(soup.a)
# attrs 获取标签的属性和属性值
print(soup.a.attrs)
 
# bs4的函数
# (1) find      返回第一个符合条件的数据
print(soup.find('a'))
# 根据title找到对应的标签对象
print(soup.find('a',title="a2"))
# 根据class的值找到对应的标签对象    注意的是class需要添加下划线
print(soup.find('a',class_="a1"))
 
# (2) find_all      返回的是一个列表,并且返回所有的a标签
print(soup.find_all('a'))
# 如果想要获取的是多个标签的数据,那么需要在find_all的参数中添加的是列表的数据
print(soup.find_all(['a','span']))
# limit的作用是查找前几个数据
print(soup.find_all('li',limit=2))
 
# (3) select(推荐)        返回的是一个列表,并且返回多个数据
print(soup.select('a'))
# 可以通过 . 代表class    我们把这种操作叫做类选择器
print(soup.select('.a1'))
# 可以通过 # 代表id
print(soup.select('#l1'))
 
# 属性选择器     通过属性来寻找对应的标签
# 查找到li标签中有id的标签
print(soup.select('li[id]'))
# 查找到li标签中id为l2的标签
print(soup.select('li[id="l2"]'))
 
# 层级选择器
#   后代选择器
# 找到的是div下面的li
print(soup.select('div li'))
 
#   子代选择器
# 某标签的第一级子标签
# 注意:很多的计算机编程语言中不加空格就不会输出内容,但是bs4中不会报错,会显示内容
print(soup.select('div > ul > li'))
# 找到a标签和li标签的所有的对象
print(soup.select('a,li'))
 
 
# 节点信息
#   获取节点内容
obj = soup.select('#d1')[0]
# 如果标签对象中只有内容,那么string和get_text()都可以使用
# 如果标签对象中除了内容还有其他标签 那么string就获取不到数据,get_text()可以获取到数据
# 一般情况下推荐使用get_text()
print(obj.string)
print(obj.get_text())
 
# 节点的属性
obj = soup.select('#p1')[0]
# name是标签的名字
print(obj.name)
# 将属性值作为一个字典返回
print(obj.attrs)
 
# 获取节点的属性
obj = soup.select('#p1')[0]
print(obj.attrs.get('class'))
print(obj.get('class'))
print(obj['class'])

4. Case: Starbucks

 Requirement: Crawl the product name in Starbucks official website.

        Find the interface of Starbucks in the webpage

python code:

import urllib.request
from bs4 import BeautifulSoup
 
url = 'https://www.starbucks.com.cn/menu/'
 
response = urllib.request.urlopen(url)
 
content = response.read().decode('utf-8')
 
soup = BeautifulSoup(content,'lxml')
 
# //ul[@class="grid padded-3 product"]//strong/text()
name_list = soup.select('ul[class="grid padded-3 product"] strong')
 
for name in name_list:
    print(name.get_text())

 

Guess you like

Origin blog.csdn.net/qq_62594984/article/details/132627960