In a series of 120 columns, you can learn the python beautifulsoup4 module, a 7000-word blog + climb the ninth workshop network

"Offer arrives, dig friends to pick up! I am participating in the 2022 Spring Recruitment Check-in Event, click to view the details of the event ."

Today is a new day for the "120 Cases of Reptiles" series of columns, and the next three articles will focus BeautifulSoup4 on learning.

BeautifulSoup4 Basics Supplement

BeautifulSoup4It is a python parsing library, mainly used for parsing HTML and XML, and parsing HTML in the crawler knowledge system will be more, the library installation command is as follows:

pip install beautifulsoup4
复制代码

BeautifulSoupWhen parsing data, you need to rely on third-party parsers. Common parsers and their advantages are as follows:

  • python 标准库 html.parser: python built-in standard library, strong fault tolerance;
  • lxml 解析器: Fast speed and strong fault tolerance;
  • html5lib: The most fault-tolerant, and the parsing method is the same as that of the browser.

Next, a custom HTML code is used to demonstrate the basic use of the beautifulsoup4library . The test code is as follows:

<html>
  <head>
    <title>测试bs4模块脚本</title>
  </head>
  <body>
    <h1>橡皮擦的爬虫课</h1>
    <p>用一段自定义的 HTML 代码来演示</p>
  </body>
</html>
复制代码

Use BeautifulSoupit for simple manipulations, including instantiating BS objects, outputting page labels, and more.

from bs4 import BeautifulSoup

text_str = """<html>
	<head>
		<title>测试bs4模块脚本</title>
	</head>
	<body>
		<h1>橡皮擦的爬虫课</h1>
		<p>用1段自定义的 HTML 代码来演示</p>
		<p>用2段自定义的 HTML 代码来演示</p>
	</body>
</html>
"""

# 实例化 Beautiful Soup 对象
soup = BeautifulSoup(text_str, "html.parser")
# 上述是将字符串格式化为 Beautiful Soup 对象,你可以从一个文件进行格式化
# soup = BeautifulSoup(open('test.html'))

print(soup)
# 输入网页标题 title 标签
print(soup.title)
# 输入网页 head 标签
print(soup.head)

# 测试输入段落标签 p
print(soup.p) # 默认获取第一个
复制代码

We can directly call the webpage tag through the BeautifulSoup object. There is a problem here. Only the tag in the first position can be obtained by calling the tag through the BS object. For example, in the above code, only one ptag . If you want to get more content, please read on.

To learn this, you need to understand the 4 built-in objects in BeautifulSoup.

  • BeautifulSoup: Basic object, the entire HTML object, generally viewed as a Tag object;
  • Tag: tag object, the tag is each node in the web page, such as title, head, p;
  • NavigableString: Label internal string;
  • Comment: Annotation object, there are not many usage scenarios in crawlers.

The following code shows you the scenarios where these kinds of objects appear, pay attention to the relevant comments in the code.

from bs4 import BeautifulSoup

text_str = """<html>
	<head>
		<title>测试bs4模块脚本</title>
	</head>
	<body>
		<h1>橡皮擦的爬虫课</h1>
		<p>用1段自定义的 HTML 代码来演示</p>
		<p>用2段自定义的 HTML 代码来演示</p>
	</body>
</html>
"""

# 实例化 Beautiful Soup 对象
soup = BeautifulSoup(text_str, "html.parser")
# 上述是将字符串格式化为 Beautiful Soup 对象,你可以从一个文件进行格式化
# soup = BeautifulSoup(open('test.html'))

print(soup)
print(type(soup))  # <class 'bs4.BeautifulSoup'>
# 输入网页标题 title 标签
print(soup.title)
print(type(soup.title)) # <class 'bs4.element.Tag'>
print(type(soup.title.string)) # <class 'bs4.element.NavigableString'>
# 输入网页 head 标签
print(soup.head)
复制代码

For Tag objects , there are two important properties, which are nameandattrs

from bs4 import BeautifulSoup

text_str = """<html>
	<head>
		<title>测试bs4模块脚本</title>
	</head>
	<body>
		<h1>橡皮擦的爬虫课</h1>
		<p>用1段自定义的 HTML 代码来演示</p>
		<p>用2段自定义的 HTML 代码来演示</p>
		<a href="http://www.csdn.net">CSDN 网站</a>
	</body>
</html>
"""

# 实例化 Beautiful Soup 对象
soup = BeautifulSoup(text_str, "html.parser")


print(soup.name) # [document]
print(soup.title.name) # 获取标签名 title

print(soup.html.body.a) # 可以通过标签层级获取下层标签
print(soup.body.a) # html 作为一个特殊的根标签,可以省略
print(soup.p.a) # 无法获取到 a 标签

print(soup.a.attrs) # 获取属性
复制代码

上述代码演示了获取 name 属性和 attrs 属性的用法,其中 attrs 属性得到的是一个字典,可以通过键获取对应的值。

获取标签的属性值,在 BeautifulSoup 中,还可以使用如下方法:

print(soup.a["href"])
print(soup.a.get("href"))
复制代码

获取 NavigableString 对象 获取了网页标签之后,就要获取标签内文本了,通过下述代码进行。

print(soup.a.string)
复制代码

除此之外,你还可以使用 text 属性和 get_text() 方法获取标签内容。

print(soup.a.string)
print(soup.a.text)
print(soup.a.get_text())
复制代码

还可以获取标签内所有文本,使用 stringsstripped_strings 即可。

print(list(soup.body.strings)) # 获取到空格或者换行
print(list(soup.body.stripped_strings)) # 去除空格或者换行
复制代码

扩展标签/节点选择器之遍历文档树

直接子节点

标签(Tag)对象的直接子元素,可以使用 contentschildren 属性获取。

from bs4 import BeautifulSoup

text_str = """<html>
	<head>
		<title>测试bs4模块脚本</title>
	</head>
	<body>
		<div id="content">
			<h1>橡皮擦的爬虫课<span>最棒</span></h1>
            <p>用1段自定义的 HTML 代码来演示</p>
            <p>用2段自定义的 HTML 代码来演示</p>
            <a href="http://www.csdn.net">CSDN 网站</a>
		</div>
        <ul class="nav">
            <li>首页</li>
            <li>博客</li>
            <li>专栏课程</li>
        </ul>

	</body>
</html>
"""

# 实例化 Beautiful Soup 对象
soup = BeautifulSoup(text_str, "html.parser")

# contents 属性获取节点的直接子节点,以列表的形式返回内容
print(soup.div.contents) # 返回列表
# children 属性获取的也是节点的直接子节点,以生成器的类型返回
print(soup.div.children) # 返回 <list_iterator object at 0x00000111EE9B6340>
复制代码

请注意以上两个属性获取的都是直接子节点,例如 h1 标签内的后代标签 span ,不会单独获取到。

如果希望将所有的标签都获取到,使用 descendants 属性,它返回的是一个生成器,所有标签包括标签内的文本都会单独获取。

print(list(soup.div.descendants))
复制代码

其它节点的获取(了解即可,即查即用)

  • parentparents:直接父节点和所有父节点;
  • next_siblingnext_siblingsprevious_siblingprevious_siblings:分别表示下一个兄弟节点、下面所有兄弟节点、上一个兄弟节点、上面所有兄弟节点,由于换行符也是一个节点,所有在使用这几个属性时,要注意一下换行符;
  • next_elementnext_elementsprevious_elementprevious_elements:这几个属性分别表示上一个节点或者下一个节点,注意它们不分层次,而是针对所有节点,例如上述代码中 div 节点的下一个节点是 h1,而 div 节点的兄弟节点是 ul

文档树搜索相关函数

第一个要学习的函数就是 find_all() 函数,原型如下所示:

find_all(name,attrs,recursive,text,limit=None,**kwargs)
复制代码
  • name:该参数为 tag 标签的名字,例如 find_all('p') 是查找所有的 p 标签,可接受标签名字符串、正则表达式与列表;
  • attrs:传入的属性,该参数可以字典的形式传入,例如 attrs={'class': 'nav'},返回的结果是 tag 类型的列表;

上述两个参数的用法示例如下:

print(soup.find_all('li')) # 获取所有的 li
print(soup.find_all(attrs={'class': 'nav'})) # 传入 attrs 属性
print(soup.find_all(re.compile("p"))) # 传递正则,实测效果不理想
print(soup.find_all(['a','p'])) # 传递列表
复制代码
  • recursive:调用 find_all () 方法时,BeautifulSoup 会检索当前 tag 的所有子孙节点,如果只想搜索 tag 的直接子节点,可以使用参数 recursive=False,测试代码如下:
print(soup.body.div.find_all(['a','p'],recursive=False)) # 传递列表
复制代码
  • text:可以检索文档中的文本字符串内容,与 name 参数的可选值一样,text 参数接受标签名字符串、正则表达式、 列表;
print(soup.find_all(text='首页')) # ['首页']
print(soup.find_all(text=re.compile("^首"))) # ['首页']
print(soup.find_all(text=["首页",re.compile('课')])) # ['橡皮擦的爬虫课', '首页', '专栏课程']
复制代码
  • limit:可以用来限制返回结果的数量;
  • kwargs:如果一个指定名字的参数不是搜索内置的参数名,搜索时会把该参数当作 tag 的属性来搜索。这里要按 class 属性搜索,因为 class 是 python 的保留字,需要写作 class_,按 class_ 查找时,只要一个 CSS 类名满足即可,如需多个 CSS 名称,填写顺序需要与标签一致。
print(soup.find_all(class_ = 'nav'))
print(soup.find_all(class_ = 'nav li'))
复制代码

还需要注意网页节点中,有些属性在搜索中不能作为 kwargs 参数使用,比如html5 中的 data-* 属性,需要通过 attrs 参数进行匹配。

find_all() 方法用户基本一致的其它方法清单如下:

  • find():函数原型 find( name , attrs , recursive , text , **kwargs ),返回一个匹配元素;
  • find_parents(),find_parent():函数原型 find_parent(self, name=None, attrs={}, **kwargs),返回当前节点的父级节点;
  • find_next_siblings(),find_next_sibling():函数原型 find_next_sibling(self, name=None, attrs={}, text=None, **kwargs),返回当前节点的下一兄弟节点;
  • find_previous_siblings(),find_previous_sibling():同上,返回当前的节点的上一兄弟节点;
  • find_all_next(),find_next(),find_all_previous () ,find_previous ():函数原型 find_all_next(self, name=None, attrs={}, text=None, limit=None, **kwargs),检索当前节点的后代节点。

CSS 选择器 该小节的知识点与 pyquery 有点撞车,核心使用 select() 方法即可实现,返回数据是列表元组。

  • 通过标签名查找,soup.select("title")
  • 通过类名查找,soup.select(".nav")
  • 通过 id 名查找,soup.select("#content")
  • 通过组合查找,soup.select("div#content")
  • 通过属性查找,soup.select("div[id='content'")soup.select("a[href]")

在通过属性查找时,还有一些技巧可以使用,例如:

  • ^=:可以获取以 XX 开头的节点:
print(soup.select('ul[class^="na"]'))
复制代码
  • *=:获取属性包含指定字符的节点:
print(soup.select('ul[class*="li"]'))
复制代码

第九工场爬虫

BeautifulSoup 的基础知识掌握之后,在进行爬虫案例的编写,就非常简单了,本次要采集的目标是 www.9thws.com/#p2,该目标网站有大量的艺术二维码,可以供设计大哥做参考。

In a series of 120 columns, you can learn the python beautifulsoup4 module, a 7000-word blog + climb the ninth workshop network 下述应用到了 BeautifulSoup 模块的标签检索与属性检索,完整代码如下:

from bs4 import BeautifulSoup
import requests
import logging

logging.basicConfig(level=logging.NOTSET)


def get_html(url, headers) -> None:
    try:
        res = requests.get(url=url, headers=headers, timeout=3)
    except Exception as e:
        logging.debug("采集异常", e)

    if res is not None:
        html_str = res.text
        soup = BeautifulSoup(html_str, "html.parser")
        imgs = soup.find_all(attrs={'class': 'lazy'})
        print("获取到的数据量是", len(imgs))
        datas = []
        for item in imgs:
            name = item.get('alt')
            src = item["src"]
            logging.info(f"{name},{src}")
            # 获取拼接数据
            datas.append((name, src))
        save(datas, headers)


def save(datas, headers) -> None:
    if datas is not None:
        for item in datas:
            try:
                # 抓取图片
                res = requests.get(url=item[1], headers=headers, timeout=5)
            except Exception as e:
                logging.debug(e)

            if res is not None:
                img_data = res.content
                with open("./imgs/{}.jpg".format(item[0]), "wb+") as f:
                    f.write(img_data)
    else:
        return None


if __name__ == '__main__':
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36"
    }
    url_format = "http://www.9thws.com/#p{}"
    urls = [url_format.format(i) for i in range(1, 2)]
    get_html(urls[0], headers)
复制代码

This code test output adopts the loggingmodule implementation, and the effect is shown in the following figure. The test only collects one page of data. If you want to expand the collection range, you only need to modify the page numbering rules in the mainfunction . ==During the code writing process, it was found that the data request type is POST, and the data return format is JSON, so this case is only used as a starting case for BeautifulSoup==In a series of 120 columns, you can learn the python beautifulsoup4 module, a 7000-word blog + climb the ninth workshop network

Code repository address: codechina.csdn.net/hihell/pyth… , give us a follow or Star.

write on the back

The bs4 module learning road has officially started, let's work together.

Today is day 238/365 of continuous writing. Looking forward to following, likes, comments, favorites.

Guess you like

Origin juejin.im/post/7079402951239794702