Basic use of Python crawler bs4
Disclaimer: Since the publication of this article, this article is for reference only and may not be reproduced or copied. If the party who browses this article is involved in any violation of national laws and regulations, all consequences shall be borne by the party who browses this article and has nothing to do with the blogger of this article. And due to the reprinting, copying and other operations of the parties who browse this article, any disputes caused by violation of national laws and regulations and all the consequences shall be borne by the parties who browse this article and have nothing to do with the blogger of this article.
import requests
from bs4 import BeautifulSoup
1. bs4 basic syntax
1.1 Get html page
Get local html page
# 读取文件
fp = open("./data/base/taobao.html", "r", encoding="UTF-8")
# 数据加载到该对象中 (本地的 html 文件)
html = BeautifulSoup(fp, "lxml")
print(html)
Read website to get html page
# 爬取页面
response_text = requests.get(url="https://s.taobao.com/").text
# 数据加载到该对象中 (网络的 html 文件)
html = BeautifulSoup(response_text, "lxml")
print(html)
1.2 Get tags
soup.<tagName>
The first one by default, return None if there is no such label
print(html.a)
print(html.img)
print(html.input)
soup.find(<tagName>)
Equivalent to soup.<tagName> default first
print(html.find("a"))
print(html.find("img"))
print(html.find("input"))
soup.find(<tagName>, <tagName.class>)
Tag attribute positioning, including <tagName.class> can be searched out
print(html.find("div", class_="site-nav"))
soup.find_all(<tagName>)
All tags, the return value is List
print(html.find_all("a"))
print(html.find_all("input"))
soup.select(<select>)
Use class selector to find tags, all tags, return value is List
print(html.select(".bang"))
print(html.select("#J_SearchForm .search-button"))
print(html.select(".copyright"))
1.3 Get the content in the label
text/get_text()
: Get all content
string
Get direct content
print(html.find("div", class_="search-button").text)
print(html.find("div", class_="search-button").string)
print(html.find("div", class_="search-button").get_text())
1.4 Get the attributes in the label
[<attribute>]
print(html.a["href"])
print(html.find("div")["class"])
print(html.find_all("div")[5]["class"])
2. Examples
Three Kingdoms article content
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import requests
from bs4 import BeautifulSoup
if __name__ == '__main__':
# url, UA, 参数
url = "https://www.shicimingju.com/book/sanguoyanyi.html"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0"
}
# 爬取页面
html = requests.get(url=url, headers=headers, timeout=5).text
# 数据加载到该对象中 (网络的 html 文件)
soup = BeautifulSoup(html, "lxml")
# 得到想要的标签 (含有章节的)
content = soup.select("div.book-mulu > ul > li > a")
# 文件
fp = open("./data/sgyy/sgyy.txt", "w", encoding="utf-8")
fp.write("章节\t链接\t内容\n")
for c in content:
# 爬取章节详细叙述的内容
href_text = requests.get(url="https://www.shicimingju.com" + c["href"], headers=headers, timeout=5).text
# 添加章节详细叙述的内容
href_soup = BeautifulSoup(href_text, "lxml")
href_text = href_soup.find("div", class_="chapter_content").text
# 添加章节的名称, 链接, 内容.
fp.write(f'{c.text}\t{"https://www.shicimingju.com" + c["href"]}\t{href_text}\n')
print(c.text + " 添加完成")
fp.close()