Basic use of Python crawler bs4 data analysis

Disclaimer: Since the publication of this article, this article is for reference only and may not be reproduced or copied. If the party who browses this article is involved in any violation of national laws and regulations, all consequences shall be borne by the party who browses this article and has nothing to do with the blogger of this article. And due to the reprinting, copying and other operations of the parties who browse this article, any disputes caused by violation of national laws and regulations and all the consequences shall be borne by the parties who browse this article and have nothing to do with the blogger of this article.

import requests
from bs4 import BeautifulSoup

1. bs4 basic syntax

1.1 Get html page

Get local html page

# 读取文件
fp = open("./data/base/taobao.html", "r", encoding="UTF-8")
# 数据加载到该对象中 (本地的 html 文件)
html = BeautifulSoup(fp, "lxml")
print(html)

Read website to get html page

# 爬取页面
response_text = requests.get(url="https://s.taobao.com/").text
# 数据加载到该对象中 (网络的 html 文件)
html = BeautifulSoup(response_text, "lxml")
print(html)

1.2 Get tags

soup.<tagName>
The first one by default, return None if there is no such label

print(html.a)
print(html.img)
print(html.input)

Insert picture description here

soup.find(<tagName>)
Equivalent to soup.<tagName> default first

print(html.find("a"))
print(html.find("img"))
print(html.find("input"))

Insert picture description here

soup.find(<tagName>, <tagName.class>)
Tag attribute positioning, including <tagName.class> can be searched out

print(html.find("div", class_="site-nav"))

Insert picture description here

soup.find_all(<tagName>)
All tags, the return value is List

print(html.find_all("a"))
print(html.find_all("input"))

Insert picture description here

soup.select(<select>)
Use class selector to find tags, all tags, return value is List

print(html.select(".bang"))
print(html.select("#J_SearchForm .search-button"))
print(html.select(".copyright"))

Insert picture description here

1.3 Get the content in the label

text/get_text(): Get all content
stringGet direct content

print(html.find("div", class_="search-button").text)
print(html.find("div", class_="search-button").string)
print(html.find("div", class_="search-button").get_text())

Insert picture description here

1.4 Get the attributes in the label

[<attribute>]

print(html.a["href"])
print(html.find("div")["class"])
print(html.find_all("div")[5]["class"])

Insert picture description here

2. Examples

Three Kingdoms article content

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

import requests
from bs4 import BeautifulSoup


if __name__ == '__main__':

    # url, UA, 参数
    url = "https://www.shicimingju.com/book/sanguoyanyi.html"
    headers = {
    
    
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0"
    }
    # 爬取页面
    html = requests.get(url=url, headers=headers, timeout=5).text
    # 数据加载到该对象中 (网络的 html 文件)
    soup = BeautifulSoup(html, "lxml")
    # 得到想要的标签 (含有章节的)
    content = soup.select("div.book-mulu > ul > li > a")

    # 文件
    fp = open("./data/sgyy/sgyy.txt", "w", encoding="utf-8")
    fp.write("章节\t链接\t内容\n")
    for c in content:
        # 爬取章节详细叙述的内容
        href_text = requests.get(url="https://www.shicimingju.com" + c["href"], headers=headers, timeout=5).text
        # 添加章节详细叙述的内容
        href_soup = BeautifulSoup(href_text, "lxml")
        href_text = href_soup.find("div", class_="chapter_content").text
        # 添加章节的名称, 链接, 内容.
        fp.write(f'{c.text}\t{"https://www.shicimingju.com" + c["href"]}\t{href_text}\n')
        print(c.text + " 添加完成")
    fp.close()

Insert picture description here

Guess you like

Origin blog.csdn.net/YKenan/article/details/111942335
Recommended