Basic use of Python crawler bs4 data analysis - Code World

Basic use of Python crawler bs4 data analysis

Others 2021-01-22 13:10:26 views: null

Basic use of Python crawler bs4

Disclaimer: Since the publication of this article, this article is for reference only and may not be reproduced or copied. If the party who browses this article is involved in any violation of national laws and regulations, all consequences shall be borne by the party who browses this article and has nothing to do with the blogger of this article. And due to the reprinting, copying and other operations of the parties who browse this article, any disputes caused by violation of national laws and regulations and all the consequences shall be borne by the parties who browse this article and have nothing to do with the blogger of this article.

import requests
from bs4 import BeautifulSoup

1. bs4 basic syntax

1.1 Get html page

Get local html page

# 读取文件
fp = open("./data/base/taobao.html", "r", encoding="UTF-8")
# 数据加载到该对象中 (本地的 html 文件)
html = BeautifulSoup(fp, "lxml")
print(html)

Read website to get html page

# 爬取页面
response_text = requests.get(url="https://s.taobao.com/").text
# 数据加载到该对象中 (网络的 html 文件)
html = BeautifulSoup(response_text, "lxml")
print(html)

1.2 Get tags

soup.<tagName>
The first one by default, return None if there is no such label

print(html.a)
print(html.img)
print(html.input)

Insert picture description here

soup.find(<tagName>)
Equivalent to soup.<tagName> default first

print(html.find("a"))
print(html.find("img"))
print(html.find("input"))

Insert picture description here

soup.find(<tagName>, <tagName.class>)
Tag attribute positioning, including <tagName.class> can be searched out

print(html.find("div", class_="site-nav"))

Insert picture description here

soup.find_all(<tagName>)
All tags, the return value is List

print(html.find_all("a"))
print(html.find_all("input"))

Insert picture description here

soup.select(<select>)
Use class selector to find tags, all tags, return value is List

print(html.select(".bang"))
print(html.select("#J_SearchForm .search-button"))
print(html.select(".copyright"))

Insert picture description here

1.3 Get the content in the label

text/get_text(): Get all content
stringGet direct content

print(html.find("div", class_="search-button").text)
print(html.find("div", class_="search-button").string)
print(html.find("div", class_="search-button").get_text())

Insert picture description here

1.4 Get the attributes in the label

[<attribute>]

print(html.a["href"])
print(html.find("div")["class"])
print(html.find_all("div")[5]["class"])

Insert picture description here

2. Examples

Three Kingdoms article content

#!/usr/bin/env python
# -*- coding: UTF-8 -*-

import requests
from bs4 import BeautifulSoup


if __name__ == '__main__':

    # url, UA, 参数
    url = "https://www.shicimingju.com/book/sanguoyanyi.html"
    headers = {
    
    
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:84.0) Gecko/20100101 Firefox/84.0"
    }
    # 爬取页面
    html = requests.get(url=url, headers=headers, timeout=5).text
    # 数据加载到该对象中 (网络的 html 文件)
    soup = BeautifulSoup(html, "lxml")
    # 得到想要的标签 (含有章节的)
    content = soup.select("div.book-mulu > ul > li > a")

    # 文件
    fp = open("./data/sgyy/sgyy.txt", "w", encoding="utf-8")
    fp.write("章节\t链接\t内容\n")
    for c in content:
        # 爬取章节详细叙述的内容
        href_text = requests.get(url="https://www.shicimingju.com" + c["href"], headers=headers, timeout=5).text
        # 添加章节详细叙述的内容
        href_soup = BeautifulSoup(href_text, "lxml")
        href_text = href_soup.find("div", class_="chapter_content").text
        # 添加章节的名称, 链接, 内容.
        fp.write(f'{c.text}\t{"https://www.shicimingju.com" + c["href"]}\t{href_text}\n')
        print(c.text + " 添加完成")
    fp.close()

Insert picture description here

Guess you like

Origin blog.csdn.net/YKenan/article/details/111942335

Basic use of Python crawler bs4 data analysis

[python] Crawler notes (four) data analysis of bs4 analysis

Basic knowledge of python crawler - the use of requests and bs4

Financial data analysis (2) Python warm-up: use bs4 to crawl university rankings in a province

Use bs4 to write a python crawler for Douban

The crawler Python parsing data base module bs4

Python Reptile Tour _ (data analysis) _bs4

Python crawler xpath data analysis basic usage

Basic concepts of Python crawler data analysis

Python3 implements crawler urllib articles + data processing (using bs4) (2)

Python3 implements crawler urllib articles + data processing (using bs4)

Basics of bs4 data analysis

Python—bs4 module analysis

Use python crawler to crawl data and then use echarts data visualization analysis

Basic use of Pandas library for Python data analysis

Web page data analysis of web crawlers (bs4)

Python crawler and data analysis in practice

Data analysis technology of Python crawler

Python crawler - data analysis BeautifulSoup

python reptile learning (X) bs4 analytical data

Data analysis study notes (regular analysis, bs4 analysis, xpath analysis)

day101- crawler, bs4 use, proxy pool, verification code cracking

Basic knowledge of python data analysis - use of dataframe() in pandas

Basic knowledge of python data analysis - use of shape() function

Financial data analysis (3) Dangdang store commodity crawler-crawler books as an example: requests&bs4

Basic grammar of bs4

Scrapy crawler basic use and stock data Scrapy crawler

Python web crawler requests, bs4 web crawling stewardess pictures

Tiantian Fund Net python crawler based on bs4 library and re library

Python crawler learning (5) HTML content retrieval based on bs4 library

Recommended

The sixth meeting of openKylin Community Ecology Committee was successfully held

Alibaba Cloud officially releases Tongyi Qianwen 2.5

Python 3.13 releases first Beta: experimental free-threading mode and JIT, improved interactive interpreter

Stack Overflow used my code to train large AI models and banned my account.

Pop!_OS’s COSMIC desktop completes App Store listing

Report: Django is still the first choice for 74% of developers

"Internet Investment and Financing Operation in the First Quarter of 2024" Research Report

15 years ago, he was on the "FFmpeg pillar of shame", and today he still has to thank us - Tencent QQPlayer avenges its shame?

Ranking

Python handwritten digit recognition, the corresponding mathematical formulas and Detailed procedures

Der M-Chip-Mac implementiert mehrere Android-Emulatoren

CentOS 명령줄 모드에서 화면이 항상 켜져 있도록 설정 ---- 예상한 효과를 얻지 못했습니다.

Teach you step by step how to use GDB to debug programs: a comprehensive guide from entry to mastery

python3 pip ipython installation

A lot of python crawler source code sharing -- talk about the little thing about python crawler

Agile Sprint in Alpha Stage (7)

Access URL in Unity

FunctoinDemo form

MS SQL common SQL statements (6): create, modify, delete triggers and other operations sq

Daily

More

2024-05-10(34)

2024-05-09(32)

2024-05-08(18)

2024-05-07(34)

2024-05-06(6)

2024-05-05(0)

2024-05-04(18)

2024-05-03(8)

2024-05-02(0)

2024-05-01(4)