[python] Crawler notes (four) data analysis of bs4 analysis

Focus crawler

Crawl the specified content in the page.
Encoding process:
Specify url-initiate a request-obtain response data- data analysis -persistent storage

Data analysis classification

  • Regular match
  • bs4
  • xpath

Principles of Data Analysis
Insert picture description here

The principle of bs4 data analysis:

  • Instantiate a BeautifulSoup object, and load the page source data into the object
  • Label positioning and data extraction by calling related attribute methods in the BS object

First of all

  • pip install bs4
  • pip install lxml

How to instantiate the Beautiful object
soup = BeautifulSoup(file,'lxml)

  • from bs4 import BeautifulSoup

  • Object instantiation:

    • Load the data in the local html document to the object
    • Load the source code of the page obtained on the Internet to the object
  • Provide properties and methods for parsing:

    • soup.tagName returns the tagName tag that appears for the first time in html
    • soup.find(‘tagName’)
      • == soup.tagName, equivalent to the above
      • Positioning, such as soup.find('div',class_/id/attr='song') #class should be underlined
    • soup.find_all('tagName') returns all tags required
    • soup.select('.tang') selector
      • select('A certain selector (id, class, label)'), the return is a list
      • soup.select('.tang> ul> li> a')[0]> indicates one level, and spaces indicate multiple levels soup.select('.tang> ul a')[0]
    • Get text data between tags
      • soup.a.text/string/get_text()
      • text/get_text() can get all the text content of a label, string can only get the text content directly below the label
    • Get tag attribute value
      • soup.select(’.tang > ul > li > a’)[0][‘href’]
from bs4 import BeautifulSoup
import requests
import re
import os
#爬取三国演义小说所有的章节和章节内容
if __name__ == "__main__":

    if not os.path.exists('三国演义'):
        os.mkdir('三国演义')

    ua = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3775.400 QQBrowser/10.6.4209.400"
    url = 'https://www.shicimingju.com/book/sanguoyanyi.html'
    headers = {
    
    
        "User-Agent":ua,
    }
    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text,'lxml')
    mulu = soup.select('.book-mulu > ul >li > a')
    urc = soup.select('.book-mulu > ul > li > a')

    ex = '<a href="(.*?)">.*?'

    for i in range(len(urc)):

        file_name = './三国演义/' + mulu[i].text + ".txt"
        f = open(file_name,'w',encoding='utf-8')
        a = re.findall(ex, str(urc[i]))
        content_url = 'https://www.shicimingju.com' + a[0] #该章内容链接

        content_all = requests.get(url=content_url,headers=headers)
        content_soup = BeautifulSoup(content_all.text,'lxml')
        for content_p in content_soup.select('.chapter_content p'):
            f.write("\n  ")
            f.write(content_p.text)

        f.close()

Insert picture description here
Insert picture description here
Comfortable

Guess you like

Origin blog.csdn.net/Sgmple/article/details/112059432
Recommended