[Web crawler] Data analysis

  • Focused crawler: crawl the specified page content in the page

        - Coding process

                - Specified URL

                - make a request

                - Get response data

                - data analysis

                - persistent storage

  • Data analysis classification:

                - regular

                - bs4

                - xpath (***)

  • Overview of Data Analysis Principles

                - The parsed local text content will be stored between tags or in the attributes corresponding to the tags.

                - 1. Position the specified label

                - 2. Extract (parse) the data stored in the tag or the attribute corresponding to the tag.

format case


 

 Regular expressions in action

        - Requirements: Crawl pictures of embarrassing encyclopedias

                - re.S single line match

                -re.MMultiple line matching

                - Binary format: content

                - format : Powerful tool for formatting strings. It can insert variables, expressions and other values ​​into a formatted string at specific positions

Picture link place

 Regular data analysis

Get the address of the picture and write the regular

        Image URL without protocol

import re
import requests
import os

#创建文件夹
if not os.path.exists('./qiutuLibs'):
    os.mkdir('./qiutuLibs')

url = 'https://www.qiushibaike.com/pic/'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'
}
#使用通用爬虫对一整张页面进行爬取
page_text = requests.get(url=url,headers=headers).text

#使用聚焦爬虫将页面中的糗图进行解析
#正则表达式
ex = '<div class="thumb">.*?<img src="(.*?)" alt.*?</div>'
img_src_list = re.findall(page_text,ex,re.S) #正则用到数据解析时一定用的re.S
print(img_src_list) #拿到图片地址的列表
for src in img_src_list:
    #拼接出一个完整的图片url
    src  = 'http:'+src
    #请求到了图片的二进制
    img_data = requests.get(url=src,headers=headers).content#content以二进制的形式存储
    #生成图片名称
    img_name = src.split('/')[-1]
    #图片最终存储的路径
    iimgPath = './qiutuLibs/'+img_name
    with open(iimgPath,'wb') as fp:
        fp.write(img_data)
        print(img_name,"下载成功!!!")

bs4 for data analysis

        - Principles of data analysis

                - 1. Label positioning

                - 2. Extract tags and data values ​​stored in tag attributes

         - The principle of bs4 data analysis

                 - 1. Instantiate a BeautifulSoup object and load the page source code data into the object

                 -2. Perform label positioning and data extraction by calling relevant properties or methods in the BeautifulSoup object

         - Environment installation

                  - pip install bs4

                  - pip install lxml

        - How to instantiate a BeautifulSoup object

                - from bs4 import BeautifulSoup

                - Object instantiation

                        - 1. Load the data in the local html document into the object

                                fp = open('./test.html','r',encoding = 'utf-8')

                                soup = BeautifulSoup(fp,'lxml')

                        - 2. Load the page source code obtained from the Internet into the object

                                page_text  =response.text

                                soup = BeautifulSoup(page_text,'lxml')

                - Provided methods and properties for data parsing

                        - soup.tagName: Returns the tag corresponding to the tagName that appears for the first time in the document

                        - soup.find() :

                                -find('tagName'): Equivalent to soup.tarName

                                - Attribute positioning:

                                        - soup.find('div',class_/id/attr='song')

                        - soup.find_all('tagName') : Returns all tags (list) that meet the requirements

                - select : 

                        - select('Some kind of selector (id, class, label...selector)'), returns a list

                        - Level selector:

                                - soup.select('.tang > ul > li > a')       

                                - soup.select('.tang > ul > li a'): Multiple levels represented by spaces 

                - Get text data between tags:

                        - soup.a.text/string/get_text() two attributes, one method

                        - text/get_text(): You can get all the text content in a certain tag

                        - string: Only the text content directly below the tag can be obtained

                 - Get the attribute value in the tag

                         - soup.a['href']

                                      

 BeautifluSoup in action: crawling Romance of the Three Kingdoms books

import requests
from bs4 import BeautifulSoup

url = 'https://www.shicimingju.com/book/sanguoyanyi.html'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/114.0.0.0 Safari/537.36'
}

#发起请求
page_text = requests.get(url=url,headers=headers).text.encode('ISO-8859-1')
print(page_text)

soup = BeautifulSoup(page_text,'lxml') #实例化对象

li_list = soup.select('.book-mulu > ul >li') #拿到所有的li标签
print(li_list)
fp =  open ('./sanguo.txt','w',encoding='utf-8')
for li in li_list:
    title  =li.a.string #拿到a标签的文本内容
    detail_url ='https://www.shicimingju.com/'+li.a['href'] #拿到该章内容的网页
    detail_page_text = requests.get(url=detail_url, headers=headers).text.encode('ISO-8859-1')
    detail_soup = BeautifulSoup(detail_page_text,'lxml')
    div_tag = detail_soup.find('div',class_='chapter_content')
    #解析到了章节的内容
    content = div_tag.get_text()
    # 持久化存储
    fp.write(title +':'+content+'\n')
    print(title,'爬取成功!!!')

Guess you like

Origin blog.csdn.net/weixin_73865721/article/details/131830352