The simplest idea of python crawler crawling post bar pictures


Just for the record, please skip it.

1 import package

import re  # re模块主要包含了正则表达式
import urllib.request

2 Crawler settings

2.1 Set headers

Used to pretend to be a user, so that the crawler is considered a human operation by the computer

headers = {
    
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/87.0.664.75'}

2.2 Set URL

url="https://tieba.baidu.com/f?ie=utf-8&kw=%E7%8C%AB%E5%9B%BE&fr=search"

2.3 Get the address: "fake user" + URL, represented by the variable page

page = urllib.request.Request(url, headers=headers)

2.4 Get the data in the address

html=urllib.request.urlopen(page).read().decode("utf-8")

This html contains all the source code in the URL

3 Picture settings

3.1 Analyze the composition of the picture part of the code in the source code of the webpage, expressed by reg

Insert picture description here

Reference map

reg = r'bpic="(.+?\.jpg)" class'

3.2 Get all reg structures from the data in the address (all source codes)

imgre = re.compile(reg)		# 先编译一下reg
imglist = re.findall(imgre, html)

imglist stores all source codes that conform to the reg structure

3.3 Save picture

useurllib.request.urlretrieve

x = 0
for imgurl in imglist:
    urllib.request.urlretrieve(imgurl, r'E:\python\cat\%s.jpg' % x)
    x += 1

Save each reg image element (denoted as imgurl) in the imglist to the local

All source code

import re  # re模块主要包含了正则表达式
import urllib.request
from urllib import request  # urllib模块提供了读取Web页面数据的接口


# 定义一个getHtml函数
def getHtml(url):
    print('start-gethtml')
    # page = urllib.request.urlopen(url)  # urllib.request.urlopen()方法用于打开一个URL地址
    # html = page.read()  # read()方法用于读取URL上的数据
    headers = {
    
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/87.0.664.75'}
    page = urllib.request.Request(url, headers=headers)
    # page=requests.get(url, headers=headers)
    html=urllib.request.urlopen(page).read().decode("utf-8")
    return html


# 图片下载
def getImg(html):
    reg = r'bpic="(.+?\.jpg)" class'  # 正则表达式,得到图片地址
    imgre = re.compile(reg)  # re.compile() 可以把正则表达式编译成一个正则表达式对象.
    html = html  # python3
    imglist = re.findall(imgre, html)  # re.findall() 方法读取html 中包含 imgre(正则表达式)的数据
    # 把筛选的图片地址通过for循环遍历并保存到本地
    # 核心是urllib.request.urlretrieve()方法,直接将远程数据下载到本地,图片通过x依次递增命名
    x = 0
    for imgurl in imglist:
        urllib.request.urlretrieve(imgurl, r'E:\python\cat\%s.jpg' % x)
        x += 1


if __name__ == '__main__':
    html = getHtml("https://tieba.baidu.com/f?ie=utf-8&kw=%E7%8C%AB%E5%9B%BE&fr=search")
    print(getImg(html))

Show off

Insert picture description here

Guess you like

Origin blog.csdn.net/weixin_41529093/article/details/113111169