Simple crawler: crawl 100 4K animation pictures

Simple crawler: crawl 100 4K animation pictures


The main steps:

  1. Send a request to the specified url
  2. Get response data
  3. data analysis
  4. Data storage

4K picture address
1. Analyze the URL of each page and get the response data

The first page url: http://pic.netbian.com/4kdongman/ The
second page url: http://pic.netbian.com/4kdongman/index_2.html The
third page url: http://pic.netbian. com/4kdongman/index_3.html

Except for the first page, each subsequent page has only index_ different. We need to crawl the first five pages of pictures.
Make a request to the url and get the response data.

url = 'http://pic.netbian.com/4kdongman/index_%d.html'
# 设置请求头
headers = {
    
    
    'Users-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36'
}
for page in range(1, 3):
    if (page == 1):
        new_url = 'http://pic.netbian.com/4kmeinv/'
    else:
        new_url = format(url % page)
    response = requests.get(url=new_url, headers=headers)
    response.encoding = 'gbk'    # 设置获取响应数据的编码格式
    page_text = response.text

2. Analyze the webpage and find the image name and url. After
Insert picture description here
opening the source code of the webpage, it is obvious that the image address and image name are in the img tag, and the information of each picture is in the li tag. The picture address needs to add the previous default part. We use the xpath method to get the li_list first , and then loop to get the information of each picture.

# 数据解析
    tree = etree.HTML(page_text)
    li_list = tree.xpath('//ul[@class="clearfix"]/li')    # 获取li标签列表
    img_list = []
    img_name_list = []
    for li in li_list:
        img_list.append(li.xpath('./a/img/@src')[0])    # 图片地址
        img_name_list.append(li.xpath('./a/img/@alt')[0])    # 图片名

    # 获取完整图片url
    img_url_list = []
    for img_url in img_list:
        img_url_list.append('http://pic.netbian.com' + img_url)

3. Extract the picture data and save it in a local folder.
According to the picture address obtained above, request the binary data of the picture, and then write the data into the specified file to create a folder to save the picture.

	# 在当前目录下创建文件夹
	isExists = os.path.exists('./4ktupian')
	if not isExists:
    	os.makedirs('./4ktupian')
    
    # 提取图片数据
    for i in range(len(img_url_list)):
        img_data = requests.get(url=img_url_list[i], headers=headers).content
        filePath = './4ktupian/' + img_name_list[i] + '.jpg'
        with open(filePath, 'wb')as fp:
            fp.write(img_data)
        print('%s,下载成功' % img_name_list[i])

Complete code

import requests
from lxml import etree
import os

# 创建文件夹
isExists = os.path.exists('./4ktupian')
if not isExists:
    os.makedirs('./4ktupian')

url = 'http://pic.netbian.com/4kmeinv/index_%d.html'
headers = {
    
    
    'Users-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36'
}
for page in range(1, 3):
    if (page == 1):
        new_url = 'http://pic.netbian.com/4kmeinv/'
    else:
        new_url = format(url % page)
    response = requests.get(url=new_url, headers=headers)
    # 设置获取响应数据的编码格式
    response.encoding = 'gbk'
    page_text = response.text

    # 数据解析
    tree = etree.HTML(page_text)
    li_list = tree.xpath('//ul[@class="clearfix"]/li')
    img_list = []
    img_name_list = []
    for li in li_list:
        img_list.append(li.xpath('./a/img/@src')[0])
        img_name_list.append(li.xpath('./a/img/@alt')[0])

    # 获取完整图片url
    img_url_list = []
    for img_url in img_list:
        img_url_list.append('http://pic.netbian.com' + img_url)

    # 提取图片数据
    for i in range(len(img_url_list)):
        img_data = requests.get(url=img_url_list[i], headers=headers).content
        filePath = './4ktupian/' + img_name_list[i] + '.jpg'
        with open(filePath, 'wb')as fp:
            fp.write(img_data)
        print('%s,下载成功' % img_name_list[i])

Result display:

Insert picture description here

Guess you like

Origin blog.csdn.net/qq_43965708/article/details/108963190
Recommended