Crawler - crawling pictures on the website

1. Use the python request library to crawl website pictures

Crawl website: http://www.pes-stars.co.ua/?page1/

  1. After entering the website, press F12 to open the developer mode, [element], and move the mouse to view the information of the picture to be downloaded, as shown in the figure below:
    insert image description here
  2. Find the information of User-Agent and Referer, select [Network], usually in the bottom column, click, and there will be a [User-Agent] at the bottom, copy it as a part of Headers
  3. code:
import  requests
import time
from lxml import etree
url = 'http://www.pes-stars.co.ua/?page1/'

headers = {
    
    "Referer":"                    ",}

resq = requests.get(url,headers = headers)

print(resq)

html = etree.HTML(resq.text)
srcs = html.xpath(".//img/@src")

for i in srcs:
    imgname = i.split('/')[-1]
    img = requests.get(i,headers = headers)
    with open('imgs1/'+imgname,'wb') as file:
        file.write(img.content)
    print(i,imgname)

2. Error reporting

I found that after downloading the first few titles of the webpage, an error was reported when it came to the face image of the text part that I really want to download. Check the [src] of each picture and found: [src] of the successfully downloaded
picture:
insert image description here

Error [src]:
insert image description here
It is found that there is a problem with the connection of [src] error, the http: information is missing, manually add it, the code becomes as follows:

import  requests
import time
from lxml import etree
url = 'http://www.pes-stars.co.ua/?page1/'

headers = {
    
    "Referer":"Referer: http://www.pes-stars.co.ua/?page1/",
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.102 Safari/537.36 Edg/104.0.1293.70",}

resq = requests.get(url,headers = headers)

print(resq)

html = etree.HTML(resq.text)
srcs = html.xpath(".//img/@src")

for i in srcs:
    print('1---i:    ', i)
    imgname = i.split('/')[-1]
    j = 'http://www.pes-stars.co.ua' + i
    print('2---j:    ', j)
    try:
        img = requests.get(j, headers=headers)
        with open('imgs1/' + imgname, 'wb') as file:
            file.write(img.content)
        print(i, imgname)
    except requests.exceptions.ConnectionError:
        requests.status_code = "Connection refused"

Downloaded successfully.

3. Crawl multiple pages of pictures

Looking at the url of each page, you can find that only the last number is different:

http://www.pes-stars.co.ua/?page1
http://www.pes-stars.co.ua/?page2
http://www.pes-stars.co.ua/?page3
...

Add a for loop directly to the url to select the number of pages to download:


# -*- coding:utf-8 -*-
# Author: 自学小白菜
'''
既然选择了前进,那每一步都要认真的去做
'''

import  requests
import time
from lxml import etree


for m in range(1, 301):     # 爬取1-300页的图
    m += 1
    url = 'http://www.pes-stars.co.ua/?page' + str(m)
    print('url:  ', url)

    headers = {
    
    "Referer": "Referer: 'http://www.pes-stars.co.ua/?page",
        "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.5112.102 Safari/537.36 Edg/104.0.1293.70",}

    resq = requests.get(url, headers=headers)

    print('resq:  ', resq)

    html = etree.HTML(resq.text)
    srcs = html.xpath(".//img/@src")

    for i in srcs:
        # print('1---i:    ', i, srcs)
        list1 = str(i)
        if list1[-3:] == 'jpg':
            imgname = i.split('/')[-1]
            j = 'http://www.pes-stars.co.ua' + i
            # print('2---j:    ', j, imgname)
            try:
                img = requests.get(j, headers=headers)
                with open('imgs1/' + imgname, 'wb') as file:
                    file.write(img.content)
                print(i, imgname)
            except requests.exceptions.ConnectionError:
                requests.status_code = "Connection refused"

Guess you like

Origin blog.csdn.net/everyxing1007/article/details/126670463