Python crawling reptile picture omelette network

Today, we try to crawl a picture omelette network.

Used in the package:

urllib.request 

the

Several functions were used to control the number of pages downloaded images, access web images, access web pages and save the image locally. Process is simple and clear

Directly on the source code:

import urllib.request
import os


def url_open(url):
    req = urllib.request.Request(url)
    req.add_header('user-agent','Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36')
    response = urllib.request.urlopen(url)
    html = response.read()

    return html

def get_page(url):
    html = url_open(url).decode('utf-8')

    a = html.find('current-comment-page')+23
    b = html.find(']',a)

    return html[a:b]


def find_imgs(url):
    html = url_open(url).decode('utf-8')
    img_addrs = []

    a = html.find('img src=')

    while a != -1:
        b = html.find('.jpg',a ,a+255)
        if b != -1:
            img_addrs.append('https:'+html[a+9:b+4]) # 'img src='为9个偏移  '.jpg'为4个偏移
        else:
            b = a+9
        a = html.find('img src=', b)

    return img_addrs


def save_imgs(folder, img_addrs):
    for each in img_addrs:
        filename = each.split('/')[-1]
        with open(filename, 'wb') as f:
            img = url_open(each)
            f.write(img)
        print(img_addrs)

def download_mm(folder = 'xxoo', pages = 5):
    os.mkdir(folder)
    os.chdir(folder)

    url = 'http://jandan.net/ooxx/'
    page_num = int(get_page(url))

    for i in range(pages):
        page_num -= i
        page_url = url + 'page-'+ str(page_num) + '#comments'
        img_addrs = find_imgs(page_url)
        save_imgs(folder, img_addrs)



if __name__ == '__main__':
    download_mm()

 

Wherein the main function download_mm (), the pages provided on the fifth surface.

Originally set is 10, but during program execution. 404ERROR error occurred

That imgae_url error occurred. Try in save_img () function was added to the test code: print (img_addrs),

 

 

 

 

I thought it would not be because the back pages of pictures, img_url format change occurred, resulting in 404, so the pages into 5,

Run again, the result is no problem, the picture will be downloaded:

 

 

Carefully observed, just the fifth in the picture plane back, there is a problem (404) can not be downloaded. So omelette online, we skip ahead to the 6th side view url picture.

 

 FIG 5 is a plane on the image url, the FIG. 5 is a front face image url

 

 

The source code, using the search image url find () function, as performed <img src = ''> <.jpg> url pictures, so that the surface appears a href 5 not match, i.e., results in a 404 ERROR. If you want to download the pictures subsequent need to re-locate to add a url

 

I.e., change in a href img src will find, need to change the offset.

 

to sum up:

Use find () to locate the page tags do too low, so they wanted to make use of the reptiles in the regular expression and Beautifulsoup package to improve efficiency, but these two I'm not particularly familiar, so they need more training.

 

Guess you like

Origin www.cnblogs.com/lesliechan/p/11494811.html