python 手动给requests模块添加urlretrieve下载文件方法!

requests模块的前代是urllib模块,传入参数headers、cookie、data什么的肯定是requests好使,但是却没有urllib.request.urlretrieve这个方法,

urlretrieve(url, filename=None,reporthook=None, params=None,)

传入url跟文件路径即可下载文件,requests次次都得自己手动编写,我觉得太麻烦了,而且它还有个回调函数,我试着能不能把这个urlretrieve方法移植到requests模块来

要点:1.如何找到自己想要的python模块呢?cmd上打path,找出python的,然后CTRL+F

         2.下载文件的实质是 contextlib.closing打开网页--->with open文件--->写入

     3.reporthook回调函数的实质就是 把文件一段段写入文件时把3个参数(每次写入bytes的数量、次数、headers得到的总大小size)传出去,让回调函数处理

    4.原来现在的模块,都是其他py文件写好方法,然后把其方法传入__init__.py这个文件的

    5.r.iter_content()的应用


进入urllib文件夹,在request文件中找到urlretrieve方法,具体如下


def urlretrieve(url, filename=None, reporthook=None, data=None):
    """
    Retrieve a URL into a temporary location on disk.

    Requires a URL argument. If a filename is passed, it is used as
    the temporary file location. The reporthook argument should be
    a callable that accepts a block number, a read size, and the
    total file size of the URL target. The data argument should be
    valid URL encoded data.

    If a filename is passed and the URL points to a local resource,
    the result is a copy from local file to new file.

    Returns a tuple containing the path to the newly created
    data file as well as the resulting HTTPMessage object.
    """
    url_type, path = splittype(url)       #分析网页的,忽略

    with contextlib.closing(urlopen(url, data)) as fp: #打开网页
        headers = fp.info()              #头

        # Just return the local path and the "headers" for file://
        # URLs. No sense in performing a copy unless requested.
        if url_type == "file" and not filename:
            return os.path.normpath(path), headers #忽略

        # Handle temporary file setup.
        if filename:
            tfp = open(filename, 'wb')                #打开文件
        else:
            tfp = tempfile.NamedTemporaryFile(delete=False)#忽略
            filename = tfp.name
            _url_tempfiles.append(filename)

        with tfp:
            result = filename, headers
            bs = 1024*8                              #每一次写入bytes的大小
            size = -1
            read = 0
            blocknum = 0                             #写入bytes的次数,2者相乘就是已经写入的大小
            if "content-length" in headers:
                size = int(headers["Content-Length"]) #size就是文件大小了

            if reporthook:
                reporthook(blocknum, bs, size)       #写入前运行一次回调函数

            while True:
                block = fp.read(bs)
                if not block:
                    break
                read += len(block)
                tfp.write(block)                     #写入
                blocknum += 1
                if reporthook:
                    reporthook(blocknum, bs, size)   #每写入一次就运行一次回调函数


    if size >= 0 and read < size:
        raise ContentTooShortError(
            "retrieval incomplete: got only %i out of %i bytes"
            % (read, size), result)

    return result


然而常规的requests模块下载文件的写法:

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36'}
with closing(requests.get(url=target, stream=True, headers=headers)) as r:
    with open('%d.jpg' % filename, 'ab+') as f:
        for chunk in r.iter_content(chunk_size=1024):
            if chunk:
                f.write(chunk)
                f.flush()

现在把它给封装起来:

def urlretrieve(url, filename=None,reporthook=None, params=None,):
    '''传入ID改变url,利用closingiter_content下载图片'''
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36'}
    with contextlib.closing(requests.get(url, stream=True, headers=headers,params=params)) as fp:#打开网页
        header=fp.headers                                          #得出头
        with open(filename, 'wb+') as tfp:  #w是覆盖原文件,a是追加写入 #打开文件
            bs = 1024
            size = -1
            blocknum = 0
            if "content-length" in header:
                size = int(header["Content-Length"])                #文件的总大小理论值
            if reporthook:
                reporthook(blocknum, bs, size)                      #写入前运行一次回调函数

            for chunk in fp.iter_content(chunk_size=1024):
                if chunk:
                    tfp.write(chunk)                                #写入
                    tfp.flush()
                    blocknum += 1
                    if reporthook:
                        reporthook(blocknum, bs, size)              #每写入一次就运行一次回调函数


测试:

def Schedule(a, b, c):
    per = 100.0*a*b/c
    if per > 100 :
        per = 100
    sys.stdout.write("  " + "%.2f%% 已经下载的大小:%ld 文件大小:%ld" % (per,a*b,c) + '\r')
    sys.stdout.flush()

url='https://images.unsplash.com/photo-1503025768915-494859bd53b2?ixlib=rb-0.3.5&q=85&fm=jpg&crop=entropy&cs=srgb&dl=tommy-344440-unsplash.jpg&s=1382cd0338e13f6460ed68182d35cac9'
urlretrieve(url=url,filename='111.jpg',reporthook=Schedule)

OK,成功



现在把这个方法放进requests模块里面,


先在requests文件夹里把刚写的方法放在api.py里的最末

还有把import contextlib  写上


然后在__init__.py这个文件,api import后面加上urlretrieve



OK,可以直接运行

import requests,os,time,sys

def Schedule(a, b, c):
    per = 100.0*a*b/c  #a是写入次数,b是每次写入bytes的数值,c是文件总大小
    if per > 100 :
        per = 100
    sys.stdout.write("  " + "%.2f%% 已经下载的大小:%ld 文件大小:%ld" % (per,a*b,c) + '\r')
    sys.stdout.flush()

url='https://images.unsplash.com/photo-1503025768915-494859bd53b2?ixlib=rb-0.3.5&q=85&fm=jpg&crop=entropy&cs=srgb&dl=tommy-344440-unsplash.jpg&s=1382cd0338e13f6460ed68182d35cac9'
requests.urlretrieve(url=url,filename='111.jpg',reporthook=Schedule)







猜你喜欢

转载自blog.csdn.net/qq_38282706/article/details/80253447