How does the python crawler realize the crawling of large files, and the realization of a series of functions such as pausing crawling, continuing crawling, and canceling crawling during the crawling process!

During this period of time, I was working on a graduation project to realize the crawling of large files. During the writing process, I encountered a series of problems and technical difficulties. I thought I could ask a certain degree, but found that the spirit of open source was a bit lacking. ! !



Knowledge of large file crawling

The first thing you should realize is that crawling large files requires a technology because of memory reasons. What technology is it? That is断点续传

We usually don’t need to crawl small files, because the computer’s memory is enough for us to squander, but large files will not work.

Breakpoint resume : It is to store the downloaded large files in fragments. When we usually download files, no matter how large the files are, the computer will wait for the files to be received, and then write them to the hard disk.

Write it only once. If the file is too large, the memory will not be able to bear it, so it is imperative to perform fragmented storage. The so-called fragmented storage is to pre-define the size of a fragment, and then write the fragment to the hard disk when the data received by the computer is equal to the size of the fragment. A large file is written several times, which greatly reduces the pressure on the memory.

The following is implemented by myself with python code 断点续传:

import requests
header = {
    
    
        'Range': 'bytes=%d-%d' % (start,end),
        'Accept': 'application/json, text/javascript, */*; q=0.01',
        'Accept-Encoding': 'gzip, deflate',
        'content-type': 'application/json',
        'x-requested-with': 'XMLHttpRequest',
        'Accept-Language': 'zh-CN,zh;q=0.8',
        'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"}
response=requests.get(url=url,headers=header,stream=True)
if response.status_code==200:
    with open(file_name,mode) as f:
        for data in response.iter_content(chunk_size=1024):
            f.write(data)
#注意
#1.header中的参数'Range'就是告诉服务器,你要爬去哪段的数据,start是开始点,end是结束点
#2.get中的参数必须设置stream=True,这是告诉服务器要进行流存储,也就是断点存储
#3.response.iter_content(chunk_size=1024)中的chunk_size就是当接收的文件数据满1K时就往硬盘写一次,可根据实际情况自行定义

Pause, continue, and cancel a series of actions during crawling

If we crawl a file and want to add some actions to the crawling process, such as pause, continue, and cancel, how can these functions be realized?

The following is the code that can realize these functions obtained through the blogger's continuous trial and error (may not be the best code, I hope you can give me more advice!!!)

from dlpackage import model_download as dm
from dlpackage import requests_header
from dlpackage import setting
from dlpackage import share
import threading
import time
import os





# 定义的全局变量
pause_flag = False            #暂停标志
cancel_flag = False           #取消下载标志
length = None                 #用来保存已经下载文件的长度
mode = None                   #用来保存文件的存储模式
file_status = False           #用来保存文件的打开或关闭状态
file_path_name = None         #用来保存文件的存储路径


# 暂停/继续下载
def pause_or_continue():

    global pause_flag
    if share.m3.m.get() == '暂停下载':
        share.m3.m.set('继续下载')
        pause_flag = True
    else:
        share.m3.m.set('暂停下载')
        pause_flag = False
        m3u8_href = share.m3.button_url.get().rstrip()
        video_name = share.m3.button_video_name.get().rstrip()
        t = threading.Thread(
            target=download_mp4, args=(
                m3u8_href, video_name,))
        # 设置守护线程,进程退出不用等待子线程完成
        t.setDaemon(True)
        t.start()

#取消下载的线程
def cancel_thread():
    t = threading.Thread(
        target=cancel)
    # 设置守护线程,进程退出不用等待子线程完成
    t.setDaemon(True)
    t.start()


# 取消下载
def cancel():
    global pause_flag
    global cancel_flag
    global file_status
    global file_path_name
    if not pause_flag:
        cancel_flag = True
        # 判断download_mp4()中打开的文件是否关闭,file_path_name是否被赋值,如果没有就再给它一点时间
        while file_status == False or file_path_name == None:
            time.sleep(1)
        #移除已经下载完成的部分文件
        os.remove(file_path_name)
        #进度条归零
        share.set_progress(0)
        share.m3.show_info("取消成功!")
        cancel_flag = False
        file_status = False
        file_path_name = None

    else:
        os.remove(file_path_name)
        share.set_progress(0)
        share.m3.m.set('暂停下载')
        pause_flag = False
        share.m3.show_info("取消成功!")



#对文件进行下载
def download_file(m3u8_href, video_name):
    global length
    global cancel_flag
    global mode
    global file_path_name
    global file_status
    share.m3.clear_alert()
    video_name = share.check_video_name(video_name)
    video_name = setting.path + "/" + video_name  # 将保存路径与文件名称进行拼接
    video_name = video_name + share.m3.button_url.get()[share.m3.button_url.get(
    ).rfind('.'):share.m3.button_url.get().rfind('.') + 4]  # 将文件的后缀与整体路径进行拼接
    chunk_size = 512
    # 当这个文件已经下载过
    if os.path.exists(
            video_name):
        response = dm.easy_download(
            url=m3u8_href,
            stream=True,
            #请求头中加入'Range': 'bytes'参数实现断点续传
            header=requests_header.get_user_agent1(
                os.path.getsize(video_name)))
        mode = 'ab'
        size = os.path.getsize(video_name)
        content_size = length
    else:
        response = dm.easy_download(
            url=m3u8_href,
            stream=True,
            header=requests_header.get_user_agent())
        mode = 'wb'
        size = 0
        content_size = int(response.headers['content-length'])
    share.m3.alert('[文件大小]:%0.2f MB' % (content_size / 1024 / 1024))
    with open(video_name, mode) as f:
        for data in response.iter_content(chunk_size=chunk_size):
            try:
                #如果既没有暂停也没有取消下载就正常下载
                if pause_flag == False and cancel_flag == False:
                    f.write(data)
                    size = size + len(data)
                    p = (size / content_size) * 100
                    share.set_progress(p)
                    share.m3.str.set('%.2f%%' % p)
                #如果取消下载
                elif cancel_flag:
                    f.close()
                    file_status = f.closed
                    file_path_name = video_name
                    break
                #如果暂停下载
                elif pause_flag:
                    f.close()
                    length = content_size
                    file_path_name = video_name
                    break
            except BaseException:
                share.m3.show_info("下载出现异常!")
    if size == content_size:
        share.m3.alert('下载完成!')
        share.m3.show_info("下载完成!")
        share.set_progress(0)
        share.m3.str.set('')

The above code is intercepted from the blogger’s graduation project, and some imported packages are defined by the blogger himself, but it does not affect the viewing of the pause, continue, and cancel function codes, understand!

Animation presentation

insert image description here



Well, this is the end of this sharing and summary. If you have any questions, you can leave a message in the comment area. If you like the blogger's article, remember to like and pay attention! ! !

Guess you like

Origin blog.csdn.net/qq_46329012/article/details/113763531