Xunlei API batch download Juchao annual report

illustrate

First, use octopus to crawl the announcement link list from the Juchao page, but the link points to the page and there is a download button and the link cannot be directly extracted from the download on the details page, and the relationship between the links can be directly modified.
A way to save the download path separately: How to use python to download in batches - use Python to call Xunlei to realize batch download in the background. It
is still a bit inconvenient to manually confirm the save
, because Xunlei still encounters the problem that the download speed is 0, so use python to supplement the download, because There is no multi-threading, so it will be a bit slow
(70 messages) Python downloads Juchao PDF annual report in batches_Invincible predecessor's blog-CSDN blog Both
Thunder and ordinary scripts encounter anti-crawler mechanisms and reject requests:
(73 messages) Python batch downloads annual reports (Anti-reptile response version
)

Annual Report Screening Process

1. By abbreviation: "Text Contains" ST
2. By title:
(1) Abstract, Canceled
(2) English version
(3) About, Announcement, H Share
3. Sort by code ascending order, time descending order:
(1) First Put the previous year’s update report into the previous year’s file (start processing from the latest year)
(2) Then deduplicate and keep the latest annual report (must be after the previous small step)
(3) Exclude the B-share annual report by code

the code

from win32com.client import Dispatch
#pip install win32compat
#pip install pywin32
import os
import re
import openpyxl
import time
def xunlei(url, downpath,filename):
    #运行之前记得在迅雷的设置中心勾选“一键下载”,不然会有弹框确认是否建立下载任务。
    #filename = url.split('/')[-1]
    thunder = Dispatch('ThunderAgent.Agent64.1')
    #thunder = Dispatch("ThunderAgent.Agent.1")
    thunder.AddTask(url, filename, downpath, "", "", -1, 0, 5)
    #thunder.AddTask(url)
    # AddTask("下载地址", "另存文件名", "保存目录","任务注释","引用地址","开始模式", "只从原始地址下载","从原始地址下载线程数")
    thunder.CommitTasks()
    time.sleep(0.05)
def code_revise(code_cell):
    code=(code_cell.value)
    code=str(code)
    #用value就是数值,text不能用
    for i in range(1,6-len(code)+1):
        code='0'+code
    return code
def url_revise(url):
    #普通命令str.replace(old, new[, max])
    #old --将被替换的子字符串。.new --新字符串,用于替换old子字符串。max --可选字符串,替换不超过max次
    #re.sub(pattern, repl, string, count=0, flags=0)
    #参数含义依次为旧字符正则匹配式、新子串、原文、次数默认全部替换
    #print("url1:" + url)
    old1=re.compile(r'disclosure/detail\?stockCode=\d+&announcementId')
    old2=re.compile(r'orgId=\w+\d+&announcementTime')
    new1='announcement/download?bulletinId'
    new2='announceTime'
    url=re.sub(old1,new1,url)
    url = re.sub(old2, new2, url)
    #print("url2:"+url)
    return url
input= r'E:\huang\Documents'
os.chdir(input)
downpath=r'E:\Alark\Users\Desktop\年报'
downlist='2015-2016年其他行业.xlsx'
wb = openpyxl.load_workbook(downlist)
ws = wb.active
#active_sheet = wb.active
for row in ws.rows:
    if row[0].value==None:
        #print("row[0]:",row[0].value)
        break
    else:
        pass
        #print(code_revise(row[0]),row[2].value)
    filename=code_revise(row[0])+'_'+row[2].value+'.pdf'
    url=url_revise(row[4].value)
    xunlei(url, downpath, filename)
wb.save("cache.xlsx")


Guess you like

Origin blog.csdn.net/qq_37639139/article/details/124168844