用python爬虫批量下载pdf

今天遇到一个任务，给一个excel文件，里面有500多个pdf文件的下载链接，需要把这些文件全部下载下来。我知道用python爬虫可以批量下载，不过之前没有接触过。今天下午找了下资料，终于成功搞定，免去了手动下载的烦恼。我参考了以下资料，这对我很有帮助：
1、廖雪峰python教程
2、用Python 爬虫批量下载PDF文档 http://blog.csdn.net/u012705410/article/details/47708031
3、用Python 爬虫爬取贴吧图片 http://blog.csdn.net/u012705410/article/details/47685417
4、Python爬虫学习系列教程 http://cuiqingcai.com/1052.html

由于我搭建的python版本是3.5，我学习了上面列举的参考文献2中的代码，这里的版本为2.7，有些语法已经不适用了。我修正了部分语法，如下：


# coding = UTF-8
# 爬取李东风PDF文档,网址：http://www.math.pku.edu.cn/teachers/lidf/docs/textrick/index.htm

import urllib.request
import re
import os

# open the url and read
def getHtml(url):
    page = urllib.request.urlopen(url)
    html = page.read()
    page.close()
    return html

# compile the regular expressions and find
# all stuff we need
def getUrl(html):
    reg = r'(?:href|HREF)="?((?:http://)?.+?\.pdf)'
    url_re = re.compile(reg)
    url_lst = url_re.findall(html.decode('gb2312'))
    return(url_lst)

def getFile(url):
    file_name = url.split('/')[-1]
    u = urllib.request.urlopen(url)
    f = open(file_name, 'wb')

    block_sz = 8192
    while True:
        buffer = u.read(block_sz)
        if not buffer:
            break

        f.write(buffer)
    f.close()
    print ("Sucessful to download" + " " + file_name)


root_url = 'http://www.math.pku.edu.cn/teachers/lidf/docs/textrick/'

raw_url = 'http://www.math.pku.edu.cn/teachers/lidf/docs/textrick/index.htm'

html = getHtml(raw_url)
url_lst = getUrl(html)

os.mkdir('ldf_download')
os.chdir(os.path.join(os.getcwd(), 'ldf_download'))

for url in url_lst[:]:
    url = root_url + url
    getFile(url)
    
    
     
     1
     
     2
     
     3
     
     4
     
     5
     
     6
     
     7
     
     8
     
     9
     
     10
     
     11
     
     12
     
     13
     
     14
     
     15
     
     16
     
     17
     
     18
     
     19
     
     20
     
     21
     
     22
     
     23
     
     24
     
     25
     
     26
     
     27
     
     28
     
     29
     
     30
     
     31
     
     32
     
     33
     
     34
     
     35
     
     36
     
     37
     
     38
     
     39
     
     40
     
     41
     
     42
     
     43
     
     44
     
     45
     
     46
     
     47
     
     48
     
     49
     
     50
     
     51
     
     52

上面这个例子是个很好的模板。当然，上面的还不适用于我的情况，我的做法是：先把地址写到了html文件中，然后对正则匹配部分做了些修改，我需要匹配的地址都是这样的，http://pm.zjsti.gov.cn/tempublicfiles/G176200001/G176200001.pdf。改进后的代码如下：


# coding = UTF-8
# 爬取自己编写的html链接中的PDF文档,网址：file:///E:/ZjuTH/Documents/pythonCode/pythontest.html

import urllib.request
import re
import os

# open the url and read
def getHtml(url):
    page = urllib.request.urlopen(url)
    html = page.read()
    page.close()
    return html

# compile the regular expressions and find
# all stuff we need
def getUrl(html):
    reg = r'([A-Z]\d+)' #匹配了G176200001
    url_re = re.compile(reg)
    url_lst = url_re.findall(html.decode('UTF-8')) #返回匹配的数组
    return(url_lst)

def getFile(url):
    file_name = url.split('/')[-1]
    u = urllib.request.urlopen(url)
    f = open(file_name, 'wb')

    block_sz = 8192
    while True:
        buffer = u.read(block_sz)
        if not buffer:
            break

        f.write(buffer)
    f.close()
    print ("Sucessful to download" + " " + file_name)


root_url = 'http://pm.zjsti.gov.cn/tempublicfiles/'  #下载地址中相同的部分

raw_url = 'file:///E:/ZjuTH/Documents/pythonCode/pythontest.html'

html = getHtml(raw_url)
url_lst = getUrl(html)

os.mkdir('pdf_download')
os.chdir(os.path.join(os.getcwd(), 'pdf_download'))

for url in url_lst[:]:
    url = root_url + url+'/'+url+'.pdf'  #形成完整的下载地址
    getFile(url)
    
    
     
     1
     
     2
     
     3
     
     4
     
     5
     
     6
     
     7
     
     8
     
     9
     
     10
     
     11
     
     12
     
     13
     
     14
     
     15
     
     16
     
     17
     
     18
     
     19
     
     20
     
     21
     
     22
     
     23
     
     24
     
     25
     
     26
     
     27
     
     28
     
     29
     
     30
     
     31
     
     32
     
     33
     
     34
     
     35
     
     36
     
     37
     
     38
     
     39
     
     40
     
     41
     
     42
     
     43
     
     44
     
     45
     
     46
     
     47
     
     48
     
     49
     
     50
     
     51
     
     52

这就轻松搞定啦。

文章标签： python 爬虫

个人分类： python

相关热词：和python python和 python【】 python下 python的与

(".MathJax").remove();    MathJax.Hub.Config({            "HTML-CSS": {                    linebreaks: { automatic: true, width: "94%container" },                    imageFont: null            },            tex2jax: {                preview: "none"            },            mml2jax: {                preview: 'none'            }    });    (function(){        var btnReadmore =

$(".MathJax").remove(); MathJax.Hub.Config({ "HTML-CSS": { linebreaks: { automatic: true, width: "94%container" }, imageFont: null }, tex2jax: { preview: "none" }, mml2jax: { preview: 'none' } }); (function(){ var btnReadmore =$ ("#btn-readmore"); if(btnReadmore.length>0){ var winH =

(w i n d o w) . h e i g h t (); v a r a r t i c l e B o x =

$(window).height(); var articleBox =$ ("div.article_content"); var artH = articleBox.height(); if(artH > winH*2){ articleBox.css({ 'height':winH*2+'px', 'overflow':'hidden' }) btnReadmore.click(function(){ articleBox.removeAttr("style"); $(this).parent().remove(); }) }else{ btnReadmore.parent().remove(); } } })()

发表评论

还能输入1000个字符

利用Python下载文件

利用Python下载文件也是十分方便的：小文件下载下载小文件的话考虑的因素比较少，给了链接直接下载就好了：import requests image_url = “https://www.python…

    <div class="info-box d-flex align-content-center">
        <p>
            <a class="avatar" src="https://blog.csdn.net/sinat_36246371" title="sinat_36246371" target="_blank">
                <img src="https://avatar.csdn.net/E/4/5/3_sinat_36246371.jpg" alt="sinat_36246371" class="avatar-pic">
                <span class="name">sinat_36246371</span>
            </a>
        </p>
        <p>
            <span class="date">2017-03-16 16:32:47</span>
        </p>
        <p>
            <span class="read-num">阅读数：16945</span>
        </p>
    </div>
</div>
                <div class="recommend-item-box" data-track-view="{&quot;mod&quot;:&quot;popu_387&quot;,&quot;con&quot;:&quot;,https://blog.csdn.net/jonathanzh/article/details/78630587,BlogCommendFromBaidu_17&quot;}" data-track-click="{&quot;mod&quot;:&quot;popu_387&quot;,&quot;con&quot;:&quot;,https://blog.csdn.net/jonathanzh/article/details/78630587,BlogCommendFromBaidu_17&quot;}">
    <h4 class="text-truncate">
        <a href="https://blog.csdn.net/jonathanzh/article/details/78630587" target="_blank">
            <em>python</em>3<em>爬虫</em>下载网页上的<em>pdf</em>           </a>
    </h4>
    <p class="content">
        <a href="https://blog.csdn.net/jonathanzh/article/details/78630587" target="_blank">
            # coding = UTF-8

# 爬取大学nlp课程的教学pdf文档课件 http://ccl.pku.edu.cn/alcourse/nlp/ import urllib.request i…

jonathanzh

2017-11-25 11:43:13

阅读数：2835

var width = $("div.recommend-box").outerWidth() - 48; NEWS_FEED({ w: width, h : 90, showid : 'GNKXx7', placeholderId: "ad1", inject : 'define', define : { imagePosition : 'left', imageBorderRadius : 0, imageWidth: 120, imageHeight: 90, imageFill : 'clip', displayImage : true, displayTitle : true, titleFontSize: 20, titleFontColor: '#333', titleFontFamily : 'Microsoft Yahei', titleFontWeight: 'bold', titlePaddingTop : 0, titlePaddingRight : 0, titlePaddingBottom : 10, titlePaddingLeft : 16, displayDesc : true, descFontSize: 14, descPaddingLeft: 14, descFontColor: '#6b6b6b', descFontFamily : 'Microsoft Yahei', paddingTop : 0, paddingRight : 0, paddingBottom : 0, paddingLeft : 0, backgroundColor: '#fff', hoverColor: '#ca0c16' } }) var width = $("div.recommend-box").outerWidth() - 48; NEWS_FEED({ w: width, h: 90, showid: 'Afihld', placeholderId: 'a_d_feed_0', inject: 'define', define: { imagePosition: 'left', imageBorderRadius: 0, imageWidth: 120, imageHeight: 90, imageFill: 'clip', displayImage: true, displayTitle: true, titleFontSize: 20, titleFontColor: '#333', titleFontFamily: 'Microsoft Yahei', titleFontWeight: 'bold', titlePaddingTop: 0, titlePaddingRight: 0, titlePaddingBottom: 10, titlePaddingLeft: 16, displayDesc: true, descFontSize: 14, descPaddingLeft: 14, descFontColor: '#6b6b6b', descFontFamily: 'Microsoft Yahei', paddingTop: 0, paddingRight: 0, paddingBottom: 0, paddingLeft: 0, backgroundColor: '#fff', hoverColor: '#ca0c16' } }) #encoding=utf-8 import urllib2 from bs…

heymysweetheart

2016-04-26 18:53:46

阅读数：962

个人资料

WittyLu

关注

原创: 63

粉丝: 18

喜欢: 1

评论: 15

等级：

访问：: 13万+

积分：: 1677

排名：: 3万+

勋章：

持之以恒

授予每个自然月内发布4篇或4篇以上原创或翻译IT博文的用户。不积跬步无以至千里，不积小流无以成江海，程序人生的精彩需要坚持不懈地积累！

        <div id="asideNewArticle" class="aside-box">
<h3 class="aside-title">最新文章</h3>
<div class="aside-content">
    <ul class="inf_list clearfix csdn-tracking-statistics tracking-click" data-mod="popu_382">
                    <li class="clearfix">
            <a href="https://blog.csdn.net/baidu_28479651/article/details/78244648" target="_blank">新博客迁移啦</a>
        </li>
                    <li class="clearfix">
            <a href="https://blog.csdn.net/baidu_28479651/article/details/78094164" target="_blank">Promise</a>
        </li>
                    <li class="clearfix">
            <a href="https://blog.csdn.net/baidu_28479651/article/details/78057626" target="_blank">js中json对象key值首字母大写化</a>
        </li>
                    <li class="clearfix">
            <a href="https://blog.csdn.net/baidu_28479651/article/details/78034892" target="_blank">jQuery插件开发</a>
        </li>
                    <li class="clearfix">
            <a href="https://blog.csdn.net/baidu_28479651/article/details/78034013" target="_blank">js数组方法考察点详解</a>
        </li>
                </ul>
</div>

个人分类

展开

最新评论

重绘和重排

weixin_42408771：善
侧边栏的实现（一）

weixin_41825000：class应该是用.
android加速度传感器数据存储…

qq_33251378：博主你好，这个为什么运行不了啊，是不是少了什么东西，我是初学者，所以不太懂
android加速度传感器数据存储…

SUN_msg：博主，这个是默认手机内存里吗？
android加速度传感器数据存储…

miuzhong1175：博主您好，我是学生在做传感器的滤波分析，您能不能发一份源码？十分感谢！[email protected]…

    <div class="aside-box">
                    <div><iframe scrolling="no" src="https://pos.baidu.com/s?hei=250&amp;wid=300&amp;di=u3163270&amp;ltu=https%3A%2F%2Fblog.csdn.net%2Fbaidu_28479651%2Farticle%2Fdetails%2F76158051&amp;dis=0&amp;ari=2&amp;pss=1908x5068&amp;col=zh-CN&amp;cja=false&amp;tlm=1531555295&amp;ti=%E7%94%A8python%E7%88%AC%E8%99%AB%E6%89%B9%E9%87%8F%E4%B8%8B%E8%BD%BDpdf%20-%20CSDN%E5%8D%9A%E5%AE%A2&amp;chi=3&amp;exps=111000&amp;pcs=1908x886&amp;cmi=0&amp;prot=2&amp;ps=2253x1262&amp;dtm=HTML_POST&amp;dc=3&amp;tpr=1531555295779&amp;cdo=-1&amp;par=1920x988&amp;pis=-1x-1&amp;dai=2&amp;ccd=24&amp;cfv=0&amp;cce=true&amp;psr=1920x1080&amp;tcn=1531555296&amp;drs=1&amp;ant=0&amp;dri=0&amp;cpl=0&amp;cec=UTF-8" width="300" height="250" frameborder="0"></iframe></div><script type="text/javascript" src="//cee1.iteye.com/avneunkwb.js"></script>
                </div>
            <div class="aside-box">
        <div class="persion_article">
        <div class="right_box footer_box csdn-tracking-statistics" data-mod="popu_475" data-dsm="post">        <h3 class="feed_new_tit"><span class="line"></span><span class="txt">联系我们</span></h3>        <div class="contact-box">        <div class="img-box"><img src="//csdnimg.cn/pubfooter/images/csdn_cs_qr.png" alt="客服"></div>        <div class="contact-info">        <h4>请扫描二维码联系客服</h4>        <p><svg width="16" height="16" xmlns="http://www.w3.org/2000/svg"><path d="M2.167 2h11.666C14.478 2 15 2.576 15 3.286v9.428c0 .71-.522 1.286-1.167 1.286H2.167C1.522 14 1 13.424 1 12.714V3.286C1 2.576 1.522 2 2.167 2zm-.164 3v1L8 10l6-4V5L8 9 2.003 5z" fill="#B3B3B3" fill-rule="evenodd"></path></svg><a href="mailto:[email protected]" target="_blank"><span class="txt">[email protected]</span></a></p><p><svg width="16" height="16" xmlns="http://www.w3.org/2000/svg"><path d="M14.999 13.355a.603.603 0 0 1-.609.645H1.61a.603.603 0 0 1-.609-.645l.139-1.47c.021-.355.25-.845.51-1.088 0 0 3.107-2.827 3.343-2.909 0 0-.029-2.46 1.2-2.46h3.635c1.112 0 1.202 2.469 1.202 2.469l3.32 2.9c.26.243.489.733.51 1.088l.139 1.47zM7 10a1 1 0 0 0 0 2h2a1 1 0 0 0 0-2H7zm7.806-5.674c.105.135.191.384.19.554l-.003 2.811c0 .17-.133.26-.295.2l-2.462-.999a.478.478 0 0 1-.296-.416V5.445c0-2.07-7.878-2.225-7.878 0v1.21c0 .17-.135.352-.3.404L1.3 7.904c-.165.052-.3-.044-.3-.213V4.88c0-.17.086-.42.191-.554C1.191 4.326 2.131 2 8 2s6.807 2.326 6.807 2.326z" fill="#B3B3B3"></path></svg><span class="txt"> 400-660-0108</span></p>        <p><svg width="16" height="16" xmlns="http://www.w3.org/2000/svg"><path d="M14.496 10.35c-.301-1.705-1.565-2.822-1.565-2.822.18-1.548-.481-1.823-.481-1.823C12.31.915 8.089.998 8 1 7.91.998 3.689.915 3.55 5.705c0 0-.662.275-.481 1.823 0 0-1.264 1.117-1.565 2.822 0 0-.16 2.882 1.445.353 0 0 .36.96 1.022 1.823 0 0-1.183.392-1.083 1.412 0 0-.04 1.136 2.527 1.058 0 0 1.805-.137 2.347-.882h.476c.542.745 2.347.882 2.347.882 2.566.078 2.527-1.058 2.527-1.058.1-1.02-1.083-1.412-1.083-1.412a7.986 7.986 0 0 0 1.022-1.823c1.604 2.529 1.445-.353 1.445-.353z" fill="#B3B3B3" fill-rule="evenodd"></path></svg><a href="javascript:void(0);" class="qqcustomer_s" target="_blank"><span class="txt">QQ客服</span></a>        <svg width="16" height="16" xmlns="http://www.w3.org/2000/svg"><path d="M7.325 13.965a6.5 6.5 0 1 1 7.175-6.4C14.467 11.677 11.346 15 7.5 15c-.514 0-1.015-.06-1.498-.172.488-.178.922-.48 1.323-.863zM4 7.5a4 4 0 1 0 8 0 .5.5 0 1 0-1 0 3 3 0 1 1-6 0 .5.5 0 0 0-1 0z" fill="#B3B3B3" fill-rule="evenodd"></path></svg><a href="http://bbs.csdn.net/forums/Service" target="_blank"><span class="txt">客服论坛</span></a>        </p>        </div></div>        <div class="bg-gray">        <div class="feed_copyright">        <p><a class="right-dotte" href="//www.csdn.net/company/index.html#about" target="_blank">关于</a><a href="//www.csdn.net/company/index.html#recruit" target="_blank" class="right-dotte">招聘</a><a href="//www.csdn.net/company/index.html#business" target="_blank" class="right-dotte">广告服务</a>        <a href="https://www.csdn.net/gather/A" target="_blank" class="footer_baidu">        网站地图</a></p>        <p class="fz12">©2018 CSDN版权所有 <a href="http://www.miibeian.gov.cn/" target="_blank" class="ml14">京ICP证09002463号</a></p>        <p class="fz12 fz12_baidu"><svg width="13" height="14" xmlns="http://www.w3.org/2000/svg"><path d="M8.392 7.013c1.014 1.454 2.753 2.8 2.753 2.8s1.303 1.017.47 2.98c-.833 1.962-3.876.942-3.876.942s-1.122-.36-2.424-.072c-1.303.291-2.426.181-2.426.181s-1.523.037-1.957-1.888c-.434-1.927 1.52-2.982 1.666-3.161.145-.183 1.159-.873 1.81-1.963.653-1.09 2.608-1.962 3.984.181zm1.23 5.706V9.346H8.64v2.534h-.937s-.3-.044-.356-.285V9.33l-.925.015v2.518s.042.627.925.855h2.277zm-3.685.013V7.951l-.896-.014v1.295H3.987s-1.054.086-1.422 1.28c-.129.798.114 1.266.156 1.368.043.099.383.682 1.238.852h1.978zm-2.433-1.45c-.087-.286.013-.613.057-.741.042-.128.228-.427.61-.54h.855v1.948h-.797s-.555-.029-.725-.668zm6.877-8.775c-.143.909-.865 2.108-1.99 1.962-1.121-.144-1.375-1.16-1.267-2.179C7.214 1.458 8.21.18 9.007.364c.796.18 1.52 1.235 1.374 2.143zm-4.09-.345c0 1.197-.68 2.164-1.52 2.164S3.25 3.36 3.25 2.162C3.25.967 3.932 0 4.77 0c.842 0 1.52.967 1.52 2.162zm4.854 2.09c1.34 0 1.701 1.309 1.701 1.743 0 .438.182 2.29-1.485 2.326-1.667.037-1.737-1.126-1.737-1.96 0-.874.179-2.11 1.52-2.11zm-7.93.581c.045.398.253 2.217-1.27 2.544C.427 7.704-.14 5.947.028 5.124c0 0 .18-1.78 1.412-1.89.98-.085 1.7.986 1.774 1.6z" fill="#999" fill-rule="evenodd"></path></svg><em>百度提供支持</em></p>        </div>        <div class="allow-info-box">        <p><a href="http://www.hd315.gov.cn/beian/view.asp?bianhao=010202001032100010" target="_blank"><span>经营性网站备案信息</span></a></p>        <p><a href="http://www.cyberpolice.cn/" target="_blank"><span>网络110报警服务</span></a></p>        <p><a href="http://www.12377.cn/" target="_blank"><span>中国互联网举报中心</span></a></p>        <p><a href="http://www.bjjubao.org/" target="_blank"><span>北京互联网违法和不良信息举报中心</span></a></p>        </div>        </div>        </div></div>
    </div>
</div>

("a.flexible-btn").click(function(){

$("a.flexible-btn").click(function(){$ (this).parents('div.aside-box').removeClass('flexible-box'); $(this).remove(); })

用python爬虫批量下载pdf

猜你喜欢