用python爬虫批量下载pdf

今天遇到一个任务,给一个excel文件,里面有500多个pdf文件的下载链接,需要把这些文件全部下载下来。我知道用python爬虫可以批量下载,不过之前没有接触过。今天下午找了下资料,终于成功搞定,免去了手动下载的烦恼。我参考了以下资料,这对我很有帮助:
1、廖雪峰python教程
2、用Python 爬虫批量下载PDF文档 http://blog.csdn.net/u012705410/article/details/47708031
3、用Python 爬虫爬取贴吧图片 http://blog.csdn.net/u012705410/article/details/47685417
4、Python爬虫学习系列教程 http://cuiqingcai.com/1052.html

由于我搭建的python版本是3.5,我学习了上面列举的参考文献2中的代码,这里的版本为2.7,有些语法已经不适用了。我修正了部分语法,如下:


# coding = UTF-8
# 爬取李东风PDF文档,网址:http://www.math.pku.edu.cn/teachers/lidf/docs/textrick/index.htm

import urllib.request
import re
import os

# open the url and read
def getHtml(url):
    page = urllib.request.urlopen(url)
    html = page.read()
    page.close()
    return html

# compile the regular expressions and find
# all stuff we need
def getUrl(html):
    reg = r'(?:href|HREF)="?((?:http://)?.+?\.pdf)'
    url_re = re.compile(reg)
    url_lst = url_re.findall(html.decode('gb2312'))
    return(url_lst)

def getFile(url):
    file_name = url.split('/')[-1]
    u = urllib.request.urlopen(url)
    f = open(file_name, 'wb')

    block_sz = 8192
    while True:
        buffer = u.read(block_sz)
        if not buffer:
            break

        f.write(buffer)
    f.close()
    print ("Sucessful to download" + " " + file_name)


root_url = 'http://www.math.pku.edu.cn/teachers/lidf/docs/textrick/'

raw_url = 'http://www.math.pku.edu.cn/teachers/lidf/docs/textrick/index.htm'

html = getHtml(raw_url)
url_lst = getUrl(html)

os.mkdir('ldf_download')
os.chdir(os.path.join(os.getcwd(), 'ldf_download'))

for url in url_lst[:]:
    url = root_url + url
    getFile(url)
    
    
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52

上面这个例子是个很好的模板。当然,上面的还不适用于我的情况,我的做法是:先把地址写到了html文件中,然后对正则匹配部分做了些修改,我需要匹配的地址都是这样的,http://pm.zjsti.gov.cn/tempublicfiles/G176200001/G176200001.pdf。改进后的代码如下:


# coding = UTF-8
# 爬取自己编写的html链接中的PDF文档,网址:file:///E:/ZjuTH/Documents/pythonCode/pythontest.html

import urllib.request
import re
import os

# open the url and read
def getHtml(url):
    page = urllib.request.urlopen(url)
    html = page.read()
    page.close()
    return html

# compile the regular expressions and find
# all stuff we need
def getUrl(html):
    reg = r'([A-Z]\d+)' #匹配了G176200001
    url_re = re.compile(reg)
    url_lst = url_re.findall(html.decode('UTF-8')) #返回匹配的数组
    return(url_lst)

def getFile(url):
    file_name = url.split('/')[-1]
    u = urllib.request.urlopen(url)
    f = open(file_name, 'wb')

    block_sz = 8192
    while True:
        buffer = u.read(block_sz)
        if not buffer:
            break

        f.write(buffer)
    f.close()
    print ("Sucessful to download" + " " + file_name)


root_url = 'http://pm.zjsti.gov.cn/tempublicfiles/'  #下载地址中相同的部分

raw_url = 'file:///E:/ZjuTH/Documents/pythonCode/pythontest.html'

html = getHtml(raw_url)
url_lst = getUrl(html)

os.mkdir('pdf_download')
os.chdir(os.path.join(os.getcwd(), 'pdf_download'))

for url in url_lst[:]:
    url = root_url + url+'/'+url+'.pdf'  #形成完整的下载地址
    getFile(url)
    
    
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40
  • 41
  • 42
  • 43
  • 44
  • 45
  • 46
  • 47
  • 48
  • 49
  • 50
  • 51
  • 52

这就轻松搞定啦。

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/baidu_28479651/article/details/76158051
文章标签: python 爬虫
个人分类: python
上一篇前端面试题4
下一篇重绘和重排
(".MathJax").remove();    MathJax.Hub.Config({            "HTML-CSS": {                    linebreaks: { automatic: true, width: "94%container" },                    imageFont: null            },            tex2jax: {                preview: "none"            },            mml2jax: {                preview: 'none'            }    });    (function(){        var btnReadmore = ("#btn-readmore"); if(btnReadmore.length>0){ var winH = ( w i n d o w ) . h e i g h t ( ) ; v a r a r t i c l e B o x = ("div.article_content"); var artH = articleBox.height(); if(artH > winH*2){ articleBox.css({ 'height':winH*2+'px', 'overflow':'hidden' }) btnReadmore.click(function(){ articleBox.removeAttr("style"); $(this).parent().remove(); }) }else{ btnReadmore.parent().remove(); } } })()
还能输入1000个字符

利用Python下载文件

利用Python下载文件也是十分方便的:小文件下载下载小文件的话考虑的因素比较少,给了链接直接下载就好了:import requests image_url = “https://www.python…

    <div class="info-box d-flex align-content-center">
        <p>
            <a class="avatar" src="https://blog.csdn.net/sinat_36246371" title="sinat_36246371" target="_blank">
                <img src="https://avatar.csdn.net/E/4/5/3_sinat_36246371.jpg" alt="sinat_36246371" class="avatar-pic">
                <span class="name">sinat_36246371</span>
            </a>
        </p>
        <p>
            <span class="date">2017-03-16 16:32:47</span>
        </p>
        <p>
            <span class="read-num">阅读数:16945</span>
        </p>
    </div>
</div>
                <div class="recommend-item-box" data-track-view="{&quot;mod&quot;:&quot;popu_387&quot;,&quot;con&quot;:&quot;,https://blog.csdn.net/jonathanzh/article/details/78630587,BlogCommendFromBaidu_17&quot;}" data-track-click="{&quot;mod&quot;:&quot;popu_387&quot;,&quot;con&quot;:&quot;,https://blog.csdn.net/jonathanzh/article/details/78630587,BlogCommendFromBaidu_17&quot;}">
    <h4 class="text-truncate">
        <a href="https://blog.csdn.net/jonathanzh/article/details/78630587" target="_blank">
            <em>python</em>3<em>爬虫</em>下载网页上的<em>pdf</em>           </a>
    </h4>
    <p class="content">
        <a href="https://blog.csdn.net/jonathanzh/article/details/78630587" target="_blank">
            # coding = UTF-8
# 爬取大学nlp课程的教学pdf文档课件 http://ccl.pku.edu.cn/alcourse/nlp/ import urllib.request i…

jonathanzh jonathanzh

2017-11-25 11:43:13

阅读数:2835

女性得了静脉曲张变成蚯蚓腿怎么办?用这方法 腾高 · 顶新
var width = $("div.recommend-box").outerWidth() - 48; NEWS_FEED({ w: width, h : 90, showid : 'GNKXx7', placeholderId: "ad1", inject : 'define', define : { imagePosition : 'left', imageBorderRadius : 0, imageWidth: 120, imageHeight: 90, imageFill : 'clip', displayImage : true, displayTitle : true, titleFontSize: 20, titleFontColor: '#333', titleFontFamily : 'Microsoft Yahei', titleFontWeight: 'bold', titlePaddingTop : 0, titlePaddingRight : 0, titlePaddingBottom : 10, titlePaddingLeft : 16, displayDesc : true, descFontSize: 14, descPaddingLeft: 14, descFontColor: '#6b6b6b', descFontFamily : 'Microsoft Yahei', paddingTop : 0, paddingRight : 0, paddingBottom : 0, paddingLeft : 0, backgroundColor: '#fff', hoverColor: '#ca0c16' } })
老中医说:男人多吃它,性生活时间延长5倍 优涅星娜样 · 顶新
var width = $("div.recommend-box").outerWidth() - 48; NEWS_FEED({ w: width, h: 90, showid: 'Afihld', placeholderId: 'a_d_feed_0', inject: 'define', define: { imagePosition: 'left', imageBorderRadius: 0, imageWidth: 120, imageHeight: 90, imageFill: 'clip', displayImage: true, displayTitle: true, titleFontSize: 20, titleFontColor: '#333', titleFontFamily: 'Microsoft Yahei', titleFontWeight: 'bold', titlePaddingTop: 0, titlePaddingRight: 0, titlePaddingBottom: 10, titlePaddingLeft: 16, displayDesc: true, descFontSize: 14, descPaddingLeft: 14, descFontColor: '#6b6b6b', descFontFamily: 'Microsoft Yahei', paddingTop: 0, paddingRight: 0, paddingBottom: 0, paddingLeft: 0, backgroundColor: '#fff', hoverColor: '#ca0c16' } }) #encoding=utf-8 import urllib2 from bs…

heymysweetheart heymysweetheart

2016-04-26 18:53:46

阅读数:962

python爬虫批量下载apk文件

2018年04月21日 2KB 下载

Python批量下载小工具

2013年08月05日 1KB 下载

<iframe id="iframeu3394176_0" src="https://pos.baidu.com/qcrm?conwid=800&amp;conhei=100&amp;rdid=3394176&amp;dc=3&amp;di=u3394176&amp;dri=0&amp;dis=0&amp;dai=5&amp;ps=6043x346&amp;enu=encoding&amp;dcb=___adblockplus&amp;dtm=HTML_POST&amp;dvi=0.0&amp;dci=-1&amp;dpt=none&amp;tsr=0&amp;tpr=1531555295988&amp;ti=%E7%94%A8python%E7%88%AC%E8%99%AB%E6%89%B9%E9%87%8F%E4%B8%8B%E8%BD%BDpdf%20-%20CSDN%E5%8D%9A%E5%AE%A2&amp;ari=2&amp;dbv=0&amp;drs=3&amp;pcs=1908x886&amp;pss=1908x6146&amp;cfv=0&amp;cpl=0&amp;chi=3&amp;cce=true&amp;cec=UTF-8&amp;tlm=1531555295&amp;prot=2&amp;rw=886&amp;ltu=https%3A%2F%2Fblog.csdn.net%2Fbaidu_28479651%2Farticle%2Fdetails%2F76158051&amp;ecd=1&amp;uc=1920x988&amp;pis=-1x-1&amp;sr=1920x1080&amp;tcn=1531555296&amp;qn=f9ea1af9c21b8517&amp;tt=1531555295613.376.376.376" vspace="0" hspace="0" scrolling="no" width="800" height="100" align="center,center"></iframe>

没有更多推荐了,返回首页

("a.flexible-btn").click(function(){ (this).parents('div.aside-box').removeClass('flexible-box'); $(this).remove(); })

猜你喜欢

转载自blog.csdn.net/yllifesong/article/details/81044619