爬一下学校网站的资源

        有一些课程里面有太多pdf了,又没有提供一键下载,手动略显麻烦(当然写脚本的时间成本更高,不过也学习了以后或许用得上),就写了一个爬虫脚本。

        废话不多说,见图。

        比如这门课pdf多,要去下载。代码贴上:

import requests

from bs4 import BeautifulSoup as soup
import os
from pathlib import Path
# filename = Path('./file')

cookie = {"sepuser":"bWlkPTU4MTZjNTBlLWI5ZTItNGE5Ni05ZDgzLTllNjYzZTFlZGM3NQ==", "JSESSIONID":"f59d88c1-f80c-4ccd-8fc6-38f2e142bb16.localhost.localdomain","pasystem_timezone_ok":"true"}
url = 'https://course.ucas.ac.cn/portal/site/195754/tool/b198988e-9d1b-4e15-885d-6351c72fee01'
# nlp_course_cookie = {"sepuser":"bWlkPTU4MTZjNTBlLWI5ZTItNGE5Ni05ZDgzLTllNjYzZTFlZGM3NQ==", "JSESSIONID":"f59d88c1-f80c-4ccd-8fc6-38f2e142bb16.localhost.localdomain","pasystem_timezone_ok":"true"}
# Define Website to Download pdf
# nlp_url = 'https://course.ucas.ac.cn/portal/site/195599/tool/4d491e2c-2428-4aa6-8e4e-c0d62b8bf61a'

filename = Path('metadata.pdf')




# Get Website content
r = requests.get(url,cookies=cookie)

# Create soup object of requests object
soup = soup(r.text, 'html.parser')
folder_location= './'
#print(soup)
count = 0
# Loop through all elements of the website with the tag a
# print(soup.find_all('a'))
for link in soup.find_all('a'):
    #print(link)
    
    # Download pdf if the name pdf is in the hyperlink and
    # is not a None Object
    if link.get('href') is not None and '.pdf' in link.get('href'):
        # Download pdf with wget
        #print(1)
        print(link.get('href'))
        count = count +1
        if count!= 1:
            
            response = requests.get(link.get('href'),cookies=cookie)
            pdf = open("pdf"+str(count-1)+".pdf", 'wb')
            #print(response.content)
            pdf.write(response.content)
            pdf.close()
        else:
            print("count=1!!!!")

        讲解一下参数,这里的url就是你要下载资源的那个网站了,然后cookie就是当前网页的cookie,怎么查找呢?(进入开发者模式,然后点击Application选择cookie,对应代码字段进行复制粘贴)

下载下来就是按,2,3,4,...,n.pdf 这样保存在当前文件夹啦,由于Cookie有时候是动态变换的,有机会我们去试试怎样动态获取网页cookie。 

おすすめ

転載: blog.csdn.net/weixin_43332715/article/details/121350149