有一些课程里面有太多pdf了,又没有提供一键下载,手动略显麻烦(当然写脚本的时间成本更高,不过也学习了以后或许用得上),就写了一个爬虫脚本。
废话不多说,见图。
比如这门课pdf多,要去下载。代码贴上:
import requests
from bs4 import BeautifulSoup as soup
import os
from pathlib import Path
# filename = Path('./file')
cookie = {"sepuser":"bWlkPTU4MTZjNTBlLWI5ZTItNGE5Ni05ZDgzLTllNjYzZTFlZGM3NQ==", "JSESSIONID":"f59d88c1-f80c-4ccd-8fc6-38f2e142bb16.localhost.localdomain","pasystem_timezone_ok":"true"}
url = 'https://course.ucas.ac.cn/portal/site/195754/tool/b198988e-9d1b-4e15-885d-6351c72fee01'
# nlp_course_cookie = {"sepuser":"bWlkPTU4MTZjNTBlLWI5ZTItNGE5Ni05ZDgzLTllNjYzZTFlZGM3NQ==", "JSESSIONID":"f59d88c1-f80c-4ccd-8fc6-38f2e142bb16.localhost.localdomain","pasystem_timezone_ok":"true"}
# Define Website to Download pdf
# nlp_url = 'https://course.ucas.ac.cn/portal/site/195599/tool/4d491e2c-2428-4aa6-8e4e-c0d62b8bf61a'
filename = Path('metadata.pdf')
# Get Website content
r = requests.get(url,cookies=cookie)
# Create soup object of requests object
soup = soup(r.text, 'html.parser')
folder_location= './'
#print(soup)
count = 0
# Loop through all elements of the website with the tag a
# print(soup.find_all('a'))
for link in soup.find_all('a'):
#print(link)
# Download pdf if the name pdf is in the hyperlink and
# is not a None Object
if link.get('href') is not None and '.pdf' in link.get('href'):
# Download pdf with wget
#print(1)
print(link.get('href'))
count = count +1
if count!= 1:
response = requests.get(link.get('href'),cookies=cookie)
pdf = open("pdf"+str(count-1)+".pdf", 'wb')
#print(response.content)
pdf.write(response.content)
pdf.close()
else:
print("count=1!!!!")
讲解一下参数,这里的url就是你要下载资源的那个网站了,然后cookie就是当前网页的cookie,怎么查找呢?(进入开发者模式,然后点击Application选择cookie,对应代码字段进行复制粘贴)
下载下来就是按,2,3,4,...,n.pdf 这样保存在当前文件夹啦,由于Cookie有时候是动态变换的,有机会我们去试试怎样动态获取网页cookie。