前言
本文的文字及图片来源于网络,仅供学习、交流使用,不具有任何商业用途,版权归原作者所有,如有问题请及时联系我们以作处理。
作者: imBobby
到了周末,写点简单加愉快的东西吧,下午健身回来,想看个电影,于是来到熟悉的网站:
btbtt.me
我觉得这个网站中文资源比较全,而海盗湾就是英文资源全一些,今天做个电影资源爬虫吧,进入btbtt.me首页:
这浓烈的的山寨风格,有一丝丝上头,先观察一下,点进高清电影区,我的思路是进入高清电影区,逐个访问页面内的电影标签,并将电影详情页面的种子下载到本地,所以先观察一下:
发现电影详情页的URL都在class为subject_link thread-new和subject_link thread-old的标签下存储,接下来点进电影详情页看看:
发现下载链接存储在属性rel为nofollow的标签a中,点击一下下载链接试试看:
竟然还有一层,有点难受了,想靠标签筛选这个下载链接有点难受,但是可以观察到:
下载链接其实就是把URL内的attach换成了download,这就省了很多事儿啊~
思路大概有了,那就写代码呗:
import requests
import bs4
import os
import time
# 设置代理,这个网站也需要科学上网
proxies = { "http": "http://127.0.0.1:41091", "https": "http://127.0.0.1:41091", }
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36 Edg/84.0.522.50"
}
def init_movie_list(nums):
"""
根据传入数字决定爬取多少页电影,每页的电影大概几十个
:param nums:爬取页面数量
:return:页面url组成的list
"""
movie_list = []
if nums < 2:
return movie_list
for num in range(1, nums + 1):
url = "http://btbtt.me/forum-index-fid-1183-page-" + str(num) + ".htm"
movie_list.append(url)
return movie_list
def get_movie_detail_url(url):
"""
根据传入的URL获取页面内电影详情链接并存储进List
:param url:目标页面URL
:return:电影详情页的URL和电影名字组成tuple,各个tuple再连成list
"""
context = requests.get(url=url, headers=headers, proxies=proxies).content
time.sleep(1)
bs4_result = bs4.BeautifulSoup(context, "html.parser")
new_read_details = bs4_result.find_all("a", class_="subject_link thread-new")
all_details = bs4_result.find_all("a", class_="subject_link thread-old") + new_read_details
if not all_details:
return []
url_list =[]
for item in all_details:
url_list.append((item.get("title"), "http://btbtt.me/" + item.get("href")))
return url_list
def get_movie_download_url(url_tuple):
"""
传入的tuple为文件夹名和下载链接组合
:param url_tuple:
:return:
"""
folder_name = replace_folder_name(url_tuple[0])
url = url_tuple[1]
resp = requests.get(url=url, headers=headers, proxies=proxies)
time.sleep(1)
bs4_result = bs4.BeautifulSoup(resp.content, "html.parser")
result = bs4_result.find_all("a", rel="nofollow", target="_blank", ajaxdialog=False)
if not result:
return ('', '', '')
file_name = replace_folder_name(result[-1].text)
download_url = "http://btbtt.me/" + result[-1].get("href").replace("dialog", "download")
return (folder_name, file_name, download_url)
def replace_folder_name(folder_name):
"""
按照windows系统下的文件命名规则规整化文件夹名
:param folder_name:
:return:
"""
illegal_str = ["?",",","/","\\","*","<",">","|"," ", "\n", ":"]
for item in illegal_str:
folder_name = folder_name.replace(item, "")
return folder_name
def download_file(input_tuple):
"""
下载文件
:param input_tuple:
:return:
"""
folder_name = input_tuple[0]
if not folder_name:
folder_name = str(int(time.time()))
file_name = input_tuple[1]
if not file_name:
file_name = str(int(time.time())) + ".zip"
download_url = input_tuple[2]
if not download_url:
return
resp = requests.get(url=download_url, headers=headers, proxies=proxies)
time.sleep(1)
# D:/torrent是我的存储路径,这里可以修改
if not os.path.exists('D:/torrent/' + folder_name):
os.mkdir('D:/torrent/' + folder_name)
with open('D:/torrent/' + folder_name + "/" + file_name, 'wb') as f:
f.write(resp.content)
if __name__ == '__main__':
url = init_movie_list(5)
url_list = []
for item in url:
url_list = get_movie_detail_url(item) + url_list
for i in url_list:
download_tuple = get_movie_download_url(i)
download_file(download_tuple)
PS:如有需要Python学习资料的小伙伴可以加下方的群去找免费管理员领取
可以免费领取源码、项目实战视频、PDF文件等