原标题:运用python多线程爬虫下载喜马拉雅上面的专辑、声音
前言:
我标题写这样并不是表示我只要一个MP3就能看视频,而是讲读者身边如果有MP3,再加上看了我的这篇博客,你会有一种看到视频的感觉。
1.需要的python模块
完成这个项目需要的python模块有urllib、os、threading、sys、bs4、random、json
下面简单讲解一下导入这些模块的作用
urllib模块:主要用来爬取网页信息和加密(urllib.request.parse.urlencode()方法);
os模块:主要用来创建文件夹,也就是将下载的专辑或者声音全部放到这个文件夹下面;
threading模块:多线程模块,因为小编对于它的理解比较少,所以在这里也就不多说了;
sys模块:主要用来退出整个程序,sys.exit();
bs4模块:用来解析爬取的信息;
random模块:这个模块可以在本项目不用的,但我用来使输出的结果(下载显示)不一样吧!
json模块:用来解析爬取的json数据,json.loads()方法。
2.怎样才能实现
我们需要来到喜马拉雅这个网址:喜马拉雅
在输入框中输入自己想听的书或者其他,我输入的是 : 百家讲坛
因为我个人算得上是一个小历史迷吧!来到下面这个界面
发现这个网址为:https://www.ximalaya.com/search/%E7%99%BE%E5%AE%B6%E8%AE%B2%E5%9D%9B
search的内容为 百家讲坛的加密字符串。
通过下面这样就可以了
def get_url(self):
name=parse.urlencode({'keyword':self.keyword})
name=name[name.find('=')+1:]
url='https://www.ximalaya.com/search/{}'.format(name) # 得到网址
return url
这样我们就得到了这个网址
按电脑键盘F12,来到这个界面,可以发现,这些专辑的链接和名称都在这个标签下面。
我们只要爬取这些信息即可,但是好像不加请求头,爬不到所需信息
代码如下:
def get_info(self):
headers={
"accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"accept-language":"zh-CN,zh;q=0.9",
"cache-control":"max-age=0",
"cookie":"_xmLog=xm_k9b2hll719h15c; device_id=xm_1587543632882_k9b2hmuqogh2yd; s&e=57b3d99db29b032792574ca15a641c84; x_xmly_traffic=utm_source%253A%2526utm_medium%253A%2526utm_campaign%253A%2526utm_content%253A%2526utm_term%253A%2526utm_from%253A; Hm_lvt_4a7d8ec50cfd6af753c4f8aee3425070=1587543634,1588561607; Hm_lpvt_4a7d8ec50cfd6af753c4f8aee3425070=1588562033; s&a=OP%09T%0AZB%07%1C]V%05^TMXC]ZXZ%04%1ARO%0AUW_%04CWVQCXZOJRKG^VYYWXN]",
"upgrade-insecure-requests":"1",
"user-agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3756.400 QQBrowser/10.5.4039.400"
}
request_1=request.Request(url=self.get_url(),headers=headers)
responce=request.urlopen(request_1)
info=responce.read().decode('utf-8')
soup=BeautifulSoup(info,'lxml')
list_1=soup.select('div.all-page-search-album._Tt>div.xm-album>div.xm-album-cover__wrapper>a')
# list_1 列表类型
list_2=soup.select('div.all-page-search-album._Tt>div.xm-album>a') # 搜索得到的名称
for i in range(len(list_1)):
url='https://www.ximalaya.com'+list_1[i]['href']
list_1[i]=url
list_2[i]=list_2[i]['title']
return list_1,list_2
之后,我们点击其中一个链接进入,发现在这个链接下面还有关于这个专辑的章节信息,如下:
我们跟上面一样,只要爬取相应的链接和名称即可,如下:
代码如下:
def get_info_list(self):
headers={
"cookie":"_xmLog=xm_k9b2hll719h15c; device_id=xm_1587543632882_k9b2hmuqogh2yd; s&e=57b3d99db29b032792574ca15a641c84; x_xmly_traffic=utm_source%253A%2526utm_medium%253A%2526utm_campaign%253A%2526utm_content%253A%2526utm_term%253A%2526utm_from%253A; Hm_lvt_4a7d8ec50cfd6af753c4f8aee3425070=1587543634,1588561607; Hm_lpvt_4a7d8ec50cfd6af753c4f8aee3425070=1588565042; s&a=OP%09T%0AZB%07%1C]V%05^TMXC]ZXZ%04%1ARO%0AUW_%04CWV[CRBRMOK^XYBXOU",
"user-agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3756.400 QQBrowser/10.5.4039.400",
}
request_1=request.Request(url=self.get_id(),headers=headers)
responce=request.urlopen(request_1)
soup=BeautifulSoup(responce.read().decode('utf-8'),'lxml')
info_1=soup.select('li._Vc>div.text._Vc>a')
for i in range(len(info_1)):
url=info_1[i]['href']
id=url[url.rfind('/')+1:]
_url='https://www.ximalaya.com/revision/play/v1/audio?id={}&ptype=1'.format(id)
name=info_1[i]['title']
info_1[i]=[_url,name]
return info_1
其实,这个链接我们不需要全部,只需后面的那个id即可,
那个网址为:https://www.ximalaya.com/revision/play/v1/audio?id={}&ptype=1
这个网址可以在这里找到。
进入这个网址,我们可以发现这是一个json数据,一个声音的下载链接就在在这个网址下面:
如下:
结下就是将这里面声音下载链接得到即可。
3.最终代码和运行结果
from urllib import request
from urllib import parse
from bs4 import BeautifulSoup
import threading
import json
import os
import sys
import random
def download(list1:list):
while True:
if len(list1)==0:
break
list2=list1.pop()
try:
request.urlretrieve(url=list2[0],filename='{}.mp3'.format(list2[1]))
except:
print('出现错误了!')
print('{}->线程{}正在加载!'.format('-'*int(random.random()*10),threading.current_thread().getName()))
class Video(object):
def __init__(self,keyword):
self.keyword=keyword
def create_dir(self):
path='./{}'.format(self.keyword)
try:
os.mkdir(path=path)
except:
sys.exit() # 如果文件夹已经存在,退出程序
return path
def get_url(self):
name=parse.urlencode({'keyword':self.keyword})
name=name[name.find('=')+1:]
url='https://www.ximalaya.com/search/{}'.format(name) # 得到网址
return url
def get_info(self):
headers={
"accept":"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"accept-language":"zh-CN,zh;q=0.9",
"cache-control":"max-age=0",
"cookie":"_xmLog=xm_k9b2hll719h15c; device_id=xm_1587543632882_k9b2hmuqogh2yd; s&e=57b3d99db29b032792574ca15a641c84; x_xmly_traffic=utm_source%253A%2526utm_medium%253A%2526utm_campaign%253A%2526utm_content%253A%2526utm_term%253A%2526utm_from%253A; Hm_lvt_4a7d8ec50cfd6af753c4f8aee3425070=1587543634,1588561607; Hm_lpvt_4a7d8ec50cfd6af753c4f8aee3425070=1588562033; s&a=OP%09T%0AZB%07%1C]V%05^TMXC]ZXZ%04%1ARO%0AUW_%04CWVQCXZOJRKG^VYYWXN]",
"upgrade-insecure-requests":"1",
"user-agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3756.400 QQBrowser/10.5.4039.400"
}
request_1=request.Request(url=self.get_url(),headers=headers)
responce=request.urlopen(request_1)
info=responce.read().decode('utf-8')
soup=BeautifulSoup(info,'lxml')
list_1=soup.select('div.all-page-search-album._Tt>div.xm-album>div.xm-album-cover__wrapper>a')
# list_1 列表类型
list_2=soup.select('div.all-page-search-album._Tt>div.xm-album>a') # 搜索得到的名称
for i in range(len(list_1)):
url='https://www.ximalaya.com'+list_1[i]['href']
list_1[i]=url
list_2[i]=list_2[i]['title']
return list_1,list_2
def get_id(self):
list1=self.get_info()
list_2=list1[1]
for i in range(len(list_2)):
print('【{}】->{}'.format(i+1,list_2[i]))
id=int(input('请输入你想听的序号:'))
return list1[0][id-1]
def get_info_list(self):
headers={
"cookie":"_xmLog=xm_k9b2hll719h15c; device_id=xm_1587543632882_k9b2hmuqogh2yd; s&e=57b3d99db29b032792574ca15a641c84; x_xmly_traffic=utm_source%253A%2526utm_medium%253A%2526utm_campaign%253A%2526utm_content%253A%2526utm_term%253A%2526utm_from%253A; Hm_lvt_4a7d8ec50cfd6af753c4f8aee3425070=1587543634,1588561607; Hm_lpvt_4a7d8ec50cfd6af753c4f8aee3425070=1588565042; s&a=OP%09T%0AZB%07%1C]V%05^TMXC]ZXZ%04%1ARO%0AUW_%04CWV[CRBRMOK^XYBXOU",
"user-agent":"Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3756.400 QQBrowser/10.5.4039.400",
}
request_1=request.Request(url=self.get_id(),headers=headers)
responce=request.urlopen(request_1)
soup=BeautifulSoup(responce.read().decode('utf-8'),'lxml')
info_1=soup.select('li._Vc>div.text._Vc>a')
for i in range(len(info_1)):
url=info_1[i]['href']
id=url[url.rfind('/')+1:]
_url='https://www.ximalaya.com/revision/play/v1/audio?id={}&ptype=1'.format(id)
name=info_1[i]['title']
info_1[i]=[_url,name]
return info_1
def Downloads(self):
list1= self.get_info_list()
path=self.create_dir()
for i in range(len(list1)):
url = list1[i][0]
headers = {
"cookie": "_xmLog=xm_k9b2hll719h15c; device_id=xm_1587543632882_k9b2hmuqogh2yd; s&e=57b3d99db29b032792574ca15a641c84; x_xmly_traffic=utm_source%253A%2526utm_medium%253A%2526utm_campaign%253A%2526utm_content%253A%2526utm_term%253A%2526utm_from%253A; Hm_lvt_4a7d8ec50cfd6af753c4f8aee3425070=1587543634,1588561607; Hm_lpvt_4a7d8ec50cfd6af753c4f8aee3425070=1588562015; s&a=OP%09T%0AZB%07%1C]V%05^TMXC]ZXZ%04%1ARO%0AUW_%04CWVQCXYOCUVZZTTO@WL",
"referer": "https://www.ximalaya.com/lishi/13505116/73924292",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3756.400 QQBrowser/10.5.4039.400", }
request_1 = request.Request(url=url, headers=headers)
responce = request.urlopen(request_1)
json_1 = json.loads(responce.read().decode('utf-8'))
list1[i][0] = json_1['data']['src']
list1[i][1] = path + '/{}'.format(list1[i][1])
threading_list = []
for i in range(5): # 创建五个线程
threading_1 = threading.Thread(target=download, args=(list1,))
threading_1.start()
threading_list.append(threading_1)
for i in threading_list:
i.join()
print('下载完毕!当前线程为{}'.format(threading.current_thread().getName()))
if __name__ == '__main__':
object=Video(input('请输入你想听的专辑名称:'))
object.Downloads()
下载完成之后,会在同一个文件夹下面多出一个文件夹,下载的声音就在这个文件夹里面。
4.总结
这个项目还有很多缺点,比如,没有用到IP代理池,比较容易使爬取信息失败;另外,我运行次数比较少,不知道还会出现怎样的错误,如果大家运行发现错误,请及时在下方留言。
如果大家觉得我的这篇文章写得还可以的话,记得给我点一个小小的赞,谢谢!