最近发现,娃喜欢睡觉之前听《三字经》,又不想给他玩手机,遂起心将喜马拉雅FM上的音频下载下来的想法。
在网上搜到了一篇文章,介绍了如何爬取:https://blog.csdn.net/majiexiong/article/details/81949388
但是一试之下发现没有爬取成功,打开https://www.ximalaya.com/ertong/15161417/的源码,发现网页源码改了,原文中的类名为dOi2的ul的类名被改为了rC5T
href_list = html_ele.xpath('//ul[@class="dOi2"]/li/div[2]/a/@href')
只要将原文中的类名改掉,运行程序便可以得到格式为m4a的音频文件。
import requests
from lxml import etree
from urllib import request
import os
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 SE 2.X MetaSr 1.0'
}
url = 'https://www.ximalaya.com/ertong/15161417/' # 由于音频专辑比较少,第二页的在后面加个p2/即可爬取第二页的音频
response = requests.get(url,headers=headers)
html_str = response.text
html_ele = etree.HTML(html_str)
href_list = html_ele.xpath('//ul[@class="rC5T"]/li/div[2]/a/@href')
if not os.path.exists('mjx'):
os.mkdir('mjx')
for href in href_list:
next_href = href.split('/')[-1]
xiangqing_url = 'https://www.ximalaya.com/revision/play/tracks?trackIds=' + str(next_href)
print(xiangqing_url)
response = requests.get(xiangqing_url,headers=headers)
json_dict = response.json()
src_str = json_dict['data']['tracksForAudioPlay'][0]['src']
trackName = json_dict['data']['tracksForAudioPlay'][0]['trackName']
request.urlretrieve(src_str,'mjx/'+ trackName + '.m4a')
再找个格式转换软件,即可得到mp3格式的音频文件,搞定!