爬虫项目实战十二：爬取酷狗音乐

爬取酷狗音乐并下载

目标

爬取酷狗音乐，利用酷狗音乐api下载歌曲。

项目准备

软件：Pycharm
第三方库：requests,fake_useragent,selenium,re
网站地址：https://www.kugou.com/

项目分析

api接口：http://mobilecdn.kugou.com/api/v3/search/song?format=json&keyword=歌曲名或歌手名&page=1&pagesize=20&showtype=1
注：来源：https://blog.csdn.net/qq_32551929/article/details/87256150?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522159730425319725211907849%2522%252C%2522scm%2522%253A%252220140713.130102334…%2522%257D&request_id=159730425319725211907849&biz_id=0&utm_medium=distribute.pc_search_result.none-task-blog-2_allfirst_rank_ecpm_v3~pc_rank_v3-2-87256150.pc_ecpm_v3_pc_rank_v3&utm_term=%E9%85%B7%E7%8B%97%E9%9F%B3%E4%B9%90api&spm=1018.2118.3001.4187

这里歌手名输入：付辛博
打开看一下。
在这里插入图片描述
会发现就是该歌手很多歌曲名字，并没有看到有mp3类似的歌曲链接。这就得看如何使用api了。
打开浏览器，酷狗音乐搜索框输入付辛博。随便打开一首歌曲

在这里插入图片描述
就打开第一首吧。

重点来了，这首歌的url链接构成。

https://www.kugou.com/song/#hash=C789A79D43105772F2783807A0D33B19&album_id=1740630

后面hash=和album_id=的东西在刚刚的api上都有显示。
在这里插入图片描述

在这里插入图片描述
这样就可以得到每首歌的链接地址了。只需拼接就OK。
然而获取连接还不够，还要得到每首歌的歌曲地址。

找到了，但是由于是动态JavaScript加载的，所以在源代码中根本找不到，这也是之前所遇到过的，而且酷狗音乐也采用各种加密方式，自己才疏学浅尝试好久依然搞不明白。只有采用selenium的方式获得歌曲源地址了。

反爬分析

同一个ip地址去多次访问会面临被封掉的风险，这里采用fake_useragent，产生随机的User-Agent请求头进行访问。

代码实现

1.导入相对应的第三方库，定义一个class类继承object，定义init方法继承self，主函数main继承self。

import requests
from fake_useragent import UserAgent
from selenium import webdriver
import re
class kugou(object):
    def __init__(self):
        self.url = 'http://mobilecdn.kugou.com/api/v3/search/song?format=json&keyword={}'
        ua = UserAgent(verify_ssl=False)
        for i in range(1, 100):
            self.headers = {
    
    
                'User-Agent': ua.random
            }
        self.driver=webdriver.Chrome()
    def main(self):
        pass
if __name__ == '__main__':
    spider = kugou()
    spider.main()

2.发送请求,获取网页。

    def get_html(self,url):
        response = requests.get(url, headers=self.headers)
        html = response.json()
        return html

3.使用api接口获取得到每首歌的网址。

    def get_link(self,html):
        content_list=html['data']['info']['singername']
        for data in content_list:
            singername=data['singername']
            songname=data['songname']
            album_id=data['album_id']
            hash=data['hash']
            filename=songname+' '+singername
            print(filename)
            link='https://www.kugou.com/song/#hash=%s&album_id=%s'%(hash,album_id)
            self.parse_html(link,filename)

4.批量下载到本地。

    def parse_html(self,link,filename):
        self.driver.get(link)
        new_html=self.driver.page_source
        musics=re.compile('<audio class="music" id="myAudio" src="(.*?)"').findall(new_html)
        for music in musics:
            music_url=music
            r=requests.get(music_url,headers=self.headers)
            with open("F:/pycharm文件/music/"+filename+'.mp3','wb') as f:
                f.write(r.content)

5.主函数及函数调用。

    def main(self):
        name=str(input("请输入歌曲或歌手名："))
        host=self.url.format(name)+'&page={}&pagesize=20&showtype=1'
        end_page = int(input("要爬多少页："))
        for page in range(1, end_page + 1):
            url = host.format(page)
            print("第%s页。。。。" % page)
            html=self.get_html(url)
            link=self.get_link(html)
            try:
                self.parse_html(link)
            except:
                pass

效果显示

在这里插入图片描述
打开文件位置。

在这里插入图片描述
随便播放一首试试看。

成功。
完整代码如下：

import requests
from fake_useragent import UserAgent
from selenium import webdriver
import re
class kugou(object):
    def __init__(self):
        self.url = 'http://mobilecdn.kugou.com/api/v3/search/song?format=json&keyword={}'
        ua = UserAgent(verify_ssl=False)
        for i in range(1, 100):
            self.headers = {
    
    
                'User-Agent': ua.random
            }
        self.driver=webdriver.Chrome()
    def __del__(self):
        self.driver.close()#运行完就关闭
    def get_html(self,url):
        response = requests.get(url, headers=self.headers)
        html = response.json()
        return html
    def get_link(self,html):
        content_list=html['data']['info']
        for data in content_list:
            singername=data['singername']
            songname=data['songname']
            album_id=data['album_id']
            hash=data['hash']
            filename=songname+' '+singername
            print(filename)
            link='https://www.kugou.com/song/#hash=%s&album_id=%s'%(hash,album_id)
            self.parse_html(link,filename)
    def parse_html(self,link,filename):
        self.driver.get(link)
        new_html=self.driver.page_source#获取源码可得到.mp3音乐地址
        musics=re.compile('<audio class="music" id="myAudio" src="(.*?)"').findall(new_html)
        for music in musics:
            music_url=music
            r=requests.get(music_url,headers=self.headers)
            with open("F:/pycharm文件/music/"+filename+'.mp3','wb') as f:
                f.write(r.content)
    def main(self):
        name=str(input("请输入歌曲或歌手名："))
        host=self.url.format(name)+'&page={}&pagesize=20&showtype=1'
        end_page = int(input("要爬多少页："))
        for page in range(1, end_page + 1):
            url = host.format(page)
            print("第%s页。。。。" % page)
            html=self.get_html(url)
            link=self.get_link(html)
            try:
                self.parse_html(link)
            except:
                pass
if __name__ == '__main__':
    spider = kugou()
    spider.main()

声明：仅作为自己学习参考使用。