Reptile project combat twelve: crawling Kugou music

aims

Crawl Kugou music, use Kugou music api to download songs.

Project preparation

Software: Pycharm
third-party library: requests, fake_useragent, selenium, re
Website address: https://www.kugou.com/

Project Analysis

API interface: http://mobilecdn.kugou.com/api/v3/search/song?format=json&keyword=song name or artist name&page=1&pagesize=20&showtype=1
Note: Source: https://blog.csdn.net /qq_32551929/article/details/87256150?ops_request_misc=%257B%2522request%255Fid%2522%253A%2522159730425319725211907849%2522%252C%2522scm%2522%253A%252220140713.130102334…%2522%257d -task-blog-2 all first_rank_ecpm_v3~pc_rank_v3-2-87256150.pc_ecpm_v3_pc_rank_v3&utm_term=%E9%85%B7%E7%8B%97%E9%9F%B3%E4%B9%90api&spm=1018.2118.3001.4187

Enter the name of the singer here: Fu Xinbo
Open it and take a look.
Insert picture description here
You will find that there are many song names of the singer, and there are no links to similar mp3 songs. It depends on how to use the api.
Open the browser, enter Fu Xinbo in the Kugou music search box. Open a song casually

Insert picture description here
Just open the first song.
Insert picture description here
Here comes the point, the url link of this song is formed.

https://www.kugou.com/song/#hash=C789A79D43105772F2783807A0D33B19&album_id=1740630
Insert picture description here

The following hash= and album_id= things are displayed on the api just now.
Insert picture description here

Insert picture description here
In this way, you can get the link address of each song. Just splicing is OK.
However, getting the connection is not enough, you have to get the song address of each song.
Insert picture description here
I found it, but because it is loaded by dynamic JavaScript, I can't find it in the source code. This is what I have encountered before, and Kugou Music also uses various encryption methods. Only use selenium to get the source address of the song.

Anti-climb analysis

Multiple accesses to the same ip address will face the risk of being blocked. Here, fake_useragent is used to generate random User-Agent request headers for access.

Code

1. Import the corresponding third-party library, define a class to inherit object, define the init method to inherit self, and the main function main to inherit self.

import requests
from fake_useragent import UserAgent
from selenium import webdriver
import re
class kugou(object):
    def __init__(self):
        self.url = 'http://mobilecdn.kugou.com/api/v3/search/song?format=json&keyword={}'
        ua = UserAgent(verify_ssl=False)
        for i in range(1, 100):
            self.headers = {
    
    
                'User-Agent': ua.random
            }
        self.driver=webdriver.Chrome()
    def main(self):
        pass
if __name__ == '__main__':
    spider = kugou()
    spider.main()

2. Send a request to get the web page.

    def get_html(self,url):
        response = requests.get(url, headers=self.headers)
        html = response.json()
        return html

3. Use the api interface to get the URL of each song.

    def get_link(self,html):
        content_list=html['data']['info']['singername']
        for data in content_list:
            singername=data['singername']
            songname=data['songname']
            album_id=data['album_id']
            hash=data['hash']
            filename=songname+' '+singername
            print(filename)
            link='https://www.kugou.com/song/#hash=%s&album_id=%s'%(hash,album_id)
            self.parse_html(link,filename)

4. Batch download to local.

    def parse_html(self,link,filename):
        self.driver.get(link)
        new_html=self.driver.page_source
        musics=re.compile('<audio class="music" id="myAudio" src="(.*?)"').findall(new_html)
        for music in musics:
            music_url=music
            r=requests.get(music_url,headers=self.headers)
            with open("F:/pycharm文件/music/"+filename+'.mp3','wb') as f:
                f.write(r.content)

5. Main function and function call.

    def main(self):
        name=str(input("请输入歌曲或歌手名:"))
        host=self.url.format(name)+'&page={}&pagesize=20&showtype=1'
        end_page = int(input("要爬多少页:"))
        for page in range(1, end_page + 1):
            url = host.format(page)
            print("第%s页。。。。" % page)
            html=self.get_html(url)
            link=self.get_link(html)
            try:
                self.parse_html(link)
            except:
                pass

Effect display

Insert picture description here
Open the file location.

Insert picture description here
Just play a song and try it out.
Insert picture description here
success.
The complete code is as follows:

import requests
from fake_useragent import UserAgent
from selenium import webdriver
import re
class kugou(object):
    def __init__(self):
        self.url = 'http://mobilecdn.kugou.com/api/v3/search/song?format=json&keyword={}'
        ua = UserAgent(verify_ssl=False)
        for i in range(1, 100):
            self.headers = {
    
    
                'User-Agent': ua.random
            }
        self.driver=webdriver.Chrome()
    def __del__(self):
        self.driver.close()#运行完就关闭
    def get_html(self,url):
        response = requests.get(url, headers=self.headers)
        html = response.json()
        return html
    def get_link(self,html):
        content_list=html['data']['info']
        for data in content_list:
            singername=data['singername']
            songname=data['songname']
            album_id=data['album_id']
            hash=data['hash']
            filename=songname+' '+singername
            print(filename)
            link='https://www.kugou.com/song/#hash=%s&album_id=%s'%(hash,album_id)
            self.parse_html(link,filename)
    def parse_html(self,link,filename):
        self.driver.get(link)
        new_html=self.driver.page_source#获取源码可得到.mp3音乐地址
        musics=re.compile('<audio class="music" id="myAudio" src="(.*?)"').findall(new_html)
        for music in musics:
            music_url=music
            r=requests.get(music_url,headers=self.headers)
            with open("F:/pycharm文件/music/"+filename+'.mp3','wb') as f:
                f.write(r.content)
    def main(self):
        name=str(input("请输入歌曲或歌手名:"))
        host=self.url.format(name)+'&page={}&pagesize=20&showtype=1'
        end_page = int(input("要爬多少页:"))
        for page in range(1, end_page + 1):
            url = host.format(page)
            print("第%s页。。。。" % page)
            html=self.get_html(url)
            link=self.get_link(html)
            try:
                self.parse_html(link)
            except:
                pass
if __name__ == '__main__':
    spider = kugou()
    spider.main()

Disclaimer: It is only used as a reference for self-study.

Guess you like

Origin blog.csdn.net/qq_44862120/article/details/107993521