Python crawler practical case - music crawler, paid songs are still available

Because many music platforms now charge for downloading songs, I have no car music to listen to. So I taught myself how to crawl and made this simple music crawler. It's not a music crawler from those big platforms, it's a crawler from an unknown small music website. Let’s get down to business:

First of all, I was looking for a music website that was not one of those big Internet companies. After my unremitting efforts, I finally found a pheasant music website with a relatively complete range of songs (please allow me to say this). Although it is a pheasant, it has all the new songs and popular songs from popular singers. Although a sparrow is small, it has all the internal organs.

Next, we need to capture the packets of the website and look for patterns in web links. Of course, the patterns of web links in this small website are also very simple. As shown below:

ba08bbb7d7bb4757b8c4a66c63db6fa4.png

348ac4b698a44c229f25d1b38c634ef0.png

It is not difficult to find the pattern of its web page URL. Just take the unknown URL in front, add the name of the singer you want to search for, and then add .html. Ever since, the problem of url in the code has been solved. code show as below:

import requests
name = input()
url = 'http://www.2t58.com/so/{}/1.html'.format(name)
response = requests.get(url = url)

The above code is completely feasible at present. Although I have not added heards and other related anti-crawling parameters, the website has not been anti-crawled so far.

Next, let’s click on a few songs:

2835c96841824584b1443cf60e36901c.png

 

d3fad3955f774d7f8d1e2a633e88deed.png

 7bc5fa96810246ee94e8917607302844.png

 

Because there are so many friends who like Jay Chou, I searched for Jay Chou and clicked on a few of his songs. We can find that the web address in the web page column has become irregular, and it is a string of English words that we cannot understand. letters plus .html. And if we want to download songs, we must enter this page, so we must find the pattern of its web address, or we must find which data packet the irregular string of English letters can be obtained from. Next, I captured the data and analyzed it. 

24b277c81c0a4e558ddebfa7535e4e3d.png

It is easy to find that the last field in the href of each a tag is the irregular string of English letters we want, so what we have to do is to check its url, and then send a request through the requests module. The webpage was requested. After my attempts, the website was reverse-crawled here. If heards are not added, the data will not be requested. Please pay attention. Then, we use regular expressions to match and retain the English letters we need. The code is as follows:

ex = '<div class="name"><a href="/song/(.*?).html" target="_mp3">.*?</a></div>'
musicIndex = re.findall(ex, response.text, re.S)

Next, just go back to the download link of the song and save it in binary form. Just click on a song to capture the song details page.

4a875de1a077428185965ed20289b405.png

There are a lot of data packets. We can selectively view them according to the type of data packets. Audio and video are obviously media type data packets. Obviously this is the music link we want. We search it in a separate web page bar to get the complete music that can be downloaded directly without restrictions. But what we want is batch downloading, and the downloading efficiency of each song is too slow. Therefore, we continue to look for a data packet whose response contains the link to this song. What I did was to search for the corresponding keyword in the packet search bar for the last .mp3 name. Fortunately, I found it.

 

 4528c6e317474c928552ca85bb444c2c.png

862c5fdc8c7b4b4eba76acad7825d417.png

 The next step is very simple. We have found the pattern of the details page of each singer to be searched (fixed URL + singer name + .html), and we have obtained the details page of each song (that paragraph has no pattern. English letters), and finally found the data package containing the song link on the song details page, so what we have to do next is to save the song link in binary. Students who have learned crawlers should have noticed that the above data package The response is json data, which returns data in the form of a dictionary. We can retrieve the download link of the song we need based on the key-value pair. The first step is to request this data packet according to the corresponding URL. Pay attention to check the parameters of the corresponding URL in the request header. The code is as follows:

data = {'id': i,'type': 'music'}

url2 = 'http://www.2t58.com/js/play.php'

response2 = requests.post(url = url2, headers = headers, data = data)

json_data = response2.json()

musicList = json_data['url']

We save the download link of the song in the list musicList, and finally save the song. The code is as follows:

musicResponse = requests.get(url = musicList)

filename = json_data['title'] + '.mp3'

with open('E:/music/' + filename, 'wb') as f:
    f.write(musicResponse.content)
    print(filename + '下载成功!')

If you change the path, it will run normally.

 

Complete code:

import requests
import re

# 张学友:aHNj
# 陈奕迅:eG4
# 林忆莲:d2t3eA



name = input()
url = 'http://www.2t58.com/so/{}/1.html'.format(name)
response = requests.get(url = url)
ex = '<div class="name"><a href="/song/(.*?).html" target="_mp3">.*?</a></div>'
musicIndex = re.findall(ex, response.text, re.S)
smallmusicList = []
for j in range(0, 5):
    smallmusicList.append(musicIndex[j])
print(smallmusicList)



headers ={
'Accept':'application/json, text/javascript, */*; q=0.01',
'Accept-Encoding':'gzip, deflate',
'Accept-Language':'zh-CN,zh;q=0.9',
'Connection':'keep-alive',
'Content-Length':'26',
'Content-Type':'application/x-www-form-urlencoded; charset=UTF-8',
'Cookie':'Hm_lvt_b8f2e33447143b75e7e4463e224d6b7f=1690974946; Hm_lpvt_b8f2e33447143b75e7e4463e224d6b7f=1690976158',
'Host':'www.2t58.com',
'Origin':'http://www.2t58.com',
'Referer':'http://www.2t58.com/song/bWhzc3hud25u.html',
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36',
'X-Requested-With':'XMLHttpRequest'
}


for i in musicIndex:
    data = {'id': i,'type': 'music'}

    url2 = 'http://www.2t58.com/js/play.php'

    response2 = requests.post(url = url2, headers = headers, data = data)

    json_data = response2.json()

    musicList = json_data['url']

    musicResponse = requests.get(url = musicList)

    filename = json_data['title'] + '.mp3'

    with open('E:/music/' + filename, 'wb') as f:
        f.write(musicResponse.content)
        print(filename + '下载成功!')


Maybe because the website is not perfect enough, some singers' detailed webpages are not the singer's name + .html, but some irregular English letters. What I have found temporarily has been placed above. If you want to download the singer's songs, enter the corresponding English alphabet.

 

For learning purposes only, please do not use it for illegal purposes.

 

 

 

 

 

Guess you like

Origin blog.csdn.net/qq_64241302/article/details/132167076