Example of crawler crawling mp3 files

I believe that obtaining the data set when training the model is also a headache. Those in the CV field can carry a camera and set up a tripod to capture the data (I have done this before), but what about the NLP field, especially after large models such as chatgpt are released. There is a greater demand for this kind of text and other data. If there is no ready-made data set, it is basically difficult to create the data yourself, so crawling is considered as one of the means to obtain data (but a reminder to obtain the data legally).
So let's take the simple batch acquisition of mp3 files as an example.

Suppose we want to get all the music files on the NetEase Cloud soaring list:
Insert image description here
The address is: https://music.163.com/#/discover/toplist?id=19723756
First open the developer tools with f12:
Insert image description here
select network, and then copy according to the song name Go to the search box and click the Clean button to clear all request information.
Then click to refresh the page, and you can see that a lot of new request information has appeared. Here, open the packet capture and start the request because the request information just now may be delayed request information and is not complete. The re-obtained here is more comprehensive.
Insert image description here
Click on the search box on the left to see the positioned location. In the a tag of the li tag, next we first check the request information get to get the information and print it out:
Insert image description here
Insert image description here

Select headers to obtain two pieces of information, one is the URL and the other is the user agent under the request headers. Copy these two pieces of information to start the code below:

import requests   
import re   # 正则表达式的库

url = "https://music.163.com/discover/toplist?id=19723756"
headers = {
    
    
    "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
}
response = requests.get(url=url, headers=headers)
print(response.text)

Insert image description here
After running, print out the obtained information, and then start to extract the desired content. Ctrl f locates the song title and you can see that it is wrapped in the li tag, because what we want to download is a file in mp3 format. See mp3 The composition of the download address includes id, and id corresponds to the song title one-to-one, so we use a for loop to obtain each song title and id to download the corresponding mp3 file: Let's take
this "Double Star" as an example first. Its composition is Like this: <li><a href="/song?id=2068206782">双星</a></li>, so we can use regular expressions to universally represent the tag composition of all song titles: <li><a href="/song\?id=(\d+)">(.*?)</a>, the code is as follows:

html_data = re.findall('<li><a href="/song\?id=(\d+)">(.*?)</a>', response.text)
# print(html_data)
for num_id, title in html_data:
    music_url = f"http://music.163.com/song/media/outer/url?id={
      
      num_id}.mp3"  # mp3文件地址
    music_content = requests.get(url=music_url, headers=headers).content
    with open("/home/alpha/桌面/results/" + title + ".mp3", mode="wb") as f:   # 下载每个mp3文件
        f.write(music_content)
    print(num_id, title)

Operation result:
Insert image description here
In this way, all mp3 files under the current page are crawled.

Guess you like

Origin blog.csdn.net/weixin_45354497/article/details/132734146