I believe that obtaining the data set when training the model is also a headache. Those in the CV field can carry a camera and set up a tripod to capture the data (I have done this before), but what about the NLP field, especially after large models such as chatgpt are released. There is a greater demand for this kind of text and other data. If there is no ready-made data set, it is basically difficult to create the data yourself, so crawling is considered as one of the means to obtain data (but a reminder to obtain the data legally).
So let's take the simple batch acquisition of mp3 files as an example.
Suppose we want to get all the music files on the NetEase Cloud soaring list:
The address is: https://music.163.com/#/discover/toplist?id=19723756
First open the developer tools with f12:
select network, and then copy according to the song name Go to the search box and click the Clean button to clear all request information.
Then click to refresh the page, and you can see that a lot of new request information has appeared. Here, open the packet capture and start the request because the request information just now may be delayed request information and is not complete. The re-obtained here is more comprehensive.
Click on the search box on the left to see the positioned location. In the a tag of the li tag, next we first check the request information get to get the information and print it out:
Select headers to obtain two pieces of information, one is the URL and the other is the user agent under the request headers. Copy these two pieces of information to start the code below:
import requests
import re # 正则表达式的库
url = "https://music.163.com/discover/toplist?id=19723756"
headers = {
"user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
}
response = requests.get(url=url, headers=headers)
print(response.text)
After running, print out the obtained information, and then start to extract the desired content. Ctrl f locates the song title and you can see that it is wrapped in the li tag, because what we want to download is a file in mp3 format. See mp3 The composition of the download address includes id, and id corresponds to the song title one-to-one, so we use a for loop to obtain each song title and id to download the corresponding mp3 file: Let's take
this "Double Star" as an example first. Its composition is Like this: <li><a href="/song?id=2068206782">双星</a></li>
, so we can use regular expressions to universally represent the tag composition of all song titles: <li><a href="/song\?id=(\d+)">(.*?)</a>
, the code is as follows:
html_data = re.findall('<li><a href="/song\?id=(\d+)">(.*?)</a>', response.text)
# print(html_data)
for num_id, title in html_data:
music_url = f"http://music.163.com/song/media/outer/url?id={
num_id}.mp3" # mp3文件地址
music_content = requests.get(url=music_url, headers=headers).content
with open("/home/alpha/桌面/results/" + title + ".mp3", mode="wb") as f: # 下载每个mp3文件
f.write(music_content)
print(num_id, title)
Operation result:
In this way, all mp3 files under the current page are crawled.