[Reptile] Python crawling Netease cloud music search and download songs!

Python crawling Netease cloud music search and download songs!

1. Preparations

NetEase cloud music I tried it and found that it is a dynamic page, the contents of which are JS generated, so not very good crawling. This time there should be a third-party website, "help" us crawling up.
I found a third-party software , you can use it to climb out of the song ID, we are crawling its source code, taken out ID (seems a bit convoluted)

2. "in the field" observation

We enter into this site and found this site a 5 download the source can be searched:
Download Source
Today our goal is to download the song Netease cloud music, interested junior partner may try to crawl songs to other sites, the principle is the same. We just search for a song, see the Web site.
URL Bar
We noticed that the URL "kw =" representatives are behind the name of the song, while "lx =" behind represents the download source.
Let us look at the source code:
Source
We see a positive have a label inside what we want: the download link and song name.
a
With song titles and download links can be easily handled, the next part of the code is the code!

3. Start Code Code!

I'm here to do a user interface and UI, as well as a new way to download: to download. Want to see the link to download a small partner can skip this chapter, see Chapter 4: search and download.
Link to download
First of all, we have to know a URL: http:?? //Music.163.com/song/media/outer/url id = .mp3
What is it? This is a download link, fill in the song "id =" at the ID can be downloaded.
URL ID
We just opened a music and found exactly "id = ???" such a format, we just use regular expressions to extract the ID, and then fill in the ID to the above URL on it on the web site, the code:

import re
import urllib.request
import tkinter.messagebox as box
# 设置下载函数
def urldownload():
    url = lefturl.get() # 这里是我UI的输入框,不想用UI的可以直接input
    try:
        # 解析歌曲id
        urlid = re.findall('id=(.*)', url)[0]
        # 获取下载网页
        durl = 'http://music.163.com/song/media/outer/url?id=%s.mp3' % urlid
        # 下载歌曲
        urllib.request.urlretrieve(durl, '绝对路径\名称.mp3')
        # 提示下载完毕
        box.showinfo(title='提示', message='音乐已下载完毕!\n已保存至download文件夹!')
    except:
        box.showerror(title='错误', message='下载链接错误!')

4. Search and Download

Want to get the download link and name, we first have to get the source code of the page:

# 搜索函数
def searchdownload(name):
    # 从网站的Requests Header中获取
    url = 'https://music.hwkxk.cn/?kw=%s&lx=wy' % name
    html = requests.get(url=url).text
    print(html)

But after the operation, the output is garbled, how is this going?
At this time, we can put the page contents to a single-byte encoding, then into UTF-8, modified as follows:

import requests
# 搜索函数
def searchdownload(name):
    # 从网站的Requests Header中获取
    url = 'https://music.hwkxk.cn/?kw=%s&lx=wy' % name
    html = requests.get(url=url).text
    html = html.encode('ISO-8859-1')
    html = html.decode('UTF-8')
    print(html)

At this time, there is no distortion.
Next, came crawling song titles and download links:
a
We see that the name of a song class label is "btn btn-xs btn-success ", but that's just a song of class, we need to find "all the songs the class ".
We see on the right "styles", found that the class is "class all a label."
all_a
Now code Code:

import bs4
import requests
# 搜索函数
def searchdownload(name):
    # 从网站的Requests Header中获取
    url = 'https://music.hwkxk.cn/?kw=%s&lx=wy' % name
    html = requests.get(url=url).text
    html = html.encode('ISO-8859-1')
    html = html.decode('UTF-8')
    # 解析网页
    soup = bs4.BeautifulSoup(html, "lxml")
    # 查找目标
    link_0 = soup.select('.btn-success')
    print(link_0)

After running the function, Python returns a list:

[<a class="btn btn-xs btn-success" download="久石譲 - Summer.flac" href="https://music.hwkxk.cn/api/?source=WYSQ&amp;id=1417064063" target="_blank">无损</a>, <a class="btn btn-xs btn-success" download="久石譲 - Summer.flac" href="https://music.hwkxk.cn/api/?source=WYSQ&amp;id=443242" target="_blank">无损</a>, <a class="btn btn-xs btn-success" download="keshi - summer.flac" href="https://music.hwkxk.cn/api/?source=WYSQ&amp;id=1378192821" target="_blank">无损</a>, <a class="btn btn-xs btn-success" download="Calvin Harris - Summer.flac" href="https://music.hwkxk.cn/api/?source=WYSQ&amp;id=28306554" target="_blank">无损</a>, <a class="btn btn-xs btn-success" download="Mazza - Summer Klaas Remix.flac" href="https://music.hwkxk.cn/api/?source=WYSQ&amp;id=28729445" target="_blank">无损</a>, <a class="btn btn-xs btn-success" download="久石譲 - Summer.flac" href="https://music.hwkxk.cn/api/?source=WYSQ&amp;id=444292" target="_blank">无损</a>, <a class="btn btn-xs btn-success" download="keshi - summer.flac" href="https://music.hwkxk.cn/api/?source=WYSQ&amp;id=1361455890" target="_blank">无损</a>, <a class="btn btn-xs btn-success" download="徐梦圆 - summer.flac" href="https://music.hwkxk.cn/api/?source=WYSQ&amp;id=34779102" target="_blank">无损</a>, <a class="btn btn-xs btn-success" download="LJY - Summer (夏).flac" href="https://music.hwkxk.cn/api/?source=WYSQ&amp;id=485263993" target="_blank">无损</a>, <a class="btn btn-xs btn-success" download="Calvin Harris - Summer.flac" href="https://music.hwkxk.cn/api/?source=WYSQ&amp;id=29460066" target="_blank">无损</a>, <a class="btn btn-xs btn-success" download="戈冧 - Summer.flac" href="https://music.hwkxk.cn/api/?source=WYSQ&amp;id=1377103256" target="_blank">无损</a>, <a class="btn btn-xs btn-success" download="David Garrett - Summer.flac" href="https://music.hwkxk.cn/api/?source=WYSQ&amp;id=17241229" target="_blank">无损</a>, <a class="btn btn-xs btn-success" download="cozy kev - summer.flac" href="https://music.hwkxk.cn/api/?source=WYSQ&amp;id=1410153419" target="_blank">无损</a>, <a class="btn btn-xs btn-success" download="KMS - summer.flac" href="https://music.hwkxk.cn/api/?source=WYSQ&amp;id=1418582038" target="_blank">无损</a>, <a class="btn btn-xs btn-success" download="Yogee New Waves - Summer.flac" href="https://music.hwkxk.cn/api/?source=WYSQ&amp;id=29979351" target="_blank">无损</a>, <a class="btn btn-xs btn-success" download="Marshmello - SuMmeR.flac" href="https://music.hwkxk.cn/api/?source=WYSQ&amp;id=39324020" target="_blank">无损</a>, <a class="btn btn-xs btn-success" download="Kesha - Summer.flac" href="https://music.hwkxk.cn/api/?source=WYSQ&amp;id=1419676441" target="_blank">无损</a>, <a class="btn btn-xs btn-success" download="Calvin Harris - Summer R3hab  Ummet Ozcan Remix.flac" href="https://music.hwkxk.cn/api/?source=WYSQ&amp;id=28696074" target="_blank">无损</a>, <a class="btn btn-xs btn-success" download="BROCKHAMPTON - SUMMER.flac" href="https://music.hwkxk.cn/api/?source=WYSQ&amp;id=502242134" target="_blank">无损</a>, <a class="btn btn-xs btn-success" download="Dan Martinez - 夏日狂欢.flac" href="https://music.hwkxk.cn/api/?source=WYSQ&amp;id=1320098269" target="_blank">无损</a>]

This is what we want, first we are now printing the list:

print(link_0[0])

Output:

<a class="btn btn-xs btn-success" download="久石譲 - Summer.flac" href="https://music.hwkxk.cn/api/?source=WYSQ&amp;id=1417064063" target="_blank">无损</a>

A closer observation, found that "download" is the name of the song, "href" is a song download link. Just behind "link_0 [0]" plus ".get ( 'href') [0]" on it, the name is the same reason, if there is no return None.

# 查找目标
    try:
        link_0 = soup.select('.btn-success')[0].get('href')[0]
        name_0 = soup.select('.btn-success')[0].get('download')
    except:
        link_0 = None
        name_0 = None
    try:
        link_1 = soup.select('.btn-success')[1].get('href')[0]
        name_1 = soup.select('.btn-success')[1].get('download')
    except:
        link_1 = None
        name_1 = None
    try:
        link_2 = soup.select('.btn-success')[2].get('href')[0]
        name_2 = soup.select('.btn-success')[2].get('download')
    except:
        link_2 = None
        name_2 = None
    try:
        link_3 = soup.select('.btn-success')[3].get('href')[0]
        name_3 = soup.select('.btn-success')[3].get('download')
    except:
        link_3 = None
        name_3 = None
    try:
        link_4 = soup.select('.btn-success')[4].get('href')[0]
        name_4 = soup.select('.btn-success')[4].get('download')
    except:
        link_4 = None
        name_4 = None

Finally, keep to the dictionary, return parameters:

link_data = {
        "0_0":link_0,
        "0_1":name_0,
        "1_0":link_1,
        "1_1":name_1,
        "2_0":link_2,
        "2_1":name_2,
        "3_0":link_3,
        "3_1":name_3,
        "4_0":link_4,
        "4_1":name_4
    }
    return link_data

With the Download link and name, you should be able to download it, as long as urllib.request.urlretrieve () on it.

Conclusion

Learn today's knowledge, you should have a lot of harvest it! I believe that you have in the way of learning Python's one step closer!

by taoxichen

Only in boiling water, tea can expand the rich aroma of life.

Published an original article · won praise 5 · Views 274

Guess you like

Origin blog.csdn.net/leotao9527/article/details/104878591