python reptile tutorial: crawling Cool Dog Music

 In common in several music sites, cool dog arguably the best crawling matter, what did not bend, did not encrypt getting better, so the most suitable for entry-white reptile

Benpian against white reptile zero-based, so every step I have screenshots and a detailed explanation, in fact, I myself have looked at the long-winded, in the final analysis is to ask two steps, but please do not spray Gangster detour.

1. Open the cool Gouguan network, you can see a search box, we have to crawl data is, after searching a song, cool dog back to return a list of songs and song information for each song (lyrics, author, url, etc.)

2, hit the F12 key to enter developer mode, select Network - All (this is all cool dog foreground and interacting request list)

3, the search box to search for content, then on the right you can see there will be a lot of list, the list of search data inside a fact in this, I have a red box marked (you can find this song_search according to that name, it is not point to open one by one to see if they're looking for content)

4, opening the line, the switching is found above the Preview json data relevant results, lists the data list is

5, opening a song, which contains the song name, author, AlbumID, FileHash song information, etc.

6, and then we switched to the above Headers, you can see RequestURL (that is, request URL), below the arrow can be seen GET requests

7, sliding down, you can see Requset Headers (This backend will verify heades, general user-agent are required to write a request, and some more partial verification, you need to look at the case processing, cool dog did not make the verification, do not write headers request also can) and request parameters (parameter that is requested, keyword search, the number of requests and other information)

8, did not talk much, we directly use python libraries of requests (the direct Baidu installed on the line) construction request, my environment is python2.7, python3 of note version differences

 

#coding=utf-8import requests
search = '喜欢你' #搜索内容pagesize = '10' #请求数目url = 'https://songsearch.kugou.com/song_search_v2?callback=jQuery11240251602301830425_1548735800928&keyword=%s&page=1&pagesize=%s&userid=-1&clientver=&platform=WebFilter&tag=em&filter=2&iscorrection=1&privilege_filter=0&_=1548735800930' % (search,pagesize)res = requests.get(url) #requests发起get请求print res.text #输出响应内容
  1. The output is such that you can see the return json print out all the content, information, and this is just the browser developer tools to see the same

10, then we get the list, and then back to the browser, to get a list of specific information on each song, the left select the first click into the details page

11, can be seen playing jump to the page, refresh the page and reload it again

12、可以看到右侧红色框圈起来的就是歌曲信息(你可能问我怎么知道哪个才是包含歌曲信息的,当然是观察法了,写多了就有经验了,实在不会一个个点进去看)

13、我用箭头标注的都是一般需要爬取的有用信息,可以看到作者,歌曲名,歌词,专辑图片,id,play_url都在里面,不信你把play_url复制到地址栏回车播放的肯定是这个歌曲,拿到这个url我们就可以直接下载歌曲了

14、接着我们再从上方从Preview切换到Headers,可以看到和请求歌曲列表差不多,还是GET请求

15、这里的query同样还是GET请求的参数,其中hash和album_id就是一首歌曲的信息,我们只需要请求不同歌曲时改这两个参数就行了(第一步请求搜索列表每一行单曲数据包含这个参数了)

16、直接刚才根据开发者模式里面的RequestURL,构造get请求,请求每首歌曲时换上每首歌对应的id和hash值就行

#coding=utf-8import requests
#在这里,为了分步演示,直接用刚才第一步搜索时开发者模式获取到的搜索列表第一条的id和hash#文章最后有整个连贯的代码
id = '557512' #单曲idhash = '41C2E4AB5660EAE04021C5893E055F50' #单曲hash值url = 'https://wwwapi.kugou.com/yy/index.php?r=play/getdata&callback=jQuery19107465224671418371_1555932632517&hash=%s&album_id=%s&_=1555932632518' % (hash,id)
res = requests.get(url)
print res.text
  1. 可以看到控制台打印了单曲信息,因为是json数据没有转换,直接输出打印现在看起来有点乱

  2.  

    注意,酷狗返回数据并不直接就是json格式,两端有一些无用字符串,需用正则表达式去除,只保留大括号{}里面(包括大括号)内容,19步骤代码里有说明

 

 

19、我们已经熟悉了上面的两步,最后进行汇总写一个完整的python爬虫,输入搜索歌曲,拿到搜索列表并包括单曲信息

 

# coding=utf-8import requestsimport jsonimport re

# 请求搜索列表数据search = raw_input('音乐名:') # 控制台输入搜索关键词pagesize = "10" # 请求数目url = 'https://songsearch.kugou.com/song_search_v2?callback=jQuery11240251602301830425_1548735800928&keyword=%s&page=1&pagesize=%s&userid=-1&clientver=&platform=WebFilter&tag=em&filter=2&iscorrection=1&privilege_filter=0&_=1548735800930' % (search, pagesize)res = requests.get(url) # 进行get请求
# 需要注意一点,返回的数据并不是真正的json格式,前后有那个多余字符串需要用正则表达式去掉,只要大括号{}包着的内容# json.loads就是将json数据转为python字典的函数res = json.loads(re.match(".*?({.*}).*", res.text, re.S).group(1))
list = res['data']['lists'] # 这个就是歌曲列表
#建立List存放歌曲列表信息,将这个歌曲列表输出,别的程序就可以直接调用musicList = []
#for循环遍历列表得到每首单曲的信息for item in list: #将列表每项的item['FileHash'],item['AlnbumID']拼接请求url2 url2 = 'https://wwwapi.kugou.com/yy/index.php?r=play/getdata&callback=jQuery191010559973368921649_1548736071852&hash=%s&album_id=%s&_=1548736071853' % ( item['FileHash'], item['AlbumID']) res2 = requests.get(url2) res2 = json.loads(re.match(".*?({.*}).*", res2.text).group(1))['data']#同样需要用正则处理一下才为json格式,再转为字典
#打印一下 print res2['song_name']+' - '+res2['author_name'] print res2['play_url'] print ''
#将单曲信息存在一个字典里 dict = { 'author': res2['author_name'], 'title': res2['song_name'], 'id': str(res2['album_id']), 'type': 'kugou', 'pic': res2['img'], 'url': res2['play_url'], 'lrc': res2['lyrics'] }
#将字典添加到歌曲列表 musicList.append(dict)
  1. 最后控制台输出结果

     

 学习python过程中有不懂的可以加入我的python零基础系统学习交流秋秋qun:934109170,与你分享Python企业当下人才需求及怎么从零基础学习Python,和学习什么内容

学习python有不懂的(学习方法,学习路线,如何学习有效率的问题),可以随时来咨询我,或者缺少系统学习资料

Guess you like

Origin www.cnblogs.com/xiaoxiany/p/10959618.html