How does the Python crawler grab the singer data of qq music?

Since I learned to crawl, has there been an urge to crawl everything? Today, Xiaoqian will teach you how to grab the singer data of qq music. More practice on the project can improve yourself faster.

The goal of today's project is to obtain the song titles, album names, and playback links of the songs of the specified number of pages of the QQ Music designated singer's single ranking, from the shallower to the deeper, layer by layer, which is very suitable for students who are just getting started. The main libraries involved are: requests, json, openpyxl.

Project steps

1. Understand the robots protocol of QQ Music website (safe)

2.png

Only prohibit playlists, can operate

2. Enter the homepage of QQ Music

3. Enter any singer, such as Zhang Jie

3.jpg

4. Open the review element (shortcut key Ctrl+Shift+I)

4.jpg

5. Analyze the webpage source code Elements and find that there is no song information and BeautifulSoup cannot be used. As shown in the figure below, the result is empty.

5.jpg

6. Click Network to see if the data is in XHR (without refreshing the update page). My experience is to first look at the largest Size, then analyze the Name and view the Preview, and it really is inside!

6.jpg

6.2.jpg

Format the response and display the content as above

7. Click Headers to get the relevant parameters. As shown in the figure below, carefully observe the relationship between url and Query String Parameters, and find that w in url represents the name of the singer, and p represents the number of pages

Where: p=1&n=10&w=%E5%BC%A0%E6%9D%B0

7.jpg

8. Realize through json code. First, try a little trick, crawl the data of the first page, and copy the url directly. success!

8.jpg

Among them: Analyze json to obtain a list of songs, and obtain a playback link

9.jpg

9. Introduce the params parameter to realize the query of the specified singer and the specified number of pages.

Note that the code url is the part before the "?" in the url in the previous step, and the parameters on both sides of params need to be added ``, requests.get add params, parameters (you can also add headers parameters by the way)

10.jpg

10. Add storage function and save to local (Excel). It can also be saved in csv format or stored in the database, the operation is similar.

11.jpg

12.jpg

to sum up

Crawling qq music is slightly more difficult than crawling websites such as Douban, the required information is not in the source code, you need to check xhr.

Crawling data through xhr generally uses json, the format is:

res = requests.get(url),json = res.json(),list = json[’’][’’]

It is for reference only. It is not recommended to crawl too much data and increase the load on the server.

The original text comes from Qianfeng Education .


Guess you like

Origin blog.51cto.com/15128702/2665341