Python crawler practice | Crawling NetEase Cloud music reviews

Crawl NetEase Cloud Music Reviews

01 Web page analysis

In order to crawl the review content of NetEase Cloud Music, this case will provide a simple processing method. NetEase Cloud Music generally provides an API to return the content requested by the developer in a JSON object, and the API format for obtaining song reviews is "http:// /music.163.com/api/v1/resource/comments/R_SO_4_” + song ID. The number of comments displayed in the JSON object of general comments is limited. In order to obtain the complete comment content, you need to add the parameter single JSON to load the number of comments. (limit) and offset (offset), and then send a GET request, as shown in Figure 17-1. JSON will display the comment content and total number of comments. Based on the above parameters, requests can be sent at intervals to obtain all comment content, and the crawler can be written. .

■ Figure 17-1 Web page analysis JSON object content

02 Write a crawler

Insert image description here

Regarding regular expression processing of text for analysis, in text analysis, in order to improve accuracy and avoid program bugs, some unnecessary characters need to be removed in advance, such as punctuation marks and special characters such as non-text expressions, which will affect the performance of the text. Text analysis causes interference, usually re. sub(pat, "", Str), pat is a pre-compiled regular expression, which replaces the removed characters with null characters. Some regular expression ideas are provided below.

(1) re. compile('\t|\n|\.|-|: |; |\)|\(|\?|(|)|\|"|u3000'), used to remove punctuation marks and spaces.

(2) Using the regular expression characteristics, [^**] means that it does not match any character in this character set. You can inversely select the required character set. In addition to the basic [a-zA-Z0-9] matching, if Unicode encoding is used Method, the Unicode range of Chinese characters is \u4e00 \u9fa5, the Unicode range of numbers is \u0030\u0039, the Unicode range of uppercase letters is \u0041 \u005a, the Unicode range of lowercase letters is \u0061\u007a, and the Unicode range of Korean is \ uAC00\uD7AF , the Japanese Unicode range is\u3040\u31FF . According to the needs of text analysis, the required characters are reserved.

03 Operation results

[Example 17-1] Analyze the representative single "Chengdu" (song ID: 436514312) by the famous folk singer Zhao Lei. It has more than 400,000 comments. The keyword cloud is shown in Figure 17-2.

■ Figure 17-2 Word cloud analysis results of the single "Chengdu"

[Example 17-2] Analyze the theme song Be the One (song ID: 530986958) of the well-known Japanese TV series "Kamen Rider Build", with about 20,000 comments, using a custom mask, and the keyword cloud is as shown in Figure 17-3 Show.

■ Figure 17-3 Word cloud analysis results of single Be the One

at last:

[For those who want to learn crawlers, I have compiled a lot of Python learning materials and uploaded them to the CSDN official. Friends in need can scan the QR code below to obtain them]

1. Study Outline

Insert image description here

2. Development tools

Insert image description here

3. Python basic materials

Insert image description here

4. Practical data

Insert image description here

Guess you like

Origin blog.csdn.net/Z987421/article/details/133313748