[python] Crawl the Kugou Music Top500 ranking list [with source code]

 1. Import the necessary modules:

    This blog will introduce how to use Python to write a crawler program to obtain image information from the Douyu live broadcast website and save it locally. We will use the requests module to send HTTP requests and receive responses, and the os module to handle file and directory operations.

        If a module error occurs

        Enter the console and input: It is recommended to use domestic mirror sources

pip install requests -i https://mirrors.aliyun.com/pypi/simple

         I have roughly listed the following domestic mirror sources:

        

清华大学
https://pypi.tuna.tsinghua.edu.cn/simple

阿里云
https://mirrors.aliyun.com/pypi/simple/

豆瓣
https://pypi.douban.com/simple/ 

百度云
https://mirror.baidu.com/pypi/simple/

中科大
https://pypi.mirrors.ustc.edu.cn/simple/

华为云
https://mirrors.huaweicloud.com/repository/pypi/simple/

腾讯云
https://mirrors.cloud.tencent.com/pypi/simple/

    

2. Send a GET request to obtain response data:

        The request header information is set to simulate the browser's request, and the function returns the JSON format content of the response data.

def get_html(url):
    header = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36'
    }
    response = requests.get(url=url, headers=header)
    # print(response.json())
    html = response.json()
    return html

        How to get the request header:

        Firefox browser:
  1. Open the landing page and right-click on an empty space on the page.
  2. Select the "Inspect Element" option, or press the shortcut Ctrl + Shift + C (Windows)
  3. In the developer tools window, switch to the Network tab.
  4. Refresh the page to capture all network requests.
  5. Select the request that interests you in the list of requests.
  6. You can find the request header information in the "Request Headers" or "Request Headers" section on the right.

     ​Copy the following request header information

3. Crawling Kugou TOP500 ranking list

        Extract the song’s ranking, song title, singer, duration and other information from the Kugou music rankings.

        

        Specific steps are as follows:

  1. Import the required modules:requests is used to send HTTP requests, BeautifulSoup is used to parse HTML, time is used Control the speed of the crawler.

  2. Set the request header information: Set the User-Agent through the headers dictionary to simulate the browser sending requests to prevent being banned by the website.

  3. Define functionget_info(url): This function receives a URL parameter and is used to crawl the information of the specified web page.

  4. Send a network request and parse HTML: Use the requests.get() function to send a GET request to obtain the HTML content of the web page, and use the BeautifulSoup module to parse the HTML.

  5. Locate the required information through the CSS selector: Use the select() method to locate the song ranking, song title, duration and other elements based on the CSS selector.

  6. Loop through each piece of information and store it in the dictionary: Use thezip() function to pack elements such as ranking, song title and duration into an iterator, and then loop through each The information is stored in thedatadictionary.

  7. Print the obtained information: Use theprint() function to print the information in the dictionarydata.

  8. Main program entry: Useif __name__ == '__main__': to determine whether the current file is directly executed. If so, execute the following code.

  9. Construct a list of page addresses to be crawled: Use list comprehension to construct a list containing the page addresses to be crawled.

  10. Call the function to obtain page information: Use for to loop through the page address list, and call the get_info() function to obtain the information of each page.

  11. Control the crawler speed: Use thetime.sleep() function to control the crawler speed to prevent the IP from being blocked too quickly.

Source code:

import requests  # 发送网络请求,获取 HTML 等信息
from bs4 import BeautifulSoup  # 解析 HTML 信息,提取需要的信息
import time  # 控制爬虫速度,防止过快被封IP


headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36"
    # 添加浏览器头部信息,模拟请求
}

def get_info(url):
    # 参数 url :要爬取的网页地址
    web_data = requests.get(url, headers=headers)  # 发送网络请求,获取 HTML 等信息
    soup = BeautifulSoup(web_data.text, 'lxml')  # 解析 HTML 信息,提取需要的信息

    # 通过 CSS 选择器定位到需要的信息
    ranks = soup.select('span.pc_temp_num')
    titles = soup.select('div.pc_temp_songlist > ul > li > a')
    times = soup.select('span.pc_temp_tips_r > span')
    
    # for 循环遍历每个信息,并将其存储到字典中
    for rank, title, time in zip(ranks, titles, times):
        data = {
            "rank": rank.get_text().strip(),  # 歌曲排名
            "singer": title.get_text().replace("\n", "").replace("\t", "").split('-')[1],  # 歌手名
            "song": title.get_text().replace("\n", "").replace("\t", "").split('-')[0],  # 歌曲名
            "time": time.get_text().strip()  # 歌曲时长
        }
        print(data)  # 打印获取到的信息

if __name__ == '__main__':
    urls = ["https://www.kugou.com/yy/rank/home/{}-8888.html".format(str(i)) for i in range(1, 24)]
    # 构造要爬取的页面地址列表
    for url in urls:
        get_info(url)  # 调用函数,获取页面信息
        time.sleep(1)  # 控制爬虫速度,防止过快被封IP

Rendering:

  Recommend a website to everyone

    IT Hot List Today One-Stop Information Platform


        It contains hundreds of IT websites, everyone is welcome to visit:IT Today’s Hot List One-stop Information Platform

   iToday, opens a new era of information. As an innovative IT digital media platform, iToday is committed to providing users with the latest and most comprehensive IT information and content. It contains technical information, IT community, job interviews, cutting-edge technology and many other contents. Our team is composed of a group of developers who love creating and enthusiasts who share professional programming knowledge. They select and organize authentic and trustworthy information to ensure that you get a unique and valuable reading experience. Stay connected to the world anytime, anywhere, on iToday, and start your new information journey!

IT Today's Hot List One-stop information platform IT Today's Hot List brings together various IT hot lists: Huxiu, Zhihu, 36Kr, JD Book Sales, Late, All-Weather Technology, Geek Park, GitHub, Nuggets, CSDN, Beep Libili, 51CTO, Blog Park, GitChat, Developer Toutiao, Sifou, LeetCode, Everyone is a Product Manager, Niuke.com, Zanzhun, Lagou, Boss Direct Recruitment http://itoday.top/#/

Guess you like

Origin blog.csdn.net/m0_73367097/article/details/134256713