Web crawler: Crawl the ID and link address of the popular songs of Netease Cloud Music

aims:

Primary goal: Use popular singers to capture the IDs and link addresses of 50 of his popular songs on NetEase Cloud Music.
The ultimate goal:
1. Grab the song ID through popular singers.
2. Use the song ID to capture the comment user ID.
3. Send targeted push messages by commenting on the user ID.
In order to store the captured results, we use MYSQL to store the results, so that each step is separated and the data is connected through MYSQL.


Learning Content:

Master the writing of crawler programs, and master the connection and operation of MYSQL databases with PYMYSQL in Python.
For example:
1. Build a mysql database
2. Master the basic syntax of python to connect to mysql
3. Master the crawler to obtain the specified information
4. Master the link specified url


Step 1: MYSQL build table

First set up the mysql environment, create a database called python name, and then create a table in this python to store the captured song ID, song name, and corresponding web page address.
The stored results are as follows:
Insert picture description here
Table building statement:

create database python
ALTER DATABASE python CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci  

DROP TABLE IF EXISTS `songinf`;
CREATE TABLE `songinf`  (
  `id` int(12) NOT NULL AUTO_INCREMENT,
  `song_id` varchar(30) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
  `song_name` varchar(1000) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
  `song_url` varchar(150) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
  `clbz` varchar(1) CHARACTER SET utf8 COLLATE utf8_general_ci ,
  `height` float(3, 2) NULL DEFAULT 0.00,  
  PRIMARY KEY (`id`) USING BTREE,
  INDEX `song_id`(`song_id`) USING BTREE
) ENGINE = InnoDB CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = Dynamic;


Step 2: Operate mysql through pysql

After completing the table creation, we use pysql to operate the songinf table we just created to test whether we can insert a piece of data into it. Can be inserted to indicate a successful connection.
Create a python file and name it: wangyiyunSpiderSQL.py. Import this file in step 3.

#!/usr/bin/env python
# -*- coding:utf-8 -*-
__author__ = 'luoji'

import pymysql
# from ,where, group by, select, having, order by, limit
class Mysql_pq(object):
    def __init__(self):
        self.conn = pymysql.Connect(host='127.0.0.1',
                                    port=3306,
                                    user='root',
                                    passwd='root',
                                    db='python',
                                    #创建数据库格式时用utf8mb4这个格式,因为可以存储表情等非字符
                                    charset='utf8mb4'
                                            )
        self.cursor = self.conn.cursor()

    def modify_sql(self,sql,data):
        self.cursor.execute(sql,data)
        self.conn.commit()

    def __del__(self):
        self.cursor.close()
        self.conn.close()

def insert_songinf(song_id,song_name,song_url,clbz):
    helper = Mysql_pq()
    print('连接上了数据库python,准备插入歌曲信息')
    # 插入数据
    insert_sql = 'insert into songinf(song_id,song_name,song_url,clbz) value (%s,%s,%s,%s)'
    data = (song_id,song_name,song_url,clbz)
    helper.modify_sql(insert_sql, data)

if __name__ == '__main__':
    song_id='519250015'
    song_name= '请记住我'
    song_url = 'https://music.163.com/#/song?id=1313052960&lv=-1&kv=-1&tv=-1'
    clbz = 'N'
    insert_songinf(song_id, song_name, song_url,clbz)
    print('test over')

Run this program, and then use DbVisualizer to check whether the table songinf has data.

Step 3: Get information about the song through the crawler

Core operation: visit the artist's homepage through the artist ID, and then get the top 50 popular songs displayed in the homepage, get the song name and ID, and piece together the song page at the same time.

The homepage format of each singer is the same, but the ID is different. The forms are:

url = 'https://music.163.com/artist?id=' + artist_id

Therefore, as long as we can get the id of the singer, we can get the address of his homepage.
Singer ID can also be obtained by crawling the ID of popular singers, but the NetEase Cloud Music interface is developed through the iframe framework. It is still difficult to obtain. It takes a lot of time to debug. During the Chinese New Year, two days were wasted to debug and obtain the user ID. Crawler. No time is wasted here. Human flesh crawls the IDs of the singers you want, and you can get them by visiting the homepage addresses of these singers. such as:

徐秉龙 1197168 周笔畅10558买辣椒也用券 12085562 华晨宇 861777林宥嘉 3685
李荣浩 4292杨宗纬 6066薛之谦 5781蔡健雅 7214金玟岐 893259林俊杰 3684

Therefore, the new python program is as follows:

#!/usr/bin/env python
# -*- coding:utf-8 -*-
__author__ = 'luoji'

# -*- coding:utf-8 -*-
# 网易云音乐 通过歌手ID,生成该歌手的热门歌曲和歌曲主页
import requests
import sys
import re
import os
import matplotlib.pyplot as plt
from PIL import Image
import numpy as np
from lxml import etree
from selenium import webdriver
from wangyiyunSpiderSQL import *

headers = {
    
    
    'Referer': 'http://music.163.com',
    'Host': 'music.163.com',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'User-Agent': 'Chrome/10'
}


# 得到指定歌手页面 热门前50的歌曲ID,歌曲名
def get_songs(artist_id):
    page_url = 'https://music.163.com/artist?id=' + artist_id
    # 获取网页HTML
    res = requests.request('GET', page_url, headers=headers)
    # 用XPath解析 前50首热门歌曲
    html = etree.HTML(res.text)
    href_xpath = "//*[@id='hotsong-list']//a/@href"
    name_xpath = "//*[@id='hotsong-list']//a/text()"
    hrefs = html.xpath(href_xpath)
    names = html.xpath(name_xpath)
    # 设置热门歌曲的ID,歌曲名称
    song_ids = []
    song_names = []
    for href, name in zip(hrefs, names):
        song_ids.append(href[9:])
        song_names.append(name)
        print(href, '  ', name)
    return song_ids, song_names


# 设置歌手ID,毛不易为12138269
#徐秉龙 1197168 周笔畅10558买辣椒也用券 12085562 华晨宇 861777林宥嘉 3685
# 李荣浩 4292杨宗纬 6066薛之谦 5781蔡健雅 7214金玟岐 893259林俊杰 3684
# 邓紫棋 7763孙燕姿 9272梁静茹 8325张惠妹 10559林忆莲 8336莫文蔚 8926
# 赵雷  6731宋冬野  5073马頔  4592朴树 4721逃跑计划 12977黄霄雲
# 14077324陈奕迅 2116艾辰 12174057封茗囧菌 12172529阮豆 12172496黑猫
# 12383659Fine乐团 1160085郭顶 2843周兴哲 980025田馥甄 9548五月天
# 13193苏打绿 12707王力宏 5346陶喆 5196周杰伦 6452周华健 6456

artist_id_list = ['12138269','1197168','10558','12085562','861777','3685','4292',
             '6066','5781','7214','893259','3684','7763','9272','8325','10559',
             '8336','8926','6731','5073','4592','4721','12977','14077324',
             '2116','12174057','12172529','12172496','12383659','1160085',
             '2843','980025','9548','13193','12707','5346','5196','6452',
             '6456','6453','6454','6455','6457','6458','6459','6460','6461',
             '6462','6463','6464','6465','6466','6467','6468','6469','6470']

for artist_id in artist_id_list:

    [song_ids, song_names] = get_songs(artist_id)
    print('len(song_ids) = ',len(song_ids))

    for (song_id, song_name) in zip(song_ids, song_names):
        # 歌词API URL
        song_url = 'https://music.163.com/#/song?id=' + song_id + '&lv=-1&kv=-1&tv=-1'
        print('song_url = ',song_url)
        print('song_id = ', song_id)
        print('song_name =', song_name)
        clbz = 'N'
        insert_songinf(song_id, song_name, song_url, clbz)
    print('插入歌曲id结束')
    #lyric = get_song_lyric(headers, song_url)
    #print('len(lyric)=',len(lyric))
print('插入歌曲id全部成功,结束')

Run the program to get the ID and link address of 2613 songs.
In the next article, we will use to get the song ID to get the ID of the commenting user.

Guess you like

Origin blog.csdn.net/weixin_43290383/article/details/113844388