Web crawler 2: Crawl the user ID and homepage address of Netease Cloud Music Comment

The goal of this article:

In the previous article, we obtained the ID and URL address of the popular singer song. This article further obtains the comment user ID and homepage address.

final goal:

1. Grab the song ID through popular singers.
2. Use the song ID to capture the comment user ID.
3. Send targeted push messages by commenting on the user ID.
The previous article completed step 1, and this article completed step 2.
Digression: The request method used in the previous article to get the song ID without a page is faster, but when you get about 2000, it will be recognized as a crawler and banned by the server. You can connect to the mobile phone hotspot and restart the flight mode before connecting. You can get another 2,000.
In the last article, we used MYSQL to store the crawling results. This time, we will use the same method. At the same time, this article will support error redoing. Each time a record is processed, a processing flag Y is set, which is similar to our production system.


Step 1: Build a mysql table

Here you need to create a table called userinf to store the user's ID, comment time, and homepage address.
The table building statement is as follows:

DROP TABLE IF EXISTS `userinf`;
CREATE TABLE `userinf`  (
  `id` int(12) NOT NULL AUTO_INCREMENT,
  `user_id` varchar(30) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
  `user_name` varchar(200) CHARACTER SET utf8 COLLATE utf8_general_ci ,
  `user_time` varchar(100) CHARACTER SET utf8 COLLATE utf8_general_ci ,
  `user_url` varchar(400) CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
  `clbz` varchar(1) CHARACTER SET utf8 COLLATE utf8_general_ci ,
  `bysz` float(3, 0) NULL DEFAULT 0.00,  
  PRIMARY KEY (`id`) USING BTREE,
  INDEX `user_id`(`user_id`) USING BTREE
) ENGINE = InnoDB CHARACTER SET = utf8 COLLATE = utf8_general_ci ROW_FORMAT = Dynamic;

After creation, we need to create a python program to insert this table.
The python program is named: useridSpiderSQL.py, and the code is:

#!/usr/bin/env python
# -*- coding:utf-8 -*-
__author__ = 'luoji'

import pymysql
# from ,where, group by, select, having, order by, limit
class Mysql_pq(object):
    def __init__(self):
        self.conn = pymysql.Connect(host='127.0.0.1',
                                    port=3306,
                                    user='root',
                                    passwd='root',
                                    db='python',
                                    #创建数据库格式时用utf8mb4这个格式,因为可以存储表情等非字符
                                    charset='utf8mb4'
                                            )
        self.cursor = self.conn.cursor()

    def modify_sql(self,sql,data):
        self.cursor.execute(sql,data)
        self.conn.commit()

    def __del__(self):
        self.cursor.close()
        self.conn.close()

def insert_userinf(user_id,user_name,user_time,user_url,clbz):
    helper = Mysql_pq()
    print('连接上了数据库python,准备插入歌曲信息')
    # 插入数据
    insert_sql = 'insert into userinf(user_id,user_name,user_time,user_url,clbz) value (%s,%s,%s,%s,%s)'
    data = (user_id,user_name,user_time,user_url,clbz)
    helper.modify_sql(insert_sql, data)



if __name__ == '__main__':

    # helper = Mysql_pq()
    # print('test db')
    # #测试
    # insert_sql = 'insert into weibo_paqu(werbo) value (%s)'
    # data = ('222222xxxxxx2222 ',)
    # helper.modify_sql(insert_sql,data)
    user_id='519250015'
    user_name= '请记住我'
    user_url = 'https://music.163.com/#/song?id=1313052960&lv=-1&kv=-1&tv=-1'
    user_time = '2021年2月18日'
    clbz = 'N'
    insert_userinf(user_id,user_name,user_time,user_url,clbz)
    print('test over')


Support error redo: go back and update the songinf table

In order to redo the error, we update the processing flag to Y after processing a songinf. When an error occurs, the program automatically skips the record of the processing flag bit Y, and only processes the record of the processing flag bit N, so that the relay can be completed.
So in order to complete this relay, we need to go back and update the songinf after crawling the comment users of a song. We need to create a python program to insert into this table.
The python program is named: updateSongURLSQL.py, and the code is:

#!/usr/bin/env python
# -*- coding:utf-8 -*-
__author__ = 'luoji'

import pymysql
# from ,where, group by, select, having, order by, limit
class Mysql_pq(object):
    def __init__(self):
        self.conn = pymysql.Connect(host='127.0.0.1',
                                    port=3306,
                                    user='root',
                                    passwd='root',
                                    db='python',
                                    #创建数据库格式时用utf8mb4这个格式,因为可以存储表情等非字符
                                    charset='utf8mb4'
                                            )
        self.cursor = self.conn.cursor()

  

    def __del__(self):
        self.cursor.close()
        self.conn.close()

def updater_songurl(url):
    helper = Mysql_pq()
    print('连接上了数据库python,准备插入歌曲信息')


    sql = "UPDATE songinf SET clbz = 'Y' WHERE song_url= '%s'" % (url)
    print('sql is :', sql)
    helper.cursor.execute(sql)
    helper.conn.commit()



if __name__ == '__main__':

    
    url = 'https://music.163.com/#/song?id=569213220&lv=-1&kv=-1&tv=-1'
    updater_songurl(url)
    print('urllist = ',url )
    print('update over')

Crawling comment users:

In order to prevent being banned by the server, we use the selenium automation control module to control browser access this time, so that the server cannot distinguish between crawlers and user access. The disadvantage is that the speed is relatively slow. The current crawling speed is about 1000 user data per hour.
After running for one night, I have obtained 100,000+ user IDs. Here we need to use the song URL information obtained in the previous article, and we need to create a python program to insert this table.
The python program is named: getSongURLSQL.py, and the code is:

#!/usr/bin/env python
# -*- coding:utf-8 -*-
__author__ = 'luoji'

import pymysql
# from ,where, group by, select, having, order by, limit
class Mysql_pq(object):
    def __init__(self):
        self.conn = pymysql.Connect(host='127.0.0.1',
                                    port=3306,
                                    user='root',
                                    passwd='root',
                                    db='python',
                                    #创建数据库格式时用utf8mb4这个格式,因为可以存储表情等非字符
                                    charset='utf8mb4'
                                            )
        self.cursor = self.conn.cursor()

    # def modify_sql(self,sql,data):
    #     self.cursor.execute(sql,data)
    #     self.conn.commit()

    def __del__(self):
        self.cursor.close()
        self.conn.close()

def select_songurl():
    helper = Mysql_pq()
    print('连接上了数据库python,准备插入歌曲信息')

    urllist = []
    sql = "SELECT * FROM songinf WHERE clbz = 'N'"
    helper.cursor.execute(sql)
    results = helper.cursor.fetchall()
    for row in results:
        id = row[0]
        song_id = row[1]
        song_name = row[2]
        song_url = row[3]
        clbz = row[4]
        # 打印结果
        print('id =', id)
        print('song_url =',song_url)
        urllist.append(song_url)
    return urllist


if __name__ == '__main__':

    # helper = Mysql_pq()
    # print('test db')
    # #测试
    # insert_sql = 'insert into weibo_paqu(werbo) value (%s)'
    # data = ('222222xxxxxx2222 ',)
    # helper.modify_sql(insert_sql,data)
    # song_id='519250015'
    # song_name= '请记住我'
    # song_url = 'https://music.163.com/#/song?id=1313052960&lv=-1&kv=-1&tv=-1'
    # clbz = 'N'
    urllist = select_songurl()
    print('urllist = ',urllist )
    print('test over')

So the database mysql is very important.
The code is:

import re
import time
import numpy as np

from flask_cors.core import LOG
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.remote.webelement import WebElement
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver import ChromeOptions
from getSongURLSQL import *
from useridSpiderSQL import *
from updateSongURLSQL import *

def is_number(s):
    try:
        float(s)
        return True
    except ValueError:
        pass

    try:
        import unicodedata
        unicodedata.numeric(s)
        return True
    except (TypeError, ValueError):
        pass

    return False

def geturl(urllist):
    # 如果driver没加入环境变量中,那么就需要明确指定其路径
    # 验证于2021年2月19日
    driver = webdriver.Firefox()
    #driver = webdriver.Chrome()
    driver.maximize_window()
    driver.set_page_load_timeout(30)
    driver.set_window_size(1124, 850)
    # locator = (By.)

    for url in urllist:
        print('now the url is :',url)
        driver.get(url)
        time.sleep(3)
        print('开始登陆')
        driver.switch_to.frame('g_iframe')  # 网易云的音乐元素都放在框架内!!!!先切换框架
       
        href_xpath = "//div[contains(@class,'cntwrap')]//div[contains(@class,'cnt f-brk')]//a[contains(@class,'s-fc7')]"
        songid = driver.find_elements_by_xpath(href_xpath)
        useridlist = []
        usernamelist = []
        for i in songid:
            userurl = i.get_attribute('href')
            userid = userurl[35:]   #用户的id数字
            print('userid = ',userid)
            username = i.text
            print('username = ',username)
            try:

                print('userid is ',userid)
                if is_number(userid) :  #说明是纯数字
                    print('用户id是数字,保留')
                    useridlist.append(userid)
                    usernamelist.append(username)
                else:
                    iter


            except (TypeError, ValueError):
                print('用户id非数字,丢弃')
                iter

     #获取用户评论时间

        commenttimelist=[]
        time_xpath = "//div[contains(@class,'cntwrap')]//div[contains(@class,'rp')]//div[contains(@class,'time s-fc4')]"
        songtime = driver.find_elements_by_xpath(time_xpath)
        for itime in songtime:
            #print(i.get_attribute('href'))
            commenttime = itime.text
            print('commenttime = ',commenttime)
            commenttimelist.append(commenttime)
        if len(commenttimelist)< len(useridlist):
            for i in np.arange(0,len(useridlist)-len(commenttimelist),1):
                commenttimelist.append('2021年2月18日')
        print('len(useridlist) is = ',len(useridlist))
        for i in np.arange(0,len(useridlist),1):
            userid_i = useridlist[i]
            username_i = usernamelist[i]
            commenttime_i = commenttimelist[i]
            #插入到数据库中
            print('userid_i=',userid_i)
            print('username_i=', username_i)
            print('commenttime_i=', commenttime_i)
            userurl_i ='https://music.163.com/#/user/home?id=' + str.strip(userid_i)
            print('userurl_i=', userurl_i)
            clbz = 'N'
            try:
                insert_userinf(userid_i, username_i, commenttime_i, userurl_i, clbz)

            except :
                print('插入数据库有错')
                pass

        time.sleep(5)
        updater_songurl(url)



def is_login(source):
    rs = re.search("CONFIG\['islogin'\]='(\d)'", source)
    if rs:
        return int(rs.group(1)) == 1
    else:
        return False

if __name__ == '__main__':
    #url = 'https://music.163.com/#/discover/toplist?id=2884035'

    urllist = select_songurl()
    # urllist =['https://music.163.com/#/song?id=569200214&lv=-1&kv=-1&tv=-1','https://music.163.com/#/song?id=569200213&lv=-1&kv=-1&tv=-1']

    geturl(urllist)

The results of the capture are as follows:
Insert picture description here
Here are a few explanations:
1. I did not do the page turning of the latest comment. If I want to do it, I need to crawl the page turning button and click, and then re-fetch the user id.
2. The content of specific comments is temporarily not stored.
3. The format of the scraped comment date information is very irregular and requires follow-up processing.

In the next part, you will complete step 3, and you will have the ability to push songs to 10w level users.

Guess you like

Origin blog.csdn.net/weixin_43290383/article/details/113868779