Crawling Zhao Lei's lyrics for her boyfriend

Offer arrives, dig friends to pick up! I am participating in the 2022 Spring Recruitment Check-In Event, click to view the event details .

Today, let's take a look at the lyrics of Netease Yun Zhao Lei's songs and make a word cloud map. In this article, you can learn what word cloud is, the basic process of crawler, and simple visual operation

[tap]

1 What is word cloud

There are many kinds of visualization, and good data visualization can make the results of data analysis more accessible. A "word cloud" is a type of visualization that generates a graph based on the frequency of keyword occurrences, which allows us to know its main points at a glance. For example the followinginsert image description here

Two steps to make a word cloud

1 The first step is to collect what needs to be visualized.

The content can be the content of the article or of course the content of the crawler. Here we first use the material of NAB star Kobe Bryant as the content.

2 Install the word cloud library

wordcloud, the installation method is as follows (provided that the computer already has a python environment and some basic libraries, it is recommended to install Anaconda, which will save a lot of trouble of dependency packages)

  • pip install wordcloud
  • pip install jieba--------Chinese library, because I want to display Chinese content
  • pip install PIL------------image processing library
  • pip install matplotlib-----image display library

3 Introduce the jieba Chinese library and the common parameters of wordcloud that will be used below

(1) jieba Sousuke

  • The jieba Chinese library is a python Chinese word segmentation component with three word segmentation modes
  • Precise mode, trying to cut the sentence most accurately, more suitable for text analysis
  • Full mode, scans all the words in the sentence that can be turned into words, which is very fast, but cannot resolve ambiguity
  • Search engine mode, which re-segments long words on the basis of precise mode, suitable for search engine word segmentation
  • Support traditional word segmentation
  • Support custom dictionary
  • MIT License Agreement

(2) jieba API

jieba.cut accepts a string of three parameters that requires word segmentation, which can be Chinese

  • Cut_all bollean type parameter, used to control whether to use the full mode
  • HMM parameters are used to control whether to use the HMM model (hidden horse model - follow-up related learning)

jieba.cut_for_search accepts two strings of strings that require word segmentation. It can be a Chinese HMM parameter to control whether to use the HMM model. The biggest difference between it and jieba.cut is that word segmentation is more delicate, and it will output all possibilities, so there is no cut_all parameter

4 Kobe word cloud map production test code 1

#-*- coding:utf-8 -*
f='科比的职业生涯随湖人队5夺NBA总冠军(2000年-2002年、2009年-2010年);\
荣膺1次常规赛MVP(2007-08赛季),2次总决赛MVP(2009年-2010年),\
4次全明星赛MVP(2002年、2007年、2009年与2011年),\
与鲍勃·佩蒂特并列NBA历史第一;共18次入选NBA全明星阵容,15次入选NBA最佳阵容,12次入选NBA最佳防守阵容'

from wordcloud import WordCloud
import matplotlib.pyplot as plt
import jieba
from PIL import Image as image
import numpy as np

#生成词云
def create_word_cloud(f):
    print('根据词频计算词云')
    text = " ".join(jieba.cut(f,cut_all=False, HMM=True))

    wc = WordCloud(
        font_path="./SimHei.ttf",#设置字体 针对中文的情况需要设置中文字体,否则会乱码
        max_words=100,# 设置最大的字数
        width=2000,#设置画布的宽度
        height=1200,#设置画布的高度
    )
    wordcloud=wc.generate(text)
    #写词云图片
    wordcloud.to_file("wordcloud.jpg")
    #显示词云文件
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.show()
create_word_cloud(f)
​```python
复制代码

The results show that:insert image description here

Three Cases NetEase Cloud Singer Landlord's Cat Word Cloud Map

1 Overall flow chart

insert image description here

2 Crawling + word cloud production

(1) Let's first check what common interfaces are required for the lyrics API interface of NetEase Cloud . Discovery requires an ID. So the first step is to access the url to enter the singer interface, and find the law to get the singer ID.

  • Enter the singer's page The singer's page 在这里插入图片描述

  • Click on Zhao Lei song page Zhao Lei song page . Here you can click a few more singers, and you will find that different singers have different IDs behind the URL in different places on the page .在这里插入图片描述

  • Click the singer's name to enter the singer's page, and select the id attribute of the div in the popular 50 songs在这里插入图片描述

  • Because we need the lyrics of each song, we need to find the lyrics connection of the song, usually the a tag, so let's look down and use xpath to parse out all the a tags在这里插入图片描述

def get_songs(artist_id):
	page_url = 'https://music.163.com/artist?id=' + artist_id
	# 获取网页HTML
	res = requests.request('GET', page_url, headers=headers)
	# 用XPath解析 前50首热门歌曲
	html = etree.HTML(res.text)
	href_xpath = "//*[@id='hotsong-list']//a/@href"
	name_xpath = "//*[@id='hotsong-list']//a/text()"#获取
	hrefs = html.xpath(href_xpath)
	names = html.xpath(name_xpath)
	# 设置热门歌曲的ID,歌曲名称
	song_ids = []
	song_names = []
	for href, name in zip(hrefs, names):
		song_ids.append(href[9:])
		song_names.append(name)
		print(href, '  ', name)
	return song_ids, song_names
复制代码

(2) Now splicing the url of our crawled lyrics. music.163.com/api/song/ly… ' + song_id + '&lv=-1&kv=-1&tv=-1'

# 获取每首歌歌词
for (song_id, song_name) in zip(song_ids, song_names):
	# 歌词API URL
	lyric_url = 'http://music.163.com/api/song/lyric?os=pc&id=' + song_id + '&lv=-1&kv=-1&tv=-1'
	lyric = get_song_lyric(headers, lyric_url)
	all_word = all_word + ' ' + lyric
	print(song_name)
复制代码

(3) Remove some stop words such as lyrics, arrangement and other words

# 去掉停用词
def remove_stop_words(f):
	stop_words = ['作词', '作曲', '编曲',  '人声', 'Vocal', '弦乐', 'Keyboard', '键盘', '编辑', '助理', 'Assistants', 'Mixing', 'Editing', 'Recording', '音乐', '制作', 'Producer', '发行', 'produced', 'and', 'distributed']
	for stop_word in stop_words:
		f = f.replace(stop_word, '')
	return f
复制代码

(4) Overall code

import requests
import re
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import jieba
import PIL.Image as image

from lxml import etree
headers = {
		'Referer'	:'http://music.163.com',
		'Host'	 	:'music.163.com',
		'Accept' 	:'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
		'User-Agent':'Chrome/10'
	}
# 得到某一首歌的歌词
def get_song_lyric(headers, lyric_url):
	res = requests.request('GET', lyric_url, headers=headers)
	if 'lrc' in res.json():
		lyric = res.json()['lrc']['lyric']
		new_lyric = re.sub(r'[\d:.[\]]','',lyric)#去掉[]中的数字信息
		return new_lyric
	else:
		return ''
		print(res.json())
# 去掉停用词
def remove_stop_words(f):
	stop_words = ['作词', '作曲', '编曲',  '人声', 'Vocal', '弦乐', 'Keyboard', '键盘', '编辑', '助理', 'Assistants', 'Mixing', 'Editing', 'Recording', '音乐', '制作', 'Producer', '发行', 'produced', 'and', 'distributed']
	for stop_word in stop_words:
		f = f.replace(stop_word, '')
	return f
# 生成词云
def create_word_cloud(f):
	print('根据词频,开始生成词云!')
	f = remove_stop_words(f)
	cut_text = " ".join(jieba.cut(f,cut_all=False, HMM=True))
	import numpy as np
	mask=np.array(image.open(r"C:\Users\lj\Desktop\1.jpg"))
	wc = WordCloud(
		mask=mask,
		font_path="./SimHei.ttf",
		max_words=100,
		width=2000,
		height=1200,
    )
	print(cut_text)
	wordcloud = wc.generate(cut_text)
	# 写词云图片
	wordcloud.to_file("wordcloud.jpg")
	# 显示词云文件
	plt.imshow(wordcloud)
	plt.axis("off")
	plt.show()
#得到指定歌手页面 热门前50的歌曲ID,歌曲名
def get_songs(artist_id):
	page_url = 'https://music.163.com/artist?id=' + artist_id
	# 获取网页HTML
	res = requests.request('GET', page_url, headers=headers)
	# 用XPath解析 前50首热门歌曲
	html = etree.HTML(res.text)
	href_xpath = "//*[@id='hotsong-list']//a/@href"
	name_xpath = "//*[@id='hotsong-list']//a/text()"#获取
	hrefs = html.xpath(href_xpath)
	names = html.xpath(name_xpath)
	# 设置热门歌曲的ID,歌曲名称
	song_ids = []
	song_names = []
	for href, name in zip(hrefs, names):
		song_ids.append(href[9:])
		song_names.append(name)
		print(href, '  ', name)
	return song_ids, song_names
# 设置歌手ID,赵雷为6731
artist_id = '6731'
[song_ids, song_names] = get_songs(artist_id)
# 所有歌词
all_word = ''
# 获取每首歌歌词
for (song_id, song_name) in zip(song_ids, song_names):
	# 歌词API URL
	lyric_url = 'http://music.163.com/api/song/lyric?os=pc&id=' + song_id + '&lv=-1&kv=-1&tv=-1'
	lyric = get_song_lyric(headers, lyric_url)
	all_word = all_word + ' ' + lyric
	print(song_name)
#根据词频 生成词云
if __name__ == '__main__':
	create_word_cloud(all_word)
复制代码

Results 在这里插入图片描述4 Summary

Summarize the overall content today. The knowledge points involved include visualization, word cloud graph production and Chinese jieba library.

Guess you like

Origin juejin.im/post/7080532022673817637