Python crawls Netease cloud song reviews and does word cloud analysis

foreword

emmmm nothing to say, everything I want to say is in the code

environmental use

  • Python 3.8 Interpreter 3.10
  • Pycharm 2021.2 Professional Edition
  • selenium 3.141.0

This time, the selenium module will be used, so please remember to download the browser driver in advance and configure the environment

Code

First install and import the required modules

from selenium import webdriver  # 导入浏览器的功能
import re   # 正则表达式模块, 内置
import time   # 时间模块, 程序延迟

1. Create a browser object

driver = webdriver.Chrome()

2. Execute automation

driver.get('https://music.163.com/#/song?id=488249475')
# selenium无法直接获取到嵌套页面里面的数据
driver.switch_to.frame(0)  # switch_to.frame()  切换到嵌套网页
driver.implicitly_wait(10)  # 让浏览器加载的时候, 等待渲染页面
Drop down the page, directly to the bottom of the page
js = 'document.documentElement.scrollTop = document.documentElement.scrollHeight'
driver.execute_script(js)

3. Parse the data

divs = driver.find_elements_by_css_selector('.itm')  # 所有div  css语法: 定位到 html 数据/xpath/正则

for div in divs:
    cnt = div.find_element_by_css_selector('.cnt.f-brk').text

    cnt = re.findall(':(.*)', cnt)[0]  # 中英文有区别
    print(cnt)

save data

turn pages
for page in range(10):  # 控制翻页  速度太快
    # 翻页 , 找到下一页标签, 点击?
    driver.find_element_by_css_selector('.znxt').click()
    time.sleep(1)
# selenium  欲速则不达
save as txt file
with open('contend.txt', mode='a', encoding='utf-8') as f:
    f.write(cnt + '\n')

run the code and get the result

do it again word cloud

import related modules

import jieba  # 中文分词库  pip install jieba
import wordcloud  # 制作词云图的模块  pip install wordcloud
import imageio

read file data

with open('contend.txt', mode='r', encoding='utf-8') as f:
    txt = f.read()
print(txt)

Word cloud graph segmentation <Chinese (words)> based on results

txt_list = jieba.lcut(txt)
print('分词结果:', txt_list)

merge

string_ = ' '.join(txt_list)  # 1 + 1 = 2   字符串的基本语法
print('合并分词:', string_)

Create a word cloud

wc = wordcloud.WordCloud(
    width=1000,  # 图片的宽
    height=800,  # 图片的高
    background_color='white',  # 图片的背景色
    font_path='msyh.ttc',  # 微软雅黑
    scale=15,  # 词云图默认的字体大小
    # mask=img,  # 指定词云图的图片

    # 停用词< 语气词, 助词,....
    stopwords=set([line.strip() for line in open('cn_stopwords.txt', mode='r', encoding='utf-8').readlines()] )
)
print('正在绘制词云图...')
wc.generate(string_)  # 绘制词云图
wc.to_file('out.png')  # 保存词云图
print('词云图绘制完成...')

final effect

Python crawls NetEase cloud music, comments, lyrics data

Guess you like

Origin blog.csdn.net/m0_48405781/article/details/124990752