foreword
emmmm nothing to say, everything I want to say is in the code
environmental use
- Python 3.8 Interpreter 3.10
- Pycharm 2021.2 Professional Edition
- selenium 3.141.0
This time, the selenium module will be used, so please remember to download the browser driver in advance and configure the environment
Code
First install and import the required modules
from selenium import webdriver # 导入浏览器的功能
import re # 正则表达式模块, 内置
import time # 时间模块, 程序延迟
1. Create a browser object
driver = webdriver.Chrome()
2. Execute automation
driver.get('https://music.163.com/#/song?id=488249475')
# selenium无法直接获取到嵌套页面里面的数据
driver.switch_to.frame(0) # switch_to.frame() 切换到嵌套网页
driver.implicitly_wait(10) # 让浏览器加载的时候, 等待渲染页面
Drop down the page, directly to the bottom of the page
js = 'document.documentElement.scrollTop = document.documentElement.scrollHeight'
driver.execute_script(js)
3. Parse the data
divs = driver.find_elements_by_css_selector('.itm') # 所有div css语法: 定位到 html 数据/xpath/正则
for div in divs:
cnt = div.find_element_by_css_selector('.cnt.f-brk').text
cnt = re.findall(':(.*)', cnt)[0] # 中英文有区别
print(cnt)
save data
turn pages
for page in range(10): # 控制翻页 速度太快
# 翻页 , 找到下一页标签, 点击?
driver.find_element_by_css_selector('.znxt').click()
time.sleep(1)
# selenium 欲速则不达
save as txt file
with open('contend.txt', mode='a', encoding='utf-8') as f:
f.write(cnt + '\n')
run the code and get the result
do it again word cloud
import related modules
import jieba # 中文分词库 pip install jieba
import wordcloud # 制作词云图的模块 pip install wordcloud
import imageio
read file data
with open('contend.txt', mode='r', encoding='utf-8') as f:
txt = f.read()
print(txt)
Word cloud graph segmentation <Chinese (words)> based on results
txt_list = jieba.lcut(txt)
print('分词结果:', txt_list)
merge
string_ = ' '.join(txt_list) # 1 + 1 = 2 字符串的基本语法
print('合并分词:', string_)
Create a word cloud
wc = wordcloud.WordCloud(
width=1000, # 图片的宽
height=800, # 图片的高
background_color='white', # 图片的背景色
font_path='msyh.ttc', # 微软雅黑
scale=15, # 词云图默认的字体大小
# mask=img, # 指定词云图的图片
# 停用词< 语气词, 助词,....
stopwords=set([line.strip() for line in open('cn_stopwords.txt', mode='r', encoding='utf-8').readlines()] )
)
print('正在绘制词云图...')
wc.generate(string_) # 绘制词云图
wc.to_file('out.png') # 保存词云图
print('词云图绘制完成...')
final effect
Python crawls NetEase cloud music, comments, lyrics data