CSDN博客主题爬取词云分析

1.requests获取http请求


import requests

url='https://blog.csdn.net/weixin_39920026'
head={'user-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36'}
rqq=requests.get(url=url,headers=head)
rqq.text

2.BeautifulSoup解析标题


from bs4 import BeautifulSoup
import numpy as np

a=[]
soup=BeautifulSoup(rqq.content,'lxml')
b=soup.find_all('a')
len(b)
for i in np.arange(13,75):
    a.append(b[i].string)
a

3.jieba分词、去停用词


import jieba
import pandas as pd

a=" ".join('%s' %i for i in a)#列表转字符串
a= jieba.lcut(a)  # 字符串jieba分词
stopWords=pd.read_csv('stopword.txt',encoding='gbk',sep='hahaha', engine='python',header=None)#读取停用词字典
stopwords=list(stopWords.iloc[:,0])+['None',' ','\n',',','1','2','3','.','(',')','—',',','。','“','”']#停用词字典和自定义停用词拼接
b=[]
for i in a:
    if i not in stopwords:
        b.append(i)
b

4.统计词频绘制词云图


from collections import Counter
word_fre=Counter(b)

import matplotlib.pyplot as plt
from wordcloud import WordCloud
mask=plt.imread('123.jpg')
ciyun=WordCloud(mask=mask,background_color='white',font_path=r'C:\Windows\Fonts\simhei.ttf')
ciyun.fit_words(word_fre)
plt.imshow(ciyun)

5.词云图分享


在这里插入图片描述

6.小结


从词云图发现,上面的主题词和最近写的博客主题比较接近。

发布了22 篇原创文章 · 获赞 3 · 访问量 3094

猜你喜欢

转载自blog.csdn.net/weixin_39920026/article/details/104321121
今日推荐