python application learning (4)-wordcloud generation word cloud


Preface

My friend recently posted some good books and drama recommendations on the public account, and then I thought about helping out, making a word cloud of book reviews, maybe it will be effective, just do it, collect information on the Internet, and finally complete it according to my own needs. That's it!

1. Complete the goal:

  Crawl book reviews or film reviews and then get their review words, and make a word cloud, as shown in Figure
Insert picture description here
2. Content preview

This article is designed to have a lot of knowledge points, and there will be comments in the code. Here is a brief list
(1) python's processing of strings (delete what you don't want)-the use of re library
(2) python's reading of files Write operation
(3) python converts various types of data including strings, lists, dictionaries, etc.
(4) crawler-related content
(5) the use of jieba (stutter, hahaha) practical
word segmentation database (6) wordcloud word cloud Use of generated libraries


1. Preparation

1. The python environment

2, related to the python library needs pip install 包名to install

pip install jieba
pip install wordcloud
(Other libraries involved in the article need to be installed if they are not)

Two, import the library

import jieba
from wordcloud import WordCloud

Note: I made a dumbfounding error when importing wordcloud. The error message was ImportError: cannot import name'WordCloud'. I felt so strange at the time. Why is my computer different from others? ? ?
Insert picture description here
Later, I found out that I wrote my python file name as "wordcloud.Py". Naturally, there will be problems. The solution is to change the file name...
Insert picture description here

Three, basic function realization

Realize the creation of a word cloud for a given text

# 简单对一定的文本制作词云(将所需文本放入wordcloudtext.txt文件中)
import re
import jieba
from wordcloud import WordCloud
import numpy
from PIL import Image

#创建词云
def create_wordcloud(content,savename):
    mask = numpy.array(Image.open("ball.jpg"))  #配置一个mask参数,生成该图片形状的词云
    contents = ''.join(content)   #拼接所给的内容,如果所给的是列表那么将列表中的内容拼接起来,如果是字典那么拼接其所有键
    content_cut = jieba.cut(contents,cut_all=False)   #jieba.cut用来分词,cut_all参数用来控制全模式或者精确模式分词
    content_space_split = ' '.join(content_cut)   #用空格将分词结果拼接起来
    result = WordCloud('simhei.ttf',
                   mask = mask,
                   background_color='white', # 背景颜色
                   width=1000,
                   height=600,).generate(content_space_split)#创建词云
    result.to_file('%s.png'%savename)  #将词云保存为图片

#删除文本中的非中文部分
def find_chinese(file):
    pattern = re.compile(r'[^\u4e00-\u9fa5]')
    chinese = re.sub(pattern, '', file)
    #print(chinese)
    return chinese

if __name__ == '__main__':
    with open ('D:\\ryc\python_learning\other\\3_wordcloud\wordcloudtext.txt','r') as f:#读取.txt文件内容(将想要制作词云的文本内容放入该文本文件中)
        content = f.read()
        content = find_chinese(content)
        content = re.sub('[我你他的了但是就还要不会那在有都才看也又太像可中却很到对时候能这而当没]','',content)  #去除文本中我你他之类的不想要的高频词
        print(content)
    create_wordcloud(content,'词云评论')

4. Crawling book reviews and making word clouds

import requests
import re
import jieba
from wordcloud import WordCloud
import numpy
from PIL import Image

#删除文本中的非中文部分
def find_chinese(file):
    pattern = re.compile(r'[^\u4e00-\u9fa5]')
    chinese = re.sub(pattern, '', file)
    return chinese

#爬取小王子的短评内容
def spider_xiaowangzi():
    commentres = ''
    with open ('D:\\ryc\python_learning\other\\3_wordcloud\spider_wordcloud.txt','w',encoding='utf-8') as f:
        url = 'https://book.douban.com/subject/1084336/comments/?percent_type=h&limit=20&status=P&sort=new_score'  #爬取目标地址
        header = {
    
    
            'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.163 Safari/537.36',
            'Cookie' : 'll="118100"; bid=gr9hyjlFAIs; __utmz=30149280.1586961843.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); _vwo_uuid_v2=DAACAB21E936827CFA01C7ADE5CAF4293|cc739421584c029e0df955dae135ef07; __gads=ID=2846de2c51666e22:T=1587043991:S=ALNI_MaF-A9RTL8744UwEClUMK5nqOC8nw; _ga=GA1.2.507836951.1586961843; gr_user_id=3aa1d1e6-1a34-4b05-82ef-8066755b0ca9; __utmz=81379588.1587383782.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); __yadk_uid=Ot4d328rVsOjtT0KtAdZE0rJH9jhIhlq; viewed="1007305"; ap_v=0,6.0; __utmc=30149280; __utma=30149280.507836951.1586961843.1588474302.1588476961.13; __utmt_douban=1; __utmb=30149280.1.10.1588476961; __utma=81379588.507836951.1586961843.1587445404.1588476961.5; __utmc=81379588; __utmt=1; __utmb=81379588.1.10.1588476961; _pk_id.100001.3ac3=bc45a2b0c4ddf6ec.1587383783.5.1588476961.1587445427.; _pk_ses.100001.3ac3=*'
        }  #带上请求头爬取才不至于被拦
        try :
            data = requests.get(url,headers = header).text
        except:
            print('爬取失败')
            exit ()
        #从爬取的data中解析出该部分内容(结果是一个列表)
        comment = re.findall('<span class="short">(.*?)</span>',data)  #<span class="short">十几岁的时候渴慕着小王子,一天之间可以看四十四次日落。是在多久之后才明白,看四十四次日落的小王子,他有多么难过。</span>

        for i in range(0,len(comment)):
            commentres = commentres + comment[i]   #将列表转换为一个完整的字符串
        commentres = find_chinese(commentres)      #去除其中的非中文部分
        commentres = re.sub('[我你他的了但是就还要不会那在有都才看也又太像可中却很说到对]','',commentres)  #去除文本中我你他之类的你不想要的高频词
            
        f.write("{duanpin}\n".format(duanpin = commentres)) #将结果写入.txt文件中
        #print (commentres)
        return commentres
        

def create_wordcloud(content,savename):
    mask = numpy.array(Image.open("ball.jpg"))    #配置一个mask参数,生成该图片形状的词云
    contents = ''.join(content)   #拼接所给的内容,如果所给的是列表那么将列表中的内容拼接起来,如果是字典那么拼接其所有键
    content_cut = jieba.cut(contents,cut_all=False)   #jieba.cut用来分词,cut_all参数用来控制全模式(True)或者精确模式分词(False)
    content_space_split = ' '.join(content_cut)    #用空格将分词结果拼接起来
    result = WordCloud('simhei.ttf',                
                   mask = mask,
                   background_color='white', # 背景颜色
                   width=1000,
                   height=600,).generate(content_space_split) #创建词云
    result.to_file('%s.png'%savename)  #将词云保存为图片

if __name__ == "__main__":
    comment = spider_xiaowangzi()
    create_wordcloud(comment,'小王子词云评论')

At last

I will continue to update similar and interesting python applications. Interested friends can follow me and get updated content in time!
(See all here, just like it, let’s go, creation is not easy!)

For other python application examples, see: https://blog.csdn.net/weixin_45386875/article/details/113766276

Guess you like

Origin blog.csdn.net/weixin_45386875/article/details/113803501