Python crawls the entire network text and word cloud analysis (one-click the whole process!)

Prospect introduction

Python is very hot recently, it is really hot, it seems to have been hot all the time, hahahaha. If you also think it is very popular, then please read this article to see if the popularity of Python can make this article popular.

Then, as a rising star programming language-Python, what can it do? Python has been "hyped" on the Internet: one-click office, learn Python well, double salary, make your life a little more money, and make your boss Look at you with admiration and let you find confidence from now on! This is not flattery, nor exaggeration. From cloud computing, big data to artificial intelligence, Python is everywhere. A series of big companies such as Baidu, Alibaba, Tencent are using Python to complete various tasks, making Python more and more grounded, and its functions do not require me. Having said that, I won't repeat its advantages and features anymore. After all, this article is a technical article, not much to say, let's do it!

If you are a scientific research party, please finish reading the article! There is a surprise at the end of the article!

Click here to download the source code and run it directly

Project Description

I recently received a private message from a small fan on CSDN, saying that a previous article Python crawled website novels and visualized analysis , that website is better, and said that I want all the books on this website, take it and study it myself. Out of my concern for fans, and I personally like literary works, it is not impossible to read a book to cultivate my sentiment in my boring spare time, hahaha. After receiving the request, I immediately started the architecture idea. I found its characteristics by observing the structure of the webpage. Finally, I added my own design ideas, added the word cloud analysis function, and tested it many times, and finally achieved one-click ! ! ! ! ! ! !

Project idea and function introduction

1. The user enters the webpage link of any book in the website, enters the storage path and presses Enter, and then runs the crawler in the background, then runs the intelligent word segmentation, and finally uses the powerful pyecharts library to display the word cloud map.

Insert picture description here
2. There are so many books, enough for you to read. If you don’t want to read it, you want to know what this book mainly talks about and what high-frequency words appear, which will eventually help you understand and understand the main content of this article.

3. This project relies on the data analysis library and the original Python library to perform text segmentation, intelligent cutting, intelligent word cloud algorithm and intelligent crawler algorithm, with anti-climbing technology writing, and data analysis highlights.

Project realization

1. First you must install these libraries

Insert picture description here
If you don’t, please read this article for a detailed introduction. If you don’t understand, I will install it for you! It can definitely be installed~

2. Implement crawler algorithm

Define global variables in advance

from pyecharts import options as opts
from pyecharts.charts import WordCloud
from pyecharts.globals import SymbolType
import jieba  # jieba用于分词,中文字典及其强大
from fake_useragent import UserAgent
import requests
from lxml import etree
import time
ll = []
lg = []
lk = []
lj = []
lp = []
li = []
d = {
    
    }  # 定义好相应的存储变量

def get_data(title,page,url,num):#title代表文件路径 page代表爬取的章节数 url为修订后网址 num为标签页数
    with open(r"{}.txt".format(title), "w", encoding="utf-8") as file:
        ua = UserAgent()  # 解决了我们平时自己设置伪装头的繁琐,此库自动为我们弹出一个可用的模拟浏览器

        def get_page(url):
            headers = {
    
    "User-Agent": ua.random}
            res = requests.get(url=url, headers=headers)
            res.encoding = 'GBK'
            html = res.text
            html_ = etree.HTML(html)
            text = html_.xpath('//div[@class="panel-body content-body content-ext"]//text()')
            num = len(text)
            for s in range(num):
                file.write(text[s] + '\n')

        for i in range(page):
            # time.sleep(2)
            file.write("第{}章".format(i + 1))#写入文本数据
            get_page(url+"{}.html".format(num + i))#爬虫标签页移动,数据输出爬取过程
            print("正在爬取第{}章!".format(i + 1))
        print("爬取完毕!!!!")

3. Realize smart word segmentation

I wrote a smart word cloud algorithm myself, including the realization of various small functions, the design is not easy, I refuse to refuse free prostitution, if you need it, you can send me a private message or download it yourself! ! !

4. Main function

def main():
    try:
        print("\t\t本小程序只针对:<https://www.cz2che.com/>网址有效,里面有大量的古今中外名著小说!!!\n\n")
        print("C:\\Users\\48125\\Desktop\\")
        title = input("请输入储存文本的路径及名称如桌面:(C:\\Users\\48125\\Desktop\\文本)不需要加.txt!\n")
        urll  = str(input("请输入您要爬取的网站(请将键盘光标移动到网址前面在回车!):"))
        url   = str(urll[:urll.rindex('/') + 1])
        num   = int(urll[urll.rindex('/') + 1:len(urll) - 5])
        print(url,num)
        page  = int(input("请输入本次您要爬取的章节数:\n"))
        get_data(title,page,url,num)
        Open(title)
        print("\n分词完毕!")
        print('''\n\n\t\t一键词云算法生成器
        \t0--退出词云系统
        \t1--生成一词组的词云图
        \t2--生成二词组的词云图
        \t3--生成三词组的词云图
        \t4--生成四词组的词云图
        \t5--生成大于1词组的词云图(研究常用)
        \t6--生成全部词组的词云图(包含所有类型的词组)
        ''')
        num = int(input("请输入本次展示的词语数量(最好不超过100):"))
        data = sort()[:num]
        Str = input("请输入这个词云图的标题:")
        print("词云图已经生成完毕,请查收!")
        print("感谢您对本程序的使用,欢迎下次光临!!")
        c = (
            WordCloud()
                .add(
                "",
                data,  # 数据集
                word_size_range=[20, 100],  # 单词字体大小范围
                shape=SymbolType.DIAMOND)  # 词云图轮廓,有以下的轮廓选择,但是对于这个版本的好像只有在提示里面选
                # circl,cardioid,diamond,triangle-forward,triangle,start,pentagon
                .set_global_opts(title_opts=opts.TitleOpts(title="{}".format(Str)),
                                 toolbox_opts=opts.ToolboxOpts())  # 工具选项
                .render("{}词云制作{}词组.html".format(title, choice))
        )
        return c
    except:
        print("无法找到,请检查你的输入!")

Project realization

1. Enter the URL and save path, as well as the number of crawled chapters
Insert picture description here2. The smart crawler is started and running

Insert picture description here

Insert picture description here

3. Smart algorithm on

Insert picture description hereInsert picture description here
4. Effect display

The desktop automatically appears, click on the HTML of the web page to display the word cloud, and you can download it yourself. This is the characteristic of the pyecharts library
Insert picture description hereInsert picture description here
Insert picture description here

Looks pretty good, I also think the effect is ok, mainly because this one-click is too easy for me, in the future, I will be able to do online interviews to help researchers do research on scientific research, as well as various e-commerce websites To solve the boss’s evaluation of the product, this one-click can help us reduce wasted time, and of course the boss will also like it.

Source me privately! ! ! ! Design is not easy! ! !

Project Development

I also designed another word cloud one-click analysis of the National Social Science Fund database.

The favorite of scientific research babies, you can directly send me a message if you need it. Grasp your research direction is the most correct choice.
Insert picture description here This program involves a web page decoding and transcoding function.
Insert picture description here
Insert picture description hereInsert picture description hereInsert picture description here

The input categories inside can be designed by yourself, and all input boxes can set their own filter conditions! ! ! !

If you are a scientific research party, it would be a pity not to do it, hahahaha! ! ! ! ! !

One word per text

The unpredictable future is full of expectations

Guess you like

Origin blog.csdn.net/weixin_47723732/article/details/111937273