Python completes word frequency statistics and analysis with one line of code, word frequency analysis is so simple

1 Introduction

(Solemnly declare: The copyright of this blog post belongs to Sweeping Monk-smile , and reprinting of the blog post is prohibited!)

(Follow the blogger, update the blog from time to time, every article is a boutique, full of dry goods!!!)

​Sweeping Monk-smile devotes himself to building a nanny-level knowledge point blog, from raising questions to comprehensive solutions, just reading this article is enough. This blog brings together the following advantages .

  • Complete knowledge about the problem

  • Logical problem solving

  • All demo codes are available : no garbled characters, clear comments, reproducible, all codes are self-developed, and uploaded after the test is correct.

依赖的三方模块
# 全局环境安装
pip install pandas jieba openpyxl -i https://mirror.baidu.com/pypi/simple/

# 虚拟环境安装(PyChram创建)
cd 项目根目录
.\venv\Scripts\activate
pip install pandas jieba openpyxl -i https://mirror.baidu.com/pypi/simple/

2 first look at the effect

  • The effect of word frequency analysis is shown in the figure:

insert image description here

  • The above effect is only due to one line of code , see the code.
if __name__ == '__main__':
    MsgLoad("./wechat.csv").words_column_values("content").to_excel()
  • You are not mistaken, it really only needs this 1 line of code, click to run and it will be completed.
  • Let's understand this line of code first. First, we create MsgLoad("./wechat.csv")an instance object and read wechat.csvthe content of . Then, we MsgLoaduse words_column_valuesthe method of the class to wechat.csvread “content”the value of the field in and generate an instance of Wordsthe class . Finally, we use the method Wordsof the class to_excelto automatically generate the excel sheet. Complete word frequency statistics.
  • The files we read are not limited toCSV , EXCELfiles are also available. The types of pages we output are not limited toEXCEL , you can also List, Set, DataFrameoutput.
  • The above process is really simple, MsgLoadbut Wordswhat exactly are the and classes? What is the effect?

3 source code

3.1 The true face of Mount Lushan (source code)

# -*- coding:utf-8 -*-
# Author : 扫地僧-smile
# Data : 2022/7/26 15:01

import pandas as pd
from concurrent.futures import ThreadPoolExecutor
import jieba
import os
import random


class Words:

    def __init__(self, data):
        """
        :param data: 可迭代对象,item为字符串类型
        """
        self.__jieba = jieba  # 初始化加载模型,提高引用效率
        self.__data = data
        self.__word_list = list()
        self.__word_set = set()
        self.__result_list = list()
        self.__core()
        self.__result()

    def __str__(self):
        return str(self.__data)

    def __split(self, data):
        _temp_list = self.__jieba.cut(data)
        for word in _temp_list:
            if len(word) >= 2:
                self.__word_list.append(word)

    def __core(self):
        self.__pool = ThreadPoolExecutor(100)
        for i in self.__data:
            self.__pool.submit(self.__split, i)
        self.__pool.shutdown(True)
        del self.__pool

    def __count(self, data):
        times = self.__word_list.count(data)
        self.__word_list.remove(data)
        self.__result_list.append([data, times])

    def __result(self):
        self.__pool = ThreadPoolExecutor(100)
        for word in self.word_set():
            self.__pool.submit(self.__count, word)
        self.__pool.shutdown(True)
        del self.__pool

    def word_list(self) -> list:
        """
        :return: 返回所有单词的列表,包含重复
        """
        return self.__word_list

    def word_set(self) -> set:
        """
        :return: 返回所有单词的集合,滤除重复
        """
        self.__word_set = set(self.__word_list)
        return self.__word_set

    def word_result(self) -> list:
        """
        :return: 返回所有单词以及出现次数的列表,例如:[['姑娘', 1], ['亲爱', 2], ['自己', 38], ['smile', 1], ['我爱你', 1]]
        """
        return self.__result_list

    def to_excel(self):
        if not os.path.exists("./Words"):
            os.mkdir("./Words")
        _name_list = random.sample('0123456789abcdef',10)
        _name = ""
        for i in _name_list:
            _name = _name + i
        _name = "./Words/{}.xlsx".format(_name)
        return self.to_dataframe().to_excel(_name)

    def to_dataframe(self):
        """
        :return: DataFrame格式输出
        """
        result = pd.DataFrame(self.__result_list, columns=["words", "times"])
        result = result.sort_values(by="times", ascending=False, ignore_index=True)
        return result


class MsgLoad:

    def __init__(self, filepath, sheet=0, header=0, skiprows=0):
        """
        :param filepath: 文件路径
        :param sheet: 工作簿名,也可以用0,1,2,3.....表示(.xlsx .xls使用)
        :param header: 字段名所在的行,从0开始(.xlsx .xls使用)
        :param skiprows : 从第几行读取数据,从0开始(.xlsx .xls使用)
        """
        _ex_name = os.path.splitext(filepath)[1]
        if _ex_name == (".xlsx" or ".xls"):
            self.__pd = pd.read_excel(filepath, sheet_name=sheet, header=header, skiprows= skiprows)
        elif _ex_name == ".csv":
            self.__pd = pd.read_csv(filepath)
        else:
            pass

    def __str__(self):
        return str(self.__pd)

    def get_column_values(self, arg) -> list:
        """
        :param arg: 输入column名
        :return:  输出List对象,指定列的值在此列表内
        """
        msg_content = [content for content in self.__pd.loc[:, arg].values]
        return msg_content

    def get_row_values(self, arg) -> list:
        """
        :param arg: 输入index
        :return: 输出List对象,指定行的值在此列表内
        """
        return list(self.__pd.loc[arg, :].values)

    def words_column_values(self, arg) -> Words:
        """
        :param arg: 输入column名
        :return:  输出Words对象,指定列的值的列表输入其内
        """
        msg_content = [content for content in self.__pd.loc[:, arg].values]
        return Words(msg_content)

    def words_row_values(self, arg) -> Words:
        """
        :param arg: 输入index
        :return: 输出Words对象,指定行的值的列表输入其内
        """
        msg_content = list(self.__pd.loc[arg, :].values)
        return Words(msg_content)


if __name__ == '__main__':
    MsgLoad("./wechat.csv").words_column_values("content").to_excel()

  • We finally know the true face of Mount Lushan. To achieve word frequency statistics, we rely on MsgLoadand Wordsthese two classes. How to use these two classes? Next, we will explain them in detail.

3.2 MsgLoadClass introduction (non-source code)

class MsgLoad:
    """
    该类用于读取CSV或EXCEL文件,以及筛选出该文件的某些字段,方便后续的数据处理
    """

    def __init__(self, filepath, sheet=0, header=0, skiprows=0):
        """
        初始化加载文件,生成DataFrame类型。
        """

    def __str__(self):
        return str(self.__pd)

    def get_column_values(self, arg) -> list:
        """
        填入字段名,将会返回该字段所有值的列表形式
        """

    def get_row_values(self, arg) -> list:
        """
        填入DataFrame的Index,将会返回该行所有值的列表形式
        """

    def words_column_values(self, arg) -> Words:
        """
        填入字段名,将会返回该字段所有值的Words类型
        """

    def words_row_values(self, arg) -> Words:
        """
        填入字段名,将会返回该字段所有值的Words类型
        """

3.3 WordsClass introduction (non-source code)

class Words:
    """
    该类用于对输入内容的分词,词频技术,排序,输出结果的类型转换。
    """

    def __init__(self, data):
        """
        data为可迭代对象,迭代项目item应全部为字符串类型
        """

    def __str__(self):
        return str(self.__data)

    def __split(self, data):
        # 中间运算,请忽略

    def __core(self):
        # 中间运算,请忽略

    def __count(self, data):
        # 中间运算,请忽略

    def __result(self):
        # 中间运算,请忽略

    def word_list(self) -> list:
        """
        结果输出:内容为所有分词的列表,未删除重复分词
        """

    def word_set(self) -> set:
        """
        结果输出:内容为所有分词的集合,没有重复,无序
        """

    def word_result(self) -> list:
        """
        结果输出:内容为所有分词的二维表,没有重复,无序,内容包含分词和出现的次数
        例如:
        [
        ["爱你",16]
        ["吃饭",28]
        ]
        """

    def to_excel(self):
        """
         结果输出:输出Excel表,词序按照次数降序排列,表格的名称自动生成,表格自动存在./Words/目录下
        """


    def to_dataframe(self):
        """
        结果输出:输出Excel表,内容同to_excel
        """

  • Well, each method has been explained. Everyone is welcome to use it . If you have any questions, you can post them in the comment area. I will continue to answer and update this blog.

Guess you like

Origin blog.csdn.net/z132533/article/details/125998750