Text classification (Naive Bayes algorithm)

1. The introduction of Bayes' theorem

1. Naive Bayes:

  The origin of the word naive in Naive Bayes is to assume that the features are independent of each other. This assumption makes the naive Bayes algorithm simple, but sometimes it sacrifices a certain classification accuracy.

2. Bayesian formula:

Insert picture description here

3. Change to the expression of the classification task

Insert picture description here

  Then we finally find p (category|feature)! It is equivalent to completing our task.
  The calculation of the denominator uses the full probability formula (in fact, the sum of the numerators ) :Insert picture description here

Second, the introduction of the data set

1. The original data set

Insert picture description here

2. Test data set

  In order to reduce the amount of tasks, the test data set is obtained by scrambling on the original data set.
Insert picture description here

Three, text classification steps

Insert picture description here

Four, TF-IDF inverse text frequency index

concept

  TF-IDF (term frequency–inverse document frequency) is an important algorithm for information processing and data mining, and it belongs to statistical methods. The most common usage is to find keywords in an article.
  It is a statistical method used to evaluate the importance of a word to a document in a corpus. The importance of a word increases proportionally with the number of times it appears in the file, and decreases inversely with its frequency in other files in the corpus. That is to say, a word appears more frequently in a document, and other documents do not appear, indicating that the word is very important for the classification of the document. However, if there are many other documents, indicating that the word is not very distinguishable, use IDF to reduce the weight of the word.
The formula is as follows :
Insert picture description here
  TF (term frequency) is the frequency of a word in this article, the higher the frequency, the more likely it is a keyword. Its specific calculation method is shown in the above formula: the number of times a key appears in the article divided by the number of all words in the article, where i is the word index number, j is the index number of the article, and k is the occurrence in the file All the words.
Insert picture description here
  IDF (Inverse Document Frequency) is the frequency of the word appearing in other articles. Its specific calculation method is shown in the above formula: where the numerator is the total number of articles, and the denominator is the number of articles containing the keyword. If the number of documents containing the keyword is If it is 0, the numerator is 0. To solve this problem, 1 is often added to the denominator. When keywords, such as "of", appear in most articles, the calculated idf value is considered small.
Insert picture description here
Multiplied by the TF and IDF is the word in this article is of importance .

Mathematical Thought :
TF-IDF and the number of times a word appears in a document to the proportional , with the number of occurrences of the word in the entire language into an inverse .
TF = TF-IDF (term frequency) * IDF (inverse document frequency)
Frequencies: TF = The number of times the word appears in the document / the total number of words in the document
Inverse document frequency: IDF = log (the total number of documents in the corpus / the number of documents containing the word + 1)

Five, code implementation

# -*- coding: utf-8 -*-
# @File  : TextClassification.py
# @Author: Junhui Yu
# @Date  : 2020/8/28

import jieba
from numpy import *
import pickle  # 持久化
import os
from sklearn.feature_extraction.text import TfidfTransformer  
from sklearn.feature_extraction.text import TfidfVectorizer  
from sklearn.datasets.base import Bunch
from sklearn.naive_bayes import MultinomialNB  


def readFile(path):
    with open(path, 'r', errors='ignore') as file: 
        content = file.read()
        file.close()
        return content


def saveFile(path, result):
    with open(path, 'w', errors='ignore') as file:
        file.write(result)
        file.close()


def segText(inputPath, resultPath):
    fatherLists = os.listdir(inputPath)  # 主目录
    for eachDir in fatherLists:  # 遍历主目录中各个文件夹
        eachPath = inputPath + eachDir + "/"  # 保存主目录中每个文件夹目录,便于遍历二级文件
        each_resultPath = resultPath + eachDir + "/"  # 分词结果文件存入的目录
        if not os.path.exists(each_resultPath):
            os.makedirs(each_resultPath)
        childLists = os.listdir(eachPath)  # 获取每个文件夹中的各个文件
        for eachFile in childLists:  # 遍历每个文件夹中的子文件
            eachPathFile = eachPath + eachFile  # 获得每个文件路径
            #  print(eachFile)
            content = readFile(eachPathFile)  # 调用上面函数读取内容
            # content = str(content)
            result = (str(content)).replace("\r\n", "").strip()  # 删除多余空行与空格
            # result = content.replace("\r\n","").strip()

            cutResult = jieba.cut(result)  # 默认方式分词,分词结果用空格隔开
            saveFile(each_resultPath + eachFile, " ".join(cutResult))  # 调用上面函数保存文件


def bunchSave(inputFile, outputFile):
    catelist = os.listdir(inputFile)
    bunch = Bunch(target_name=[], label=[], filenames=[], contents=[])
    bunch.target_name.extend(catelist)  
    for eachDir in catelist:
        eachPath = inputFile + eachDir + "/"
        fileList = os.listdir(eachPath)
        for eachFile in fileList:  # 二级目录中的每个子文件
            fullName = eachPath + eachFile  # 二级目录子文件全路径
            bunch.label.append(eachDir)  # 当前分类标签
            bunch.filenames.append(fullName)  # 保存当前文件的路径
            bunch.contents.append(readFile(fullName).strip())  # 保存文件词向量
    with open(outputFile, 'wb') as file_obj:  # 持久化必须用二进制访问模式打开
        pickle.dump(bunch, file_obj)


def readBunch(path):
    with open(path, 'rb') as file:
        bunch = pickle.load(file)
        # pickle.load(file)
        # 函数的功能:将file中的对象序列化读出。
    return bunch


def writeBunch(path, bunchFile):
    with open(path, 'wb') as file:
        pickle.dump(bunchFile, file)


def getStopWord(inputFile):
    stopWordList = readFile(inputFile).splitlines()
    return stopWordList


def getTFIDFMat(inputPath, stopWordList, outputPath,
                tftfidfspace_path,tfidfspace_arr_path,tfidfspace_vocabulary_path):  # 求得TF-IDF向量
    bunch = readBunch(inputPath)
    tfidfspace = Bunch(target_name=bunch.target_name, label=bunch.label, filenames=bunch.filenames, tdm=[],
                       vocabulary={
    
    })
    '''读取tfidfspace'''
    tfidfspace_out = str(tfidfspace)
    saveFile(tftfidfspace_path, tfidfspace_out)
    # 初始化向量空间
    vectorizer = TfidfVectorizer(stop_words=stopWordList, sublinear_tf=True, max_df=0.5)
    transformer = TfidfTransformer() 
    # 文本转化为词频矩阵,单独保存字典文件
    tfidfspace.tdm = vectorizer.fit_transform(bunch.contents)
    tfidfspace_arr = str(vectorizer.fit_transform(bunch.contents))
    saveFile(tfidfspace_arr_path, tfidfspace_arr)
    tfidfspace.vocabulary = vectorizer.vocabulary_  # 获取词汇
    tfidfspace_vocabulary = str(vectorizer.vocabulary_)
    saveFile(tfidfspace_vocabulary_path, tfidfspace_vocabulary)
    '''over'''
    writeBunch(outputPath, tfidfspace)


def getTestSpace(testSetPath, trainSpacePath, stopWordList, testSpacePath,
                 testSpace_path,testSpace_arr_path,trainbunch_vocabulary_path):
    bunch = readBunch(testSetPath)
    # 构建测试集TF-IDF向量空间
    testSpace = Bunch(target_name=bunch.target_name, label=bunch.label, filenames=bunch.filenames, tdm=[],
                      vocabulary={
    
    })
    '''
       读取testSpace
       '''
    testSpace_out = str(testSpace)
    saveFile(testSpace_path, testSpace_out)
    # 导入训练集的词袋
    trainbunch = readBunch(trainSpacePath)
    # 使用TfidfVectorizer初始化向量空间模型  使用训练集词袋向量
    vectorizer = TfidfVectorizer(stop_words=stopWordList, sublinear_tf=True, max_df=0.5,
                                 vocabulary=trainbunch.vocabulary)
    transformer = TfidfTransformer()
    testSpace.tdm = vectorizer.fit_transform(bunch.contents)
    testSpace.vocabulary = trainbunch.vocabulary
    testSpace_arr = str(testSpace.tdm)
    trainbunch_vocabulary = str(trainbunch.vocabulary)
    saveFile(testSpace_arr_path, testSpace_arr)
    saveFile(trainbunch_vocabulary_path, trainbunch_vocabulary)
    # 持久化
    writeBunch(testSpacePath, testSpace)


def bayesAlgorithm(trainPath, testPath,tfidfspace_out_arr_path,
                   tfidfspace_out_word_path,testspace_out_arr_path,
                   testspace_out_word_apth):
    trainSet = readBunch(trainPath)
    testSet = readBunch(testPath)
    clf = MultinomialNB(alpha=0.001).fit(trainSet.tdm, trainSet.label)
   
    '''处理bat文件'''
    tfidfspace_out_arr = str(trainSet.tdm)  
    tfidfspace_out_word = str(trainSet)
    saveFile(tfidfspace_out_arr_path, tfidfspace_out_arr) 
    saveFile(tfidfspace_out_word_path, tfidfspace_out_word)  

    testspace_out_arr = str(testSet)
    testspace_out_word = str(testSet.label)
    saveFile(testspace_out_arr_path, testspace_out_arr)
    saveFile(testspace_out_word_apth, testspace_out_word)

    '''处理结束'''
    predicted = clf.predict(testSet.tdm)
    total = len(predicted)
    rate = 0
    for flabel, fileName, expct_cate in zip(testSet.label, testSet.filenames, predicted):
        if flabel != expct_cate:
            rate += 1
            print(fileName, ":实际所属类别:", flabel, "-->预测所属类别:", expct_cate)
    print("整个测试数据集错误率:", float(rate) * 100 / float(total), "%")

#
if __name__ == '__main__':
    #原始集路径
    datapath = "./data/"  #原始数据路径
    stopWord_path = "./stop/stopword.txt"#停用词路径
    test_path = "./test/"            #测试集路径


    test_split_dat_path =  "./test_set.dat" #测试集分词bat文件路径
    testspace_dat_path ="./testspace.dat"   #测试集输出空间矩阵dat文件
    train_dat_path = "./train_set.dat"  # 读取分词数据之后的词向量并保存为二进制文件
    tfidfspace_dat_path = "./tfidfspace.dat"  #tf-idf词频空间向量的dat文件
    '''
    以上四个为dat文件路径,是为了存储信息做的
    '''

   
    test_split_path = './split/test_split/'   
    split_datapath = "./split/split_data/"  
   
    tfidfspace_path = "./tfidfspace.txt"  
    tfidfspace_arr_path = "./tfidfspace_arr.txt" 
    tfidfspace_vocabulary_path = "./tfidfspace_vocabulary.txt"  
    testSpace_path = "./testSpace.txt"  
    testSpace_arr_path = "./testSpace_arr.txt"  
    trainbunch_vocabulary_path = "./trainbunch_vocabulary.txt" 
    tfidfspace_out_arr_path = "./tfidfspace_out_arr.txt"  
    tfidfspace_out_word_path = "./tfidfspace_out_word.txt" 
    testspace_out_arr_path = "./testspace_out_arr.txt"     
    testspace_out_word_apth ="./testspace_out_word.txt"    
   

    #输入训练集
    segText(datapath,
            split_datapath)
    bunchSave(split_datapath,
              train_dat_path)  
    stopWordList = getStopWord(stopWord_path)  
    getTFIDFMat(train_dat_path, 
                stopWordList,    
                tfidfspace_dat_path, 
                tfidfspace_path, 
                tfidfspace_arr_path,
                tfidfspace_vocabulary_path) 

    #输入测试集
    segText(test_path,
            test_split_path)  
    bunchSave(test_split_path,
              test_split_dat_path)  
    getTestSpace(test_split_dat_path,
                 tfidfspace_dat_path,
                 stopWordList,
                 testspace_dat_path,
                 testSpace_path,
                 testSpace_arr_path,
                 trainbunch_vocabulary_path)
    bayesAlgorithm(tfidfspace_dat_path,
                   testspace_dat_path,
                   tfidfspace_out_arr_path,
                   tfidfspace_out_word_path,
                   testspace_out_arr_path,
                   testspace_out_word_apth)

6. Forecast results

Insert picture description here

Guess you like

Origin blog.csdn.net/yjh_SE007/article/details/108283145