[Machine Learning] News Classification Based on Naive Bayes

Experiment introduction

1. Experimental content

This lab shows how to use Bayesian algorithms to solve a practical problem: news classification.

2. Experimental objectives

Through this experiment, you can further master the principle of Bayesian algorithm, master how to use Bayesian algorithm to solve practical problems, and master the solution process of Bayesian algorithm in the real world.

3. Experimental knowledge points

  • Bayesian algorithm
  • data preprocessing
  • gradient descent
  • learned

4. Experimental environment

  • python 3.6.5
  • jieba

5. Preliminary knowledge

  • Probability Theory and Mathematical Statistics
  • Basic operation of Linux commands
  • Basics of Python programming

Preparation

Click the download experiment data module at the top right of the screen, select to download bayes_news.tgz to the specified directory, then select and click File->Open->Upload at the top, upload the data set compression package just downloaded, and then use the following command to decompress:

!tar -zxvf bayes_news.tgz
bayes_news/ bayes_news/Sample/ bayes_news/Sample/C000008/ bayes_news/Sample/C000008/10.txt bayes_news/Sample/C000008/11.txt bayes_news/Sample/C000008/12.txt bayes_news/Sample/C000008/13.txt bayes_news/Sample/C000008/14.txt bayes_news/Sample/C000008/15.txt bayes_news/Sample/C000008/16.txt bayes_news/Sample/C000008/17.txt bayes_news/Sample/C000008/18.txt bayes_news/Sample/C000008/19.txt 

~~~

frame

This experiment uses Python3 programming to implement a simple news classification algorithm.

1. Naive Bayes Theory

Naive Bayes is part of Bayesian decision theory, so it is necessary to have a quick look at Bayesian decision theory before talking about Naive Bayes.

2. News classification

Next, we began to use the naive Bayesian algorithm for news classification.

Chinese sentence segmentation

Consider a question, English sentences can be segmented by non-letters and non-numbers, but what about Chinese sentences? For example, how to segment the pile of words I typed? We write a rule ourselves? Fortunately, we don't need to do this part of the work ourselves, and we can directly use the third-party word segmentation component, namely jieba, which means "stuttering". The news classification data set can be found in the experiment directory. The data set has been classified and saved in folders. The classification results are as follows:

 


The data set is ready, next, let's get straight to the point. Segment Chinese sentences and write the following code:

# -*- coding: UTF-8 -*-
import os
import jieba

def TextProcessing(folder_path):
    folder_list = os.listdir(folder_path)                        #查看folder_path下的文件
    data_list = []                                                #训练集
    class_list = []

    #遍历每个子文件夹
    for folder in folder_list:
        new_folder_path = os.path.join(folder_path, folder)        #根据子文件夹,生成新的路径
        files = os.listdir(new_folder_path)                        #存放子文件夹下的txt文件的列表

        j = 1
        #遍历每个txt文件
        for file in files:
            if j > 100:                                            #每类txt样本数最多100个
                break
            with open(os.path.join(new_folder_path, file), 'r', encoding = 'utf-8') as f:    #打开txt文件
                raw = f.read()

            word_cut = jieba.cut(raw, cut_all = False)            #精简模式,返回一个可迭代的generator
            word_list = list(word_cut)                            #generator转换为list

            data_list.append(word_list)
            class_list.append(folder)
            j += 1
        print(data_list)
        print(class_list)
if __name__ == '__main__':
    #文本预处理
    folder_path = 'bayes_news/Sample'                #训练集存放地址
    TextProcessing(folder_path)

result:

It can be seen that we have successfully segmented each text and marked it with categories.

word frequency statistics

We divide all texts into training set and test set, and perform word frequency statistics on all words in the training set, and sort them in descending order. That is to say, the words with the most occurrences are placed first, and the words with the fewest occurrences are sorted last. Write the code as follows:

# -*- coding: UTF-8 -*-
import os
import random
import jieba

"""
函数说明:中文文本处理

Parameters:
    folder_path - 文本存放的路径
    test_size - 测试集占比,默认占所有数据集的百分之20
Returns:
    all_words_list - 按词频降序排序的训练集列表
    train_data_list - 训练集列表
    test_data_list - 测试集列表
    train_class_list - 训练集标签列表
    test_class_list - 测试集标签列表
"""
def TextProcessing(folder_path, test_size = 0.2):
    folder_list = os.listdir(folder_path)                        #查看folder_path下的文件
    data_list = []                                                #数据集数据
    class_list = []                                                #数据集类别

    #遍历每个子文件夹
    for folder in folder_list:
        new_folder_path = os.path.join(folder_path, folder)        #根据子文件夹,生成新的路径
        files = os.listdir(new_folder_path)                        #存放子文件夹下的txt文件的列表

        j = 1
        #遍历每个txt文件
        for file in files:
            if j > 100:                                            #每类txt样本数最多100个
                break
            with open(os.path.join(new_folder_path, file), 'r', encoding = 'utf-8') as f:    #打开txt文件
                raw = f.read()

            word_cut = jieba.cut(raw, cut_all = False)            #精简模式,返回一个可迭代的generator
            word_list = list(word_cut)                            #generator转换为list

            data_list.append(word_list)                            #添加数据集数据
            class_list.append(folder)                            #添加数据集类别
            j += 1

    data_class_list = list(zip(data_list, class_list))            #zip压缩合并,将数据与标签对应压缩
    random.shuffle(data_class_list)                                #将data_class_list乱序
    index = int(len(data_class_list) * test_size) + 1            #训练集和测试集切分的索引值
    train_list = data_class_list[index:]                        #训练集
    test_list = data_class_list[:index]                            #测试集
    train_data_list, train_class_list = zip(*train_list)        #训练集解压缩
    test_data_list, test_class_list = zip(*test_list)            #测试集解压缩

    all_words_dict = {}                                            #统计训练集词频
    for word_list in train_data_list:
        for word in word_list:
            if word in all_words_dict.keys():
                all_words_dict[word] += 1
            else:
                all_words_dict[word] = 1

    #根据键的值倒序排序
    all_words_tuple_list = sorted(all_words_dict.items(), key = lambda f:f[1], reverse = True)
    all_words_list, all_words_nums = zip(*all_words_tuple_list)    #解压缩
    all_words_list = list(all_words_list)                        #转换成列表
    return all_words_list, train_data_list, test_data_list, train_class_list, test_class_list

if __name__ == '__main__':
    #文本预处理
    folder_path = 'bayes_news/Sample'                #训练集存放地址
    all_words_list, train_data_list, test_data_list, train_class_list, test_class_list = TextProcessing(folder_path, test_size=0.2)
    print(all_words_list)

 

all_words_list is a word collection formed by sorting the segmentation results of all training sets in descending order of word frequency. Observing the printed results, it is not difficult to find that there are a lot of punctuation marks. Obviously, these punctuation marks cannot be used as the characteristics of news classification. In order to reduce the impact of these high-frequency symbols on the classification results, these punctuation marks need to be deleted. In addition to these, there are words such as "in" and "le" that are not helpful for news classification. In addition, there are some numbers, and numbers obviously cannot be used as characteristics of classified news. So to eliminate their influence on the classification results, a rule can be customized.
A simple rule can be formulated like this:

  • First remove high-frequency words, as for how many high-frequency words to remove, we can determine by observing the relationship between the number of removed high-frequency words and the final detection accuracy.
  • Then, the digits are removed, and the digits are not used as categorical features.
  • Finally, remove some specific words, such as: "of", "one", "in", "no", "of course", "how" and other prepositions, pronouns and conjunctions that have no effect on news classification.

data cleaning

You can use the already organized stopwords_cn.txt text to remove these words, stopwords_cn.txt is located in the data set directory of the current experiment.

You can use the head command to view the first 40 lines of the file content:

!head -n 40 bayes_news/stopwords_cn.txt 

 

Therefore, according to this document, these words can be removed and not used as classification features. First remove the first 100 high-frequency words, and then write the code as follows:

"""
函数说明:读取文件里的内容,并去重

Parameters:
    words_file - 文件路径
Returns:
    words_set - 读取的内容的set集合
"""
def MakeWordsSet(words_file):
    words_set = set()                                            #创建set集合
    with open(words_file, 'r', encoding = 'utf-8') as f:        #打开文件
        for line in f.readlines():                                #一行一行读取
            word = line.strip()                                    #去回车
            if len(word) > 0:                                    #有文本,则添加到words_set中
                words_set.add(word)                               
    return words_set                                             #返回处理结果

"""
函数说明:文本特征选取

Parameters:
    all_words_list - 训练集所有文本列表
    deleteN - 删除词频最高的deleteN个词
    stopwords_set - 指定的结束语
Returns:
    feature_words - 特征集
"""
def words_dict(all_words_list, deleteN, stopwords_set = set()):
    feature_words = []                            #特征列表
    n = 1
    for t in range(deleteN, len(all_words_list), 1):
        if n > 1000:                            #feature_words的维度为1000
            break                               
        #如果这个词不是数字,并且不是指定的结束语,并且单词长度大于1小于5,那么这个词就可以作为特征词
        if not all_words_list[t].isdigit() and all_words_list[t] not in stopwords_set and 1 < len(all_words_list[t]) < 5:
            feature_words.append(all_words_list[t])
        n += 1   
    return feature_words

if __name__ == '__main__':
    #文本预处理
    folder_path = 'bayes_news/Sample'                #训练集存放地址
    all_words_list, train_data_list, test_data_list, train_class_list, test_class_list = TextProcessing(folder_path, test_size=0.2)

    #生成stopwords_set
    stopwords_file = 'bayes_news/stopwords_cn.txt'
    stopwords_set = MakeWordsSet(stopwords_file)

    feature_words = words_dict(all_words_list, 100, stopwords_set)
    print(feature_words)

 

 

It can be seen that we have filtered out those useless phrases, and this feature_words is the feature we finally selected for news classification. Then, we can vectorize the text according to feature_words, and then use it to train a Naive Bayesian classifier.

Sklearn interface description

The data has been processed, and then you can use sklearn to build a Naive Bayesian classifier.
Official English document address: sklearn.naive_bayes.MultinomialNB — scikit-learn 1.2.dev0 documentation
Naive Bayes is a relatively simple algorithm, and the use of the Naive Bayes library in scikit-learn is also relatively simple. Compared with algorithms such as decision trees and KNN, Naive Bayes needs to pay attention to fewer parameters, which is easier to master. In scikit-learn, there are 3 classification algorithm classes of Naive Bayesian. They are GaussianNB, MultinomialNB and BernoulliNB respectively. Among them, GaussianNB is Naive Bayesian with Gaussian distribution prior, MultinomialNB is Naive Bayesian with multinomial distribution a priori, and BernoulliNB is Naive Bayesian with Bernoulli distribution a priori. The prior probability model explained in the previous article is the naive Bayesian whose prior probability is a multinomial distribution.


For news classification, it is a multi-classification problem. We can use MultinamialNB() for our news classification problem. The use of the other two functions will not be expanded for the time being, and you can learn by yourself. MultinomialNB assumes that the prior probability of the feature is a multinomial distribution, which is as follows:



The parameters are described as follows: 

  • alpha: floating-point optional parameter, the default is 1.0, in fact, it is to add Laplace smoothing, that is, λ in the above formula, if this parameter is set to 0, no smoothing will be added;
  • fit_prior: Boolean optional parameter, default is True. The Boolean parameter fit_prior indicates whether the prior probability should be considered. If it is false, all sample category outputs have the same category prior probability. Otherwise, you can use the third parameter class_prior to input the prior probability, or do not input the third parameter class_prior to let MultinomialNB calculate the prior probability from the training set samples. The prior probability at this time is P(Y=Ck)=mk /m. Where m is the total number of samples in the training set, and mk is the number of samples in the training set whose output is the kth category.
  • class_prior: optional parameter, default is None.

Summarized as follows:


In addition, MultinamialNB also has some methods for us to use:

An important function of MultinomialNB is the partial_fit method. This method is generally used if the training set data is very large and cannot be loaded into memory at one time. At this time, we can divide the training set into several equal parts, and repeatedly call partial_fit to learn the training set step by step, which is very convenient. GaussianNB and BernoulliNB also have similar functions. After fitting the data using the fit method or partial_fit method of MultinomialNB, predictions can be made.
There are three methods for prediction at this time, including predict, predict_log_proba and predict_proba. The predict method is our most commonly used prediction method, which directly gives the predicted category output of the test set. predict_proba is different, it will give the probability of the test set samples predicted on each category. It is easy to understand that the category corresponding to the maximum value in the probability of each category predicted by predict_proba is the category obtained by the predict method. predict_log_proba is similar to predict_proba, it will give a logarithmic transformation of the probability predicted by the test set samples on each category. After conversion, predict_log_proba predicts the category corresponding to the maximum value in the logarithmic probability of each category, that is, the category obtained by the predict method. For details, please refer to the official website manual.

[Exercise] Classification based on Sklearn

Knowing this, we can write code to determine the value of deleteN by observing the relationship between the number of high-frequency words removed before deleteN and the final detection accuracy:

# -*- coding: UTF-8 -*-
from sklearn.naive_bayes import MultinomialNB
import matplotlib.pyplot as plt

"""
函数说明:根据feature_words将文本向量化

Parameters:
    train_data_list - 训练集
    test_data_list - 测试集
    feature_words - 特征集
Returns:
    train_feature_list - 训练集向量化列表
    test_feature_list - 测试集向量化列表
"""
def TextFeatures(train_data_list, test_data_list, feature_words):
    def text_features(text, feature_words):                        #出现在特征集中,则置1                                               
        text_words = set(text)
        features = [1 if word in text_words else 0 for word in feature_words]
        return features
    train_feature_list = [text_features(text, feature_words) for text in train_data_list]
    test_feature_list = [text_features(text, feature_words) for text in test_data_list]
    return train_feature_list, test_feature_list                #返回结果

"""
函数说明:新闻分类器

Parameters:
    train_feature_list - 训练集向量化的特征文本
    test_feature_list - 测试集向量化的特征文本
    train_class_list - 训练集分类标签
    test_class_list - 测试集分类标签
Returns:
    test_accuracy - 分类器精度
"""
def TextClassifier(train_feature_list, test_feature_list, train_class_list, test_class_list):
    
    ### Start Code Here ###
    #创建朴素贝叶斯对象
    classifier = MultinomialNB().fit(train_feature_list, train_class_list)
    #使用fit函数进行模型训练
    test_accuracy = classifier.score(test_feature_list, test_class_list)
    #调用score函数计算模型在测试集上的得分test_accuracy
    ### End Code Here ###
    
    return test_accuracy

if __name__ == '__main__':
    #文本预处理
    folder_path = 'bayes_news/Sample'                #训练集存放地址
    all_words_list, train_data_list, test_data_list, train_class_list, test_class_list = TextProcessing(folder_path, test_size=0.2)

    # 生成stopwords_set
    stopwords_file = 'bayes_news/stopwords_cn.txt'
    stopwords_set = MakeWordsSet(stopwords_file)


    test_accuracy_list = []
    deleteNs = range(0, 1000, 20)                #0 20 40 60 ... 980
    for deleteN in deleteNs:
        feature_words = words_dict(all_words_list, deleteN, stopwords_set)
        train_feature_list, test_feature_list = TextFeatures(train_data_list, test_data_list, feature_words)
        test_accuracy = TextClassifier(train_feature_list, test_feature_list, train_class_list, test_class_list)
        test_accuracy_list.append(test_accuracy)

    plt.figure()
    plt.plot(deleteNs, test_accuracy_list)
    plt.title('Relationship of deleteNs and test_accuracy')
    plt.xlabel('deleteNs')
    plt.ylabel('test_accuracy')
    plt.show()

We draw the relationship between deleteNs and test_accuracy, so that we can roughly determine how many high-frequency words are removed. Every time the program is run, the drawn graphics may be different. We can determine the value of deleteN through multiple tests, and then determine this parameter, so that we can successfully build a naive Bayesian classifier for news classification up.
It can be seen that 450 is not bad, and the worst classification accuracy can reach more than 50%.

Modify the code under if  name  == ' main ' as follows:

if __name__ == '__main__':
    #文本预处理
    folder_path = 'bayes_news/Sample'                #训练集存放地址
    all_words_list, train_data_list, test_data_list, train_class_list, test_class_list = TextProcessing(folder_path, test_size=0.2)

    # 生成stopwords_set
    stopwords_file = 'bayes_news/stopwords_cn.txt'
    stopwords_set = MakeWordsSet(stopwords_file)


    test_accuracy_list = []
    feature_words = words_dict(all_words_list, 450, stopwords_set)
    train_feature_list, test_feature_list = TextFeatures(train_data_list, test_data_list, feature_words)
    test_accuracy = TextClassifier(train_feature_list, test_feature_list, train_class_list, test_class_list)
    test_accuracy_list.append(test_accuracy)
    ave = lambda c: sum(c) / len(c)

    print(ave(test_accuracy_list))

 

0.7894736842105263

Experiment summary

With this lab, you should be able to achieve the following two goals:

    1. Master the principle of Naive Bayes algorithm.
    1. Familiar with the initial application of Naive Bayes algorithm.

References and Further Reading

References:

  • 1. Harrington, Li Rui. Machine learning in practice: Machine learning in action[M]. People's Posts and Telecommunications Press, 2013.
  • 2. Zhou Zhihua. Machine Learning: Machine learning[M]. Tsinghua University Press, 2016.

Further reading:

  • 1. Li Hang. Statistical Learning Methods [M]. Tsinghua University Press, 2012.

 

Guess you like

Origin blog.csdn.net/weixin_46601559/article/details/124863317