Project Development Tutorial for Spam Classification System Based on Naive Bayes

Project resource download

  1. Source code of spam classification system based on Naive Bayes

Project Description

  This project is based on the naive Bayesian algorithm to solve the problem of spam classification, and it is verified by using the confusion matrix, and the accuracy and recall rate (96% and 97%) are very good. In addition, a visual spam classification system interface was developed, using PyQT for interface design.


project structure

  • data: data set
    • trec06c: Chinese mail dataset
      • data: Chinese email data set
      • delay: mail text index and label
      • full: mail text index and labels
  • model: trained model
  • cut_word_lists.npy: word segmentation results
  • main.py: all codes of spam statistics and classification
  • stopwords.txt: stop words
  • VisualizationInterface.py: All codes for visual classification of spam

Project development software environment

  • Windows 11
  • Python 3.7
  • PyCharm 2022.1

Project development hardware environment

  • CPU:Intel® Core™ i7-8750H CPU @ 2.20GHz 2.20 GHz
  • RAM:24GB
  • GPU:NVIDIA GeForce GTX 1060


foreword

  This blog makes a detailed summary of the development process of the spam classification system based on Naive Bayesian. Every step from principle to implementation is recorded. After reading this blog, readers are guaranteed to have a deeper understanding of this aspect of knowledge. In addition, as long as you follow me step by step, readers can also implement a spam classification system based on Naive Bayes!


Zero, project demo

  1. Results of training and testing spam using a Naive Bayes classifier:
    Please add a picture description

  2. Bayesian classifier confusion matrix:
    Please add a picture description

  3. Classification of non-spam:
    Please add a picture description

  4. Classification of spam:
    Please add a picture description

1. Mail dataset

  The source of the data set used in this project is a public spam corpus provided by the International Text Retrieval Conference. You can enter the official website to download this data set . This data set is a widely used email data set for testing spam filtering algorithms. Effect. The dataset contains about 10,000 labeled development sets and 50,000 unlabeled test sets, with a total of more than 60,000 test emails. About 40% of these email messages were marked as spam and the rest were marked as not spam. The data set used in this project is the trec06c Chinese mail data set. If you want to classify English spam, you can use the trec06p data set.

  This dataset is widely used in various research projects, such as spam filtering, email classification, and feature extraction. Furthermore, this dataset has been used to compare the performance of different algorithms such as Support Vector Machines, Naive Bayes, Decision Trees, etc. But when we use it, we need to perform some preprocessing operations on emails, such as removing HTML tags, stop word filtering and stemming. Evaluation metrics include precision, recall, F1 score, etc. According to previous research results, the classification performance of this dataset is usually around 90%.

  You can directly use the download link provided above to download the data set. Below I will introduce in detail how to preprocess and use this data set. The following is an example of Chinese emails in this dataset:
Please add a picture description

2. Algorithm principle

2.1 Conditional probability formula

  Conditional probability refers to the probability of an event occurring under certain conditions, usually with the symbol P ( A ∣ B ) P(A|B)P ( A B ) means that in eventBBEvent AA on the premise that B occursThe probability of A happening. The conditional probability formula is a mathematical formula used to calculate the conditional probability, which can be expressed as:
P ( A ∣ B ) = P ( A ∩ B ) P ( B ) P(A|B) = \frac{P(A \cap B)}{P(B)}P(AB)=P(B)P(AB)
  Among them, P ( A ∩ B ) P(A \cap B)P(AB ) represents eventAAA and eventBBThe probability of B occurring at the same time,P ( B ) P(B)P ( B ) represents eventBBThe probability of B happening. Therefore, the conditional probability formula can be interpreted as: at eventBBGiven the occurrence of B , event AAA andBBThe probability of B occurring simultaneously, divided by the eventBBThe probability of B happening is that in eventBBEvent AA on the premise that B occursThe probability of A happening.

  For example, suppose a class has 30 3030 students,20 of them 2020 boys and10 1010 girls. If a student is randomly selected from this class, what is the probability that the selected student is a boy? According to the conditional probability formula, we can get:

P (boys∣ selected students) = P (boys ∩ selected students) P ​​(selected students) P(boys|selected students) = \frac{P(boys\cap selected students)}{P(selected student)}P ( Boys∣Selected students ) _=P ( selected students )P ( boyselected students )
  Among them, P (boys∩ selected students) P(boys\cap selected students)P ( boySelected student ) indicates the probability that the selected student is a boy,P (selected student) P(selected student)P ( selected student ) represents the probability that any student is selected. Because boys account for2 3 \frac{2}{3}32, so the probability of choosing a boy is 20 30 \frac{20}{30}3020, that is, 2 3 \frac{2}{3}32; The probability that any student is selected is 1 11 , since we are sure to select a student. Therefore, we can get:

P (boys∣ selected students) = 20 30 1 = 2 3 P(boys|selected students) = \frac{\frac{20}{30}}{1} = \frac{2}{3}P ( Boys∣Selected students ) _=13020=32
  Therefore, if a student is randomly selected from this class, the probability that he is a boy is 2 3 \frac{2}{3}32

2.2 Total probability formula

  The total probability formula is an important formula in probability theory, which is used to calculate the probability of an event. It can be used to deal with complex probability problems, especially when we cannot directly calculate the probability of an event, it can be calculated by the total probability formula. The total probability formula can be expressed as:

P ( A ) = ∑ i = 1 n P ( A ∣ B i ) ∗ P ( B i ) P(A)=\sum_{i=1}^{n} P(A|B_i)*P(B_i) P(A)=i=1nP(ABi)P(Bi)
  among them,P ( A ) P(A)P ( A ) means eventAAThe probability of A happening,P ( A ∣ B i ) P(A|B_i)P(ABi) means that in eventB i B_iBioccurs under the condition that event AAThe probability of A happening,P ( B i ) P(B_i)P(Bi) means eventB i B_iBiprobability of occurrence. ∑ i = 1 n \sum_{i=1}^{n}i=1nDenotes that for all possible events B i B_iBiSum, i.e. include events B 1 , B 2 , . . . , B n B_1, B_2, ..., B_nB1,B2,...,Bn

  The basic idea of ​​the formula is that event AAThe probability of A occurring is decomposed into the sum of the probabilities under different conditions. In other words, eventAAA may occur under different conditions, we need to consider all possible conditions, and then calculate the eventAAA probability of occurrence, and weighted summation. These weights are the probabilities under each condition, that is,P ( B i ) P(B_i)P(Bi)

For example, we assume that there are 1 3 \frac{1}{3}   in the class31of students like mathematics, 1 3 \frac{1}{3}31of students like Chinese, 1 3 \frac{1}{3}31of students like English. We also know that there are 1 2 \frac{1}{2} students who like mathematics21Of the students who like computers, there are 1 3 \frac{1}{3} students who like Chinese31of students like computers, and 1 4 \frac{1}{4} of students who like English41people like computers.

  Now the question is, if a student is chosen at random, what is the probability that he or she likes computers? We can solve this problem using the total probability formula, namely:

P (like computer) = P (like computer∣ like math) ∗ P (like math) + P (like computer∣ like Chinese) ∗ P (like Chinese) + P (like computer∣ like English) ∗ P (like English) P(like computer) = P(like computer|like math) * P(like math) + P(like computer|like Chinese) * P(like Chinese) + P(like computer|like English) * P(like English)P ( likes computers )=P ( likes computers∣likes mathematics )P ( likes math )+P ( like computers∣like Chinese )P ( like Chinese )+P ( like computers∣like English )P ( like English )
  among them,P (like computer∣ like mathematics) P (like computer | like mathematics)P ( like computers∣like mathematics ) means the probability of liking computers among students who like mathematics, that is,1 2 \frac{1}{2}21; P (likes math) P(likes math)P ( likes mathematics ) indicates the proportion of students who like mathematics to the total number of students, that is,1 3 \frac{1}{3}31. In the same way, P (likes computers∣ likes Chinese) P(likes computers|likes Chinese)P ( like computer∣like Chinese ) means the probability of liking computer among students who like Chinese, that is,1 3 \frac{1}{3}31; P (like Chinese) P(like Chinese)P ( like Chinese ) indicates the proportion of students who like Chinese to the total number of students, that is,1 3 \frac{1}{3}31. P ( like computer∣ like English) P( like computer | like English)P ( like computer∣like English ) means the probability of liking computer among students who like English, that is,1 4 \frac{1}{4}41; P (like English) P(like English)P ( like English ) means the proportion of students who like English to the total number of students, that is,1 3 \frac{1}{3}31

  So, plugging these values ​​into the formula, we get:

P (like computer) = 1 2 ∗ 1 3 + 1 3 ∗ 1 3 + 1 4 ∗ 1 3 = 13 36 P(like computer) = \frac{1}{2} * \frac{1}{3} + \frac{1}{3} * \frac{1}{3} + \frac{1}{4} * \frac{1}{3} = \frac{13}{36}P ( likes computers )=2131+3131+4131=3613
  So, pick a student at random, the probability that he or she likes computers is 13 36 \frac{13}{36}3613

2.3 Naive Bayes formula

2.3.1 What is Naive Bayes formula

  Now that we have mastered the conditional probability formula and the total probability formula, let us go a step further and learn about the Naive Bayesian formula. The Naive Bayes formula is a simple probability classifier based on Bayes' theorem. Naive Bayes assumes that each feature is conditionally independent given the class. Although this assumption is often not true in the real world, Naive Bayes has shown amazing performance in many practical problems. The Naive Bayes formula is as follows:
P ( A ∣ B ) = P ( B ∣ A ) × P ( A ) P ( B ) P(A|B) = \frac{P(B|A) \times P(A )}{P(B)}P(AB)=P(B)P(BA)×P(A)
  Among them, P ( A ∣ B ) P(A|B)P ( A B ) is the posterior probability, which means that in a givenBBIn case of B , AAProbability of A happening;P ( B ∣ A ) P(B|A)P ( B A ) is the conditional probability that, givenAAIn case of A , BBProbability of B happening;P ( A ) P(A)P ( A ) isAAThe prior probability of A ;P ( B ) P(B)P ( B ) isBBThe prior probability of B.

  Now we explain Naive Bayes with the example of a class of students. Suppose a class has 100 100100 students, of which60 6060 boys,40 4040 girls. We want to predict whether a student wears glasses or not. We have observed that30 out of 3030 wear glasses, 20 of the girls2020 wear glasses. Question: Given a student who wears glasses, what is the probability that he (she) is a boy?

  We solve it according to Bayes' theorem:
P (boys ∣ wearing glasses) = P (wearing glasses ∣ boys) ∗ P (boys) P (wearing glasses) P(boys|wearing glasses) = \frac{P(wearing glasses| boy) * P(boy)} {P(wears glasses)}P ( boy∣wearing glasses ) _=P ( with glasses )P ( wearing glasses∣boy ) _P ( boy )
  In this example:

  • P(boy) = 60 100 = 0.6 P(boy) = \frac{60}{100} = 0.6P ( male )=10060=0.6
  • P (girls) = 40 100 = 0.4 P(girls) = \frac{40}{100} = 0.4P ( girl )=10040=0.4
  • P (wear glasses∣ boys) = 30 60 = 0.5 P(wear glasses|boys) = \frac{30}{60} = 0.5P ( wearing glasses∣boy ) _=6030=0.5
  • P (wear glasses∣ girl) = 20 40 = 0.5 P(wear glasses|girl) = \frac{20}{40} = 0.5P ( wearing glasses∣girl ) _=4020=0.5

  We first calculate P(wears glasses) P(wears glasses)P ( wear glasses ) :
P (wear glasses) = P (wear glasses ∣ boys) × P (boys) + P (wear glasses ∣ girls) × P (girls) = 0.5 × 0.6 + 0.5 × 0.4 = 0.5 P(wear Glasses) = P(wear glasses|boy) \times P(boy) + P(wear glasses|girl) \times P(girl) = 0.5 \times 0.6 + 0.5 \times 0.4 = 0.5P ( with glasses )=P ( wearing glasses∣boy ) _×P ( male )+P ( wearing glasses∣girl ) _×P ( girl )=0.5×0.6+0.5×0.4=0.5
  Then, we can calculate the posterior probability according to the Naive Bayes formula:
P (boy∣ wearing glasses) = P (wearing glasses∣boy) × P (boy) P (wearing glasses) = 0.5 × 0.6 0.5 = 0.6 P( Boys|wearing glasses) = \frac{P(wearing glasses|boys) \times P(boys)}{P(wearing glasses)} = \frac{0.5 \times 0.6}{0.5} = 0.6P ( boy∣wearing glasses ) _=P ( with glasses )P ( wearing glasses∣boy ) _×P ( boy )=0.50.5×0.6=0.6
  So, given a student wearing glasses, the probability that he (she) is a boy is60 % 60\%60%

2.3.2 Using Naive Bayes formula for spam classification

  After we master the above knowledge points, we can use the Naive Bayesian formula to solve the problem of spam classification. Before we formally solve this problem, let's think about it. As humans, how do we classify a thing? For example, to identify grass, flowers, trees, etc. from various plants, it can be analyzed from the characteristics of color, shape, etc.; to identify cats, dogs, chickens, etc. from various animals, it can be identified from the sound, appearance, etc. Feature analysis; of course, extending this idea to the problem of email classification, people can also classify emails according to the specific content in the email. For example, if an email is full of: "Today, all 50% off, looking forward to your return visit , Welcome to buy our company's products" and other sentences, it is obvious that this is a spam email, so how can we make the computer recognize what kind of email is spam like a human being?

  In fact, the idea is very simple. The basis for us to classify an email is the keyword in the email. If there are many words in an email: "weather, mood, life" and other words related to daily greetings, we can think that This is a simple email for daily communication, it is not a spam email, and if there are many useless words such as "discount, room rate, invoice" in an email, we can consider it to be an email. spam. So we consider whether it is possible for the computer to classify spam according to the keywords in the mail? The answer is yes, now we define as follows:

  • s p a m spam s p am means spam
  • h a m ham ham means not spam
  • x x x indicates the text content of the email
  • yyy represents the classification result

  If there is an email now, we need to classify it, because we still don't know whether this email belongs to spam spams p am still belongs toham hamham , so we need to calculate its posterior probability for different categories:

  • When the input email text is xxx , this email isspam spamspam的概率为:
    P ( Y = s p a m ∣ X = x ) = P ( X = x ∣ Y = s p a m ) × P ( Y = s p a m ) P ( X = x ∣ Y = s p a m ) × P ( Y = s p a m ) + P ( X = x ∣ Y = h a m ) × P ( Y = h a m ) P(Y=spam|X=x)=\frac{P(X=x|Y=spam) \times P(Y=spam)}{P(X=x|Y=spam) \times P(Y=spam) + P(X=x|Y=ham) \times P(Y=ham)} P ( Y)=spamX=x)=P(X=xY=spam)×P ( Y)=spam)+P(X=xY=ham)×P ( Y)=ham)P(X=xY=spam)×P ( Y)=spam)

  • When the input email text is xxAt x , this email isham hamham的概率为:
    P ( Y = h a m ∣ X = x ) = P ( X = x ∣ Y = h a m ) × P ( Y = h a m ) P ( X = x ∣ Y = s p a m ) × P ( Y = s p a m ) + P ( X = x ∣ Y = h a m ) × P ( Y = h a m ) P(Y=ham|X=x)=\frac{P(X=x|Y=ham) \times P(Y=ham)}{P(X=x|Y=spam) \times P(Y=spam) + P(X=x|Y=ham) \times P(Y=ham)} P ( Y)=hamX=x)=P(X=xY=spam)×P ( Y)=spam)+P(X=xY=ham)×P ( Y)=ham)P(X=xY=ham)×P ( Y)=ham)

  Finally if P ( Y = spam ∣ X = x ) P(Y=spam|X=x)P ( Y)=spamX=The probability value of x ) is large, indicating thatxxx is a spam, otherwise it meansxxx is a non-spam email, the above formula seems to be complicated to calculate, but in fact their denominators are all equal, the only difference is the numerator, so we can get: y = argmax P ( X = x ∣ Y
= y ) × P ( Y = y ) , y ∈ { spam , ham } y = argmaxP(X=x|Y=y) \times P(Y=y),y \in \{spam,ham\}y=argmaxP(X=xY=y)×P ( Y)=y),y{ spam,ham }
  It is also important to note thatxxAlthough x is the input email text, it has been processed into email textxxThe feature vector corresponding to x , and the naive Bayesian formula we use defaults that each feature is conditionally independent, that is to say, the TF-IDF values ​​​​of each word in the feature vector do not affect each other, so the above formula can be Continue to simplify to:
y = argmax ∏ i P ( X = x ( i ) ∣ Y = y ) × P ( Y = y ) , y ∈ { spam , ham } y = argmax\prod_{i}P(X= x^{(i)}|Y=y) \times P(Y=y),y \in \{spam,ham\}y=argmaxiP(X=x(i)Y=y)×P ( Y)=y),y{ spam,ham }
  , you can train the data, and use the maximum likelihood estimation method to calculate the conditional probability valueP of each keyword ( X = x ( i ) ∣ Y = y ) P(X=x^{(i)}|Y =y)P(X=x(i)Y=y ),而P ( Y = y ) P(Y=y)P ( Y)=y ) can be easily calculated as the prior probability. When entering a new email textxxAfter x , you can perform the same data processing on it, and then substitute it into the above formula to get the classification probability value, take the largest classification probability value as the final classification result, and you can judge the inputxxx is spam or not spam.

3. Code implementation

3.1 Realization of spam statistical classification code

  1. First import the libraries needed for the project to run:

    import numpy as np
    import matplotlib.pyplot as plt
    import re
    import jieba
    import itertools
    import time
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import confusion_matrix, recall_score
    from sklearn.naive_bayes import MultinomialNB
    from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
    
  2. Then get the path and label of the data file according to the index file, mark the label value with 0 and 1, 0 means it is not spam, 1 means it is spam, and finally get the mail file path list and the corresponding label value list of the mail:

    def get_path_label():
        """
        根据index文件获取数据文件的路径和标签
        :return 数据文件路径、数据文件标签
        """
        label_path_list = open("data/trec06c/full/index", "r", encoding="gb2312", errors="ignore")
        label_path = [data for data in label_path_list]
        label_path_split = [data.split() for data in label_path if len(data.split()) == 2]
        label_list = [1 if data[0] == "spam" else 0 for data in label_path_split]
        path_list = [data[1].replace("..", "trec06c") for data in label_path_split]
        return path_list, label_list
    
  3. Then obtain the text content of each email according to the email file path list obtained in the previous step, and perform simple processing on it, such as removing special fields, line breaks, etc., and finally obtain the body content list of each email. In addition, it should be noted that this function needs to be called only when running for the first time, because the body of the email is only used for word segmentation, and we have saved the word segmentation results locally later, so it can be called directly when using it. load every time, which saves time:

    def get_data(path_list):
        """
        根据数据文件路径打开数据文件,提取每个邮件的正文
        :param path_list:
        :return 提取的邮件正文
        """
        mail = open(path_list, "r", encoding="gb2312", errors="ignore")
        mail_text = [data for data in mail]
        mail_head_index = [mail_text.index(i) for i in mail_text if re.match("[a-zA-z0-9]", i)]
        text = ''.join(mail_text[max(mail_head_index) + 1:])
        text = re.sub('\s+', '', re.sub("\u3000", "", re.sub("\n", "", text)))
        return text
    
  4. Then load the stop word list, stop words are some unimportant words, such as "ah, they, no", etc. These words not only do not help us in the final judgment result, but also consume storage space and reduce search efficiency, so We should ignore these words, the purpose of loading the stop word list is to remove the stop words in the body of each email just obtained. In addition, this stop word list is in my project, and it can be used directly after downloading the project. Another thing to note is that only the first run needs to load the stop word list, because the stop word list is only used during word segmentation, and the result after word segmentation has been saved locally by me, so I need to use When you use it, you can directly call the locally saved word segmentation results, so you don’t need to load the disabled word list every time you use it, which can save time:

    def upload_stopword():
        """
        加载停用词
        :return 返回加载的停用词表
        """
        with open("stopwords.txt", encoding="utf-8") as file:
            data = file.read()
            return data.split("\n")
    
  5. Then use the jieba word segmentation tool to segment the email text, because we want to calculate the impact of each word in each email on the final classification of this email, here we need to use the stop word list just loaded to remove unnecessary words. It should also be noted here that this function is only called when it is used for the first time, otherwise each word segmentation is very time-consuming. Except for the first use, the saved word segmentation results can be directly called during the rest of the test, and the saved word segmentation results It has been placed in the main directory of the project by me, and it can be called directly when using it, which can save time:

    def participle(mail_list, stopword_list):
        """
        使用jieba对邮件文本分词
        :param mail_list: 邮件文本
        :param stopword_list: 停用词表
        :return 返回邮件文本分词结果
        """
        cur_word_list = []
        startTime = time.time()
        for mail in mail_list:
            cut_word = [data for data in jieba.lcut(mail) if data not in set(stopword_list)]
            cur_word_list.append(cut_word)
        print("jieba分词用时%0.2f秒" % (time.time() - startTime))
        return cur_word_list
    
  6. Then calculate the term frequency (TF) of the word in the email text, which can also be called the frequency of the word (Term Frequency). TF refers to the importance of a word. The higher the TF value, the more times a word appears in the email, which means that the word is more important, and vice versa. The calculation formula of TF is as follows:
    TF A certain word = the number of occurrences of a certain word The total number of words in the email TF_{a certain word}=\frac{The number of occurrences of a certain word}{The total number of words in the email}TFa word=The total number of words in the messagethe number of times a word occurs
    The amount of calculation above seems to be large, but it is very simple to implement in Python. We only need to use CountVectorizer()functions to help us complete the calculation of TF values, and finally return the list of calculated TF values:

    def get_TF(cur_word_list):
        """
        计算TF
        :param cur_word_list: 分词结果列表
        :return TF列表
        """
        text = [' '.join(data) for data in cur_word_list]
        cv = CountVectorizer(max_features=5000, max_df=0.6, min_df=5)
        count_list = cv.fit_transform(text)
        return count_list
    

    A brief introduction to CountVectorizer()the parameter values ​​​​in the function:

    • max_features=5000: Sort the TF values ​​in descending order, and take the first 5000 words as the keyword set
    • max_df=0.6: Remove the words that appear in 60% of the emails, because there are too many occurrences and there is no distinction
    • min_df=5: Remove words that only appear in less than 5 emails, because the number of occurrences is too small to distinguish
  7. Then calculate the IDF (Inverse Document Frequency) of the words in the email text, which can also be called the inverse text word frequency. IDF refers to the degree of discrimination of a word in a certain email. If a certain word only appears many times in a certain email, but hardly appears in other emails, then it can be considered that the word has a significant impact on this email. The email is important, which means that this email can be distinguished from other emails by this word, and vice versa. The calculation formula of IDF is as follows:
    IDF a word = log (the total number of emails in the dataset contains the number of emails with this word + 1) IDF_{a certain word}=log(\frac{the total number of emails in the dataset}{the number of emails that contain this word Email count+1})IDFa word=log(the number of messages containing the term+1Total number of emails in the dataset)
    We have just calculated the TF value, now we multiply the TF value and the IDF value to get the TF-IDF value of a word:
    TF − IDF a word = TF a word × IDF a word TF -IDF_{some word}=TF_{some word} \times IDF_{some word}TFIDFa word=TFa word×IDFa word
    By calculating the TF-IDF value of the word, the word with a high TF-IDF value can be used as the characteristic attribute of the email, because the TF-IDF value not only includes the weight of the importance of the word, but also the weight of the word discrimination. By combining the weight The feature values ​​are obtained, thereby reducing the importance of words with low feature values ​​and increasing the importance of words with high feature values, and these words can be used to classify emails. The above calculation steps may seem complicated, but in Python, only two lines (or even one line) are needed to complete the calculation of TF-IDF:

    def get_TFIDF(count_list):
        """
        计算TF-IDF
        :param count_list: 计算得到的TF列表
        :return TF-IDF列表
        """
        TF_IDF = TfidfTransformer()
        TF_IDF_matrix = TF_IDF.fit_transform(count_list)
        return TF_IDF_matrix
    
  8. In order to visualize our classification test results, we can calculate the confusion matrix (Confusion Matrix), each column of the confusion matrix represents the real value, and each row represents the predicted value (you can also set the representation value of the row and column by yourself), through the confusion matrix We can roughly observe the classification effect of the training model:

    def plt_confusion_matrix(confusion_matrix, classes, title, cmap=plt.cm.Blues):
        """
        绘制混淆矩阵
        :param confusion_matrix: 混淆矩阵值
        :param classes: 分类类别
        :param title: 绘制图形的标题
        :param cmap: 绘图风格
        """
        plt.rcParams['font.sans-serif'] = ['SimHei']
        plt.rcParams['axes.unicode_minus'] = False
        plt.imshow(confusion_matrix, interpolation="nearest", cmap=cmap)
        plt.title(title)
        plt.colorbar()
        axis_marks = np.arange(len(classes))
        plt.xticks(axis_marks, classes, rotation=0)
        plt.yticks(axis_marks, classes, rotation=0)
        axis_line = confusion_matrix.max() / 2.
        for i, j in itertools.product(range(confusion_matrix.shape[0]), range(confusion_matrix.shape[1])):
            plt.text(j, i, confusion_matrix[i, j], horizontalalignment="center",
                     color="white" if confusion_matrix[i, j] > axis_line else "black")
        plt.tight_layout()
        plt.xlabel("预测结果")
        plt.ylabel("真实结果")
        plt.show()
    
  9. Then use the Bayesian classifier to classify spam. When classifying, you need to use the TF-IDF matrix value and label list just calculated for training and testing:

    def train_test_Bayes(TF_IDF_matrix_result, label_list):
        """
        朴素贝叶斯分类器对于垃圾邮件数据集的训练和测试
        :param TF_IDF_matrix_result: TF_IDF值矩阵
        :param label_list: 标签列表
        :return 无
        """
        print(">>>>>>>>>>>>>朴素贝叶斯分类器垃圾邮件分类<<<<<<<<<<<<<")
        train_x, test_x, train_y, test_y = train_test_split(TF_IDF_matrix_result, label_list, test_size=0.2, random_state=0)
        classifier = MultinomialNB()
        startTime = time.time()
        classifier.fit(train_x, train_y)
        print("贝叶斯分类器训练耗时>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>", time.time() - startTime)
        score = classifier.score(test_x, test_y)
        print("贝叶斯分类器的分类结果准确率>>>>>>>>>>>>>>>>>>>>>>>>>>>", score)
        predict_y = classifier.predict(test_x)
        print("贝叶斯分类器的召回率>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>", recall_score(test_y, predict_y))
        plt_confusion_matrix(confusion_matrix(test_y, predict_y), [0, 1], title="贝叶斯分类器混淆矩阵")
    
    • First use train_test_split()the division of the training set and the test set. In this project, the training set is divided into 80%, and the test set is divided into 20%.
    • Then use MultinomialNB()the naive Bayesian model training. In fact, the naive_bayes module in the sklearn library implements five naive Bayesian algorithms. We choose them here MultinomialNB()because they are suitable for discrete data and conform to the characteristic variables of multinomial distribution, that is, words The TF-IDF value, in addition, MultinomialNB()the default parameter alpha=1, that is, use Laplacian smoothing, and use the method of adding 1 to count the probability of words that have not appeared before, to avoid the probability caused by insufficient training set samples When the calculation result is 0, the probability value obtained in this way is closer to the real probability value
    • Then calculate the accuracy of the model
    • Then calculate the recall of the model
    • plt_confusion_matrix()Finally, draw the confusion matrix using the method described above
  10. After we complete the functions of each sub-module, we can integrate these sub-modules to complete the task of spam classification. The calling order of each sub-module in the main function is the same as the order introduced by each sub-module above. I just added an additional The running time of each sub-step is calculated. In addition, it should be noted that the three functions , , and are only used when running for the first time, and they can be commented out for the rest of the time. This can save a lot of time, because we have too much training and testing data, if get_data()not It takes too long to do this:upload_stopword()participle()

    if __name__ == '__main__':
        print(">>>>>>>>>>>>>>>>>>>>>>>>垃圾邮件分类系统开始运行<<<<<<<<<<<<<<<<<<<<<<<<")
        path_lists, label_lists = get_path_label()
        '''
            只有第一次运行的时候需要加载,因为后面都已经将分词结果保存到本地了,所以就不用每次都加载邮件文本了,而且停用词表也在分词的时候使用过了,所以也不用每次都加载
        '''
        # mail_texts = [get_data(path) for path in path_lists]
        # stopword_lists = upload_stopword()
        '''
            将筛选后的数据集分词,并将分词结果保存,只有第一次使用的时候需要将注释打开,否则每次分词非常耗时,除了第一次使用,其余测试的时候直接调用保存的分词结果即可
        '''
        # cut_word_lists = participle(mail_texts, stopword_lists)
        # cut_word_lists = np.array(cut_word_lists)
        # np.save("cut_word_lists.npy", cut_word_lists)
        startOne = time.time()
        cut_word_lists = np.load('cut_word_lists.npy')
        print("加载分词文件耗时>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>", time.time() - startOne)
        startTwo = time.time()
        cut_word_lists = cut_word_lists.tolist()
        print("转换分词列表耗时>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>", time.time() - startTwo)
        startThree = time.time()
        count_lists = get_TF(cut_word_lists)
        print("获取分词TF值耗时>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>", time.time() - startThree)
        startFour = time.time()
        TF_IDF_matrix_results = get_TFIDF(count_lists)
        print("获取分词TF-IDF值耗时>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>", time.time() - startFour)
        startFive = time.time()
        train_test_Bayes(TF_IDF_matrix_results, label_lists)
        print("贝叶斯分类器训练和预测耗时>>>>>>>>>>>>>>>>>>>>>>>>>>>>", time.time() - startFive)
    
  11. Finally, we can run the main function to see the training results of the model:

    • It can be seen that the running time of the whole project is quite fast, if the three functions just added, the time will be much slower. Of course, time is not the key point. We can observe that the accuracy rate of our model is 96%, and the recall rate is 97%, which means that the training result of our model is still good. This experiment is very successful:
      Please add a picture description

    • Let's look at the generated confusion matrix again, and we can see that almost all emails are classified successfully, which also proves from a visual point of view that the classification accuracy of our trained model is very high:
      Please add a picture description

3.2 Realization of spam visual classification code

  1. First import the libraries needed for the project to run:

    import sys
    import re
    import jieba
    from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
    from PyQt5 import QtCore, QtGui, QtWidgets
    from scipy.sparse import csr_matrix
    import joblib
    
  2. Then load the stop word list, the purpose is to remove the stop words in the test text, as described above, no more details:

    def upload_stopword():
        """
        加载停用词
        :return 返回加载的停用词表
        """
        with open("stopwords.txt", encoding="utf-8") as file:
            data = file.read()
            return data.split("\n")
    
  3. Then preprocess the test text and calculate the TF-IDF value. The processing of this part is the same as the process described above, so I won’t repeat it here:

    def handle_mail(mail, stopword_list):
        """
        对邮件文本预处理,并计算其TF-IDF值,最后返回TF-IDF值矩阵
        :param mail: 待检测的邮件文本
        :param stopword_list: 停用词列表
        :return TF-IDF值矩阵
        """
        '''
            处理邮件文本的邮件头以及空格、换行等等
        '''
        mail_head_index = [mail.index(i) for i in mail if re.match("[a-zA-z0-9]", i)]
        mail_res = ''.join(mail[max(mail_head_index) + 1:])
        mail_res = re.sub('\s+', '', re.sub("\u3000", "", re.sub("\n", "", mail_res)))
        '''
            对邮件文本进行分词处理
        '''
        cut_word = [data for data in jieba.lcut(mail_res) if data not in set(stopword_list)]
        '''
            计算邮件本文的TF-IDF值
        '''
        text = [' '.join(cut_word)]
        cv = CountVectorizer()
        TF_IDF = TfidfTransformer()
        TF_IDF_matrix = TF_IDF.fit_transform(cv.fit_transform(text))
        TF_IDF_matrix_res = csr_matrix((TF_IDF_matrix.data, TF_IDF_matrix.indices, TF_IDF_matrix.indptr), shape=(1, 5000))
        return TF_IDF_matrix_res
    
  4. Then directly call the trained model for testing. I have saved the trained model to the main directory of the project. Of course, readers can also use their own trained models to test. The specific method is relatively simple to search by themselves, because this part of the content is not the focus of this blog, so I won’t repeat it here:

    def predict_res(mail):
        """
        使用训练好的模型对传入的测试邮件进行分类预测
        :param mail: 待测试邮件
        :return: 分类预测结果
        """
        classifier = joblib.load("model/classifier.pkl")
        res = classifier.predict(mail)
        return res
    
  5. Then draw the test window. This part of the content is not the focus of this blog, so I won’t repeat it here. The only thing that needs to be introduced is the my_func()function my_func(). The result is displayed on the window:

    class Ui_Dialog(object):
        """
        垃圾邮件分类的可视化展示
        """
    
        def __init__(self):
            """
            初始化参数
            """
            self.pushButton_2 = None
            self.pushButton = None
            self.horizontalLayout = None
            self.plainTextEdit_2 = None
            self.plainTextEdit = None
            self.horizontalLayout_2 = None
            self.verticalLayout = None
            self.verticalLayout_2 = None
    
        def setupUi(self, Dialog):
            """
            绘制垃圾邮件分类的UI界面
            :param Dialog: 当前对话窗口
            """
            Dialog.setObjectName("基于朴素贝叶斯的垃圾邮件过滤系统")
            Dialog.resize(525, 386)
            self.verticalLayout_2 = QtWidgets.QVBoxLayout(Dialog)
            self.verticalLayout_2.setObjectName("verticalLayout_2")
            self.verticalLayout = QtWidgets.QVBoxLayout()
            self.verticalLayout.setObjectName("verticalLayout")
            self.horizontalLayout_2 = QtWidgets.QHBoxLayout()
            self.horizontalLayout_2.setObjectName("horizontalLayout_2")
            self.plainTextEdit = QtWidgets.QPlainTextEdit(Dialog)
            font = QtGui.QFont()
            font.setFamily("SimHei")
            font.setPointSize(12)
            self.plainTextEdit.setFont(font)
            self.plainTextEdit.setObjectName("plainTextEdit")
            self.horizontalLayout_2.addWidget(self.plainTextEdit)
            self.plainTextEdit_2 = QtWidgets.QPlainTextEdit(Dialog)
            font = QtGui.QFont()
            font.setFamily("SimHei")
            font.setPointSize(12)
            self.plainTextEdit_2.setFont(font)
            self.plainTextEdit_2.setObjectName("plainTextEdit_2")
            self.horizontalLayout_2.addWidget(self.plainTextEdit_2)
            self.verticalLayout.addLayout(self.horizontalLayout_2)
            self.horizontalLayout = QtWidgets.QHBoxLayout()
            self.horizontalLayout.setObjectName("horizontalLayout")
            self.pushButton = QtWidgets.QPushButton(Dialog)
            font = QtGui.QFont()
            font.setFamily("SimHei")
            font.setPointSize(16)
            self.pushButton.setFont(font)
            self.pushButton.setObjectName("pushButton")
            self.horizontalLayout.addWidget(self.pushButton)
            self.pushButton_2 = QtWidgets.QPushButton(Dialog)
            font = QtGui.QFont()
            font.setFamily("SimHei")
            font.setPointSize(16)
            self.pushButton_2.setFont(font)
            self.pushButton_2.setObjectName("pushButton_2")
            self.horizontalLayout.addWidget(self.pushButton_2)
            self.verticalLayout.addLayout(self.horizontalLayout)
            self.verticalLayout_2.addLayout(self.verticalLayout)
            self.retranslateUi(Dialog)
            QtCore.QMetaObject.connectSlotsByName(Dialog)
            self.pushButton.clicked.connect(self.my_func)
    
        def retranslateUi(self, Dialog):
            """
            绘制当前窗口的按钮
            :param Dialog: 当前对话窗口
            """
            _translate = QtCore.QCoreApplication.translate
            Dialog.setWindowTitle(_translate("Dialog", "Dialog"))
            self.pushButton.setText(_translate("Dialog", "确定"))
            self.pushButton_2.setText(_translate("Dialog", "清空"))
            self.pushButton_2.clicked.connect(self.plainTextEdit.clear)
            self.pushButton_2.clicked.connect(self.plainTextEdit_2.clear)
    
        def my_func(self):
            """
            调用各个功能模块
            """
            stopword_lists = upload_stopword()
            mail = self.plainTextEdit.toPlainText()
            mail_handle = handle_mail(mail, stopword_lists)
            res = predict_res(mail_handle)
            show_res = None
            if res[0] == 1:
                show_res = "这是垃圾邮件"
            else:
                show_res = "这不是垃圾邮件"
            self.plainTextEdit_2.setPlainText(show_res)
    
  6. Then directly use the main function to call the drawing window module introduced above (the prediction function module has been embedded in the drawing window module):

    if __name__ == "__main__":
        app = QtWidgets.QApplication(sys.argv)
        Dialog = QtWidgets.QDialog()
        ui = Ui_Dialog()
        ui.setupUi(Dialog)
        Dialog.show()
        sys.exit(app.exec_())
    
  7. Finally, we can run the main function to see the running effect:

    • When we enter a piece of non-spam email, the window outputs "This is not spam", indicating that the prediction is successful:
      Please add a picture description

    • When we enter a piece of spam, the window outputs "This is spam", indicating that the prediction is successful:
      Please add a picture description

4. Problems to be solved urgently

  At present, it seems that our model training effect is not bad, but there are still problems. When we are performing visual email classification, we need to input a test email, and we need to calculate the TF-IDF value of this test email after word segmentation. But when we were training and testing the model before, because the total amount of data was very large, we set the dimension to 5000 dimensions, that is, the first 5000 words after the descending order of TF-IDF values ​​were taken as the keyword set, and The test email we input must not reach 5000 dimensions, which makes it impossible to predict.

  In order to solve this problem, the way I can think of is to use csr_matrix()functions and shape()functions to expand the dimension of the calculated TF-IDF value of the test email, and expand its dimension to 5000 dimensions:

TF_IDF_matrix_res = csr_matrix((TF_IDF_matrix.data, TF_IDF_matrix.indices, TF_IDF_matrix.indptr), shape=(1, 5000))

  In this way, although the predicted classification value can be successfully output, the result of the test classification is always poor, but I have not thought of a better way to solve this problem. If readers have a good way, please feel free to communicate with me!

5. All codes

5.1 All codes for statistical classification of spam

#!/usr/bin/env python
# -*- coding:utf-8 -*-
# Author:IronmanJay
# email:[email protected]

# 导入程序运行必需的库
import numpy as np
import matplotlib.pyplot as plt
import re
import jieba
import itertools
import time
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, recall_score
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer


def get_path_label():
    """
    根据index文件获取数据文件的路径和标签
    :return 数据文件路径、数据文件标签
    """
    label_path_list = open("data/trec06c/full/index", "r", encoding="gb2312", errors="ignore")
    label_path = [data for data in label_path_list]
    label_path_split = [data.split() for data in label_path if len(data.split()) == 2]
    label_list = [1 if data[0] == "spam" else 0 for data in label_path_split]
    path_list = [data[1].replace("..", "trec06c") for data in label_path_split]
    return path_list, label_list


def get_data(path_list):
    """
    根据数据文件路径打开数据文件,提取每个邮件的正文
    :param path_list:
    :return 提取的邮件正文
    """
    mail = open(path_list, "r", encoding="gb2312", errors="ignore")
    mail_text = [data for data in mail]
    mail_head_index = [mail_text.index(i) for i in mail_text if re.match("[a-zA-z0-9]", i)]
    text = ''.join(mail_text[max(mail_head_index) + 1:])
    text = re.sub('\s+', '', re.sub("\u3000", "", re.sub("\n", "", text)))
    return text


def upload_stopword():
    """
    加载停用词
    :return 返回加载的停用词表
    """
    with open("stopwords.txt", encoding="utf-8") as file:
        data = file.read()
        return data.split("\n")


def participle(mail_list, stopword_list):
    """
    使用jieba对邮件文本分词
    :param mail_list: 邮件文本
    :param stopword_list: 停用词表
    :return 返回邮件文本分词结果
    """
    cur_word_list = []
    startTime = time.time()
    for mail in mail_list:
        cut_word = [data for data in jieba.lcut(mail) if data not in set(stopword_list)]
        cur_word_list.append(cut_word)
    print("jieba分词用时%0.2f秒" % (time.time() - startTime))
    return cur_word_list


def get_TF(cur_word_list):
    """
    计算TF
    :param cur_word_list: 分词结果列表
    :return TF列表
    """
    text = [' '.join(data) for data in cur_word_list]
    cv = CountVectorizer(max_features=5000, max_df=0.6, min_df=5)
    count_list = cv.fit_transform(text)
    return count_list


def get_TFIDF(count_list):
    """
    计算TF-IDF
    :param count_list: 计算得到的TF列表
    :return TF-IDF列表
    """
    TF_IDF = TfidfTransformer()
    TF_IDF_matrix = TF_IDF.fit_transform(count_list)
    return TF_IDF_matrix


def plt_confusion_matrix(confusion_matrix, classes, title, cmap=plt.cm.Blues):
    """
    绘制混淆矩阵
    :param confusion_matrix: 混淆矩阵值
    :param classes: 分类类别
    :param title: 绘制图形的标题
    :param cmap: 绘图风格
    """
    plt.rcParams['font.sans-serif'] = ['SimHei']
    plt.rcParams['axes.unicode_minus'] = False
    plt.imshow(confusion_matrix, interpolation="nearest", cmap=cmap)
    plt.title(title)
    plt.colorbar()
    axis_marks = np.arange(len(classes))
    plt.xticks(axis_marks, classes, rotation=0)
    plt.yticks(axis_marks, classes, rotation=0)
    axis_line = confusion_matrix.max() / 2.
    for i, j in itertools.product(range(confusion_matrix.shape[0]), range(confusion_matrix.shape[1])):
        plt.text(j, i, confusion_matrix[i, j], horizontalalignment="center",
                 color="white" if confusion_matrix[i, j] > axis_line else "black")
    plt.tight_layout()
    plt.xlabel("预测结果")
    plt.ylabel("真实结果")
    plt.show()


def train_test_Bayes(TF_IDF_matrix_result, label_list):
    """
    朴素贝叶斯分类器对于垃圾邮件数据集的训练和测试
    :param TF_IDF_matrix_result: TF_IDF值矩阵
    :param label_list: 标签列表
    :return 无
    """
    print(">>>>>>>>>>>>>朴素贝叶斯分类器垃圾邮件分类<<<<<<<<<<<<<")
    train_x, test_x, train_y, test_y = train_test_split(TF_IDF_matrix_result, label_list, test_size=0.2, random_state=0)
    classifier = MultinomialNB()
    startTime = time.time()
    classifier.fit(train_x, train_y)
    print("贝叶斯分类器训练耗时>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>", time.time() - startTime)
    score = classifier.score(test_x, test_y)
    print("贝叶斯分类器的分类结果准确率>>>>>>>>>>>>>>>>>>>>>>>>>>>", score)
    predict_y = classifier.predict(test_x)
    print("贝叶斯分类器的召回率>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>", recall_score(test_y, predict_y))
    plt_confusion_matrix(confusion_matrix(test_y, predict_y), [0, 1], title="贝叶斯分类器混淆矩阵")


if __name__ == '__main__':
    print(">>>>>>>>>>>>>>>>>>>>>>>>垃圾邮件分类系统开始运行<<<<<<<<<<<<<<<<<<<<<<<<")
    path_lists, label_lists = get_path_label()
    '''
        只有第一次运行的时候需要加载,因为后面都已经将分词结果保存到本地了,所以就不用每次都加载邮件文本了,而且停用词表也在分词的时候使用过了,所以也不用每次都加载
    '''
    # mail_texts = [get_data(path) for path in path_lists]
    # stopword_lists = upload_stopword()
    '''
        将筛选后的数据集分词,并将分词结果保存,只有第一次使用的时候需要将注释打开,否则每次分词非常耗时,除了第一次使用,其余测试的时候直接调用保存的分词结果即可
    '''
    # cut_word_lists = participle(mail_texts, stopword_lists)
    # cut_word_lists = np.array(cut_word_lists)
    # np.save("cut_word_lists.npy", cut_word_lists)
    startOne = time.time()
    cut_word_lists = np.load('cut_word_lists.npy')
    print("加载分词文件耗时>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>", time.time() - startOne)
    startTwo = time.time()
    cut_word_lists = cut_word_lists.tolist()
    print("转换分词列表耗时>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>", time.time() - startTwo)
    startThree = time.time()
    count_lists = get_TF(cut_word_lists)
    print("获取分词TF值耗时>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>", time.time() - startThree)
    startFour = time.time()
    TF_IDF_matrix_results = get_TFIDF(count_lists)
    print("获取分词TF-IDF值耗时>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>", time.time() - startFour)
    startFive = time.time()
    train_test_Bayes(TF_IDF_matrix_results, label_lists)
    print("贝叶斯分类器训练和预测耗时>>>>>>>>>>>>>>>>>>>>>>>>>>>>", time.time() - startFive)

5.2 All codes for visual classification of spam

#!/usr/bin/env python
# -*- coding:utf-8 -*-
# Author:IronmanJay
# email:[email protected]

# 导入程序运行必需的库
import sys
import re
import jieba
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from PyQt5 import QtCore, QtGui, QtWidgets
from scipy.sparse import csr_matrix
import joblib


def upload_stopword():
    """
    加载停用词
    :return 返回加载的停用词表
    """
    with open("stopwords.txt", encoding="utf-8") as file:
        data = file.read()
        return data.split("\n")


def handle_mail(mail, stopword_list):
    """
    对邮件文本预处理,并计算其TF-IDF值,最后返回TF-IDF值矩阵
    :param mail: 待检测的邮件文本
    :param stopword_list: 停用词列表
    :return TF-IDF值矩阵
    """
    '''
        处理邮件文本的邮件头以及空格、换行等等
    '''
    mail_head_index = [mail.index(i) for i in mail if re.match("[a-zA-z0-9]", i)]
    mail_res = ''.join(mail[max(mail_head_index) + 1:])
    mail_res = re.sub('\s+', '', re.sub("\u3000", "", re.sub("\n", "", mail_res)))
    '''
        对邮件文本进行分词处理
    '''
    cut_word = [data for data in jieba.lcut(mail_res) if data not in set(stopword_list)]
    '''
        计算邮件本文的TF-IDF值
    '''
    text = [' '.join(cut_word)]
    cv = CountVectorizer()
    TF_IDF = TfidfTransformer()
    TF_IDF_matrix = TF_IDF.fit_transform(cv.fit_transform(text))
    TF_IDF_matrix_res = csr_matrix((TF_IDF_matrix.data, TF_IDF_matrix.indices, TF_IDF_matrix.indptr), shape=(1, 5000))
    return TF_IDF_matrix_res


def predict_res(mail):
    """
    使用训练好的模型对传入的测试邮件进行分类预测
    :param mail: 待测试邮件
    :return: 分类预测结果
    """
    classifier = joblib.load("model/classifier.pkl")
    res = classifier.predict(mail)
    return res


class Ui_Dialog(object):
    """
    垃圾邮件分类的可视化展示
    """

    def __init__(self):
        """
        初始化参数
        """
        self.pushButton_2 = None
        self.pushButton = None
        self.horizontalLayout = None
        self.plainTextEdit_2 = None
        self.plainTextEdit = None
        self.horizontalLayout_2 = None
        self.verticalLayout = None
        self.verticalLayout_2 = None

    def setupUi(self, Dialog):
        """
        绘制垃圾邮件分类的UI界面
        :param Dialog: 当前对话窗口
        """
        Dialog.setObjectName("基于朴素贝叶斯的垃圾邮件过滤系统")
        Dialog.resize(525, 386)
        self.verticalLayout_2 = QtWidgets.QVBoxLayout(Dialog)
        self.verticalLayout_2.setObjectName("verticalLayout_2")
        self.verticalLayout = QtWidgets.QVBoxLayout()
        self.verticalLayout.setObjectName("verticalLayout")
        self.horizontalLayout_2 = QtWidgets.QHBoxLayout()
        self.horizontalLayout_2.setObjectName("horizontalLayout_2")
        self.plainTextEdit = QtWidgets.QPlainTextEdit(Dialog)
        font = QtGui.QFont()
        font.setFamily("SimHei")
        font.setPointSize(12)
        self.plainTextEdit.setFont(font)
        self.plainTextEdit.setObjectName("plainTextEdit")
        self.horizontalLayout_2.addWidget(self.plainTextEdit)
        self.plainTextEdit_2 = QtWidgets.QPlainTextEdit(Dialog)
        font = QtGui.QFont()
        font.setFamily("SimHei")
        font.setPointSize(12)
        self.plainTextEdit_2.setFont(font)
        self.plainTextEdit_2.setObjectName("plainTextEdit_2")
        self.horizontalLayout_2.addWidget(self.plainTextEdit_2)
        self.verticalLayout.addLayout(self.horizontalLayout_2)
        self.horizontalLayout = QtWidgets.QHBoxLayout()
        self.horizontalLayout.setObjectName("horizontalLayout")
        self.pushButton = QtWidgets.QPushButton(Dialog)
        font = QtGui.QFont()
        font.setFamily("SimHei")
        font.setPointSize(16)
        self.pushButton.setFont(font)
        self.pushButton.setObjectName("pushButton")
        self.horizontalLayout.addWidget(self.pushButton)
        self.pushButton_2 = QtWidgets.QPushButton(Dialog)
        font = QtGui.QFont()
        font.setFamily("SimHei")
        font.setPointSize(16)
        self.pushButton_2.setFont(font)
        self.pushButton_2.setObjectName("pushButton_2")
        self.horizontalLayout.addWidget(self.pushButton_2)
        self.verticalLayout.addLayout(self.horizontalLayout)
        self.verticalLayout_2.addLayout(self.verticalLayout)
        self.retranslateUi(Dialog)
        QtCore.QMetaObject.connectSlotsByName(Dialog)
        self.pushButton.clicked.connect(self.my_func)

    def retranslateUi(self, Dialog):
        """
        绘制当前窗口的按钮
        :param Dialog: 当前对话窗口
        """
        _translate = QtCore.QCoreApplication.translate
        Dialog.setWindowTitle(_translate("Dialog", "Dialog"))
        self.pushButton.setText(_translate("Dialog", "确定"))
        self.pushButton_2.setText(_translate("Dialog", "清空"))
        self.pushButton_2.clicked.connect(self.plainTextEdit.clear)
        self.pushButton_2.clicked.connect(self.plainTextEdit_2.clear)

    def my_func(self):
        """
        调用各个功能模块
        """
        stopword_lists = upload_stopword()
        mail = self.plainTextEdit.toPlainText()
        mail_handle = handle_mail(mail, stopword_lists)
        res = predict_res(mail_handle)
        show_res = None
        if res[0] == 1:
            show_res = "这是垃圾邮件"
        else:
            show_res = "这不是垃圾邮件"
        self.plainTextEdit_2.setPlainText(show_res)


if __name__ == "__main__":
    app = QtWidgets.QApplication(sys.argv)
    Dialog = QtWidgets.QDialog()
    ui = Ui_Dialog()
    ui.setupUi(Dialog)
    Dialog.show()
    sys.exit(app.exec_())

Summarize

  The above is the entire development tutorial of the spam classification system project based on Naive Bayes. If readers have any questions, please feel free to private message or leave a message with me, or if you have any good ideas for my problem, please feel free to contact me. This blog has come to an end, see you in the next blog!

Guess you like

Origin blog.csdn.net/IronmanJay/article/details/130378675