Python machine learning (1)

  I have opened a new column to explain some examples of python machine learning. This time I will learn the Chinese mail classification of the Naive Bayes algorithm.

Chinese Mail Classification Based on Naive Bayes Algorithm

1. Principle of Naive Bayes Algorithm

  Bayesian theory: Calculate the probability of another event according to the probability of an event that has occurred.

  Naiveness: In the whole process, only the most primitive and simple assumptions are made, for example, it is assumed that the features are independent of each other and the features are equally important.

  Simple logic: When using this classification algorithm to calculate the probability of an unknown sample belongs to a known category, and then select the maximum probability sample classification as a result,

  Description: Naive Bayes classifier originated in classical mathematical theory, has a solid mathematical foundation, And stable classification efficiency. The Bayesian method is based on Bayesian theory and uses the knowledge of probability statistics to classify the sample data set, and the false positive rate is very low. The characteristic of the Bayesian method is to combine the prior probability and the posterior probability, which avoids the subjective bias of using only the prior probability, and also avoids the over-fitting phenomenon of using sample information alone. In the case of a large data set, it shows a high accuracy.

  The naive Bayesian method is based on the Bayesian algorithm to make the corresponding simplification, that is, given the target value, the attributes are mutually independent of each other. The proportion of attribute variables is similar, which greatly simplifies the complexity of the Bayesian method, but the classification effect is reduced.

2. Project Introduction

  The project is to use the simple Bayesian algorithm to classify Chinese mail. There are spam and non-spam mails. The most effective words are counted. Then the number of effective words in each mail is counted in this file to construct a feature vector , As a training set to train and classify emails.

3. Project steps

(1) Collect enough spam and non-spam content from the e-mail box as a training set.

(2) Read all training sets and delete the disturbing characters, such as [] *. , And so on, and then segment the words, and then delete the single words of length 1, such single words do not contribute to the text classification, and the remaining words are considered valid words.

(3) Count the number of occurrences of each effective vocabulary in all training sets, and intercept the top N occurrences (which can be adjusted according to actual conditions).

(4) Generate feature vectors based on each spam and non-spam content after preprocessing in step 2, and count the frequency of occurrence of the N words obtained in step 3 in the mail. Each email corresponds to a feature vector, the length of the feature vector is N, and the value of each component indicates the number of times the corresponding word appears in this email. For example, the feature vector [3, 0, 0, 5] means that the first word appears 3 times in this email, the second and third words do not appear, and the fourth word appears 5 times.

(5) Create and train a Naive Bayes model based on the feature vector obtained in step 4 and the classification of known emails.

(6) Read the test email and refer to step 2 to preprocess the email text and extract feature vectors.

(7) Use the model trained in step 5 to classify emails based on the feature vector extracted in step 6.

4. Code

  Import various libraries:

from re import sub
from os import listdir
from collections import Counter
from itertools import chain
from numpy import array
from jieba import cut
from sklearn.naive_bayes import MultinomialNB

  Get all valid words in every email:

def getWordsFromFile(txtFile):
    # 获取每一封邮件中的所有词语
    words = []
    # 所有存储邮件文本内容的记事本文件都使用UTF8编码
    with open(txtFile, encoding='utf8') as fp:
        for line in fp:
            # 遍历每一行,删除两端的空白字符
            line = line.strip()
            # 过滤干扰字符或无效字符
            line = sub(r'[.【】0-9、—。,!~\*]', '', line)
            # 分词
            line = cut(line)
            # 过滤长度为1的词
            line = filter(lambda word: len(word)>1, line)
            # 把本行文本预处理得到的词语添加到words列表中
            words.extend(line)
    # 返回包含当前邮件文本中所有有效词语的列表
    return words

  Train and save results

# 存放所有文件中的单词
# 每个元素是一个子列表,其中存放一个文件中的所有单词
allWords = []
def getTopNWords(topN):
    # 按文件编号顺序处理当前文件夹中所有记事本文件
    # 训练集中共151封邮件内容,0.txt到126.txt是垃圾邮件内容
    # 127.txt到150.txt为正常邮件内容
    txtFiles = [str(i)+'.txt' for i in range(151)]
    # 获取训练集中所有邮件中的全部单词
    for txtFile in txtFiles:
        allWords.append(getWordsFromFile(txtFile))
    # 获取并返回出现次数最多的前topN个单词
    freq = Counter(chain(*allWords))
    return [w[0] for w in freq.most_common(topN)]

# 全部训练集中出现次数最多的前600个单词
topWords = getTopNWords(600)

# 获取特征向量,前600个单词的每个单词在每个邮件中出现的频率
vectors = []
for words in allWords:
    temp = list(map(lambda x: words.count(x), topWords))
    vectors.append(temp)
vectors = array(vectors)
# 训练集中每个邮件的标签,1表示垃圾邮件,0表示正常邮件
labels = array([1]*127 + [0]*24)

# 创建模型,使用已知训练集进行训练
model = MultinomialNB()
model.fit(vectors, labels)

# 下面是保存结果
joblib.dump(model, "垃圾邮件分类器.pkl")
print('保存模型和训练结果成功。')
with open('topWords.txt', 'w', encoding='utf8') as fp:
    fp.write(','.join(topWords))
print('保存topWords成功。')

  Load and use training results

model = joblib.load("垃圾邮件分类器.pkl")
print('加载模型和训练结果成功。')
with open('topWords.txt', encoding='utf8') as fp:
    topWords = fp.read().split(',')

def predict(txtFile):
    # 获取指定邮件文件内容,返回分类结果
    words = getWordsFromFile(txtFile)
    currentVector = array(tuple(map(lambda x: words.count(x),
                                    topWords)))
    result = model.predict(currentVector.reshape(1, -1))[0]
    return '垃圾邮件' if result==1 else '正常邮件'

# 151.txt至155.txt为测试邮件内容
for mail in ('%d.txt'%i for i in range(151, 156)):
    print(mail, predict(mail), sep=':')

5. Results

Guess you like

Origin www.cnblogs.com/ITXiaoAng/p/12732796.html