Naive Bayes - Spam Filtering

Document Classification Using Naive Bayes

1. Get the data set

Download the dataset and get some email documents. Among them, the ham folder is full of normal mail, and the spam folder is full of junk mail.
insert image description here
In the data set, some phrases are separated by spaces, which is convenient for later segmentation of phrases and statistical vocabulary.
insert image description here

2. Segment text

There are a lot of phrases separated by the space bar in each email, and these phrases need to be separated and placed in a vocabulary.

def cut_sentences(content):   # 实现分句的函数,content参数是传入的文本字符串
    end_flag = ['?', '!', '.', '?', '!', '。', ' ']   # 结束符号,包含中文和英文的
    content_len = len(content)
    sentences = []   # 存储每一个句子的列表
    tmp_char = ''
    for idx, char in enumerate(content):
        tmp_char += char   # 拼接字符
        if (idx + 1) == content_len:   # 判断是否已经到了最后一位
            sentences.append(tmp_char.strip().replace('\ufeff', ''))
            break
        if char in end_flag:   # 判断此字符是否为结束符号
            # 再判断下一个字符是否为结束符号,如果不是结束符号,则切分句子
            next_idx = idx + 1
            if not content[next_idx] in end_flag:
                sentences.append(tmp_char.strip().replace('\ufeff', ''))
                tmp_char = ''
    sentences = list(set(sentences))
    return  [tok for tok in sentences if len(tok) > 1]

words = cut_sentences(open('TrainingSet/Ham/{}.txt'.format(2)).read())
print(words)

For example, after splitting an email, you can get a vocabulary
insert image description here

3. Build vocabulary and classification

In the following method, posting_list stores the vocabulary of each email, class_vec stores the category corresponding to the subscript vocabulary, 1 represents friendly files, and 0 represents unfriendly files. Finally returns a list of vocabulary, and a list of categories.

# 获取词表和对应类别
def load_data_set():
    # 词表和对应类别
    posting_list = []
    class_vec = []
    for i in range(2, 10):
        # 获取友好邮件词表
        words = cut_sentences(open('TrainingSet/Ham/{}.txt'.format(i)).read())
        posting_list.append(words)
        class_vec.append(1)
        # 获取不友好邮件词表
        words = cut_sentences(open('TrainingSet/Spam/{}.txt'.format(i)).read())
        posting_list.append(words)
        class_vec.append(0)
    return posting_list, class_vec

insert image description here
The create_vocab_list method will combine the vocabulary list returned by load_data_set into a large vocabulary. After each phrase in it is input to the Set function, set will return a large vocabulary without repetition.

# 合并所有词表
def create_vocab_list(data_set):
    vocab_set = set() 
    for item in data_set:
        # | 求两个集合的并集
        vocab_set = vocab_set | set(item)
    return list(vocab_set)
print(create_vocab_list(posting_list))

Through the above create_vocab_list method, a large vocabulary containing all the vocabulary can be returned. Using the set_of_words2vec method, the large vocabulary and the small vocabulary can be passed in, and the position and corresponding position of each vocabulary in the small vocabulary can be traversed in the large vocabulary. Flag is set to 1. Create a vector with the same length as the vocabulary, and set its elements to 0, traverse all the words in the document, and if a word in the vocabulary appears, set the corresponding value in the output document vector to 1.

def set_of_words2vec(vocab_list, input_set):
    # 创建一个和词汇表等长的向量,并将其元素都设置为0
    result = [0] * len(vocab_list)
    # 遍历文档中的所有单词,如果出现了词汇表中的单词,则将输出的文档向量中的对应值设为1
    for word in input_set:
        if word in vocab_list:
            result[vocab_list.index(word)] = 1
        else:
            pass
    return result

insert image description here

4. Build a classifier

利用公式:
P ( c i ∣ w ) = P ( w ∣ c i ) P ( c i ) P ( w ) P(c_{i}|\mathbf{w } ) = \frac{P(\mathbf{w }|c_{i})P(c_{i})}{P(\mathbf{w } )} P(ciw)=P(w)P(wci)P(ci)
First, the probability can be calculated by dividing the number of emails of category i (friendly or unfriendly) by the total number of emails P(ci). Then calculate P(w|ci), here we will use the Bayesian hypothesis. If it is wexpanded into independent features, then the above probability can be written P(w0,w1,w2...wn). Here, it is assumed that all words are independent of each other. This assumption is also used as a conditional independence assumption, which means that the P(w0,ci)P(w1,ci)P(w2,ci)...P(wn,ci)calculation probability can be used.

		该函数的伪代码如下:
		计算每个类别中的文档数目
		对每篇训练文档:
		        对每个类别:
		                如果词条出现在文档中→ 增加该词条的计数值
		                增加所有词条的计数值
		        对每个类别:
		                对每个词条:
		                        将该词条的数目除以总词条数目得到条件概率
		返回每个类别的条件概率

To calculate the P(wi|c1)sum P(wi|c1), the numerator variable and denominator variable in the program need to be initialized. Once each word (friendly or unfriendly) appears in a certain file, the number corresponding to the changed word (p0Num or p1Num) will increase by 1, and in all files, the total number of words in the file will also increase by 1.


def train_naive_bayes(train_mat, train_category):
    train_doc_num = len(train_mat)
    words_num = len(train_mat[0])
    # 因为侮辱性的被标记为了1, 所以只要把他们相加就可以得到侮辱性的有多少
    # 侮辱性文件的出现概率,即train_category中所有的1的个数,
    # 代表的就是多少个侮辱性文件,与文件的总数相除就得到了侮辱性文件的出现概率
    pos_abusive = np.sum(train_category) / train_doc_num
    # 单词出现的次数
    p0num = np.ones(words_num)
    p1num = np.ones(words_num)
    print(p1num)
    p0num_all = 2.0
    p1num_all = 2.0

    for i in range(train_doc_num):
        # 遍历所有的文件,如果是侮辱性文件,就计算此侮辱性文件中出现的侮辱性单词的个数
        if train_category[i] == 1:
            p1num += train_mat[i]
            p1num_all += np.sum(train_mat[i])
        else:
            p0num += train_mat[i]
            p0num_all += np.sum(train_mat[i])
    p1vec = np.log(p1num / p1num_all)
    p0vec = np.log(p0num / p0num_all)
    return p0vec, p1vec, pos_abusive

vec2classify: Data to be tested [0,1,1,1,1…], which is the vector to be classified, p0vec: category 0, that is, normal documents, p1vec: category 1, that is, insulting documents, p_class1category 1, the occurrence of insulting documents probability. Use the algorithm: multiplication: P(C|F1F2...Fn) = P(F1F2...Fn|C)P(C)/P(F1F2...Fn), addition: P(F1|C)*P(F2|C)....P(Fn|C)P(C) -> log(P(F1|C))+log(P(F2|C))+....+log(P(Fn|C))+log(P(C)), use NumPy array to calculate the result of multiplying two vectors, the multiplication here refers to the multiplication of corresponding elements, that is, first multiply the first elements of the two vectors, and then multiply The 2nd element is multiplied, and so on. My understanding is: vec2Classify * p1Vec here means to associate each word with its corresponding probability, which can be understood as 1. Under the condition that the word is in the vocabulary, the probability of the file being a good category can also be understood as 2 . The probability that a document is both in the vocabulary and in the good category under the entire space.

def classify_naive_bayes(vec2classify, p0vec, p1vec, p_class1):
    p1 = np.sum(vec2classify * p1vec) + np.log(p_class1)
    p0 = np.sum(vec2classify * p0vec) + np.log(1 - p_class1)
    if p1 > p0:
        return 1
    else:
        return 0

5. Test algorithm

Input the corresponding vocabulary of the test file to judge whether the file is friendly or unfriendly. For example, if I input part of the vocabulary of two files, the algorithm can determine whether the email is a friendly or unfriendly file. Among them, the vocabulary ['一番', '卑劣', '某人', '冷静', '以为', '真的', '不能', '有着', '这件', '清华', '心情 。', '一起 。', '解释',]is intercepted from friendly files and ['邮箱', '地税', '查证', '树立', '较多现', '广州', '我司', '大小', '如贵司', '额度', '核心思想', '电脑', '真票 !', '每月', '洽商']from non-friendly files, and the result predicted by the algorithm is also correct.

def testing_naive_bayes():
    # 1. 加载数据集
    list_post, list_classes = load_data_set()
    # 2. 创建单词集合
    vocab_list = create_vocab_list(list_post)

    # 3. 计算单词是否出现并创建数据矩阵
    train_mat = []
    for post_in in list_post:
        train_mat.append(
            # 返回m*len(vocab_list)的矩阵, 记录的都是0,1信息
            set_of_words2vec(vocab_list, post_in)
        )
    # 4. 训练数据
    p0v, p1v, p_abusive = train_naive_bayes(np.array(train_mat), np.array(list_classes))
    # 5. 测试数据
    test_one = ['一番', '卑劣', '某人', '冷静', '以为', '真的', '不能', '有着', '这件', '清华', '心情 。', '一起 。', '解释',]
    print(test_one)
    test_one_doc = np.array(set_of_words2vec(vocab_list, test_one))
    print('这个邮件的类别是: {}'.format(classify_naive_bayes(test_one_doc, p0v, p1v, p_abusive)))
    test_two = ['邮箱', '地税', '查证', '树立', '较多现', '广州', '我司', '大小', '如贵司', '额度', '核心思想', '电脑', '真票 !', '每月', '洽商']
    print(test_two)
    test_two_doc = np.array(set_of_words2vec(vocab_list, test_two))
    print('这个邮件的类别是: {}'.format(classify_naive_bayes(test_two_doc, p0v, p1v, p_abusive)))

insert image description here

Spam Filtering Using Naive Bayes

1. Import data set

There are about 8500 friendly emails and unfriendly emails in the data set.
insert image description here

2. Spam prediction

The spam prediction algorithm mainly uses the Bayesian classifier in the document classification using Naive Bayesian, and compares the judgment type returned by the classifier with the actual type, and finally the error ratio can be calculated. There are 1,000 friendly and unfriendly files here. After processing the return probability of these emails, 100 are randomly selected from the total files for prediction, and finally the predicted results are output.

def spam_test():
    """
    对贝叶斯垃圾邮件分类器进行自动化处理。
    :return: nothing
    """
    doc_list = []
    class_list = []
    full_text = []
    for i in range(1, 51):
        try:
            words = cut_sentences(open('TrainingSet/Ham/{}.txt'.format(i)).read())
        except:
            words = cut_sentences(open('TrainingSet/Ham/{}.txt'.format(i), encoding='Windows 1252').read())
        doc_list.append(words)
        full_text.extend(words)
        class_list.append(1)
        try:
            # 添加非垃圾邮件
            words = cut_sentences(open('TrainingSet/Ham/{}.txt'.format(i)).read())
        except:
            words = cut_sentences(open('TrainingSet/Ham/{}.txt'.format(i), encoding='Windows 1252').read())
        doc_list.append(words)
        full_text.extend(words)
        class_list.append(0)
    # 创建词汇表
    vocab_list = create_vocab_list(doc_list)
    
    import random
    test_set = [int(num) for num in random.sample(range(100), 50)]
    training_set = list(set(range(100)) - set(test_set))
    
    training_mat = []
    training_class = []
    for doc_index in training_set:
        training_mat.append(set_of_words2vec(vocab_list, doc_list[doc_index]))
        training_class.append(class_list[doc_index])
    p0v, p1v, p_spam = train_naive_bayes(
        np.array(training_mat),
        np.array(training_class)
    )

    # 开始测试
    error_count = 0
    for doc_index in test_set:
        word_vec = set_of_words2vec(vocab_list, doc_list[doc_index])
        if classify_naive_bayes(
            np.array(word_vec),
            p0v,
            p1v,
            p_spam
        ) != class_list[doc_index]:
            error_count += 1
    print('正确率为 {}'.format(
        1 - (error_count / len(test_set))
    ))

Summarize

The algorithm is not very friendly to a large number of data sets. At the beginning, I found tens of thousands of emails as a data set, and the final result was particularly poor, with an error rate of 0.9. In the end, we had to reduce the data set, using 500 friendly and unfriendly files respectively, and extracting a total of 100 test accuracy rates, the final result accuracy rate was only 0.18.
insert image description here
Continue to reduce the data set, and finally use 50 friendly and unfriendly files respectively to extract a total of 20 test accuracy rates, and the accuracy rate is only 0.28. It should be that there is too much content in each email in my data set, and there is not much connection between Chinese phrases.
insert image description here

Advantages of Naive Bayes algorithm:

  • Naive Bayesian models have stable classification efficiency.
  • It performs well on small-scale data, can handle multiple classification tasks, and is suitable for incremental training, especially when the amount of data exceeds the memory, it can be incrementally trained batch by batch.
  • It is less sensitive to missing data and the algorithm is relatively simple. It is often used in text classification.

shortcoming:

  • In theory, Naive Bayesian models have the smallest error rate compared to other classification methods. But this is not always the case in practice. This is because the Naive Bayesian model assumes that the attributes are independent of each other when the output category is given. This assumption is often not true in practical applications. When the number of attributes is large or the attributes When the correlation between them is large, the classification effect is not good. Naive Bayes performs best when the attribute correlation is small. For this, there are algorithms such as semi-naive Bayes that are modestly improved by taking into account partial dependencies.
  • It is necessary to know the prior probability, and the prior probability often depends on the assumptions. There can be many hypothetical models, so sometimes the prediction effect will be poor due to the hypothetical prior model.
  • Sensitive to the representation of the input data.

Code address:
Link: https://pan.baidu.com/s/1iXUTGRxISzWisOk_Wt6xXA?pwd=zrqx
Extraction code: zrqx

Guess you like

Origin blog.csdn.net/chenxingxingxing/article/details/128052051