垃圾邮件分类 jieba+bayes

简介

本身我对文本方面的比如自然语言处理什么的钻研的不多,这里是我之前写的邮件分类,用的方法其实是很简单的算法,同时这种处理方式可以说是最常用的文本处理技巧。

下下来一个是为了自己记录一下,当然如果您刚刚入门机器学习或者NLP,能给您一些帮助也最好不过了。

数据集

  1. 垃圾邮件

在这里插入图片描述

  1. 普通邮件

在这里插入图片描述

当然还有测试集:
在这里插入图片描述
这里我使用的数据集中的部分截图。

代码

导入库函数

import jieba
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB

读取数据

停用词文档。

stopWordFilePath = 'stopword.txt'
stopList = [word.strip() for word in open(stopWordFilePath, encoding='utf-8').readlines()]
stopList[:4]
ham = []
spam = []
with open('ham_100.utf8', 'r', encoding='utf-8') as f:
    for line in f.readlines():
        ham.append(line)
with open('spam_100.utf8', 'r', encoding='utf-8') as f:
    for line in f.readlines():
        spam.append(line)

ham[:2],spam[:2]
## 读取测试集
test_data = []
with open('test.utf8', 'r', encoding='utf-8') as f:
    for line in f.readlines():
        test_data.append(line)
cut_test_data = [jieba.cut(item) for item in test_data]
test_data_fin = [' '.join(word) for word in cut_test_data]

分词并去掉停用词

cut_ham = [jieba.cut(sentence=str0) for str0 in ham]
train_ham = [' '.join(item) for item in cut_ham]
train_ham[:3]
cut_spam = [jieba.cut(str0) for str0 in spam]
train_spam = [' '.join(item) for item in cut_spam]
train_spam[:4]

这里使用词袋处理停用词。

count = CountVectorizer(stop_words = stopList)
train_X_tmp = train_spam + train_ham
count.fit(train_spam + train_ham + test_data_fin)
train_X = count.transform(train_X_tmp).toarray()

train_X.shape

添加标记 1 : 是垃圾邮件 0 : 不是垃圾邮件

train_y_spam = [1 for i in range(len(train_spam))]
train_y_ham = [0 for i in range(len(train_ham))]

合并数据集

train_y_spam = [1 for i in range(len(train_spam))]
train_y_ham = [0 for i in range(len(train_ham))]

模型构建

mnb = MultinomialNB()
mnb.fit(train_X, train_y)
y_pred = mnb.predict(test_x)
y_pred
for lab, (index, i) in zip(y_pred, enumerate(test_data)):
    print(lab, "第", index+1, "封", i[:20])

在这里插入图片描述
在这里插入图片描述

算是做个笔记,大家共勉~~

猜你喜欢

转载自blog.csdn.net/qq_40742298/article/details/105868511