中文自然语言处理--基于朴素贝叶斯的中文短文本分类

朴素贝叶斯分类大致介绍：
它是一种十分简单的分类算法，叫它朴素贝叶斯分类是因为这种方法的思想真的很朴素，朴素贝叶斯的思想基础是这样的：对于给出的待分类项，求解在此项出现的条件下各个类别出现的概率，哪个最大，就认为此待分类项属于哪个类别。通俗来说，就好比这么个道理，你在街上看到一个黑人，我问你你猜这哥们哪里来的，你十有八九猜非洲。为什么呢？因为黑人中非洲人的比率最高，当然人家也可能是美洲人或亚洲人，但在没有其它可用信息下，我们会选择条件概率最大的类别，这就是朴素贝叶斯的思想基础。
具体步骤：
在这里插入图片描述

这里用朴素贝叶斯分类中文短文本：

import random
import jieba
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC

# 整个过程分为以下几个步骤：语料加载，分词，去停用词，抽取词向量特征，
# 分别进行算法建模和模型训练，评估、计算 AUC 值，模型对比

# 加载停用词
stopwords = pd.read_csv('./NB_SVM/stopwords.txt', index_col=False, quoting=3, sep="\t", names=['stopword'], encoding='utf-8')
print("stopwords:\n", stopwords)
stopwords = stopwords['stopword'].values

# 加载语料,语料是4个已经分好类的 csv 文件
laogong_df = pd.read_csv('./NB_SVM/beilaogongda.csv', encoding='utf-8', sep=',', index_col=[0])
laopo_df = pd.read_csv('./NB_SVM/beilaopoda.csv', encoding='utf-8', sep=',', index_col=[0])
erzi_df = pd.read_csv('./NB_SVM/beierzida.csv', encoding='utf-8', sep=',', index_col=[0])
nver_df = pd.read_csv('./NB_SVM/beinverda.csv', encoding='utf-8', sep=',', index_col=[0])
# 删除语料的nan行
laogong_df.dropna(inplace=True)
laopo_df.dropna(inplace=True)
erzi_df.dropna(inplace=True)
nver_df.dropna(inplace=True)
print("laogong_df:\n", laogong_df)
print("laopo_df:\n", laopo_df)
print("erzi_df:\n", erzi_df)
print("nver_df:\n", nver_df)
# 转换
laogong = laogong_df.segment.values.tolist()
laopo = laopo_df.segment.values.tolist()
erzi = erzi_df.segment.values.tolist()
nver = nver_df.segment.values.tolist()

# 定义分词和打标签函数preprocess_text
# 参数content_lines即为上面转换的list
# 参数sentences是定义的空list，用来储存打标签之后的数据
# 参数category 是类型标签
jieba.add_word("报警人")
jieba.add_word("防护装备")
jieba.add_word("防护设备")
jieba.suggest_freq(("人", "称"), tune=True)
def preprocess_text(content_lines, sentences, category):
    for line in content_lines:
        try:
            segs = jieba.lcut(line)
            segs = [v for v in segs if not str(v).isdigit()]  # 去数字
            segs = list(filter(lambda x: x.strip(), segs))  # 去左右空格
            segs = list(filter(lambda x: len(x) > 1, segs))  # 长度为1的字符
            segs = list(filter(lambda x: x not in stopwords, segs))  # 去掉停用词
            sentences.append((" ".join(segs), category))  # 打标签
        except Exception:
            print(line)
            continue

# 调用函数、生成训练数据
sentences = []
preprocess_text(laogong, sentences,0)
preprocess_text(laopo, sentences, 1)
preprocess_text(erzi, sentences, 2)
preprocess_text(nver, sentences, 3)
# 将得到的数据集打散，生成更可靠的训练集分布，避免同类数据分布不均匀
random.shuffle(sentences)
# 在控制台输出前10条数据
for sentence in sentences[:10]:
    print(sentence[0], sentence[1])  # 下标0是词列表，1是标签

# 抽取词向量特征
# CountVectorizer是属于常见的特征数值计算类，是一个文本特征提取方法。对于每一个训练文本，它只考虑每种词汇在该训练文本中出现的频率。
# CountVectorizer会将文本中的词语转换为词频矩阵，它通过fit_transform函数计算各个词语出现的次数。
# analyzer    一般使用默认，可设置为string类型，如’word’, ‘char’, ‘char_wb’，还可设置为callable类型，比如函数是一个callable类型
# max_features    默认为None，可设为int，对所有关键词的term frequency进行降序排序，只取前max_features个作为关键词集
vec = CountVectorizer(
    analyzer='word', # tokenise by character ngrams
    max_features=4000,  # keep the most common 1000 ngrams
)
# # 尝试加入抽取 2-gram 和 3-gram 的统计特征，把词库的量放大一点
# vec = CountVectorizer(
#     analyzer='word',  # tokenise by character ngrams
#     ngram_range=(1, 4),  # use ngrams of size 1 and 2
#     max_features=20000,  # keep the most common 1000 ngrams
# )

# 把语料数据切分
x, y = zip(*sentences)
# random_state：是随机数的种子。
# 随机数种子：其实就是该组随机数的编号，在需要重复试验的时候，保证得到一组一样的随机数。
# 比如你每次都填1，其他参数一样的情况下你得到的随机数组是一样的。但填0或不填，每次都会不一样。
# stratify是为了保持split前类的分布。通常在类分布不平衡的情况下会用到stratify。
# 将stratify=X就是按照X中的比例分配
# 将stratify=y就是按照y中的比例分配
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=123, stratify=y)
print(len(x_train))
print(len(x_test))
# 把训练数据转换为词袋模型
vec.fit(x_train)
print(vec.get_feature_names())
print(vec.vocabulary_)
print(vec.transform(x_train).toarray())

# 进行算法建模和模型训练
classifier = MultinomialNB()  # 朴素贝叶斯模型
classifier.fit(vec.transform(x_train), y_train)
# 评估、计算 AUC 值
print(classifier.score(vec.transform(x_test), y_test))
# 进行测试集的预测
pre = classifier.predict(vec.transform(x_test))
print(pre)

# 改变训练模型:
# 使用 SVM 训练
svm = SVC(kernel='linear')
svm.fit(vec.transform(x_train), y_train)
print(svm.score(vec.transform(x_test), y_test))

# 从优化和提高模型准确率来说，主要有两方面可以尝试：
#   特征向量的构建，除了词袋模型，可以考虑使用 word2vec 和 doc2vec 等；
#   模型上可以选择有监督的分类算法、集成学习以及神经网络等。

原文有语料：
https://soyoger.blog.csdn.net/article/details/108729408

中文自然语言处理--基于朴素贝叶斯的中文短文本分类

猜你喜欢