Sesame HTTP: Remember the pit of scikit-learn Bayesian text classification

The basic steps:

1. Classification of training materials:

I am referring to the official directory structure:

Put the corresponding text in each directory, a txt file and a corresponding article: like the following

 

It should be noted that the proportion of all materials should be kept at the same proportion (adjusted as appropriate according to the training results, the proportion should not be too disparate, and it is easy to cause overfitting (the popular point is that most articles give you the category with the most materials) )

Don't talk nonsense and go directly to the code (the test code is ugly; let's take a look)

A small tool is required: pip install chinese-tokenizer

Here is the trainer:

import re
import jieba
import json
from io import BytesIO
from chinese_tokenizer.tokenizer import Tokenizer
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.externals import joblib

jie_ba_tokenizer = Tokenizer().jie_ba_tokenizer

# 加载数据集
training_data = load_files('./data', encoding='utf-8')
# x_train txt内容 y_train 是类别(正 负 中 )
x_train, _, y_train, _ = train_test_split(training_data.data, training_data.target)
print('开始建模.....')
with open('training_data.target', 'w', encoding='utf-8') as f:
    f.write(json.dumps(training_data.target_names))
# tokenizer参数是用来对文本进行分词的函数(就是上面我们结巴分词)
count_vect = CountVectorizer(tokenizer=jieba_tokenizer)

tfidf_transformer = TfidfTransformer()
X_train_counts = count_vect.fit_transform(x_train)

X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
print('正在训练分类器.....')
# 多项式贝叶斯分类器训练
clf = MultinomialNB().fit(X_train_tfidf, y_train)
# 保存分类器(好在其它程序中使用)
joblib.dump(clf, 'model.pkl')
# 保存矢量化(坑在这儿!!需要使用和训练器相同的 矢量器 不然会报错!!!!!! 提示 ValueError dimension mismatch··)
joblib.dump(count_vect, 'count_vect')
print("分类器的相关信息:")
print(clf)

Here is an example of classifying articles using the trained classifier:

The articles that need to be classified are placed in the predict_data directory: it is still an article and a txt file


# -*- coding: utf-8 -*-
# @Time    : 2017/8/23 18:02
# @Author  : 哎哟卧槽
# @Site    : 
# @File    : 贝叶斯分类器.py
# @Software: PyCharm
 
import re
import jieba
import json
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.externals import joblib
 
 
# 加载分类器
clf = joblib.load('model.pkl')
 
count_vect = joblib.load('count_vect')
testing_data = load_files('./predict_data', encoding='utf-8')
target_names = json.loads(open('training_data.target', 'r', encoding='utf-8').read())
#     # 字符串处理
tfidf_transformer = TfidfTransformer()
 
X_new_counts = count_vect.transform(testing_data.data)
X_new_tfidf = tfidf_transformer.fit_transform(X_new_counts)
# 进行预测
predicted = clf.predict(X_new_tfidf)
for title, category in zip(testing_data.filenames, predicted):
    print('%r => %s' % (title, target_names[category]))

In this way, the trained classifier will not report an error when used in a new program: ValueError dimension mismatch··

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324944244&siteId=291194637