Actual Naive Bayes text classification: import data from a file to get 84% accuracy rate

Process:

step1: Import File

step2: word

step3: to stop words

step4: tf-idf screening

step5: Chi-square Screening

step6: Training prediction

step1、2、3

Importing files is very simple, if the path is Chinese, it is noted in the Windows need to use Unicode (path, 'utf8') conversion path name

A large number of consecutive spaces file appears, line breaks, so the use of regular matching method replaces it into a space

Digital (here being that digital useless), Chinese English punctuation, to no avail, to filter out

They can also be written to stop words, and then filter out all together

With jieba word, will encounter a space as a word, the word will spread, will filter out all spaces

# -*- coding: utf-8 -*-
import jieba
import os
<pre name="code" class="python">import re
import time
import string
rootpath="../转换后的文件"
os.chdir(rootpath)
# stopword
words_list = []                                    
filename_list = []
category_list = []
all_words = {}                                # 全词库 {'key':value }
stopwords = {}.fromkeys([line.rstrip() for line in open('../stopwords.txt')])
category = os.listdir(rootpath)               # 类别列表
delEStr = string.punctuation + ' ' + string.digits
identify = string.maketrans('', '')   
#########################
#       分词,创建词库    #
#########################
def fileWordProcess(contents):
    wordsList = []
    contents = re.sub(r'\s+',' ',contents) # trans 多空格 to 空格
    contents = re.sub(r'\n',' ',contents)  # trans 换行 to 空格
    contents = re.sub(r'\t',' ',contents)  # trans Tab to 空格
    contents = contents.translate(identify, delEStr) 
    for seg in jieba.cut(contents):
        seg = seg.encode('utf8')
        if seg not in stopwords:           # remove 停用词
            if seg!=' ':                   # remove 空格
                wordsList.append(seg)      # create 文件词列表
    file_string = ' '.join(wordsList)            
    return file_string
 
for categoryName in category:             # 循环类别文件,OSX系统默认第一个是系统文件
    if(categoryName=='.DS_Store'):continue
    categoryPath = os.path.join(rootpath,categoryName) # 这个类别的路径
    filesList = os.listdir(categoryPath)      # 这个类别内所有文件列表
    # 循环对每个文件分词
    for filename in filesList:
        if(filename=='.DS_Store'):continue
        starttime = time.clock()
        contents = open(os.path.join(categoryPath,filename)).read()
        wordProcessed = fileWordProcess(contents)       # 内容分词成列表
#暂时不做#filenameWordProcessed = fileWordProcess(filename) # 文件名分词,单独做特征
#         words_list.append((wordProcessed,categoryName,filename)) # 训练集格式:[(当前文件内词列表,类别,文件名)]
        words_list.append(wordProcessed)
        filename_list.append(filename)
        category_list.append(categoryName)
        endtime = time.clock(); 
        print '类别:%s >>>>文件:%s >>>>导入用时: %.3f' % (categoryName,filename,endtime-starttime)


With three lists stored contents of the file,

words_list store all the files will spread word thesaurus, filename_list store the corresponding file name, category_list stored corresponding file type (here 'confidential, secret, internal' categories)

Xiao Bian recommend a school population of 315 346 913 python learning
whether you are big or small white cow, or would want to switch into the line can come together to learn to understand progress together! There are tools within the group, a lot of dry goods and technical information to share!

 step4

sklearn very powerful two functions CountVectorizer, TfidfTransformer, the first word frequency matrix can be generated, the term frequency weight is greater than 11 is converted to a word vector matrix, and the second function calculates the tf-idf matrix, to use him to filter out tf-idf Calcd small feature words,

# 创建词向量矩阵,创建tfidf值矩阵
 
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
freWord = CountVectorizer(stop_words='english')
transformer = TfidfTransformer()
fre_matrix = freWord.fit_transform(words_list)
tfidf = transformer.fit_transform(fre_matrix)
 
import pandas as pd
feature_names = freWord.get_feature_names()           # 特征名
freWordVector_df = pd.DataFrame(fre_matrix.toarray()) # 全词库 词频 向量矩阵
tfidf_df = pd.DataFrame(tfidf.toarray())              # tfidf值矩阵
# print freWordVector_df
tfidf_df.shape
# tf-idf 筛选
tfidf_sx_featuresindex = tfidf_df.sum(axis=0).sort_values(ascending=False)[:10000].index
print len(tfidf_sx_featuresindex)
freWord_tfsx_df = freWordVector_df.ix[:,tfidf_sx_featuresindex] # tfidf法筛选后的词向量矩阵
df_columns = pd.Series(feature_names)[tfidf_sx_featuresindex]
print df_columns.shape
def guiyi(x):
    x[x>1]=1
    return x
import numpy as np
tfidf_df_1 = freWord_tfsx_df.apply(guiyi)
tfidf_df_1.columns = df_columns
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
tfidf_df_1['label'] = le.fit_transform(category_list)
tfidf_df_1.index = filename_list

step5
chi-square test is even simpler screening before feature index could not find a chi-square test has finally chosen will use, get_support (indices = False), the election is not false return index, returned all the characteristics of a Boolean value list, select true selected index value is returned

# 卡方检验
from sklearn.feature_selection import SelectKBest, chi2
ch2 = SelectKBest(chi2, k=7000)
nolabel_feature = [x for x in tfidf_df_1.columns if x not in ['label']]
ch2_sx_np = ch2.fit_transform(tfidf_df_1[nolabel_feature],tfidf_df_1['label'])
label_np = np.array(tfidf_df_1['label'])


step6:
Here I first chose the Naive Bayes algorithm training, pre-training my first sample stratified according to 10-fold cross Dividing the data set, then iterate 10 times, respectively, for training and forecasting,

Finally, comparing the predicted value and the true value of up to 84% accuracy rate

# 朴素贝叶斯
from sklearn.naive_bayes import MultinomialNB
from sklearn.cross_validation import train_test_split 
from sklearn.cross_validation import StratifiedKFold
from sklearn.cross_validation import KFold
from sklearn.metrics import precision_recall_curve  
from sklearn.metrics import classification_report
# nolabel_feature = [x for x in tfidf_df_1.columns if x not in ['label']]
# x_train, x_test, y_train, y_test = train_test_split(ch2_sx_np, tfidf_df_1['label'], test_size = 0.2)
 
X = ch2_sx_np
y = label_np
skf = StratifiedKFold(y,n_folds=10)
y_pre = y.copy()
for train_index,test_index in skf:
    X_train,X_test = X[train_index],X[test_index]
    y_train,y_test = y[train_index],y[test_index]
    clf = MultinomialNB().fit(X_train, y_train)  
    y_pre[test_index] = clf.predict(X_test)  
       
print '准确率为 %.6f' %(np.mean(y_pre == y)) 

STEP7 :
authentication precision, recall, f1 and confusion matrix values

# 精准率 召回率 F1score
from sklearn.metrics import confusion_matrix,classification_report
print 'precision,recall,F1-score如下:》》》》》》》》'
print classification_report(y,y_pre)
 
# confusion matrix
import matplotlib.pyplot as plt
%matplotlib inline
def plot_confusion_matrix(cm, title='Confusion matrix', cmap=plt.cm.Blues):
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(category[1:]))
    category_english=['neibu','jimi','mimi']
    plt.xticks(tick_marks, category_english, rotation=45)
    plt.yticks(tick_marks, category_english)
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    for x in range(len(cm)): 
        for y in range(len(cm)):
            plt.annotate(cm[x,y], xy=(x, y), horizontalalignment='center', verticalalignment='center')
print '混淆矩阵如下:》》》》》》'
cm = confusion_matrix(y,y_pre)
plt.figure()
plot_confusion_matrix(cm)
 
plt.show()


 

Guess you like

Origin blog.csdn.net/weixin_44995023/article/details/91804297