7.建立一个多类分类系统

7.建立一个多类分类系统

从规模化到特征提取、建模和评估,已经完成了简历分类系统的全部必要的步骤。现在将所有的东西组装在一起,应用到真实数据上以建立一个分类文本分类系统。对于此工作,将使用 scikit-learn 下载的 20 个新闻组数据集。这 20 个新闻组数据集包括分散在 20 个不同类别或主题的 18000 个新闻组帖子,这就构建了 20 类分类问题!请记住类的数量越多,尝试建立正确分类器就越复杂或者越困难。为防止模型因为文件头或者邮件地址而过拟合或泛化能力不强,体检的做法是从文档中去除文件头、文件尾和引用,因此需要确保考虑到了这一点。对于去除上述三项内容后的空文档或没用内容的文档,也将给予剔除,因为尝试从空文档中提取特征是毫无意义的。

开始下载所需的数据集以及为建立训练和测试数据集所用的函数:

from  sklearn.datasets  import  fetch_20newsgroups
## 文档中使用的模块在高版本中会被剔除,根据提示替换模块解决问题
# from sklearn.cross_validation import train_test_split
from  sklearn.model_selection  import  train_test_split
 
def  get_data():
     data  =  fetch_20newsgroups(subset = 'all' ,
                               shuffle = True ,
                               remove = ( 'headers' 'footers' 'quotes' ))
     return  data
     
def  prepare_datasets(corpus, labels, test_data_proportion = 0.3 ):
     train_X, test_X, train_Y, test_Y  =  train_test_split(corpus, labels,
                                                         test_size = 0.33 , random_state = 42 )
     return  train_X, test_X, train_Y, test_Y
 
def  remove_empty_docs(corpus, labels):
     filtered_corpus  =  []
     filtered_labels  =  []
     for  doc, label  in  zip (corpus, labels):
         if  doc.strip():
             filtered_corpus.append(doc)
             filtered_labels.append(label)
     return  filtered_corpus, filtered_labels

现在已经获得了数据,查看了数据集中分类的数量,使用下面的代码将数据集分为测试数据集和训练数据集。(下面代码执行下载数据集:)

In [ 20 ]: dataset  =  get_data()
     ...:  print (dataset.target_names)
     ...:
Downloading  20news  dataset. This may take a few minutes.
Downloading dataset  from  https: / / ndownloader.figshare.com / files / 5975967  ( 14  MB)
[ 'alt.atheism' 'comp.graphics' 'comp.os.ms-windows.misc' 'comp.sys.ibm.pc.hardware' 'comp.sys.mac.hardware' 'comp.windows.x' 'misc.forsale' 'rec.autos' 'rec.motorcycles' 'rec.sport.baseball' 'rec.sport.hockey' 'sci.crypt' 'sci.electronics' 'sci.med' 'sci.space' 'soc.religion.christian' 'talk.politics.guns' 'talk.politics.mideast' 'talk.politics.misc' 'talk.religion.misc' ]
In [ 21 ]: corpus, labels  =  dataset.data, dataset.target
     ...: corpus, labels  =  remove_empty_docs(corpus, labels)
     ...:
     ...:  print ( 'Sample document:' , corpus[ 10 ])
     ...:  print ( 'Class label:' ,labels[ 10 ])
     ...:  print ( 'Actual class label:' , dataset.target_names[labels[ 10 ]])
     ...:
     ...:
Sample document: the blood of the lamb.
 
This will be a hard task, because most cultures used most animals
for  blood sacrifices. It has to be something related to our current
post - modernism state. Hmm, what about used computers?
 
Cheers,
Kent
Class label:  19
Actual  class  label: talk.religion.misc
train_corpus, test_corpus, train_labels, test_labels  =  prepare_datasets(corpus,
                                                                         labels,
                                                                         test_data_proportion = 0.3 )

从上面的代码可以看到文档和标签的情况。每个文档拥有自己的标签类,这些标签是需要进行分类的 20 个主题之一。这些标签是数字形式的,如果需要,可以使用上面的代码容易地将它们映射回原来的类别名字。已经把数据分为训练数据集和测试数据集,测试数据集占总数据的 30%。将使用训练数据集建立模型,使用测试数据集测试模型的性能。下面的代码将使用前面建立的规范化模块对数据集进行规范化处理:

normalization.py  折叠源码
# -*- coding: utf-8 -*-
"""
Created on Fri Aug 26 20:45:10 2016
@author: DIP
"""
 
from  contractions  import  CONTRACTION_MAP
import  re
import  nltk
import  string
from  nltk.stem  import  WordNetLemmatizer
 
stopword_list  =  nltk.corpus.stopwords.words( 'english' )
wnl  =  WordNetLemmatizer()
 
def  tokenize_text(text):
     tokens  =  nltk.word_tokenize(text)
     tokens  =  [token.strip()  for  token  in  tokens]
     return  tokens
 
def  expand_contractions(text, contraction_mapping):
     
     contractions_pattern  =  re. compile ( '({})' . format ( '|' .join(contraction_mapping.keys())),
                                       flags = re.IGNORECASE|re.DOTALL)
     def  expand_match(contraction):
         match  =  contraction.group( 0 )
         first_char  =  match[ 0 ]
         expanded_contraction  =  contraction_mapping.get(match)\
                                 if  contraction_mapping.get(match)\
                                 else  contraction_mapping.get(match.lower())                      
         expanded_contraction  =  first_char + expanded_contraction[ 1 :]
         return  expanded_contraction
         
     expanded_text  =  contractions_pattern.sub(expand_match, text)
     expanded_text  =  re.sub( "'" , "", expanded_text)
     return  expanded_text
     
     
from  pattern.en  import  tag
from  nltk.corpus  import  wordnet as wn
 
# Annotate text tokens with POS tags
def  pos_tag_text(text):
     
     def  penn_to_wn_tags(pos_tag):
         if  pos_tag.startswith( 'J' ):
             return  wn.ADJ
         elif  pos_tag.startswith( 'V' ):
             return  wn.VERB
         elif  pos_tag.startswith( 'N' ):
             return  wn.NOUN
         elif  pos_tag.startswith( 'R' ):
             return  wn.ADV
         else :
             return  None
     
     tagged_text  =  tag(text)
     tagged_lower_text  =  [(word.lower(), penn_to_wn_tags(pos_tag))
                          for  word, pos_tag  in
                          tagged_text]
     return  tagged_lower_text
     
# lemmatize text based on POS tags   
def  lemmatize_text(text):
     
     pos_tagged_text  =  pos_tag_text(text)
     lemmatized_tokens  =  [wnl.lemmatize(word, pos_tag)  if  pos_tag
                          else  word                    
                          for  word, pos_tag  in  pos_tagged_text]
     lemmatized_text  =  ' ' .join(lemmatized_tokens)
     return  lemmatized_text
     
 
def  remove_special_characters(text):
     tokens  =  tokenize_text(text)
     pattern  =  re. compile ( '[{}]' . format (re.escape(string.punctuation)))
     filtered_tokens  =  filter ( None , [pattern.sub('', token)  for  token  in  tokens])
     filtered_text  =  ' ' .join(filtered_tokens)
     return  filtered_text
     
     
def  remove_stopwords(text):
     tokens  =  tokenize_text(text)
     filtered_tokens  =  [token  for  token  in  tokens  if  token  not  in  stopword_list]
     filtered_text  =  ' ' .join(filtered_tokens)   
     return  filtered_text
 
     
 
def  normalize_corpus(corpus, tokenize = False ):
     
     normalized_corpus  =  []   
     for  text  in  corpus:
         text  =  expand_contractions(text, CONTRACTION_MAP)
         text  =  lemmatize_text(text)
         text  =  remove_special_characters(text)
         text  =  remove_stopwords(text)
         normalized_corpus.append(text)
         if  tokenize:
             text  =  tokenize_text(text)
             normalized_corpus.append(text)
             
     return  normalized_corpus
from  normalization  import  normalize_corpus
 
norm_train_corpus  =  normalize_corpus(train_corpus)
norm_test_corpus  =  normalize_corpus(test_corpus)

执行语句可能会耗费一段时间才能完成。

如果出现类似错误:

...
RuntimeError: generator raised StopIteration

请切换至 Python3.6 或更高版本

记住,语料库中每个文档进行规范化处理需要很多步骤,所以这将会耗费一些时间才能完成。完成文档规范化处理后,将使用前面建立的特征提取模块从文档中提取特征。将分别建立词袋模型、TF-IDF 模型、平均词向量模型和 TF-IDF 加权平均词向量模型,并比较它们的性能。

下面的代码基于不同技术提取必要的特征:

from  feature_extractors  import  bow_extractor, tfidf_extractor
from  feature_extractors  import  averaged_word_vectorizer
from  feature_extractors  import  tfidf_weighted_averaged_word_vectorizer
import  nltk
import  gensim
 
# bag of words features
bow_vectorizer, bow_train_features  =  bow_extractor(norm_train_corpus) 
bow_test_features  =  bow_vectorizer.transform(norm_test_corpus)
 
# tfidf features
tfidf_vectorizer, tfidf_train_features  =  tfidf_extractor(norm_train_corpus) 
tfidf_test_features  =  tfidf_vectorizer.transform(norm_test_corpus)   
 
 
# tokenize documents
tokenized_train  =  [nltk.word_tokenize(text)
                    for  text  in  norm_train_corpus]
tokenized_test  =  [nltk.word_tokenize(text)
                    for  text  in  norm_test_corpus] 
# build word2vec model                  
model  =  gensim.models.Word2Vec(tokenized_train,
                                size = 500 ,
                                window = 100 ,
                                min_count = 30 ,
                                sample = 1e - 3 )                 
                    
# averaged word vector features
avg_wv_train_features  =  averaged_word_vectorizer(corpus = tokenized_train,
                                                  model = model,
                                                  num_features = 500 )                  
avg_wv_test_features  =  averaged_word_vectorizer(corpus = tokenized_test,
                                                 model = model,
                                                 num_features = 500 )                                               
# tfidf weighted averaged word vector features
vocab  =  tfidf_vectorizer.vocabulary_
tfidf_wv_train_features  =  tfidf_weighted_averaged_word_vectorizer(corpus = tokenized_train,
                                                                   tfidf_vectors = tfidf_train_features,
                                                                   tfidf_vocabulary = vocab,
                                                                   model = model,
                                                                   num_features = 500 )
tfidf_wv_test_features  =  tfidf_weighted_averaged_word_vectorizer(corpus = tokenized_test,
                                                                  tfidf_vectors = tfidf_test_features,
                                                                  tfidf_vocabulary = vocab,
                                                                  model = model,
                                                                  num_features = 500 )

使用上面的特征提取器从文本文档中提取了全部必要的特征之后,基于前面讨论的四个指标,定义一个函数用来苹果分类模型,函数如下面代码段所示:

from  sklearn  import  metrics
import  numpy as np
 
def  get_metrics(true_labels, predicted_labels):
     
     print ( 'Accuracy:' , np. round (
                         metrics.accuracy_score(true_labels,
                                                predicted_labels),
                         2 ))
     print ( 'Precision:' , np. round (
                         metrics.precision_score(true_labels,
                                                predicted_labels,
                                                average = 'weighted' ),
                         2 ))
     print ( 'Recall:' , np. round (
                         metrics.recall_score(true_labels,
                                                predicted_labels,
                                                average = 'weighted' ),
                         2 ))
     print ( 'F1 Score:' , np. round (
                         metrics.f1_score(true_labels,
                                                predicted_labels,
                                                average = 'weighted' ),
                         2 ))

现在定义一个函数使用机器学习算法和训练数据来训练模型,使用训练的模型在测试数据上执行预测,接着使用上面的函数苹果模型预测性能:

def  train_predict_evaluate_model(classifier,
                                  train_features, train_labels,
                                  test_features, test_labels):
     # build model   
     classifier.fit(train_features, train_labels)
     # predict using model
     predictions  =  classifier.predict(test_features)
     # evaluate model prediction performance  
     get_metrics(true_labels = test_labels,
                 predicted_labels = predictions)
     return  predictions

现在进入了 2 个机器学习算法,开始使用已经提取的特征建立模型。将使用前面提到的 scikit-learn 引入必要的分类算法,以节省花费在重写代码的时间和精力上:

from  sklearn.naive_bayes  import  MultinomialNB
from  sklearn.linear_model  import  SGDClassifier
 
mnb  =  MultinomialNB()
svm  =  SGDClassifier(loss = 'hinge' , n_iter = 100 )

现在下面的代码将使用多项式朴素贝叶斯和支持向量机以及全部不同类型的特征进行模型训练、预测和评估:

# Multinomial Naive Bayes with bag of words features
mnb_bow_predictions  =  train_predict_evaluate_model(classifier = mnb,
                                            train_features = bow_train_features,
                                            train_labels = train_labels,
                                            test_features = bow_test_features,
                                            test_labels = test_labels)
Accuracy:  0.67
Precision:  0.72
Recall:  0.67
F1 Score:  0.65
# Support Vector Machine with bag of words features
svm_bow_predictions  =  train_predict_evaluate_model(classifier = svm,
                                            train_features = bow_train_features,
                                            train_labels = train_labels,
                                            test_features = bow_test_features,
                                            test_labels = test_labels)
Accuracy:  0.61
Precision:  0.67
Recall:  0.61
F1 Score:  0.62
# Multinomial Naive Bayes with tfidf features                                          
mnb_tfidf_predictions  =  train_predict_evaluate_model(classifier = mnb,
                                            train_features = tfidf_train_features,
                                            train_labels = train_labels,
                                            test_features = tfidf_test_features,
                                            test_labels = test_labels)
Accuracy:  0.72
Precision:  0.78
Recall:  0.72
F1 Score:  0.7
# Support Vector Machine with tfidf features
svm_tfidf_predictions  =  train_predict_evaluate_model(classifier = svm,
                                            train_features = tfidf_train_features,
                                            train_labels = train_labels,
                                            test_features = tfidf_test_features,
                                            test_labels = test_labels)
Accuracy:  0.77
Precision:  0.77
Recall:  0.77
F1 Score:  0.77
# Support Vector Machine with averaged word vector features
svm_avgwv_predictions  =  train_predict_evaluate_model(classifier = svm,
                                            train_features = avg_wv_train_features,
                                            train_labels = train_labels,
                                            test_features = avg_wv_test_features,
                                            test_labels = test_labels)
Accuracy:  0.56
Precision:  0.58
Recall:  0.56
F1 Score:  0.56
# Support Vector Machine with tfidf weighted averaged word vector features
svm_tfidfwv_predictions  =  train_predict_evaluate_model(classifier = svm,
                                            train_features = tfidf_wv_train_features,
                                            train_labels = train_labels,
                                            test_features = tfidf_wv_test_features,
                                            test_labels = test_labels)
Accuracy:  0.53
Precision:  0.58
Recall:  0.53
F1 Score:  0.52

使用不同类型的特征建立了 6 个模型,使用测试数据评估了模型的性能。从上面的结果可以看到使用 TF-IDF 特征的 SVM 模型获得了最好的结果,准确率、精确率、召回率和 F1 score 均为 77%。可以建立 SVM TF-IDF 模型的混淆矩阵,以便了解模型性能不好的具体分类的情况:

import  pandas as pd
cm  =  metrics.confusion_matrix(test_labels, svm_tfidf_predictions)
pd.DataFrame(cm, index = range ( 0 , 20 ), columns = range ( 0 , 20 ))
Out[ 47 ]:
      0     1     2     3     4     5     6     7     8     9     10    11    12    13    14    15    16    17    18   19
0    156     3     0     1     1     0     2     3     4     1     4     4     2     4     5    34     3     7     7   22
1      1   224     9     7     8    14     8     0     2     1     0     2     5     4     4     1     4     0     3    0
2      1    20   221    18     9    18     8     1     0     0     0     3     5     2     2     2     1     1     2    0
3      1    11    25   223     9     4     9     2     1     1     1     2     6     3     1     0     1     0     0    0
4      0     4     7    15   228     6     5     2     3     1     0     3     9     3     3     1     1     0     1    0
5      0    21    18     1     2   272     0     1     1     0     0     0     4     3     1     0     0     1     0    0
6      0     2     7    11    12     1   270    10     3     2     1     1    10     1     4     0     2     1     1    0
7      1     5     2     2     2     3     4   246    19     1     3     2    10     3     2     0     4     3     3    1
8      3     1     0     4     2     2     5    27   252     3     4     2     1     4     1     3     2     2     4    0
9      1     1     1     0     2     3     5     3     6   278    12     2     1     1     2     4     2     0     1    0
10     0     0     0     0     0     0     1     3     2     4   282     1     2     1     4     1     0     1     1    0
11     3     5     3     3     1     2     2     2     2     3     0   259     6     2     0     1     5     2     5    0
12     1     6     6    15     7     2    13    10     8     4     4     2   212     3     5     1     1     1     0    1
13     2     4     0     1     3     4     3     0     2     0     1     1     7   267     4     2     3     0     4    0
14     0     5     3     0     2     4     2     5     4     1     2     0     8     3   264     2     4     1     3    1
15    11     1     0     0     1     1     0     0     4     1     3     2     1     7     5   292     4     4     2    4
16     4     1     0     0     0     4     2     1     7     2     2    11     3     2     4     2   227     3    13    3
17     6     0     1     0     1     3     0     2     3     2     4     6     1     3     1     6     5   259    10    2
18     9     1     2     1     0     1     2     1     5     3     3     7     0     9     6     4    33     7   165    3
19    21     5     0     1     0     2     3     3     7     2     1     1     0    11     3    57    21     7     3   65

从上表混淆矩阵上,可以看到很多类标签为 0 的文档被错误地分类到类标签 15 里面,同样对于类标签 18 的很多文档被错误地分类到类标签 16 里面。很多类标签 19 的文档被错误地分类到类型标签 15 里面。打印类型名字,可以看到如下输出:

In [ 48 ]: class_names  =  dataset.target_names
     ...:  print (class_names[ 0 ],  '->' , class_names[ 15 ])
     ...:  print (class_names[ 18 ],  '->' , class_names[ 16 ])
     ...:  print (class_names[ 19 ],  '->' , class_names[ 15 ])
     ...:
     ...:
alt.atheism  - > soc.religion.christian
talk.politics.misc  - > talk.politics.guns
talk.religion.misc  - > soc.religion.christian

从前面的输出可以看到错误分类与实际分类并没有显著的不同。Christian、religion 和 atheism 都是与商都和宗教存在有关的概念,可能会有相似的特征。杂项问题和强制都与政治有关,必然有相似的特征。可以使用下面的代码,进一步详细查看和分析被错误分类的问题:

import  re
 
num  =  0
for  document, label, predicted_label  in  zip (test_corpus, test_labels, svm_tfidf_predictions):
     if  label  = =  0  and  predicted_label  = =  15 :
         print ( 'Actual Label:' , class_names[label])
         print ( 'Predicted Label:' , class_names[predicted_label])
         print ( 'Document:-' )
         print (re.sub( '\n' ' ' , document))
         print ("")
         num  + =  1
         if  num  = =  4 :
             break

打印结果:

Actual Label: alt.atheism
Predicted Label: soc.religion.christian
Document: -
I would like a  list  of Bible contadictions  from  those of you who dispite being free  from  Christianity are well versed  in  the Bible.
 
Actual Label: alt.atheism
Predicted Label: soc.religion.christian
Document: -
   They spent quite a bit of time on the wording of the Constitution.  They picked words whose meanings implied the intent.  We have already looked  in  the dictionary to define the word.  Isn 't this sufficient?   But we were discussing it in relation to the death penalty.  And, the Constitution need not define each of the words within.  Anyone who doesn' t know what cruel  is  can look  in  the dictionary ( and  we did).
 
Actual Label: alt.atheism
Predicted Label: soc.religion.christian
Document: -
Our Lord  and  Savior David Keresh has risen!     He has been seen alive!     Spread the word!      - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
 
Actual Label: alt.atheism
Predicted Label: soc.religion.christian
Document: -
   "This is your god"  ( from  John Carpenter's  "They Live,"  natch)
num  =  0
for  document, label, predicted_label  in  zip (test_corpus, test_labels, svm_tfidf_predictions):
     if  label  = =  18  and  predicted_label  = =  16 :
         print ( 'Actual Label:' , class_names[label])
         print ( 'Predicted Label:' , class_names[predicted_label])
         print ( 'Document:-' )
         print (re.sub( '\n' ' ' , document))
         print ()
         num  + =  1
         if  num  = =  4 :
             break

打印结果:

Actual Label: talk.politics.misc
Predicted Label: talk.politics.guns
Document: -
After the initial gun battle was over, they had  50  days to come out peacefully. They had their high priced lawyer,  and  judging by the posts here they had some public support. Can anyone come up with a rational explanation why the didn 't come out (even after they negotiated coming out after the radio sermon) that doesn' t include the Davidians wanting to commit suicide / murder / general mayhem?
 
Actual Label: talk.politics.misc
Predicted Label: talk.politics.guns
Document: -
Yesterday, the FBI was saying that at least three of the bodies had gunshot wounds, indicating that they were shot trying to escape the fire.  Today 's paper quotes the medical examiner as saying that there is no evidence of gunshot wounds in any of the recovered bodies.  At the beginning of this siege, it was reported that while Koresh had a class III (machine gun) license, today' s paper quotes the government as saying, no, they didn 't have a license.  Today' s paper reports that a number of the bodies were found with shoulder weapons  next  to them, as  if  they had been using them  while  dying  - -  which doesn 't sound like the sort of action I would expect from a suicide.  Our government lies, as it tries to cover over its incompetence and negligence.  Why should I believe the FBI' s claims about anything  else , when we can see that they are LYING?  This system of government  is  beyond reform.
 
Actual Label: talk.politics.misc
Predicted Label: talk.politics.guns
Document: -
   Well,  for  one thing most,  if  not  all  the Dividians (depending on whether they could show they acted  in  self - defense  and  there were no illegal weapons), could have gone on with their life as they were living it. No one was forcing them to give up their religion  or  even their legal weapons. The Dividians had survived a change  in  leadership before so even  if  Koresch himself would have been convicted  and  sent to jail, they still could have carried on.   I don 't think the Dividians were insane, but I don' t see a reason  for  mass suicide ( if  the fire was intentional  set  by some of the Dividians.) We also don 't know that, if the fire was intentionally set from inside, was it a generally know plan or was this something only an inner circle knew about, or was it something two or three felt they had to do with or without Koresch' s knowledge / blessing, etc.? I don't know much about Masada. Were some people throwing others over? Did mothers jump over with their babies  in  their arms?
 
Actual Label: talk.politics.misc
Predicted Label: talk.politics.guns
Document: -
[email protected] (Russ Anderson) writes...      The fact  is  that Koresh  and  his followers involved themselves    in  a gun battle to control the Mt Carmel  complex . That  is  not    in  dispute. From what I remember of the trial, the authories    couldn 't reasonably establish who fired first, the big reason   behind the aquittal.                    _____  _____                   \\\\\\/ ___/___________________   Mitchell S Todd  \\\\/ /                 _____/__________________________ ________________    \\/ / mst4298@zeus._____/.' . '.' . '.' . '.' . '.' . '.' . '_' _ '_/ \_____        \__    / / tamu.edu  _____/.' . '.' . '.' . '.' . '.' . '.' . '.' _ '_/     \__________\__  / /        _____/_' _ '_' _ '_' _ '_' _ '_' _ '_' _ '_' _'_ /                  \_  /  / __________ /                   \ / ____ / \\\\\\              \\\\\\

可以看到是如何分析和查看错误分类的文档的,然后回到前面步骤,调整优化特征提取方法,通过删除特征的单词或调整单词权重来减少或增加影响程度。

猜你喜欢

转载自www.cnblogs.com/dalton/p/11353954.html