7.建立一个多类分类系统

从规模化到特征提取、建模和评估，已经完成了简历分类系统的全部必要的步骤。现在将所有的东西组装在一起，应用到真实数据上以建立一个分类文本分类系统。对于此工作，将使用 scikit-learn 下载的 20 个新闻组数据集。这 20 个新闻组数据集包括分散在 20 个不同类别或主题的 18000 个新闻组帖子，这就构建了 20 类分类问题！请记住类的数量越多，尝试建立正确分类器就越复杂或者越困难。为防止模型因为文件头或者邮件地址而过拟合或泛化能力不强，体检的做法是从文档中去除文件头、文件尾和引用，因此需要确保考虑到了这一点。对于去除上述三项内容后的空文档或没用内容的文档，也将给予剔除，因为尝试从空文档中提取特征是毫无意义的。

开始下载所需的数据集以及为建立训练和测试数据集所用的函数：

 
          from  
          sklearn.datasets  
          import  
          fetch_20newsgroups 
         
          ## 文档中使用的模块在高版本中会被剔除，根据提示替换模块解决问题 
         
          # from sklearn.cross_validation import train_test_split 
         
          from  
          sklearn.model_selection  
          import  
          train_test_split 
         
          def  
          get_data(): 
         
          data  
          =  
          fetch_20newsgroups(subset 
          = 
          'all' 
          , 
         
          shuffle 
          = 
          True 
          , 
         
          remove 
          = 
          ( 
          'headers' 
          ,  
          'footers' 
          ,  
          'quotes' 
          )) 
         
          return  
          data 
         
          def  
          prepare_datasets(corpus, labels, test_data_proportion 
          = 
          0.3 
          ): 
         
          train_X, test_X, train_Y, test_Y  
          =  
          train_test_split(corpus, labels, 
         
          test_size 
          = 
          0.33 
          , random_state 
          = 
          42 
          ) 
         
          return  
          train_X, test_X, train_Y, test_Y 
         
          def  
          remove_empty_docs(corpus, labels): 
         
          filtered_corpus  
          =  
          [] 
         
          filtered_labels  
          =  
          [] 
         
          for  
          doc, label  
          in  
          zip 
          (corpus, labels): 
         
          if  
          doc.strip(): 
         
          filtered_corpus.append(doc) 
         
          filtered_labels.append(label) 
         
          return  
          filtered_corpus, filtered_labels

现在已经获得了数据，查看了数据集中分类的数量，使用下面的代码将数据集分为测试数据集和训练数据集。（下面代码执行下载数据集：）

 
     
      
        
          In [ 
          20 
          ]: dataset  
          =  
          get_data() 
         
 
               
          ...:  
          print 
          (dataset.target_names) 
         
 
               
          ...: 
         
 
          Downloading  
          20news  
          dataset. This may take a few minutes. 
         
 
          Downloading dataset  
          from  
          https: 
          / 
          / 
          ndownloader.figshare.com 
          / 
          files 
          / 
          5975967  
          ( 
          14  
          MB) 
         
 
          [ 
          'alt.atheism' 
          ,  
          'comp.graphics' 
          ,  
          'comp.os.ms-windows.misc' 
          ,  
          'comp.sys.ibm.pc.hardware' 
          ,  
          'comp.sys.mac.hardware' 
          ,  
          'comp.windows.x' 
          ,  
          'misc.forsale' 
          ,  
          'rec.autos' 
          ,  
          'rec.motorcycles' 
          ,  
          'rec.sport.baseball' 
          ,  
          'rec.sport.hockey' 
          ,  
          'sci.crypt' 
          ,  
          'sci.electronics' 
          ,  
          'sci.med' 
          ,  
          'sci.space' 
          ,  
          'soc.religion.christian' 
          ,  
          'talk.politics.guns' 
          ,  
          'talk.politics.mideast' 
          ,  
          'talk.politics.misc' 
          ,  
          'talk.religion.misc' 
          ] 
         
 
      
 
     
   

 
          In [ 
          21 
          ]: corpus, labels  
          =  
          dataset.data, dataset.target 
         
          ...: corpus, labels  
          =  
          remove_empty_docs(corpus, labels) 
         
          ...: 
         
          ...:  
          print 
          ( 
          'Sample document:' 
          , corpus[ 
          10 
          ]) 
         
          ...:  
          print 
          ( 
          'Class label:' 
          ,labels[ 
          10 
          ]) 
         
          ...:  
          print 
          ( 
          'Actual class label:' 
          , dataset.target_names[labels[ 
          10 
          ]]) 
         
          ...: 
         
          ...: 
         
          Sample document: the blood of the lamb. 
         
          This will be a hard task, because most cultures used most animals 
         
          for  
          blood sacrifices. It has to be something related to our current 
         
          post 
          - 
          modernism state. Hmm, what about used computers? 
         
          Cheers, 
         
          Kent 
         
          Class label:  
          19 
         
          Actual  
          class  
          label: talk.religion.misc

 
          train_corpus, test_corpus, train_labels, test_labels  
          =  
          prepare_datasets(corpus, 
         
          labels, 
         
          test_data_proportion 
          = 
          0.3 
          )

从上面的代码可以看到文档和标签的情况。每个文档拥有自己的标签类，这些标签是需要进行分类的 20 个主题之一。这些标签是数字形式的，如果需要，可以使用上面的代码容易地将它们映射回原来的类别名字。已经把数据分为训练数据集和测试数据集，测试数据集占总数据的 30%。将使用训练数据集建立模型，使用测试数据集测试模型的性能。下面的代码将使用前面建立的规范化模块对数据集进行规范化处理：

normalization.py 折叠源码

 
          # -*- coding: utf-8 -*- 
         
          """ 
         
          Created on Fri Aug 26 20:45:10 2016 
         
          @author: DIP 
         
          """ 
         
          from  
          contractions  
          import  
          CONTRACTION_MAP 
         
          import  
          re 
         
          import  
          nltk 
         
          import  
          string 
         
          from  
          nltk.stem  
          import  
          WordNetLemmatizer 
         
          stopword_list  
          =  
          nltk.corpus.stopwords.words( 
          'english' 
          ) 
         
          wnl  
          =  
          WordNetLemmatizer() 
         
          def  
          tokenize_text(text): 
         
          tokens  
          =  
          nltk.word_tokenize(text) 
         
          tokens  
          =  
          [token.strip()  
          for  
          token  
          in  
          tokens] 
         
          return  
          tokens 
         
          def  
          expand_contractions(text, contraction_mapping): 
         
          contractions_pattern  
          =  
          re. 
          compile 
          ( 
          '({})' 
          . 
          format 
          ( 
          '|' 
          .join(contraction_mapping.keys())), 
         
          flags 
          = 
          re.IGNORECASE|re.DOTALL) 
         
          def  
          expand_match(contraction): 
         
          match  
          =  
          contraction.group( 
          0 
          ) 
         
          first_char  
          =  
          match[ 
          0 
          ] 
         
          expanded_contraction  
          =  
          contraction_mapping.get(match)\ 
         
          if  
          contraction_mapping.get(match)\ 
         
          else  
          contraction_mapping.get(match.lower())                       
         
          expanded_contraction  
          =  
          first_char 
          + 
          expanded_contraction[ 
          1 
          :] 
         
          return  
          expanded_contraction 
         
          expanded_text  
          =  
          contractions_pattern.sub(expand_match, text) 
         
          expanded_text  
          =  
          re.sub( 
          "'" 
          , "", expanded_text) 
         
          return  
          expanded_text 
         
          from  
          pattern.en  
          import  
          tag 
         
          from  
          nltk.corpus  
          import  
          wordnet as wn 
         
          # Annotate text tokens with POS tags 
         
          def  
          pos_tag_text(text): 
         
          def  
          penn_to_wn_tags(pos_tag): 
         
          if  
          pos_tag.startswith( 
          'J' 
          ): 
         
          return  
          wn.ADJ 
         
          elif  
          pos_tag.startswith( 
          'V' 
          ): 
         
          return  
          wn.VERB 
         
          elif  
          pos_tag.startswith( 
          'N' 
          ): 
         
          return  
          wn.NOUN 
         
          elif  
          pos_tag.startswith( 
          'R' 
          ): 
         
          return  
          wn.ADV 
         
          else 
          : 
         
          return  
          None 
         
          tagged_text  
          =  
          tag(text) 
         
          tagged_lower_text  
          =  
          [(word.lower(), penn_to_wn_tags(pos_tag)) 
         
          for  
          word, pos_tag  
          in 
         
          tagged_text] 
         
          return  
          tagged_lower_text 
         
          # lemmatize text based on POS tags    
         
          def  
          lemmatize_text(text): 
         
          pos_tagged_text  
          =  
          pos_tag_text(text) 
         
          lemmatized_tokens  
          =  
          [wnl.lemmatize(word, pos_tag)  
          if  
          pos_tag 
         
          else  
          word                     
         
          for  
          word, pos_tag  
          in  
          pos_tagged_text] 
         
          lemmatized_text  
          =  
          ' ' 
          .join(lemmatized_tokens) 
         
          return  
          lemmatized_text 
         
          def  
          remove_special_characters(text): 
         
          tokens  
          =  
          tokenize_text(text) 
         
          pattern  
          =  
          re. 
          compile 
          ( 
          '[{}]' 
          . 
          format 
          (re.escape(string.punctuation))) 
         
          filtered_tokens  
          =  
          filter 
          ( 
          None 
          , [pattern.sub('', token)  
          for  
          token  
          in  
          tokens]) 
         
          filtered_text  
          =  
          ' ' 
          .join(filtered_tokens) 
         
          return  
          filtered_text 
         
          def  
          remove_stopwords(text): 
         
          tokens  
          =  
          tokenize_text(text) 
         
          filtered_tokens  
          =  
          [token  
          for  
          token  
          in  
          tokens  
          if  
          token  
          not  
          in  
          stopword_list] 
         
          filtered_text  
          =  
          ' ' 
          .join(filtered_tokens)    
         
          return  
          filtered_text 
         
          def  
          normalize_corpus(corpus, tokenize 
          = 
          False 
          ): 
         
          normalized_corpus  
          =  
          []    
         
          for  
          text  
          in  
          corpus: 
         
          text  
          =  
          expand_contractions(text, CONTRACTION_MAP) 
         
          text  
          =  
          lemmatize_text(text) 
         
          text  
          =  
          remove_special_characters(text) 
         
          text  
          =  
          remove_stopwords(text) 
         
          normalized_corpus.append(text) 
         
          if  
          tokenize: 
         
          text  
          =  
          tokenize_text(text) 
         
          normalized_corpus.append(text) 
         
          return  
          normalized_corpus

 
          from  
          normalization  
          import  
          normalize_corpus 
         
          norm_train_corpus  
          =  
          normalize_corpus(train_corpus) 
         
          norm_test_corpus  
          =  
          normalize_corpus(test_corpus)

执行语句可能会耗费一段时间才能完成。

如果出现类似错误：

 
            ... 
           
            RuntimeError: generator raised StopIteration

请切换至 Python3.6 或更高版本

记住，语料库中每个文档进行规范化处理需要很多步骤，所以这将会耗费一些时间才能完成。完成文档规范化处理后，将使用前面建立的特征提取模块从文档中提取特征。将分别建立词袋模型、TF-IDF 模型、平均词向量模型和 TF-IDF 加权平均词向量模型，并比较它们的性能。

下面的代码基于不同技术提取必要的特征：

 
          from  
          feature_extractors  
          import  
          bow_extractor, tfidf_extractor 
         
          from  
          feature_extractors  
          import  
          averaged_word_vectorizer 
         
          from  
          feature_extractors  
          import  
          tfidf_weighted_averaged_word_vectorizer 
         
          import  
          nltk 
         
          import  
          gensim 
         
          # bag of words features 
         
          bow_vectorizer, bow_train_features  
          =  
          bow_extractor(norm_train_corpus)  
         
          bow_test_features  
          =  
          bow_vectorizer.transform(norm_test_corpus) 
         
          # tfidf features 
         
          tfidf_vectorizer, tfidf_train_features  
          =  
          tfidf_extractor(norm_train_corpus)  
         
          tfidf_test_features  
          =  
          tfidf_vectorizer.transform(norm_test_corpus)    
         
          # tokenize documents 
         
          tokenized_train  
          =  
          [nltk.word_tokenize(text) 
         
          for  
          text  
          in  
          norm_train_corpus] 
         
          tokenized_test  
          =  
          [nltk.word_tokenize(text) 
         
          for  
          text  
          in  
          norm_test_corpus]  
         
          # build word2vec model                   
         
          model  
          =  
          gensim.models.Word2Vec(tokenized_train, 
         
          size 
          = 
          500 
          , 
         
          window 
          = 
          100 
          , 
         
          min_count 
          = 
          30 
          , 
         
          sample 
          = 
          1e 
          - 
          3 
          )                  
         
          # averaged word vector features 
         
          avg_wv_train_features  
          =  
          averaged_word_vectorizer(corpus 
          = 
          tokenized_train, 
         
          model 
          = 
          model, 
         
          num_features 
          = 
          500 
          )                   
         
          avg_wv_test_features  
          =  
          averaged_word_vectorizer(corpus 
          = 
          tokenized_test, 
         
          model 
          = 
          model, 
         
          num_features 
          = 
          500 
          )                                                
         
          # tfidf weighted averaged word vector features 
         
          vocab  
          =  
          tfidf_vectorizer.vocabulary_ 
         
          tfidf_wv_train_features  
          =  
          tfidf_weighted_averaged_word_vectorizer(corpus 
          = 
          tokenized_train, 
         
          tfidf_vectors 
          = 
          tfidf_train_features, 
         
          tfidf_vocabulary 
          = 
          vocab, 
         
          model 
          = 
          model, 
         
          num_features 
          = 
          500 
          ) 
         
          tfidf_wv_test_features  
          =  
          tfidf_weighted_averaged_word_vectorizer(corpus 
          = 
          tokenized_test, 
         
          tfidf_vectors 
          = 
          tfidf_test_features, 
         
          tfidf_vocabulary 
          = 
          vocab, 
         
          model 
          = 
          model, 
         
          num_features 
          = 
          500 
          )

使用上面的特征提取器从文本文档中提取了全部必要的特征之后，基于前面讨论的四个指标，定义一个函数用来苹果分类模型，函数如下面代码段所示：

 
          from  
          sklearn  
          import  
          metrics 
         
          import  
          numpy as np 
         
          def  
          get_metrics(true_labels, predicted_labels): 
         
          print 
          ( 
          'Accuracy:' 
          , np. 
          round 
          ( 
         
          metrics.accuracy_score(true_labels, 
         
          predicted_labels), 
         
          2 
          )) 
         
          print 
          ( 
          'Precision:' 
          , np. 
          round 
          ( 
         
          metrics.precision_score(true_labels, 
         
          predicted_labels, 
         
          average 
          = 
          'weighted' 
          ), 
         
          2 
          )) 
         
          print 
          ( 
          'Recall:' 
          , np. 
          round 
          ( 
         
          metrics.recall_score(true_labels, 
         
          predicted_labels, 
         
          average 
          = 
          'weighted' 
          ), 
         
          2 
          )) 
         
          print 
          ( 
          'F1 Score:' 
          , np. 
          round 
          ( 
         
          metrics.f1_score(true_labels, 
         
          predicted_labels, 
         
          average 
          = 
          'weighted' 
          ), 
         
          2 
          ))

现在定义一个函数使用机器学习算法和训练数据来训练模型，使用训练的模型在测试数据上执行预测，接着使用上面的函数苹果模型预测性能：

 
          def  
          train_predict_evaluate_model(classifier, 
         
          train_features, train_labels, 
         
          test_features, test_labels): 
         
          # build model    
         
          classifier.fit(train_features, train_labels) 
         
          # predict using model 
         
          predictions  
          =  
          classifier.predict(test_features) 
         
          # evaluate model prediction performance   
         
          get_metrics(true_labels 
          = 
          test_labels, 
         
          predicted_labels 
          = 
          predictions) 
         
          return  
          predictions

现在进入了 2 个机器学习算法，开始使用已经提取的特征建立模型。将使用前面提到的 scikit-learn 引入必要的分类算法，以节省花费在重写代码的时间和精力上：

 
          from  
          sklearn.naive_bayes  
          import  
          MultinomialNB 
         
          from  
          sklearn.linear_model  
          import  
          SGDClassifier 
         
          mnb  
          =  
          MultinomialNB() 
         
          svm  
          =  
          SGDClassifier(loss 
          = 
          'hinge' 
          , n_iter 
          = 
          100 
          )

现在下面的代码将使用多项式朴素贝叶斯和支持向量机以及全部不同类型的特征进行模型训练、预测和评估：

 
          # Multinomial Naive Bayes with bag of words features 
         
          mnb_bow_predictions  
          =  
          train_predict_evaluate_model(classifier 
          = 
          mnb, 
         
          train_features 
          = 
          bow_train_features, 
         
          train_labels 
          = 
          train_labels, 
         
          test_features 
          = 
          bow_test_features, 
         
          test_labels 
          = 
          test_labels)

 
          Accuracy:  
          0.67 
         
          Precision:  
          0.72 
         
          Recall:  
          0.67 
         
          F1 Score:  
          0.65

 
          # Support Vector Machine with bag of words features 
         
          svm_bow_predictions  
          =  
          train_predict_evaluate_model(classifier 
          = 
          svm, 
         
          train_features 
          = 
          bow_train_features, 
         
          train_labels 
          = 
          train_labels, 
         
          test_features 
          = 
          bow_test_features, 
         
          test_labels 
          = 
          test_labels)

 
          Accuracy:  
          0.61 
         
          Precision:  
          0.67 
         
          Recall:  
          0.61 
         
          F1 Score:  
          0.62

 
          # Multinomial Naive Bayes with tfidf features                                           
         
          mnb_tfidf_predictions  
          =  
          train_predict_evaluate_model(classifier 
          = 
          mnb, 
         
          train_features 
          = 
          tfidf_train_features, 
         
          train_labels 
          = 
          train_labels, 
         
          test_features 
          = 
          tfidf_test_features, 
         
          test_labels 
          = 
          test_labels)

 
          Accuracy:  
          0.72 
         
          Precision:  
          0.78 
         
          Recall:  
          0.72 
         
          F1 Score:  
          0.7

 
          # Support Vector Machine with tfidf features 
         
          svm_tfidf_predictions  
          =  
          train_predict_evaluate_model(classifier 
          = 
          svm, 
         
          train_features 
          = 
          tfidf_train_features, 
         
          train_labels 
          = 
          train_labels, 
         
          test_features 
          = 
          tfidf_test_features, 
         
          test_labels 
          = 
          test_labels)

 
          Accuracy:  
          0.77 
         
          Precision:  
          0.77 
         
          Recall:  
          0.77 
         
          F1 Score:  
          0.77

 
          # Support Vector Machine with averaged word vector features 
         
          svm_avgwv_predictions  
          =  
          train_predict_evaluate_model(classifier 
          = 
          svm, 
         
          train_features 
          = 
          avg_wv_train_features, 
         
          train_labels 
          = 
          train_labels, 
         
          test_features 
          = 
          avg_wv_test_features, 
         
          test_labels 
          = 
          test_labels)

 
          Accuracy:  
          0.56 
         
          Precision:  
          0.58 
         
          Recall:  
          0.56 
         
          F1 Score:  
          0.56

 
          # Support Vector Machine with tfidf weighted averaged word vector features 
         
          svm_tfidfwv_predictions  
          =  
          train_predict_evaluate_model(classifier 
          = 
          svm, 
         
          train_features 
          = 
          tfidf_wv_train_features, 
         
          train_labels 
          = 
          train_labels, 
         
          test_features 
          = 
          tfidf_wv_test_features, 
         
          test_labels 
          = 
          test_labels)

 
          Accuracy:  
          0.53 
         
          Precision:  
          0.58 
         
          Recall:  
          0.53 
         
          F1 Score:  
          0.52

使用不同类型的特征建立了 6 个模型，使用测试数据评估了模型的性能。从上面的结果可以看到使用 TF-IDF 特征的 SVM 模型获得了最好的结果，准确率、精确率、召回率和 F1 score 均为 77%。可以建立 SVM TF-IDF 模型的混淆矩阵，以便了解模型性能不好的具体分类的情况：

 
          import  
          pandas as pd 
         
          cm  
          =  
          metrics.confusion_matrix(test_labels, svm_tfidf_predictions) 
         
          pd.DataFrame(cm, index 
          = 
          range 
          ( 
          0 
          , 
          20 
          ), columns 
          = 
          range 
          ( 
          0 
          , 
          20 
          ))

从上表混淆矩阵上,可以看到很多类标签为 0 的文档被错误地分类到类标签 15 里面，同样对于类标签 18 的很多文档被错误地分类到类标签 16 里面。很多类标签 19 的文档被错误地分类到类型标签 15 里面。打印类型名字，可以看到如下输出：

 
          In [ 
          48 
          ]: class_names  
          =  
          dataset.target_names 
         
          ...:  
          print 
          (class_names[ 
          0 
          ],  
          '->' 
          , class_names[ 
          15 
          ]) 
         
          ...:  
          print 
          (class_names[ 
          18 
          ],  
          '->' 
          , class_names[ 
          16 
          ]) 
         
          ...:  
          print 
          (class_names[ 
          19 
          ],  
          '->' 
          , class_names[ 
          15 
          ]) 
         
          ...: 
         
          ...: 
         
          alt.atheism  
          - 
          > soc.religion.christian 
         
          talk.politics.misc  
          - 
          > talk.politics.guns 
         
          talk.religion.misc  
          - 
          > soc.religion.christian

从前面的输出可以看到错误分类与实际分类并没有显著的不同。Christian、religion 和 atheism 都是与商都和宗教存在有关的概念，可能会有相似的特征。杂项问题和强制都与政治有关，必然有相似的特征。可以使用下面的代码，进一步详细查看和分析被错误分类的问题：

 
          import  
          re 
         
          num  
          =  
          0 
         
          for  
          document, label, predicted_label  
          in  
          zip 
          (test_corpus, test_labels, svm_tfidf_predictions): 
         
          if  
          label  
          = 
          =  
          0  
          and  
          predicted_label  
          = 
          =  
          15 
          : 
         
          print 
          ( 
          'Actual Label:' 
          , class_names[label]) 
         
          print 
          ( 
          'Predicted Label:' 
          , class_names[predicted_label]) 
         
          print 
          ( 
          'Document:-' 
          ) 
         
          print 
          (re.sub( 
          '\n' 
          ,  
          ' ' 
          , document)) 
         
          print 
          ("") 
         
          num  
          + 
          =  
          1 
         
          if  
          num  
          = 
          =  
          4 
          : 
         
          break

打印结果：

 
     
      
        
          Actual Label: alt.atheism 
         
 
          Predicted Label: soc.religion.christian 
         
 
          Document: 
          - 
         
 
          I would like a  
          list  
          of Bible contadictions  
          from  
          those of you who dispite being free  
          from  
          Christianity are well versed  
          in  
          the Bible. 
         

             
         
 
          Actual Label: alt.atheism 
         
 
          Predicted Label: soc.religion.christian 
         
 
          Document: 
          - 
         
 
             
          They spent quite a bit of time on the wording of the Constitution.  They picked words whose meanings implied the intent.  We have already looked  
          in  
          the dictionary to define the word.  Isn 
          't this sufficient?   But we were discussing it in relation to the death penalty.  And, the Constitution need not define each of the words within.  Anyone who doesn' 
          t know what cruel  
          is  
          can look  
          in  
          the dictionary ( 
          and  
          we did). 
         

             
         
 
          Actual Label: alt.atheism 
         
 
          Predicted Label: soc.religion.christian 
         
 
          Document: 
          - 
         
 
          Our Lord  
          and  
          Savior David Keresh has risen!     He has been seen alive!     Spread the word!      
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
          - 
         

             
         
 
          Actual Label: alt.atheism 
         
 
          Predicted Label: soc.religion.christian 
         
 
          Document: 
          - 
         
 
             
          "This is your god"  
          ( 
          from  
          John Carpenter's  
          "They Live,"  
          natch) 
         
 
      
 
     
   

 
          num  
          =  
          0 
         
          for  
          document, label, predicted_label  
          in  
          zip 
          (test_corpus, test_labels, svm_tfidf_predictions): 
         
          if  
          label  
          = 
          =  
          18  
          and  
          predicted_label  
          = 
          =  
          16 
          : 
         
          print 
          ( 
          'Actual Label:' 
          , class_names[label]) 
         
          print 
          ( 
          'Predicted Label:' 
          , class_names[predicted_label]) 
         
          print 
          ( 
          'Document:-' 
          ) 
         
          print 
          (re.sub( 
          '\n' 
          ,  
          ' ' 
          , document)) 
         
          print 
          () 
         
          num  
          + 
          =  
          1 
         
          if  
          num  
          = 
          =  
          4 
          : 
         
          break

打印结果：

 
          Actual Label: talk.politics.misc 
         
 
          Predicted Label: talk.politics.guns 
         
 
          Document: 
          - 
         
 
          After the initial gun battle was over, they had  
          50  
          days to come out peacefully. They had their high priced lawyer,  
          and  
          judging by the posts here they had some public support. Can anyone come up with a rational explanation why the didn 
          't come out (even after they negotiated coming out after the radio sermon) that doesn' 
          t include the Davidians wanting to commit suicide 
          / 
          murder 
          / 
          general mayhem? 
         

             
         
 
          Actual Label: talk.politics.misc 
         
 
          Predicted Label: talk.politics.guns 
         
 
          Document: 
          - 
         
 
          Yesterday, the FBI was saying that at least three of the bodies had gunshot wounds, indicating that they were shot trying to escape the fire.  Today 
          's paper quotes the medical examiner as saying that there is no evidence of gunshot wounds in any of the recovered bodies.  At the beginning of this siege, it was reported that while Koresh had a class III (machine gun) license, today' 
          s paper quotes the government as saying, no, they didn 
          't have a license.  Today' 
          s paper reports that a number of the bodies were found with shoulder weapons  
          next  
          to them, as  
          if  
          they had been using them  
          while  
          dying  
          - 
          -  
          which doesn 
          't sound like the sort of action I would expect from a suicide.  Our government lies, as it tries to cover over its incompetence and negligence.  Why should I believe the FBI' 
          s claims about anything  
          else 
          , when we can see that they are LYING?  This system of government  
          is  
          beyond reform. 
         

             
         
 
          Actual Label: talk.politics.misc 
         
 
          Predicted Label: talk.politics.guns 
         
 
          Document: 
          - 
         
 
             
          Well,  
          for  
          one thing most,  
          if  
          not  
          all  
          the Dividians (depending on whether they could show they acted  
          in  
          self 
          - 
          defense  
          and  
          there were no illegal weapons), could have gone on with their life as they were living it. No one was forcing them to give up their religion  
          or  
          even their legal weapons. The Dividians had survived a change  
          in  
          leadership before so even  
          if  
          Koresch himself would have been convicted  
          and  
          sent to jail, they still could have carried on.   I don 
          't think the Dividians were insane, but I don' 
          t see a reason  
          for  
          mass suicide ( 
          if  
          the fire was intentional  
          set  
          by some of the Dividians.) We also don 
          't know that, if the fire was intentionally set from inside, was it a generally know plan or was this something only an inner circle knew about, or was it something two or three felt they had to do with or without Koresch' 
          s knowledge 
          / 
          blessing, etc.? I don't know much about Masada. Were some people throwing others over? Did mothers jump over with their babies  
          in  
          their arms? 
         

             
         
 
          Actual Label: talk.politics.misc 
         
 
          Predicted Label: talk.politics.guns 
         
 
          Document: 
          - 
         
 
          [email protected] (Russ Anderson) writes...      The fact  
          is  
          that Koresh  
          and  
          his followers involved themselves    
          in  
          a gun battle to control the Mt Carmel  
          complex 
          . That  
          is  
          not    
          in  
          dispute. From what I remember of the trial, the authories    couldn 
          't reasonably establish who fired first, the big reason   behind the aquittal.                    _____  _____                   \\\\\\/ ___/___________________   Mitchell S Todd  \\\\/ /                 _____/__________________________ ________________    \\/ / mst4298@zeus._____/.' 
          . 
          '.' 
          . 
          '.' 
          . 
          '.' 
          . 
          '.' 
          . 
          '.' 
          . 
          '_' 
          _ 
          '_/ \_____        \__    / / tamu.edu  _____/.' 
          . 
          '.' 
          . 
          '.' 
          . 
          '.' 
          . 
          '.' 
          . 
          '.' 
          . 
          '.' 
          _ 
          '_/     \__________\__  / /        _____/_' 
          _ 
          '_' 
          _ 
          '_' 
          _ 
          '_' 
          _ 
          '_' 
          _ 
          '_' 
          _ 
          '_' 
          _'_ 
          /                  
          \_  
          /  
          / 
          __________ 
          /                   
          \ 
          / 
          ____ 
          / 
          \\\\\\              \\\\\\ 
         

可以看到是如何分析和查看错误分类的文档的，然后回到前面步骤，调整优化特征提取方法，通过删除特征的单词或调整单词权重来减少或增加影响程度。

7.建立一个多类分类系统

7.建立一个多类分类系统

猜你喜欢