文本文档分类实战(哈希编码/权重编码提取特征 + 卡方过滤 + 搭建神经网络分类)

本文已参与「新人创作礼」活动,一起开启掘金创作之路。


以下内容如有错误,恳请指出。


最近在学习一些sklearn的特征降维与特征提取的事情,可以参考前面几篇发的博客。在之前也提及到,对于一个文本,可以将其编码为词频向量,或在是权重向量,也可以将其编码为哈希向量。对于最后的哈希向量可以理解为一个降维的特征提取过程。

首先说一下,我暂时完全没有接触过nlp的相关任务,也还没有使用过rnn,lstm等时序网络。但是现在,已经把一个文本也可以称得上为一个样本编码为一个向量,配合我们之前学到的各种机器学习分类器与深度学习,那么不就可以对这个特征向量进行分类训练,从而实现文本分类的任务了吗?

所以,不会rnn或者lstm的搭建也没有关系,现在可以通过其他的方法将文本特征提取出来,那么就已经足够了。我就可以拿这个特征来做下一步的事情。这就像视频理解领域的C3D,将视频编码为一个4096的特征向量,那么就拿这个特征向量来进行其他的下游任务。同理,这里也是一样的,将其编码为哈希向量之后,我就可以使用分类器对其进行一个文本分类的NLP任务。

理论分析结束,接下来就是实战验证。对于哈希编码的特征向量,我使用了svm、随机森林与搭建了个神经网络分别测试了分类的效果。此外还用svm测试了权重编码的效果,同时查看了一下那些特征会影响分类结果。最后,贴上了一个官方的代码以供学习。

@[toc] 这里我使用的是sklearn自带的一个数据集,属于一个新闻分类数据,有20类,18846个样本,其中11314个训练样本。其类别信息分别为:

['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']
复制代码

这里会选取其中的4个类别进行文本分类处理:"alt.atheism", "talk.religion.misc", "comp.graphics", "sci.space"


1. 文本向量提取

1.1 数据集导入

from sklearn.datasets import fetch_20newsgroups

categories = [ "alt.atheism", "talk.religion.misc", "comp.graphics", "sci.space",]
remove = ("headers", "footers", "quotes")

data_train = fetch_20newsgroups(
    data_home='./dataset/', subset='train', categories=categories, remove=remove, random_state=42
)
data_test = fetch_20newsgroups(
    data_home='./dataset/', subset='test', categories=categories, remove=remove, random_state=42
)

data_train.target_names, data_test.target_names
# 输出:
# (['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc'],
# ['alt.atheism', 'comp.graphics', 'sci.space', 'talk.religion.misc'])

y_train, y_test = data_train.target, data_test.target
复制代码

ps:这里通过remove删除文件的标题、签名快与应用块,使得数据更加的真实。这是因为,如果不删除分类器会过分拟合许多的内容:

  • 几乎每个组都通过诸如 NNTP-Posting-Host:和之类的标题是否Distribution:经常出现来区分。
  • 另一个重要特征涉及发件人是否隶属于大学,如其标题或签名所示。
  • “文章”这个词是一个重要的特征,基于人们引用以前的帖子的频率是这样的:“在文章 [文章 ID],[名称] <[电子邮件地址]> 写道:”
  • 其他功能与当时发帖的特定人员的姓名和电子邮件地址相匹配。

有了如此丰富的区分新闻组的线索,分类器几乎不需要从文本中识别主题,而且它们都在相同的高水平上执行。所以,这里删除了('headers', 'footers', 'quotes')的信息。

1.2 哈希编码

from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = HashingVectorizer(stop_words="english", alternate_sign=False, n_features=2 ** 10)
X_train = vectorizer.fit_transform(data_train.data)
X_test = vectorizer.transform(data_test.data)
print("X_train.shape:{}, X_test.shape:{}".format(X_train.shape, X_test.shape))

# 输出:
# X_train.shape:(2034, 1024), X_test.shape:(1353, 1024)
复制代码

1.3 卡方过滤

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

select_kbest = SelectKBest(chi2, k=200)
X_train = select_kbest.fit_transform(X_train, y_train)
X_test = select_kbest.transform(X_test)
print("X_train.shape:{}, X_test.shape:{}".format(X_train.shape, X_test.shape))

# 输出:
# X_train.shape:(2034, 200), X_test.shape:(1353, 200)
复制代码

哈希编码内容与卡方过滤内容详细可以见我之前的两篇笔记,链接如下:

1. sklearn特征提取方法汇总(包含字典、文本、图像的特征提取)

2. klearn特征降维方法汇总(方差过滤,卡方,F过滤,互信息,嵌入法)


2. 机器学习模型训练

2.1 SVM模型测试

from time import time
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

t0 = time()
clf = LinearSVC(dual=False, tol=1e-3)
clf.fit(X_train, y_train)
train_time = time() - t0
print("train time: %0.3fs" % train_time)
    
t0 = time()    
pred = clf.predict(X_test)
test_time = time() - t0
print("test time:  %0.3fs" % test_time)

test_sroce = accuracy_score(y_test, pred)
train_sroce = accuracy_score(y_train, clf.predict(X_train))
print("test accuracy:   %0.3f" % test_sroce, 
      "train accuracy:   %0.3f" % train_sroce)
复制代码

输出:

train time: 0.020s
test time:  0.001s
test accuracy:   0.662 train accuracy:   0.775
复制代码

2.2 随机森林模型测试

from time import time
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

t0 = time()
clf = RandomForestClassifier(verbose=1, random_state=42)
clf.fit(X_train, y_train)
train_time = time() - t0
print("train time: %0.3fs" % train_time)

t0 = time()    
pred = clf.predict(X_test)
test_time = time() - t0
print("test time:  %0.3fs" % test_time)

test_sroce = accuracy_score(y_test, pred)
train_sroce = accuracy_score(y_train, clf.predict(X_train))
print("test accuracy:   %0.3f" % test_sroce, 
      "train accuracy:   %0.3f" % train_sroce)
复制代码

输出:

train time: 0.466s
test time:  0.032s
test accuracy:   0.617 train accuracy:   0.964
复制代码

3. 搭建神经网络测试

其实,文本分类只要是将文本信息编码为特征,但是这一步已经在哈希编码与卡方过滤后提取了出来。最后每个文本信息被编码为了一个200维的特征信息,那么现在就可以利用这个特征信息构建一个神经网络来进行训练。

所以下面会简单的搭建一个多层感知机来进行训练,这里我就搭建了一个3层的全连接层,没有特别的设计(其实我也不会特别设计,哭...)

3.1 神经网络搭建

import torch.nn as nn

class MLP(nn.Module):
    def __init__(self, embedding=200, n_class=4):
        super().__init__()
        
        self.model = nn.Sequential(
            nn.Linear(embedding, 256),
            nn.Dropout(p=0.6),
            nn.ReLU(),
            nn.Linear(256, 512),
            nn.Dropout(p=0.6),
            nn.ReLU(),
            nn.Linear(512, n_class),
        )
    
    def forward(self, x):
        return self.model(x)
复制代码

3.2 神经网络训练

  • 数据格式转换

在训练之前,需要进行数据格式的转换。由于卡方过滤出来的矩阵是一个稀疏矩阵,也就是一个稀疏的格式矩阵,所以需要进行.toarray()转换为numpy的array格式。eg:X_train = X_train..toarray()

之后具体是转什么样的格式,就执行在出错的位置看error的提醒就可以了,或者直接按照我这里的数据格式转换,基本是正确的。

import torch

X_train = torch.tensor(X_train.toarray()).float()
X_test = torch.tensor(X_test.toarray()).float()
y_train = torch.tensor(y_train).long()
y_test = torch.tensor(y_test).long()
复制代码
  • 神经网络训练

数据格式转换后,就可以使用神经网络来进行训练,参考代码如下:

import torch
from torch import optim
from time import time
from sklearn.metrics import accuracy_score

# 设置相关参数
epochsize = 500
learning_rate = 1e-3
best_acc = 0

model = MLP()
criteon = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

# 训练
t0 = time()
for epoch in range(epochsize):
    model.train()
    
    # 损失计算
    pred = model(X_train)
    loss = criteon(pred, y_train)
    
    # 反向传播
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    # 查看训练中正确个数
    # print(epoch, 'loss:', loss.item())
    
    model.eval()
    with torch.no_grad():
        category = model(X_test)
        pred = category.argmax(dim=1)
        score = accuracy_score(y_test, pred)
    
    if best_acc < score:
        best_acc = score
        
train_time = time() - t0

print("_" * 80)
print("best acc:{}".format(best_acc))
print("train time: %0.3fs" % train_time)
复制代码

输出:

________________________________________________________________________________
best acc:0.6688839615668883
train time: 2.777s
复制代码

最后,用神经网络来训练的结果好像与svm得到的结果差不多,无论是集成算法还是支持向量机算法,还是神经网络,最后的结果都在66%左右。


4. 权重向量编码

对于文本分类来说,如果想要了解对于一个文本的分类哪个单词特征是最重要的,可以通过权重向量编码配合一些设计权重分配的分类器来查看。(比如SVM等)

4.1 获取列表中最大的N个数索引

这里用一节的内容来介绍一个trick,也就是一个库的使用方法,参考链接见参考资料3.

  • 针对数组无重复数的
import heapq
int n
lis=[2,4,5,1,7]
re1 = map(lis.index, heapq.nlargest(n, lis)) #求最大的n个索引    nsmallest 求最小  nlargest求最大
re2 = heapq.nlargest(n, lis) #求最大的三个元素
print(list(re1)) #因为re1由map()生成的不是list,直接print不出来,添加list()就行了
print(re2) 
复制代码
  • 针对有重复数的
import heapq
lis= [2, 4, 4, 1, 0]
int n
max_number = heapq.nlargest(n, lis) 
max_index = []
for t in max_number:
    index = lis.index(t)
    max_index.append(index)
    lis[index] = 0
    
print(max_number)
print(max_index)
复制代码

4.2 查看息量最大的N个特征名字

  • np.argsort实现
import numpy as np

# 查看信息量最大的前10个特征单词
def show_top10(classifier, vectorizer, categories):
    # 获取文本位置编码信息
    # vectorizer.get_feature_names_out() 以列表的形式输出
    # vectorizer.vocabulary_: 以字典的形式输出
    feature_names = vectorizer.get_feature_names_out()
    for i, category in enumerate(categories):
        # clf.coef_.shape: (4, 26576) 表示的是每个类别每个单词的重要程度
        # np.argsort是返回排序后的树荫,这里先排序再选择
        top10 = np.argsort(classifier.coef_[i])[-10:]
        print("%s::  %s" % (category, " ".join(feature_names[top10])))
        
show_top10(clf, vectorizer, data_train.target_names)
复制代码

输出:

alt.atheism::  nanci islamic deletion motto islam atheist bobby atheists religion atheism
comp.graphics::  card images 42 looking hi computer 3d file image graphics
sci.space::  flight mars solar moon shuttle spacecraft launch nasa orbit space
talk.religion.misc::  commandment koresh blood jesus children rosicrucian christ fbi christians christian
复制代码

代码来源,见参考资料4.

  • heapq实现
import numpy as np
import heapq

# 查看信息量最大的前10个特征单词
def show_top10(classifier, vectorizer, categories, top_k=10):
    # 获取文本位置编码信息
    # vectorizer.get_feature_names_out() 以列表的形式输出
    # vectorizer.vocabulary_: 以字典的形式输出
    feature_names = vectorizer.get_feature_names_out()
    for i, category in enumerate(categories):
        # 利用heapq取出数值最高的前k个索引
        top = map(list(clf.coef_[i]).index, heapq.nlargest(top_k, clf.coef_[i]))
        top = list(top)
        print("%s::  %s" % (category, " ".join(feature_names[top])))
        
show_top10(clf, vectorizer, data_train.target_names)
复制代码

输出:

alt.atheism::  atheism religion atheists bobby atheist islam motto deletion islamic nanci
comp.graphics::  graphics image file 3d computer hi looking 42 images card
sci.space::  space orbit nasa launch spacecraft shuttle moon solar mars flight
talk.religion.misc::  christian christians fbi christ rosicrucian children jesus blood koresh commandment
复制代码

可以看见,这两种的方法的输出结果是一直的,只是第二种方法还会按权重的大小从大到小排列。而对于输出结果来说,大概可以知道权重越大的单词和文本类别越相关。

4.3 使用权重编码进行文本分类

参考代码如下:

from time import time
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score

# 权重向量编码
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words="english")
X_train = vectorizer.fit_transform(data_train.data)
X_test = vectorizer.transform(data_test.data)

# X_train.shape, X_test.shape: ((2034, 26576), (1353, 26576))
y_train, y_test = data_train.target, data_test.target

# 构建分类器训练
t0 = time()
clf = LinearSVC(penalty='l2', dual=False, tol=1e-3)
clf.fit(X_train, y_train)
train_time = time() - t0
print("train time: %0.3fs" % train_time)

t0 = time()
pred = clf.predict(X_test)
test_time = time() - t0
print("test time: %0.3fs" % test_time)

# 查看训练效果
test_sroce = accuracy_score(y_test, pred)
train_sroce = accuracy_score(y_train, clf.predict(X_train))
print("test accuracy:   %0.3f" % test_sroce, 
      "train accuracy:   %0.3f" % train_sroce)
复制代码

输出:

train time: 0.247s
test time: 0.002s
test accuracy:   0.780 train accuracy:   0.978
复制代码

分析:可以看见,使用权重编码相比哈希编码的效果要更好,这是因为哈希编码其实是一种降维的方法,其减少了训练的时长,但是同时也会丢失部分的信息。所以哈希编码的特征分类效果要差一点,但是训练的时间要短一点。


5. 使用稀疏特征对文本文档进行分类

这里贴上一个官方的提供代码,供自己学习,具体内容见参考资料5. 相关的内容这里翻译为中文。

这是一个示例,展示了如何使用 scikit-learn 使用词袋方法按主题对文档进行分类。此示例使用 scipy.sparse 矩阵来存储特征,并演示了可以有效处理稀疏矩阵的各种分类器。

此示例中使用的数据集是 20 个新闻组数据集。它将被自动下载,然后缓存。

5.1 参数设置

# Author: Peter Prettenhofer <[email protected]>
#         Olivier Grisel <[email protected]>
#         Mathieu Blondel <[email protected]>
#         Lars Buitinck
# License: BSD 3 clause

import logging
import numpy as np
from optparse import OptionParser
import sys
from time import time
import matplotlib.pyplot as plt

from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_selection import SelectFromModel
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.linear_model import RidgeClassifier
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import Perceptron
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.naive_bayes import BernoulliNB, ComplementNB, MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neighbors import NearestCentroid
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils.extmath import density
from sklearn import metrics


# Display progress logs on stdout
logging.basicConfig(level=logging.INFO, format="%(asctime)s %(levelname)s %(message)s")

op = OptionParser()
op.add_option(
    "--report",
    action="store_true",
    dest="print_report",
    help="Print a detailed classification report.",
)
op.add_option(
    "--chi2_select",
    action="store",
    type="int",
    dest="select_chi2",
    help="Select some number of features using a chi-squared test",
)
op.add_option(
    "--confusion_matrix",
    action="store_true",
    dest="print_cm",
    help="Print the confusion matrix.",
)
op.add_option(
    "--top10",
    action="store_true",
    dest="print_top10",
    help="Print ten most discriminative terms per class for every classifier.",
)
op.add_option(
    "--all_categories",
    action="store_true",
    dest="all_categories",
    help="Whether to use all categories or not.",
)
op.add_option("--use_hashing", action="store_true", help="Use a hashing vectorizer.")
op.add_option(
    "--n_features",
    action="store",
    type=int,
    default=2 ** 16,
    help="n_features when using the hashing vectorizer.",
)
op.add_option(
    "--filtered",
    action="store_true",
    help=(
        "Remove newsgroup information that is easily overfit: "
        "headers, signatures, and quoting."
    ),
)


def is_interactive():
    return not hasattr(sys.modules["__main__"], "__file__")


# work-around for Jupyter notebook and IPython console
argv = [] if is_interactive() else sys.argv[1:]
(opts, args) = op.parse_args(argv)
if len(args) > 0:
    op.error("this script takes no arguments.")
    sys.exit(1)

print(__doc__)
op.print_help()
print()
复制代码

Out:

Usage: plot_document_classification_20newsgroups.py [options]

Options:
  -h, --help            show this help message and exit
  --report              Print a detailed classification report.
  --chi2_select=SELECT_CHI2
                        Select some number of features using a chi-squared
                        test
  --confusion_matrix    Print the confusion matrix.
  --top10               Print ten most discriminative terms per class for
                        every classifier.
  --all_categories      Whether to use all categories or not.
  --use_hashing         Use a hashing vectorizer.
  --n_features=N_FEATURES
                        n_features when using the hashing vectorizer.
  --filtered            Remove newsgroup information that is easily overfit:
                        headers, signatures, and quoting.
复制代码

5.2 从训练集中加载数据

让我们从新闻组数据集中加载数据,该数据集包含 20 个主题的大约 18000 个新闻组帖子,分为两个子集:一个用于训练(或开发),另一个用于测试(或性能评估)。

if opts.all_categories:
    categories = None
else:
    categories = [
        "alt.atheism",
        "talk.religion.misc",
        "comp.graphics",
        "sci.space",
    ]

if opts.filtered:
    remove = ("headers", "footers", "quotes")
else:
    remove = ()

print("Loading 20 newsgroups dataset for categories:")
print(categories if categories else "all")

data_train = fetch_20newsgroups(
    subset="train", categories=categories, shuffle=True, random_state=42, remove=remove
)

data_test = fetch_20newsgroups(
    subset="test", categories=categories, shuffle=True, random_state=42, remove=remove
)
print("data loaded")

# order of labels in `target_names` can be different from `categories`
target_names = data_train.target_names


def size_mb(docs):
    return sum(len(s.encode("utf-8")) for s in docs) / 1e6


data_train_size_mb = size_mb(data_train.data)
data_test_size_mb = size_mb(data_test.data)

print(
    "%d documents - %0.3fMB (training set)" % (len(data_train.data), data_train_size_mb)
)
print("%d documents - %0.3fMB (test set)" % (len(data_test.data), data_test_size_mb))
print("%d categories" % len(target_names))
print()

# split a training set and a test set
y_train, y_test = data_train.target, data_test.target

print("Extracting features from the training data using a sparse vectorizer")
t0 = time()
if opts.use_hashing:
    vectorizer = HashingVectorizer(
        stop_words="english", alternate_sign=False, n_features=opts.n_features
    )
    X_train = vectorizer.transform(data_train.data)
else:
    vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, stop_words="english")
    X_train = vectorizer.fit_transform(data_train.data)
duration = time() - t0
print("done in %fs at %0.3fMB/s" % (duration, data_train_size_mb / duration))
print("n_samples: %d, n_features: %d" % X_train.shape)
print()

print("Extracting features from the test data using the same vectorizer")
t0 = time()
X_test = vectorizer.transform(data_test.data)
duration = time() - t0
print("done in %fs at %0.3fMB/s" % (duration, data_test_size_mb / duration))
print("n_samples: %d, n_features: %d" % X_test.shape)
print()

# mapping from integer feature name to original token string
if opts.use_hashing:
    feature_names = None
else:
    feature_names = vectorizer.get_feature_names_out()

if opts.select_chi2:
    print("Extracting %d best features by a chi-squared test" % opts.select_chi2)
    t0 = time()
    ch2 = SelectKBest(chi2, k=opts.select_chi2)
    X_train = ch2.fit_transform(X_train, y_train)
    X_test = ch2.transform(X_test)
    if feature_names is not None:
        # keep selected feature names
        feature_names = feature_names[ch2.get_support()]
    print("done in %fs" % (time() - t0))
    print()


def trim(s):
    """Trim string to fit on terminal (assuming 80-column display)"""
    return s if len(s) <= 80 else s[:77] + "..."
复制代码

Out:

Loading 20 newsgroups dataset for categories:
['alt.atheism', 'talk.religion.misc', 'comp.graphics', 'sci.space']
data loaded
2034 documents - 3.980MB (training set)
1353 documents - 2.867MB (test set)
4 categories

Extracting features from the training data using a sparse vectorizer
done in 0.383082s at 10.388MB/s
n_samples: 2034, n_features: 33809

Extracting features from the test data using the same vectorizer
done in 0.236998s at 12.099MB/s
n_samples: 1353, n_features: 33809
复制代码

5.3 分类构建

用 15 种不同的分类模型训练和测试数据集,并获得每个模型的性能结果。

def benchmark(clf):
    print("_" * 80)
    print("Training: ")
    print(clf)
    t0 = time()
    clf.fit(X_train, y_train)
    train_time = time() - t0
    print("train time: %0.3fs" % train_time)

    t0 = time()
    pred = clf.predict(X_test)
    test_time = time() - t0
    print("test time:  %0.3fs" % test_time)

    score = metrics.accuracy_score(y_test, pred)
    print("accuracy:   %0.3f" % score)

    if hasattr(clf, "coef_"):
        print("dimensionality: %d" % clf.coef_.shape[1])
        print("density: %f" % density(clf.coef_))

        if opts.print_top10 and feature_names is not None:
            print("top 10 keywords per class:")
            for i, label in enumerate(target_names):
                top10 = np.argsort(clf.coef_[i])[-10:]
                print(trim("%s: %s" % (label, " ".join(feature_names[top10]))))
        print()

    if opts.print_report:
        print("classification report:")
        print(metrics.classification_report(y_test, pred, target_names=target_names))

    if opts.print_cm:
        print("confusion matrix:")
        print(metrics.confusion_matrix(y_test, pred))

    print()
    clf_descr = str(clf).split("(")[0]
    return clf_descr, score, train_time, test_time


results = []
for clf, name in (
    (RidgeClassifier(tol=1e-2, solver="sag"), "Ridge Classifier"),
    (Perceptron(max_iter=50), "Perceptron"),
    (PassiveAggressiveClassifier(max_iter=50), "Passive-Aggressive"),
    (KNeighborsClassifier(n_neighbors=10), "kNN"),
    (RandomForestClassifier(), "Random forest"),
):
    print("=" * 80)
    print(name)
    results.append(benchmark(clf))

for penalty in ["l2", "l1"]:
    print("=" * 80)
    print("%s penalty" % penalty.upper())
    # Train Liblinear model
    results.append(benchmark(LinearSVC(penalty=penalty, dual=False, tol=1e-3)))

    # Train SGD model
    results.append(benchmark(SGDClassifier(alpha=0.0001, max_iter=50, penalty=penalty)))

# Train SGD with Elastic Net penalty
print("=" * 80)
print("Elastic-Net penalty")
results.append(
    benchmark(SGDClassifier(alpha=0.0001, max_iter=50, penalty="elasticnet"))
)

# Train NearestCentroid without threshold
print("=" * 80)
print("NearestCentroid (aka Rocchio classifier)")
results.append(benchmark(NearestCentroid()))

# Train sparse Naive Bayes classifiers
print("=" * 80)
print("Naive Bayes")
results.append(benchmark(MultinomialNB(alpha=0.01)))
results.append(benchmark(BernoulliNB(alpha=0.01)))
results.append(benchmark(ComplementNB(alpha=0.1)))

print("=" * 80)
print("LinearSVC with L1-based feature selection")
# The smaller C, the stronger the regularization.
# The more regularization, the more sparsity.
results.append(
    benchmark(
        Pipeline(
            [
                (
                    "feature_selection",
                    SelectFromModel(LinearSVC(penalty="l1", dual=False, tol=1e-3)),
                ),
                ("classification", LinearSVC(penalty="l2")),
            ]
        )
    )
)
复制代码

Out:

================================================================================
Ridge Classifier
________________________________________________________________________________
Training:
RidgeClassifier(solver='sag', tol=0.01)
/home/circleci/project/sklearn/linear_model/_ridge.py:729: UserWarning: "sag" solver requires many iterations to fit an intercept with sparse inputs. Either set the solver to "auto" or "sparse_cg", or set a low "tol" and a high "max_iter" (especially if inputs are not standardized).
  warnings.warn(
train time: 0.167s
test time:  0.001s
accuracy:   0.898
dimensionality: 33809
density: 1.000000


================================================================================
Perceptron
________________________________________________________________________________
Training:
Perceptron(max_iter=50)
train time: 0.015s
test time:  0.001s
accuracy:   0.888
dimensionality: 33809
density: 0.255302


================================================================================
Passive-Aggressive
________________________________________________________________________________
Training:
PassiveAggressiveClassifier(max_iter=50)
train time: 0.027s
test time:  0.001s
accuracy:   0.902
dimensionality: 33809
density: 0.711867


================================================================================
kNN
________________________________________________________________________________
Training:
KNeighborsClassifier(n_neighbors=10)
train time: 0.001s
test time:  0.148s
accuracy:   0.858

================================================================================
Random forest
________________________________________________________________________________
Training:
RandomForestClassifier()
train time: 1.258s
test time:  0.079s
accuracy:   0.826

================================================================================
L2 penalty
________________________________________________________________________________
Training:
LinearSVC(dual=False, tol=0.001)
train time: 0.072s
test time:  0.001s
accuracy:   0.900
dimensionality: 33809
density: 1.000000


________________________________________________________________________________
Training:
SGDClassifier(max_iter=50)
train time: 0.024s
test time:  0.001s
accuracy:   0.903
dimensionality: 33809
density: 0.579424


================================================================================
L1 penalty
________________________________________________________________________________
Training:
LinearSVC(dual=False, penalty='l1', tol=0.001)
train time: 0.176s
test time:  0.001s
accuracy:   0.873
dimensionality: 33809
density: 0.005553


________________________________________________________________________________
Training:
SGDClassifier(max_iter=50, penalty='l1')
train time: 0.092s
test time:  0.002s
accuracy:   0.880
dimensionality: 33809
density: 0.022509


================================================================================
Elastic-Net penalty
________________________________________________________________________________
Training:
SGDClassifier(max_iter=50, penalty='elasticnet')
train time: 0.134s
test time:  0.001s
accuracy:   0.901
dimensionality: 33809
density: 0.184685


================================================================================
NearestCentroid (aka Rocchio classifier)
________________________________________________________________________________
Training:
NearestCentroid()
train time: 0.004s
test time:  0.002s
accuracy:   0.855

================================================================================
Naive Bayes
________________________________________________________________________________
Training:
MultinomialNB(alpha=0.01)
train time: 0.003s
test time:  0.001s
accuracy:   0.899
/home/circleci/project/sklearn/utils/deprecation.py:103: FutureWarning: Attribute `coef_` was deprecated in version 0.24 and will be removed in 1.1 (renaming of 0.26).
  warnings.warn(msg, category=FutureWarning)
dimensionality: 33809
density: 1.000000


________________________________________________________________________________
Training:
BernoulliNB(alpha=0.01)
train time: 0.005s
test time:  0.004s
accuracy:   0.884
/home/circleci/project/sklearn/utils/deprecation.py:103: FutureWarning: Attribute `coef_` was deprecated in version 0.24 and will be removed in 1.1 (renaming of 0.26).
  warnings.warn(msg, category=FutureWarning)
dimensionality: 33809
density: 1.000000


________________________________________________________________________________
Training:
ComplementNB(alpha=0.1)
train time: 0.003s
test time:  0.001s
accuracy:   0.911
/home/circleci/project/sklearn/utils/deprecation.py:103: FutureWarning: Attribute `coef_` was deprecated in version 0.24 and will be removed in 1.1 (renaming of 0.26).
  warnings.warn(msg, category=FutureWarning)
dimensionality: 33809
density: 1.000000


================================================================================
LinearSVC with L1-based feature selection
________________________________________________________________________________
Training:
Pipeline(steps=[('feature_selection',
                 SelectFromModel(estimator=LinearSVC(dual=False, penalty='l1',
                                                     tol=0.001))),
                ('classification', LinearSVC())])
train time: 0.192s
test time:  0.002s
accuracy:   0.879
复制代码

5.4 可视化处理

条形图表示每个分类器的准确度、训练时间(标准化)和测试时间(标准化)。

indices = np.arange(len(results))

results = [[x[i] for x in results] for i in range(4)]

clf_names, score, training_time, test_time = results
training_time = np.array(training_time) / np.max(training_time)
test_time = np.array(test_time) / np.max(test_time)

plt.figure(figsize=(12, 8))
plt.title("Score")
plt.barh(indices, score, 0.2, label="score", color="navy")
plt.barh(indices + 0.3, training_time, 0.2, label="training time", color="c")
plt.barh(indices + 0.6, test_time, 0.2, label="test time", color="darkorange")
plt.yticks(())
plt.legend(loc="best")
plt.subplots_adjust(left=0.25)
plt.subplots_adjust(top=0.95)
plt.subplots_adjust(bottom=0.05)

for i, c in zip(indices, clf_names):
    plt.text(-0.3, i, c)

plt.show()
复制代码

Out: 在这里插入图片描述


参考资料:

1. sklearn特征提取方法汇总(包含字典、文本、图像的特征提取)

2. klearn特征降维方法汇总(方差过滤,卡方,F过滤,互信息,嵌入法)

3. python获取列表中最大的N个数索引

4. The 20 newsgroups text dataset

5. Classification of text documents using sparse features

猜你喜欢

转载自juejin.im/post/7095005618074812424