7.建立一个多类分类系统
从规模化到特征提取、建模和评估,已经完成了简历分类系统的全部必要的步骤。现在将所有的东西组装在一起,应用到真实数据上以建立一个分类文本分类系统。对于此工作,将使用 scikit-learn 下载的 20 个新闻组数据集。这 20 个新闻组数据集包括分散在 20 个不同类别或主题的 18000 个新闻组帖子,这就构建了 20 类分类问题!请记住类的数量越多,尝试建立正确分类器就越复杂或者越困难。为防止模型因为文件头或者邮件地址而过拟合或泛化能力不强,体检的做法是从文档中去除文件头、文件尾和引用,因此需要确保考虑到了这一点。对于去除上述三项内容后的空文档或没用内容的文档,也将给予剔除,因为尝试从空文档中提取特征是毫无意义的。
开始下载所需的数据集以及为建立训练和测试数据集所用的函数:
from
sklearn.datasets
import
fetch_20newsgroups
## 文档中使用的模块在高版本中会被剔除,根据提示替换模块解决问题
# from sklearn.cross_validation import train_test_split
from
sklearn.model_selection
import
train_test_split
def
get_data():
data
=
fetch_20newsgroups(subset
=
'all'
,
shuffle
=
True
,
remove
=
(
'headers'
,
'footers'
,
'quotes'
))
return
data
def
prepare_datasets(corpus, labels, test_data_proportion
=
0.3
):
train_X, test_X, train_Y, test_Y
=
train_test_split(corpus, labels,
test_size
=
0.33
, random_state
=
42
)
return
train_X, test_X, train_Y, test_Y
def
remove_empty_docs(corpus, labels):
filtered_corpus
=
[]
filtered_labels
=
[]
for
doc, label
in
zip
(corpus, labels):
if
doc.strip():
filtered_corpus.append(doc)
filtered_labels.append(label)
return
filtered_corpus, filtered_labels
|
现在已经获得了数据,查看了数据集中分类的数量,使用下面的代码将数据集分为测试数据集和训练数据集。(下面代码执行下载数据集:)
In [
20
]: dataset
=
get_data()
...:
print
(dataset.target_names)
...:
Downloading
20news
dataset. This may take a few minutes.
Downloading dataset
from
https:
/
/
ndownloader.figshare.com
/
files
/
5975967
(
14
MB)
[
'alt.atheism'
,
'comp.graphics'
,
'comp.os.ms-windows.misc'
,
'comp.sys.ibm.pc.hardware'
,
'comp.sys.mac.hardware'
,
'comp.windows.x'
,
'misc.forsale'
,
'rec.autos'
,
'rec.motorcycles'
,
'rec.sport.baseball'
,
'rec.sport.hockey'
,
'sci.crypt'
,
'sci.electronics'
,
'sci.med'
,
'sci.space'
,
'soc.religion.christian'
,
'talk.politics.guns'
,
'talk.politics.mideast'
,
'talk.politics.misc'
,
'talk.religion.misc'
]
|
In [
21
]: corpus, labels
=
dataset.data, dataset.target
...: corpus, labels
=
remove_empty_docs(corpus, labels)
...:
...:
print
(
'Sample document:'
, corpus[
10
])
...:
print
(
'Class label:'
,labels[
10
])
...:
print
(
'Actual class label:'
, dataset.target_names[labels[
10
]])
...:
...:
Sample document: the blood of the lamb.
This will be a hard task, because most cultures used most animals
for
blood sacrifices. It has to be something related to our current
post
-
modernism state. Hmm, what about used computers?
Cheers,
Kent
Class label:
19
Actual
class
label: talk.religion.misc
|
train_corpus, test_corpus, train_labels, test_labels
=
prepare_datasets(corpus,
labels,
test_data_proportion
=
0.3
)
|
从上面的代码可以看到文档和标签的情况。每个文档拥有自己的标签类,这些标签是需要进行分类的 20 个主题之一。这些标签是数字形式的,如果需要,可以使用上面的代码容易地将它们映射回原来的类别名字。已经把数据分为训练数据集和测试数据集,测试数据集占总数据的 30%。将使用训练数据集建立模型,使用测试数据集测试模型的性能。下面的代码将使用前面建立的规范化模块对数据集进行规范化处理:
from
normalization
import
normalize_corpus
norm_train_corpus
=
normalize_corpus(train_corpus)
norm_test_corpus
=
normalize_corpus(test_corpus)
|
执行语句可能会耗费一段时间才能完成。
如果出现类似错误:
...
RuntimeError: generator raised StopIteration
|
请切换至 Python3.6 或更高版本
记住,语料库中每个文档进行规范化处理需要很多步骤,所以这将会耗费一些时间才能完成。完成文档规范化处理后,将使用前面建立的特征提取模块从文档中提取特征。将分别建立词袋模型、TF-IDF 模型、平均词向量模型和 TF-IDF 加权平均词向量模型,并比较它们的性能。
下面的代码基于不同技术提取必要的特征:
from
feature_extractors
import
bow_extractor, tfidf_extractor
from
feature_extractors
import
averaged_word_vectorizer
from
feature_extractors
import
tfidf_weighted_averaged_word_vectorizer
import
nltk
import
gensim
# bag of words features
bow_vectorizer, bow_train_features
=
bow_extractor(norm_train_corpus)
bow_test_features
=
bow_vectorizer.transform(norm_test_corpus)
# tfidf features
tfidf_vectorizer, tfidf_train_features
=
tfidf_extractor(norm_train_corpus)
tfidf_test_features
=
tfidf_vectorizer.transform(norm_test_corpus)
# tokenize documents
tokenized_train
=
[nltk.word_tokenize(text)
for
text
in
norm_train_corpus]
tokenized_test
=
[nltk.word_tokenize(text)
for
text
in
norm_test_corpus]
# build word2vec model
model
=
gensim.models.Word2Vec(tokenized_train,
size
=
500
,
window
=
100
,
min_count
=
30
,
sample
=
1e
-
3
)
# averaged word vector features
avg_wv_train_features
=
averaged_word_vectorizer(corpus
=
tokenized_train,
model
=
model,
num_features
=
500
)
avg_wv_test_features
=
averaged_word_vectorizer(corpus
=
tokenized_test,
model
=
model,
num_features
=
500
)
# tfidf weighted averaged word vector features
vocab
=
tfidf_vectorizer.vocabulary_
tfidf_wv_train_features
=
tfidf_weighted_averaged_word_vectorizer(corpus
=
tokenized_train,
tfidf_vectors
=
tfidf_train_features,
tfidf_vocabulary
=
vocab,
model
=
model,
num_features
=
500
)
tfidf_wv_test_features
=
tfidf_weighted_averaged_word_vectorizer(corpus
=
tokenized_test,
tfidf_vectors
=
tfidf_test_features,
tfidf_vocabulary
=
vocab,
model
=
model,
num_features
=
500
)
|
使用上面的特征提取器从文本文档中提取了全部必要的特征之后,基于前面讨论的四个指标,定义一个函数用来苹果分类模型,函数如下面代码段所示:
from
sklearn
import
metrics
import
numpy as np
def
get_metrics(true_labels, predicted_labels):
print
(
'Accuracy:'
, np.
round
(
metrics.accuracy_score(true_labels,
predicted_labels),
2
))
print
(
'Precision:'
, np.
round
(
metrics.precision_score(true_labels,
predicted_labels,
average
=
'weighted'
),
2
))
print
(
'Recall:'
, np.
round
(
metrics.recall_score(true_labels,
predicted_labels,
average
=
'weighted'
),
2
))
print
(
'F1 Score:'
, np.
round
(
metrics.f1_score(true_labels,
predicted_labels,
average
=
'weighted'
),
2
))
|
现在定义一个函数使用机器学习算法和训练数据来训练模型,使用训练的模型在测试数据上执行预测,接着使用上面的函数苹果模型预测性能:
def
train_predict_evaluate_model(classifier,
train_features, train_labels,
test_features, test_labels):
# build model
classifier.fit(train_features, train_labels)
# predict using model
predictions
=
classifier.predict(test_features)
# evaluate model prediction performance
get_metrics(true_labels
=
test_labels,
predicted_labels
=
predictions)
return
predictions
|
现在进入了 2 个机器学习算法,开始使用已经提取的特征建立模型。将使用前面提到的 scikit-learn 引入必要的分类算法,以节省花费在重写代码的时间和精力上:
from
sklearn.naive_bayes
import
MultinomialNB
from
sklearn.linear_model
import
SGDClassifier
mnb
=
MultinomialNB()
svm
=
SGDClassifier(loss
=
'hinge'
, n_iter
=
100
)
|
现在下面的代码将使用多项式朴素贝叶斯和支持向量机以及全部不同类型的特征进行模型训练、预测和评估:
# Multinomial Naive Bayes with bag of words features
mnb_bow_predictions
=
train_predict_evaluate_model(classifier
=
mnb,
train_features
=
bow_train_features,
train_labels
=
train_labels,
test_features
=
bow_test_features,
test_labels
=
test_labels)
|
Accuracy:
0.67
Precision:
0.72
Recall:
0.67
F1 Score:
0.65
|
# Support Vector Machine with bag of words features
svm_bow_predictions
=
train_predict_evaluate_model(classifier
=
svm,
train_features
=
bow_train_features,
train_labels
=
train_labels,
test_features
=
bow_test_features,
test_labels
=
test_labels)
|
Accuracy:
0.61
Precision:
0.67
Recall:
0.61
F1 Score:
0.62
|
# Multinomial Naive Bayes with tfidf features
mnb_tfidf_predictions
=
train_predict_evaluate_model(classifier
=
mnb,
train_features
=
tfidf_train_features,
train_labels
=
train_labels,
test_features
=
tfidf_test_features,
test_labels
=
test_labels)
|
Accuracy:
0.72
Precision:
0.78
Recall:
0.72
F1 Score:
0.7
|
# Support Vector Machine with tfidf features
svm_tfidf_predictions
=
train_predict_evaluate_model(classifier
=
svm,
train_features
=
tfidf_train_features,
train_labels
=
train_labels,
test_features
=
tfidf_test_features,
test_labels
=
test_labels)
|
Accuracy:
0.77
Precision:
0.77
Recall:
0.77
F1 Score:
0.77
|
# Support Vector Machine with averaged word vector features
svm_avgwv_predictions
=
train_predict_evaluate_model(classifier
=
svm,
train_features
=
avg_wv_train_features,
train_labels
=
train_labels,
test_features
=
avg_wv_test_features,
test_labels
=
test_labels)
|
Accuracy:
0.56
Precision:
0.58
Recall:
0.56
F1 Score:
0.56
|
# Support Vector Machine with tfidf weighted averaged word vector features
svm_tfidfwv_predictions
=
train_predict_evaluate_model(classifier
=
svm,
train_features
=
tfidf_wv_train_features,
train_labels
=
train_labels,
test_features
=
tfidf_wv_test_features,
test_labels
=
test_labels)
|
Accuracy:
0.53
Precision:
0.58
Recall:
0.53
F1 Score:
0.52
|
使用不同类型的特征建立了 6 个模型,使用测试数据评估了模型的性能。从上面的结果可以看到使用 TF-IDF 特征的 SVM 模型获得了最好的结果,准确率、精确率、召回率和 F1 score 均为 77%。可以建立 SVM TF-IDF 模型的混淆矩阵,以便了解模型性能不好的具体分类的情况:
import
pandas as pd
cm
=
metrics.confusion_matrix(test_labels, svm_tfidf_predictions)
pd.DataFrame(cm, index
=
range
(
0
,
20
), columns
=
range
(
0
,
20
))
|
Out[
47
]:
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
0
156
3
0
1
1
0
2
3
4
1
4
4
2
4
5
34
3
7
7
22
1
1
224
9
7
8
14
8
0
2
1
0
2
5
4
4
1
4
0
3
0
2
1
20
221
18
9
18
8
1
0
0
0
3
5
2
2
2
1
1
2
0
3
1
11
25
223
9
4
9
2
1
1
1
2
6
3
1
0
1
0
0
0
4
0
4
7
15
228
6
5
2
3
1
0
3
9
3
3
1
1
0
1
0
5
0
21
18
1
2
272
0
1
1
0
0
0
4
3
1
0
0
1
0
0
6
0
2
7
11
12
1
270
10
3
2
1
1
10
1
4
0
2
1
1
0
7
1
5
2
2
2
3
4
246
19
1
3
2
10
3
2
0
4
3
3
1
8
3
1
0
4
2
2
5
27
252
3
4
2
1
4
1
3
2
2
4
0
9
1
1
1
0
2
3
5
3
6
278
12
2
1
1
2
4
2
0
1
0
10
0
0
0
0
0
0
1
3
2
4
282
1
2
1
4
1
0
1
1
0
11
3
5
3
3
1
2
2
2
2
3
0
259
6
2
0
1
5
2
5
0
12
1
6
6
15
7
2
13
10
8
4
4
2
212
3
5
1
1
1
0
1
13
2
4
0
1
3
4
3
0
2
0
1
1
7
267
4
2
3
0
4
0
14
0
5
3
0
2
4
2
5
4
1
2
0
8
3
264
2
4
1
3
1
15
11
1
0
0
1
1
0
0
4
1
3
2
1
7
5
292
4
4
2
4
16
4
1
0
0
0
4
2
1
7
2
2
11
3
2
4
2
227
3
13
3
17
6
0
1
0
1
3
0
2
3
2
4
6
1
3
1
6
5
259
10
2
18
9
1
2
1
0
1
2
1
5
3
3
7
0
9
6
4
33
7
165
3
19
21
5
0
1
0
2
3
3
7
2
1
1
0
11
3
57
21
7
3
65
|
从上表混淆矩阵上,可以看到很多类标签为 0 的文档被错误地分类到类标签 15 里面,同样对于类标签 18 的很多文档被错误地分类到类标签 16 里面。很多类标签 19 的文档被错误地分类到类型标签 15 里面。打印类型名字,可以看到如下输出:
In [
48
]: class_names
=
dataset.target_names
...:
print
(class_names[
0
],
'->'
, class_names[
15
])
...:
print
(class_names[
18
],
'->'
, class_names[
16
])
...:
print
(class_names[
19
],
'->'
, class_names[
15
])
...:
...:
alt.atheism
-
> soc.religion.christian
talk.politics.misc
-
> talk.politics.guns
talk.religion.misc
-
> soc.religion.christian
|
从前面的输出可以看到错误分类与实际分类并没有显著的不同。Christian、religion 和 atheism 都是与商都和宗教存在有关的概念,可能会有相似的特征。杂项问题和强制都与政治有关,必然有相似的特征。可以使用下面的代码,进一步详细查看和分析被错误分类的问题:
import
re
num
=
0
for
document, label, predicted_label
in
zip
(test_corpus, test_labels, svm_tfidf_predictions):
if
label
=
=
0
and
predicted_label
=
=
15
:
print
(
'Actual Label:'
, class_names[label])
print
(
'Predicted Label:'
, class_names[predicted_label])
print
(
'Document:-'
)
print
(re.sub(
'\n'
,
' '
, document))
print
("")
num
+
=
1
if
num
=
=
4
:
break
|
打印结果:
Actual Label: alt.atheism
Predicted Label: soc.religion.christian
Document:
-
I would like a
list
of Bible contadictions
from
those of you who dispite being free
from
Christianity are well versed
in
the Bible.
Actual Label: alt.atheism
Predicted Label: soc.religion.christian
Document:
-
They spent quite a bit of time on the wording of the Constitution. They picked words whose meanings implied the intent. We have already looked
in
the dictionary to define the word. Isn
't this sufficient? But we were discussing it in relation to the death penalty. And, the Constitution need not define each of the words within. Anyone who doesn'
t know what cruel
is
can look
in
the dictionary (
and
we did).
Actual Label: alt.atheism
Predicted Label: soc.religion.christian
Document:
-
Our Lord
and
Savior David Keresh has risen! He has been seen alive! Spread the word!
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
-
Actual Label: alt.atheism
Predicted Label: soc.religion.christian
Document:
-
"This is your god"
(
from
John Carpenter's
"They Live,"
natch)
|
num
=
0
for
document, label, predicted_label
in
zip
(test_corpus, test_labels, svm_tfidf_predictions):
if
label
=
=
18
and
predicted_label
=
=
16
:
print
(
'Actual Label:'
, class_names[label])
print
(
'Predicted Label:'
, class_names[predicted_label])
print
(
'Document:-'
)
print
(re.sub(
'\n'
,
' '
, document))
print
()
num
+
=
1
if
num
=
=
4
:
break
|
打印结果:
Actual Label: talk.politics.misc
Predicted Label: talk.politics.guns
Document:
-
After the initial gun battle was over, they had
50
days to come out peacefully. They had their high priced lawyer,
and
judging by the posts here they had some public support. Can anyone come up with a rational explanation why the didn
't come out (even after they negotiated coming out after the radio sermon) that doesn'
t include the Davidians wanting to commit suicide
/
murder
/
general mayhem?
Actual Label: talk.politics.misc
Predicted Label: talk.politics.guns
Document:
-
Yesterday, the FBI was saying that at least three of the bodies had gunshot wounds, indicating that they were shot trying to escape the fire. Today
's paper quotes the medical examiner as saying that there is no evidence of gunshot wounds in any of the recovered bodies. At the beginning of this siege, it was reported that while Koresh had a class III (machine gun) license, today'
s paper quotes the government as saying, no, they didn
't have a license. Today'
s paper reports that a number of the bodies were found with shoulder weapons
next
to them, as
if
they had been using them
while
dying
-
-
which doesn
't sound like the sort of action I would expect from a suicide. Our government lies, as it tries to cover over its incompetence and negligence. Why should I believe the FBI'
s claims about anything
else
, when we can see that they are LYING? This system of government
is
beyond reform.
Actual Label: talk.politics.misc
Predicted Label: talk.politics.guns
Document:
-
Well,
for
one thing most,
if
not
all
the Dividians (depending on whether they could show they acted
in
self
-
defense
and
there were no illegal weapons), could have gone on with their life as they were living it. No one was forcing them to give up their religion
or
even their legal weapons. The Dividians had survived a change
in
leadership before so even
if
Koresch himself would have been convicted
and
sent to jail, they still could have carried on. I don
't think the Dividians were insane, but I don'
t see a reason
for
mass suicide (
if
the fire was intentional
set
by some of the Dividians.) We also don
't know that, if the fire was intentionally set from inside, was it a generally know plan or was this something only an inner circle knew about, or was it something two or three felt they had to do with or without Koresch'
s knowledge
/
blessing, etc.? I don't know much about Masada. Were some people throwing others over? Did mothers jump over with their babies
in
their arms?
Actual Label: talk.politics.misc
Predicted Label: talk.politics.guns
Document:
-
[email protected] (Russ Anderson) writes... The fact
is
that Koresh
and
his followers involved themselves
in
a gun battle to control the Mt Carmel
complex
. That
is
not
in
dispute. From what I remember of the trial, the authories couldn
't reasonably establish who fired first, the big reason behind the aquittal. _____ _____ \\\\\\/ ___/___________________ Mitchell S Todd \\\\/ / _____/__________________________ ________________ \\/ / mst4298@zeus._____/.'
.
'.'
.
'.'
.
'.'
.
'.'
.
'.'
.
'_'
_
'_/ \_____ \__ / / tamu.edu _____/.'
.
'.'
.
'.'
.
'.'
.
'.'
.
'.'
.
'.'
_
'_/ \__________\__ / / _____/_'
_
'_'
_
'_'
_
'_'
_
'_'
_
'_'
_
'_'
_'_
/
\_
/
/
__________
/
\
/
____
/
\\\\\\ \\\\\\
|
可以看到是如何分析和查看错误分类的文档的,然后回到前面步骤,调整优化特征提取方法,通过删除特征的单词或调整单词权重来减少或增加影响程度。