Rocchio算法测试测试集时出错：Incompatible dimension for X and Y matrices: X.shape[1]

在白话大数据与机器学习一书，对照p222打例子：

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.datasets import fetch_20newsgroups
from sklearn.neighbors.nearest_centroid import NearestCentroid
from pprint import pprint
import sys

#读取数据
newsgroups_train = fetch_20newsgroups(subset='train')
pprint(list(newsgroups_train.target_names))
#随机选4个主题
categories = ['alt.atheism','comp.graphics','soc.religion.christian','sci.med']
#下载4个主题里的文件
train_data = fetch_20newsgroups(subset = "train", categories = categories)
#文件内容在train_data.data这个变量里。现在对内容进行分词和向量化操作
count_vect = CountVectorizer()
train_counts = count_vect.fit_transform(train_data.data)
#接着对向量化之后的结果做TF-IDF转换
tfidf_transformer = TfidfTransformer()
train_tfidf = tfidf_transformer.fit_transform(train_counts)

#现在把TF-IDF转换后的结果和每条结果对应的主题编号train_data.target放入分类器中进行训练
clf = NearestCentroid().fit(train_tfidf, train_data.target)
#创建测试集合，这里有两条数据，每条数据一行内容，进行向量化和TF-IDF转换
docs_new = {'OpenGL onthe GPU is fast','God is love'}
docs_new_counts = count_vect.fit_transform(docs_new)
docs_new_tfidf = tfidf_transformer.fit_transform(docs_new_counts)
print(sys.modules['__main__'])
print(__file__)

print(docs_new_tfidf)
#预测
predicted = clf.predict(docs_new_tfidf)

#打印结果,zip
for doc,category in zip(docs_new,predicted):
    print('%r to %s' %(doc,train_data.target_names[category]))

报错：ValueError: Incompatible dimension for X and Y matrices: X.shape[1] == 7 while Y.shape[1] == 35788

这是训练集特征向量维度和测试集不同，测试集特征向量才7。。看了两天官方的sklearn文档都想不出怎么解决，嗯官方上面有个例子就是讲用

fetch_20newsgroups这个数据集来做训练。

链接：http://scikit-learn.org/stable/auto_examples/text/document_clustering.html#sphx-glr-auto-examples-text-document-clustering-py

以为会有方法参数之类能设置特征向量扩展到和训练集一样大小的维度。。然而并没有。。
后来百度到stackoverflow有一篇文章：
https://stackoverflow.com/questions/41791956/scikit-learn-how-to-classify-data-train-and-data-test-with-a-different-features

原文：

from sklearn.neighbors import KNeighborsClassifier
knns = {}
for n_feats in range(1, xtrain.shape[-1] + 1):
    knns[n_feats] = KNeighborsClassifier(n_neighbors=7, weights='distance')
    knns[n_feats].fit(xtrain[:, :n_feats], ytrain)
The classify function should consume two parameters, namely the test data and the dictionary of classifiers. This way you ensure the classification is performed by a classifier that was trained using exactly the same features of the test data (and discarding the others):

def classify(test_data, classifiers):
    """Classify test_data using classifiers[n], which is the classifier
    trained with the first n features of test_data
    """
    X = np.asarray(test_data, dtype='float')
    n_feats = X.shape[-1]
    return classifiers[n_feats].predict(X)
大意就是训练集要降维到和测试集一样大。。
嗯就这样试试,fetch_20newsgroups这个训练集维度太大了，训练集就不全部做了：
from time import time
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.datasets import fetch_20newsgroups
from sklearn.neighbors.nearest_centroid import NearestCentroid

def classify(test_data, classifiers):
    """Classify test_data using classifiers[n], which is the classifier
    trained with the first n features of test_data
        分类测试集使用classifiers[n]，它被分类器使用训练集的前n个特征来训练
    """
    n_feats = test_data.shape[-1]
    return classifiers[n_feats].predict(test_data)

#读取数据
newsgroups_train = fetch_20newsgroups(subset='train')
#随机选4个主题
categories = ['comp.windows.x','sci.space','comp.sys.ibm.pc.hardware','sci.med']
#下载4个主题里的文件
train_data = fetch_20newsgroups(subset = "train", categories = categories)
print(train_data.target_names)
#文件内容在train_data.data这个变量里。现在对内容进行分词和向量化操作
count_vect = CountVectorizer()
train_counts = count_vect.fit_transform(train_data.data)
#接着对向量化之后的结果做TF-IDF转换
tfidf_transformer = TfidfTransformer()
train_tfidf = tfidf_transformer.fit_transform(train_counts)

#现在把TF-IDF转换后的结果和每条结果对应的主题编号train_data.target放入分类器中进行训练
#特征向量每一个维度都做了训练。。
knns = {}
for n_feats in range(20, 200):
    knns[n_feats] = NearestCentroid()
    knns[n_feats].fit(train_tfidf[:, :n_feats], train_data.target)

#创建测试集合，这里有两条数据，每条数据一行内容，进行向量化和TF-IDF转换
docs_new = {"""And International Space Station crew landed back on Earth in the grasslands of Kazakhstan after spending 186 days in space. Support workers helped the crew emerged safely from their capsule, which was charred by a fiery descent through the atmosphere""",
            """To jump-start adoption of Windows 10 the company is offering free upgrades to the vast majority of home users. Microsoft has set a target of 1 billion devices running windows 10 within three years of launch"""}
docs_new_counts = count_vect.fit_transform(docs_new)
docs_new_tfidf = tfidf_transformer.fit_transform(docs_new_counts)

#预测
predicted = classify(docs_new_tfidf, knns)
print(predicted)
#打印结果,zip
for doc,category in zip(docs_new,predicted):
    print('%r to %s' %(doc,train_data.target_names[category]）

控制台输出：
['comp.sys.ibm.pc.hardware', 'comp.windows.x', 'sci.med', 'sci.space']
[1 3]
'To jump-start adoption of Windows 10 the company is offering free upgrades to the vast majority of home users. Microsoft has set a target of 1 billion devices running windows 10 within three years of launch' to comp.windows.x
'And International Space Station crew landed back on Earth in the grasslands of Kazakhstan after spending 186 days in space. Support workers helped the crew emerged safely from their capsule, which was charred by a fiery descent through the atmosphere' to sci.space
特意找了一个Microsoft和一个空间站的新闻才匹配对了新闻种类。。前几次都不对。。

Rocchio算法测试测试集时出错：Incompatible dimension for X and Y matrices: X.shape[1]

猜你喜欢