[sklearn] Dimensional inconsistency between training set and test set (svm calls predict and ValueError: X.shape should be equal to or Dimension mismatch, etc.)

[sklearn] Dimension mismatch between training set and test set (Dimension mismatch ValueError: X.shape should be equal to, etc. when calling predict)

problem record

When using TF-IDF for text training, first svm reported an error when predicting:

ValueError: X.shape[1] = 3216 should be equal to 8610, the number of features at training time

I temporarily commented out the svm model and used the MultinomialNB model, which still appeared in the predict

ValueError: dimension mismatch

It can be seen that it is not a single model cause, but a problem with data processing

looking for answers

svm search

Using the keyword search of the svm error, it is found that the explanation given in the above link of the ValueError when using the linear SVM of scikit-learn python is

One possibility is that they are loaded as sparse matrices, and load_svmlight_file infers the number of features. If the test data contains features not seen in the training data, the resulting X_test may have a larger dimension. This can be avoided by passing Parameter n_features to specify the number of features in load_svmlight_file.

That is to say, the test set contains data that is not in the training set. It is recommended to specify the n_feature parameter when using load_svmlight_file to read the data.
The description of the function sklearn.datasets.load_svmlight_file is introduced, but I have not used this function to read files before, and it is difficult to modify, so I am looking for other ways to solve the problem.

MultinomialNB error search

Using the error search question of the Naive Bayesian model, I found that there are many related answers, and I found the right one.
There are many answers, and one of the links is attached. sklearn calls Naive Bayes.predict() and
reports an error dimension mismatch.

fit(): It is to find the mean, variance, maximum value, minimum value of the training set, and the inherent properties of these training sets.
transform(): On the basis of fit, perform operations such as standardization, dimensionality reduction, and normalization.
fit_transform(): fit_transform is a combination of fit and transform, including both training and conversion. The functions of both transform() and fit_transform() are to perform some unified processing on the data (such as standardization ~N(0,1), scaling (mapping) the data to a fixed interval, normalization, regularization, etc. )

needs to be modified to

tf = TfidfVectorizer()
tf.fit_tranform(X_train)
tf.tranform(X_test)

The translation is that fit_transorm has been trained and transformed again, and then the files in the test set only need to be normalized. I
checked my own files carefully and found that in the TF-IDF I wrote at the beginning, the processing of variables is tf .fit_tranform function, and then the test set also called the function directly, so an error occurred.

But when I modified the function, I found a new error

ValueError: Input has n_features=3216 while the model has been trained with n_features=8610

problem solved

Why is the dimension still wrong? I searched for a new error and found the link
ValueError: Input has n_features=10 while the model has been trained with n_features=4261
The solution given above did not understand, but he said

You don't keep the CountVectorizer that originally fit the data.
This bagOfWords call installs a single CountVectorizer in its own scope.
new_word_counts = util.bagOfWords(["a place to listen to music it s making its way to the us"])
You want to use the one that fits your training set.
You are also training your converter with the whole of X, including X_test. You want to exclude your test tests from any training (including transformations).

CountVectorizer is still being trained by x_test, and after checking the code carefully, it is found that CountVectorizer has missed a fit_tranform and has not modified it. Therefore attaching the modified code

vectorizer = CountVectorizer(min_df=1e-5) #将文本中的词语转换为词频矩阵,矩阵元素a[i][j] 表示j词在i类文本下的词频  
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(vectorizer.fit_transform(contents) #计算个词语出现的次数)   
tfidf_test = transformer.transform(vectorizer.transform(content_test))#将词频矩阵X统计成TF-IDF值 

Guess you like

Origin blog.csdn.net/m0_54352040/article/details/123947961