Split into test and train set before or after generating document term matrix?

Doctor :

I'm working on simple machine learning problems and I trying to build a classifier that can differentiate between spam and non-spam SMS. I'm confused as to whether I need to generate the document-term matrix before splitting into test and train sets or should I generate the document-term matrix after splitting into test and train?

I tried it both ways and found that the accuracy is slightly higher when the I split the data before generating the document-term matrix. But to me, this makes no sense. Shouldn't the accuracy be the same? Does the order of these operations make any difference?

Prune :

Qualitatively, you don't need to do it either way. However, proper procedure requires that you keep your training and test data entirely separate. The overall concept is that the test data are not directly represented in the training; this helps reduce over-fitting. The test data (and later validation data) are samples that the trained model has never encountered during training.

Therefore, the test data should not be included in your pre-processing -- the document-term matrix. This breaks the separation, in that the model has, in one respect, "seen" the test data during training.

Quantitatively, you need to do the split first, because that matrix is to be used for training the model against only the training set. When you included the test data in the matrix, you obtained a matrix that is slightly inaccurate in representing the training data: it no longer properly represents the data you're actually training against. This is why your model isn't quite as good as the one that followed proper separation procedures.

It's a subtle difference, most of all because the training and test sets are supposed to be random samples of the same population of possible inputs. Random differences provide the small surprise you encountered.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=320142&siteId=1