I'm working on the mushroom classification data set (found here: https://www.kaggle.com/uciml/mushroom-classification).
I'm trying to split my data into training and testing sets for my models, however if i use the train_test_split method my models always achieve 100% accuracy. This is not the case when i split my data manually.
x = data.copy()
y = x['class']
del x['class']
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33)
model = xgb.XGBClassifier()
model.fit(x_train, y_train)
predictions = model.predict(x_test)
print(confusion_matrix(y_test, predictions))
print(accuracy_score(y_test, predictions))
This produces:
[[1299 0]
[ 0 1382]]
1.0
If I split the data manually I get a more reasonable result.
x = data.copy()
y = x['class']
del x['class']
x_train = x[0:5443]
x_test = x[5444:]
y_train = y[0:5443]
y_test = y[5444:]
model = xgb.XGBClassifier()
model.fit(x_train, y_train)
predictions = model.predict(x_test)
print(confusion_matrix(y_test, predictions))
print(accuracy_score(y_test, predictions))
Result:
[[2007 0]
[ 336 337]]
0.8746268656716418
What could be causing this behaviour?
Edit: As per request I'm including shapes of slices.
train_test_split:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33)
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)
Result:
(5443, 64)
(5443,)
(2681, 64)
(2681,)
Manual split:
x_train = x[0:5443]
x_test = x[5444:]
y_train = y[0:5443]
y_test = y[5444:]
print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)
Result:
(5443, 64)
(5443,)
(2680, 64)
(2680,)
I've tried defining my own split function and the resulting split also results in 100% classifier accuracy.
Here's the code for the split
def split_data(dataFrame, testRatio):
dataCopy = dataFrame.copy()
testCount = int(len(dataFrame)*testRatio)
dataCopy = dataCopy.sample(frac = 1)
y = dataCopy['class']
del dataCopy['class']
return dataCopy[testCount:], dataCopy[0:testCount], y[testCount:], y[0:testCount]
You got lucky there on your train_test_split. The split you are doing manually may be having the most unseen data, which is doing better validation than the train_test_split which internally shuffled the data to split it.
For better validation use K-fold cross validation, which will allow to verify the model accuracy with each of the different parts in your data as your test and rest part as train.