100% classifier accuracy after using train_test_split

Don Andre :

I'm working on the mushroom classification data set (found here: https://www.kaggle.com/uciml/mushroom-classification).

I'm trying to split my data into training and testing sets for my models, however if i use the train_test_split method my models always achieve 100% accuracy. This is not the case when i split my data manually.

x = data.copy()
y = x['class']
del x['class']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33)

model = xgb.XGBClassifier()
model.fit(x_train, y_train)
predictions = model.predict(x_test)

print(confusion_matrix(y_test, predictions))
print(accuracy_score(y_test, predictions))

This produces:

[[1299    0]
 [   0 1382]]
1.0

If I split the data manually I get a more reasonable result.

x = data.copy()
y = x['class']
del x['class']

x_train = x[0:5443]
x_test = x[5444:]
y_train = y[0:5443]
y_test = y[5444:]

model = xgb.XGBClassifier()
model.fit(x_train, y_train)
predictions = model.predict(x_test)

print(confusion_matrix(y_test, predictions))
print(accuracy_score(y_test, predictions))

Result:

[[2007    0]
 [ 336  337]]
0.8746268656716418

What could be causing this behaviour?

Edit: As per request I'm including shapes of slices.

train_test_split:

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33)

print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

Result:

(5443, 64)
(5443,)
(2681, 64)
(2681,)

Manual split:

x_train = x[0:5443]
x_test = x[5444:]
y_train = y[0:5443]
y_test = y[5444:]

print(x_train.shape)
print(y_train.shape)
print(x_test.shape)
print(y_test.shape)

Result:

(5443, 64)
(5443,)
(2680, 64)
(2680,)

I've tried defining my own split function and the resulting split also results in 100% classifier accuracy.

Here's the code for the split

def split_data(dataFrame, testRatio):
  dataCopy = dataFrame.copy()
  testCount = int(len(dataFrame)*testRatio)
  dataCopy = dataCopy.sample(frac = 1)
  y = dataCopy['class']
  del dataCopy['class']
  return dataCopy[testCount:], dataCopy[0:testCount], y[testCount:], y[0:testCount]
Desmond :

You got lucky there on your train_test_split. The split you are doing manually may be having the most unseen data, which is doing better validation than the train_test_split which internally shuffled the data to split it.

For better validation use K-fold cross validation, which will allow to verify the model accuracy with each of the different parts in your data as your test and rest part as train.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=8491&siteId=1