I've split my Pandas DataFrame into train_X
and train_y
parts, where train_X
has all N columns, and train_y
has only N-th column, depicting the variable that I want to predict. Currently I'm doing:
train_X.drop("N-th column name", axis=1, inplace=True)
model = SomeSklearnModel()
model.fit(train_X, train_y)
Do I have to do it "by hand" (i. e. using drop()
on train_X
), or can I just do the 3rd line and Scikit-learn will "know" which column train_y
is and not use it for model training (only for checking results)?
You must declare X
and y
explicitly when calling fit
on a sklearn
estimator. Generally by the time you're ready to split your data into training and testing sets, X
should include model features only, so should not include your target y
.
There are many ways to do it, but here a couple of common ways using the iris dataset as an example:
# Setup
df_iris = pd.DataFrame({'sepal_length': [5.0, 4.8, 5.8, 5.7, 4.5, 6.0, 6.3, 4.8, 5.6, 6.4],
'sepal_width': [3.2, 3.4, 2.8, 4.4, 2.3, 3.0, 2.5, 3.4, 3.0, 2.8],
'petal_length': [1.2, 1.6, 5.1, 1.5, 1.3, 4.8, 5.0, 1.9, 4.5, 5.6],
'petal_width': [0.2, 0.2, 2.4, 0.4, 0.3, 1.8, 1.9, 0.2, 1.5, 2.1],
'target': ['setosa', 'setosa', 'virginica', 'setosa', 'setosa','virginica',
'virginica', 'setosa', 'versicolor', 'virginica']})
If your target y
is the "n-th" column of "n", you can use iloc
slicing:
X = df_iris.iloc[:, :-1]
y = df_iris.iloc[:, -1]
Another way would be to use pop
which both drops and returns the column for assignment:
X = df_iris.copy()
y = X.pop('target')
Or using your own method with drop
:
X = df_iris.drop('target', axis=1)
y = df_iris['target']