I am doing model selection and hyperparameter tuning using GridSearchCV. From initial experiment, it turns out SVC with rdf kernel has the best performance. The problem with that is it is too slow (200K+) sample. Using OneVsRestClassifier can parallelize SVC (n_jobs). However gridsearchcv doesn't work for this embeded estimator when I am using Pipeline to test multiple estimators at the same time.
pipe = Pipeline([('clf', SVC())]) # Placeholder Estimator
# Candidate learning algorithms and their hyperparameters
search_space = [{'clf': [OneVsRestClassifier(SVC(tol=0.1, gamma='scale', probability=True), n_jobs=-1],
'clf__kernel': ['rbf', 'linear'],
'clf__C': [1, 10, 100]},
{'clf': [LogisticRegression(tol=0.1, penalty='l1', solver='saga', multi_class='multinomial', n_jobs=8)],
'clf__C': [1, 10, 100]},
{'clf': [RandomForestClassifier(n_jobs=8)],
'clf__n_estimators': [50, 100, 200, 300, 400],
'clf__max_depth': [10, 20, 30],
'clf__min_samples_leaf': [1, 2, 4],
'clf__min_samples_split': [2, 5, 10]},
{'clf': [MultinomialNB()],
'clf__alpha': [0.1, 0.5, 1]}]
gs = GridSearchCV(pipe, search_space, cv=skf, scoring='accuracy', verbose=10)
I got error
Invalid Parameter __kernel
But according to GridSearch for an estimator inside a OneVsRestClassifier, this method should work. I think it is the Pipeline that messed it up since it basically adds another layer on top of OneVsRestClassifier. How exactly do I perform gridsearchcv for this nested estimator?
As is, the pipeline looks for a parameter kernel
in OneVsRestClassifier
, can't find one (unsurprisingly, since the module does not have such a parameter), and raises an error. Since you actually want the parameter kernel
(and, subsequently, C
) of SVC
, you should go a level deeper: change the first 3 entries of your search_space
to:
{'clf': [OneVsRestClassifier(SVC(tol=0.1, gamma='scale', probability=True), n_jobs=-1],
'clf__estimator__kernel': ['rbf', 'linear'],
'clf__estimator__C': [1, 10, 100]}
and you should be fine.
Irrespectively of the error, however, your rationale for using this approach:
The problem with that is it is too slow (200K+) sample. Using OneVsRestClassifier can parallelize SVC (n_jobs).
is not correct. OneVsRestClassifier
will parallelize the fitting of the n_classes
different SVC
estimators, not the SVC
itself. In effect, you are trying to avoid a bottleneck (SVC
) by wrapping something else around it (here OneVsRestClassifier
) which imposes its own additional computational complexity, only to (unsurprisingly) find it again in front of you.
We can demonstrate this with some timings on dummy data - let's try a somewhat realistic dataset of 10K samples, 5 features, and 3 classes:
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
from sklearn.datasets import make_classification
X, y = make_classification(n_samples = 10000, n_features=5, n_redundant=0, n_informative=5,
n_classes = 3, n_clusters_per_class=1, random_state=42)
%timeit for x in range(10): SVC().fit(X,y)
# 1 loop, best of 3: 7.72 s per loop
%timeit for x in range(10): OneVsRestClassifier(SVC()).fit(X, y)
# 1 loop, best of 3: 21.1 s per loop
Well, that's your baseline difference; now setting n_jobs=-1
helps:
%timeit for x in range(10): OneVsRestClassifier(SVC(), n_jobs=-1).fit(X, y)
# 1 loop, best of 3: 19 s per loop
but, unsurprisingly, it does so only in relation to the un-parallelized OneVsRestClassifier
, not in relation to SVC
itself.
The difference is getting worse with more features & classes; without going to your full case, here is the situation with 10 features and 5 classes (same no. of samples, 10K):
X1, y1 = make_classification(n_samples = 10000, n_features=10, n_redundant=0, n_informative=10,
n_classes = 5, n_clusters_per_class=1, random_state=42)
%timeit for x in range(10): SVC().fit(X1,y1)
# 1 loop, best of 3: 10.3 s per loop
%timeit for x in range(10): OneVsRestClassifier(SVC()).fit(X1, y1)
# 1 loop, best of 3: 30.7 s per loop
%timeit for x in range(10): OneVsRestClassifier(SVC(), n_jobs=-1).fit(X1, y1)
# 1 loop, best of 3: 24.9 s per loop
So, I would seriously suggest to reconsider your approach (and your aims) here.
(All timings in Google Colab).