How to gridsearch an estimator embeded in OneVsRestClassifier using Pipeline

ddd :

I am doing model selection and hyperparameter tuning using GridSearchCV. From initial experiment, it turns out SVC with rdf kernel has the best performance. The problem with that is it is too slow (200K+) sample. Using OneVsRestClassifier can parallelize SVC (n_jobs). However gridsearchcv doesn't work for this embeded estimator when I am using Pipeline to test multiple estimators at the same time.

pipe = Pipeline([('clf', SVC())]) # Placeholder Estimator

# Candidate learning algorithms and their hyperparameters
search_space = [{'clf': [OneVsRestClassifier(SVC(tol=0.1, gamma='scale', probability=True), n_jobs=-1],
                 'clf__kernel': ['rbf', 'linear'],
                 'clf__C': [1, 10, 100]},

                {'clf': [LogisticRegression(tol=0.1, penalty='l1', solver='saga', multi_class='multinomial', n_jobs=8)], 
                 'clf__C': [1, 10, 100]},

                {'clf': [RandomForestClassifier(n_jobs=8)],
                 'clf__n_estimators': [50, 100, 200, 300, 400],
                 'clf__max_depth': [10, 20, 30],
                 'clf__min_samples_leaf': [1, 2, 4],
                 'clf__min_samples_split': [2, 5, 10]},

                {'clf': [MultinomialNB()],
                 'clf__alpha': [0.1, 0.5, 1]}]

gs = GridSearchCV(pipe, search_space, cv=skf, scoring='accuracy', verbose=10)

I got error

Invalid Parameter __kernel

But according to GridSearch for an estimator inside a OneVsRestClassifier, this method should work. I think it is the Pipeline that messed it up since it basically adds another layer on top of OneVsRestClassifier. How exactly do I perform gridsearchcv for this nested estimator?

desertnaut :

As is, the pipeline looks for a parameter kernel in OneVsRestClassifier, can't find one (unsurprisingly, since the module does not have such a parameter), and raises an error. Since you actually want the parameter kernel (and, subsequently, C) of SVC, you should go a level deeper: change the first 3 entries of your search_space to:

{'clf': [OneVsRestClassifier(SVC(tol=0.1, gamma='scale', probability=True), n_jobs=-1],
 'clf__estimator__kernel': ['rbf', 'linear'],
 'clf__estimator__C': [1, 10, 100]}

and you should be fine.

Irrespectively of the error, however, your rationale for using this approach:

The problem with that is it is too slow (200K+) sample. Using OneVsRestClassifier can parallelize SVC (n_jobs).

is not correct. OneVsRestClassifier will parallelize the fitting of the n_classes different SVC estimators, not the SVC itself. In effect, you are trying to avoid a bottleneck (SVC) by wrapping something else around it (here OneVsRestClassifier) which imposes its own additional computational complexity, only to (unsurprisingly) find it again in front of you.

We can demonstrate this with some timings on dummy data - let's try a somewhat realistic dataset of 10K samples, 5 features, and 3 classes:

from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC
from sklearn.datasets import make_classification

X, y = make_classification(n_samples = 10000, n_features=5, n_redundant=0, n_informative=5,
                             n_classes = 3, n_clusters_per_class=1, random_state=42)

%timeit for x in range(10): SVC().fit(X,y)
# 1 loop, best of 3: 7.72 s per loop

%timeit for x in range(10): OneVsRestClassifier(SVC()).fit(X, y)
# 1 loop, best of 3: 21.1 s per loop

Well, that's your baseline difference; now setting n_jobs=-1 helps:

%timeit for x in range(10): OneVsRestClassifier(SVC(), n_jobs=-1).fit(X, y)
# 1 loop, best of 3: 19 s per loop

but, unsurprisingly, it does so only in relation to the un-parallelized OneVsRestClassifier, not in relation to SVC itself.

The difference is getting worse with more features & classes; without going to your full case, here is the situation with 10 features and 5 classes (same no. of samples, 10K):

X1, y1 = make_classification(n_samples = 10000, n_features=10, n_redundant=0, n_informative=10,
                             n_classes = 5, n_clusters_per_class=1, random_state=42)

%timeit for x in range(10): SVC().fit(X1,y1)
# 1 loop, best of 3: 10.3 s per loop

%timeit for x in range(10): OneVsRestClassifier(SVC()).fit(X1, y1)
# 1 loop, best of 3: 30.7 s per loop

%timeit for x in range(10): OneVsRestClassifier(SVC(), n_jobs=-1).fit(X1, y1)
# 1 loop, best of 3: 24.9 s per loop

So, I would seriously suggest to reconsider your approach (and your aims) here.

(All timings in Google Colab).

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=391439&siteId=1