Introduction to Yellowbrick Beginners: A Python Library for Visualizing Machine Learning Models

Yellowbrick is a new Python library that extends the Scikit-Learn API to incorporate visualization into machine learning workflows.

Yellowbrick needs to rely on many third-party libraries, including Scikit-Learn, Matplotlib, Numpy, etc.

Yellowbrick is an open-source, pure-Python project that extends Scikit-Learn with visual analysis and diagnostic tools. The Yellowbrick API also wraps matplotlib to create publishable graphs and interactive data exploration, while still allowing developers fine-grained control over the graphs. For users, Yellowbrick can help evaluate the performance, stability, and predictive value of machine learning models, and assist in diagnosing problems throughout the machine learning workflow.

More recently, much of the workflow has been automated through grid search methods, standardized APIs, and GUI-based applications. In practice, however, human intuition and guidance can hone quality models more effectively than exhaustive search. By visualizing the model selection process, data scientists can move toward final, interpretable models and avoid pitfalls.

The Yellowbrick library is a diagnostic visualization platform for machine learning that allows data scientists to guide the model selection process. Yellowbrick extends the Scikit-Learn API with a new core object: visualization tools. Visualization tools allow visual models to be fitted and transformed as part of a Scikit-Learn Pipeline pipeline, providing visual diagnostics throughout the transformation of high-dimensional data.

Machine learning visualization will help us understand machine learning results and know what actions should be taken to improve the model. That's the Yellowbrick mission.

Yellowbrick makes it easier for us to:

1. Select function

2. Adjust hyperparameters

3. Interpret the model's score

4. Visualizing Text Data

Install

To install the Yellowbrick library, the easiest way is to use pipthe following command:

pip install yellowbrick

Use Yellowbrick

The Yellowbrick API is specifically designed to work perfectly with Scikit-Learn. Here is an example of a typical workflow with Scikit-Learn and Yellowbrick:

Yellowbrick's tutorial directory is as follows, the reference URL:

https://pythonhosted.org/yellowbrick/examples/examples.html

picture

 

The figure below is the flow chart of the visualizer module of Yellowbrick

picture

feature visualization

In this example, we see how Rank2D uses a specific metric or algorithm to perform a pairwise comparison of each feature in a dataset, and then returns their rank in the form of a bottom left triangle plot:

from yellowbrick.features import Rank2Dvisualizer = Rank2D(features=features, algorithm='covariance')visualizer.fit(X, y)                # Fit the data to the visualizervisualizer.transform(X)             # Transform the datavisualizer.poof()                   # Draw/show/poof the data

Having more features doesn't always equate to a better model. The more features a model has, the more sensitive the model is to errors due to variance. Therefore, we want to choose the fewest features needed to generate an effective model.

A common approach to feature elimination is to eliminate the features that are least important to the model. We then re-evaluate whether the model actually performed better during cross-validation.

Feature importance is well suited for this task as it helps us visualize the relative importance of model features.

from yellowbrick.model_selection import FeatureImportances
viz = FeatureImportances(model)viz.fit(X, y)viz.show()

picture

It seems that light is the most important feature of DecisionTreeClassifier, followed by CO2, temperature.

Considering that we don't have many features in our data, we won't remove humidity. But if we have many features in our model, we should eliminate features that are not important to the model to prevent errors due to variance.

The image below is an example of visualization of other yellowbrick features

picture

model visualization

In this example, we instantiate a Scikit-Learn classifier, then use Yellowbrick's ROCAUC class to visualize the classifier's tradeoff between sensitivity and specificity:

from sklearn.svm import LinearSVCfrom yellowbrick.classifier import ROCAUCmodel = LinearSVC()model.fit(X,y)visualizer = ROCAUC(model)visualizer.score(X,y)visualizer.poof()

picture

visualize data

Ranking Features

How correlated is each pair of features in the data? Two-dimensional ranking of features utilizes a ranking algorithm that considers feature pairs at a time. We scored using Pearson correlation to detect collinear relationships.

from yellowbrick.features import Rank2D
visualizer = Rank2D(algorithm='pearson')visualizer.fit(X, y)           visualizer.transform(X)        visualizer.show()

picture

According to the data, humidity is closely related to relative humidity. Light is closely related to temperature. This makes sense, since these features usually go hand in hand.

category balance

One of the biggest challenges with classification models is the imbalance of classes in the training data. For unbalanced classes, our high f1 score may not be a good evaluation score, since the classifier can simply guess that all majority classes get high scores.

Therefore, it is very important to visualize the distribution of categories. We can ClassBalancevisualize the distribution of categories using a bar chart.

picture

It looks like there is much more unoccupied data than occupied. Knowing this, we can utilize various techniques to deal with class imbalance such as stratified sampling, weighting for more informative results.

Visualize the results of the model

Now we come back to the question: what does an f1 score of 98% really mean? Will an increase in f1 score result in more profits for your company?

Yellowbrick provides a variety of tools that can be used to visualize the results of classification problems. Some of these you may or may not have heard of and can be very helpful in explaining your model.

confusion matrix

What is the percentage of wrong predictions in the vacant category? What is the percentage of mispredictions in the occupying class? A confusion matrix helps us answer this question.

picture

It looks like the occupancy class has a higher proportion of wrong predictions; therefore, we can try to increase the number of correct predictions in the occupancy class to improve the score.

This is an example of yellowbrick other model score visualization.

picture

How can we improve the model?

Now that we understand the performance of the model, how can we improve the model? To improve our model, we might want

  • prevent our model from underfitting or overfitting

  • Find the most important features for the estimator

We will explore the tools provided by Yellowbrick to help us figure out how to improve our model

Verification curve

A model can have many hyperparameters. We can choose hyperparameters that accurately predict the training data. A good way to find the best hyperparameters is to select a combination of these parameters via grid search.

But how do we know that these hyperparameters also accurately predict the test data? Plotting the effect of a single hyperparameter on training and test data is useful to determine whether an estimator is underfitting or overfitting for certain hyperparameter values.

A validation curve can help us find the sweet spot, values ​​below or above this hyperparameter will either underfit or overfit the data.

from yellowbrick.model_selection import validation_curve
viz = validation_curve(    model, X, y, param_name="max_depth",    param_range=np.arange(1, 11), cv=10, scoring="f1",)

picture

From the figure, we can see that although the higher the maximum depth number, the higher the training score, but the lower the cross-validation score. This makes sense because the deeper the decision tree, the easier it is to overfit.

So the sweet spot will be where the cross-validation score does not decrease, i.e. 1.

learning curve

Does more data lead to better model performance? Not always, the estimator may be more sensitive to errors due to variance. This is when the learning curve is useful.

The learning curves show the relationship between the training score and the cross-validation test score for an estimator with different numbers of training samples.

from yellowbrick.model_selection import LearningCurvefrom sklearn.model_selection import StratifiedKFold
#Create the learning curve visualizercv = StratifiedKFold(n_splits=12)sizes = np.linspace(0.3, 1.0, 10)
visualizer = LearningCurve(    model, cv=cv, scoring='f1', train_sizes=sizes, )
visualizer.fit(X, y)        # Fit the data to the visualizervisualizer.show()           # Finalize and render the figure

picture

From the figure we can see that the number of training instances around 8700 leads to the best f1 score. The larger the number of training instances, the lower the f1 score.

There are many other uses of Yellowbrick, not limited to the above introduction. The figure below shows Yellowbrick showing that as the model search space becomes larger, the time increases exponentially.

picture

The following figure shows Yellowbrick's hyperparameter space search

picture

picture

in conclusion

Congratulations! You just learned how to create plots to help you interpret the results of your model. Being able to understand your machine learning results will make it easier to find next steps to improve its performance. Welcome everyone to collect the CSDN Academy course "From 0 to 1 Python Data Science Journey". The course has a large number of practical cases of data science modeling. Please scan the QR code below and remember to bookmark the course.

Copyright statement: The article comes from the official account (python risk control model), without permission, no plagiarism. Following the CC 4.0 BY-SA copyright agreement, please attach the original source link and this statement for reprinting.

Guess you like

Origin blog.csdn.net/toby001111/article/details/131806784