[Machine Learning] Understanding and Explanation of the Yellowbrick Package

1. Introduction

The first is the package installation process:

pip install --user yellowbrick

Yellowbrick is a suite of visual analysis and diagnostic tools designed to facilitate machine learning with scikit-learn.

The library implements a new core API object, Visualizer, which is a scikit-learn estimator—an object that learns from data.

Similar to transformers or models, visualizers learn from data by creating a visual representation of the model selection workflow.

Visualizer allows users to guide the model selection process, building intuition around feature engineering, algorithm selection, and hyperparameter tuning.

For example, they can help diagnose common problems around model complexity and bias, heteroscedasticity, underfitting and overtraining, or class balance issues.

By applying visualization tools to the model selection workflow, Yellowbrick allows you to guide predictive models faster to more successful results.

Full documentation is available at scikit-yb.org, including a quick start guide for new users.

insert image description here

二、Visualizers

Visualizers are estimators — objects learned from data — whose main goal is to create visualizations that provide insight into the model selection process.

In scikit-learn terms, they are like transformers when visualizing the data space, or wrapping model estimators similar to how ModelCV (eg RidgeCV, LassoCV) methods work.

Yellowbrick aims to create a sensitive API similar to scikit-learn. Some of our most popular visualization tools include:

2.1 Classification Visualization

  1. Classification Report: A visual classification report that shows the model's precision, recall, and F1 scores per class in the form of a heat map;
  2. Confusion Matrix: a heatmap view of the class pair confusion matrix in multiclass classification;
  3. Discrimination Threshold: Visualization of precision, recall, F1-score, and queuing rate against the discrimination threshold of a binary classifier;
  4. Precision-Recall Curve: Draw precision and recall scores for different probability thresholds;
  5. ROC/AUC: Plot receiver operating characteristic (ROC) and area under the curve (AUC).

2.2 Clustering Visualization

  1. Intercluster Distance Maps: Visualize the relative distance and size of clusters;
  2. KElbow Visualizer: Visualize clusters according to a specified scoring function, looking for "elbows" in curves.
  3. Silhouette Visualizer: Select k by visualizing the silhouette coefficient scores for each cluster in a single model.

2.3 Feature Visualization

  1. Manifold Visualization: High-dimensional visualization with manifold learning;
  2. Parallel Coordinates: horizontal visualization of instances;
  3. PCA Projection: instance projection based on principal components;
  4. RadViz Visualizer: separate instances around circular graphs;
  5. Rank Features: Single or pairwise ranking of features to detect relationships.

2.4 Model Selection Visualization

  1. Cross Validation Scores: Displays the cross validation scores as a bar graph with the mean score plotted as a horizontal line;
  2. Feature Importances: Rank features according to performance within the model;
  3. Learning Curve: shows whether the model could benefit from more data or less complexity;
  4. Recursive Feature Elimination: Find the best subset of features based on importance;
  5. Validation Curve: Tuning a model based on a single hyperparameter.

2.5 Regression Visualization

  1. Alpha Selection: shows how the choice of alpha affects regularization;
  2. Cook's Distance: shows the influence of the instance on the linear regression;
  3. Prediction Error Plots: Find model faults along the target domain;
  4. Residuals Plot: Shows the difference in residuals for training and testing data.

2.6 Target Visualization

  1. Balanced Binning Reference: Generates a histogram with vertical lines showing suggested value points to bin data into evenly distributed bins;
  2. Class Balance: Displays the representation frequency of support relationship classes in the dataset for each class in the training data and test data by displaying the frequency of occurrence of each class in the form of a bar graph;
  3. Feature Correlation: Visualize the correlation between the dependent variable and the target.

2.7 Text Visualization

  1. Dispersion Plot: Visualize how key terms are dispersed throughout the corpus;
  2. PosTag Visualizer: plots counts of different parts of speech across tagged corpora;
  3. Token Frequency Distribution: Visualize the frequency distribution of terms in the corpus;
  4. t-SNE Corpus Visualization: Projecting documents using random neighbor embeddings;
  5. UMAP Corpus Visualization: Draws similar documents closer together to discover clusters.

Guess you like

Origin blog.csdn.net/wzk4869/article/details/130708289