yellowbrick: A feature engineering visualization artifact!

A very important job before building a model is to do feature engineering, and in the process of feature engineering, exploratory data analysis is an essential part.

This time introduces a very powerful feature engineering visualization tool: yellowbrick, including radar, one-dimensional sorting, PCA, feature importance, recursive elimination, regularization, residual graph, elbow method, learning curve, verification curve, etc. , with its assistance, more exploration time can be saved, and feature information can be grasped quickly.

Technology Exchange

Technology must learn to share and communicate, and it is not recommended to work behind closed doors. One person can go fast, and a group of people can go farther.

Relevant files and codes have been uploaded, and can be obtained by adding to the communication group. The group has more than 2,000 members. The best way to add notes is: source + interest direction, so that it is convenient to find like-minded friends.

Method ①, add WeChat account: dkl88194, remarks: from CSDN + add group
Method ②, WeChat search official account: Python learning and data mining, background reply: add group

Function

Radar RadViz

RadViz radar chart is a multivariate data visualization algorithm that evenly distributes each feature around the circumference and normalizes each feature value. Typically data scientists use this method to detect associations between classes. For example, is there an opportunity to learn something from the feature set or is there too much noise?

# Load the classification data set
data = load_data("occupancy")

# Specify the features of interest and the classes of the target
features = ["temperature", "relative humidity", "light", "C02", "humidity"]
classes = ["unoccupied", "occupied"]

# Extract the instances and target
X = data[features]
y = data.occupancy

# Import the visualizer
from yellowbrick.features import RadViz

# Instantiate the visualizer
visualizer = RadViz(classes=classes, features=features)

visualizer.fit(X, y)      # Fit the data to the visualizer
visualizer.transform(X)   # Transform the data
visualizer.poof()         # Draw/show/poof the data

picture

From the radar chart above, it can be seen that among the five dimensions, temperature has a relatively large impact on the target class.

One-dimensional sorting Rank 1D

The one-dimensional ranking of features utilizes a ranking algorithm that considers only individual features, by default using the Shapiro-Wilk algorithm to assess the normality of the distribution of instances associated with a feature , and then draws a bar graph showing the relative rank of each feature.

from yellowbrick.features import Rank1D

# Instantiate the 1D visualizer with the Sharpiro ranking algorithm
visualizer = Rank1D(features=features, algorithm='shapiro')

visualizer.fit(X, y)                # Fit the data to the visualizer
visualizer.transform(X)             # Transform the data
visualizer.poof()                   # Draw/show/poof the data

picture

PCA Projection

PCA decomposition visualization utilizes principal component analysis to decompose high-dimensional data into two or three dimensions so that each instance can be plotted in a scatterplot. The use of PCA means that a projected dataset can be analyzed along the principal axis of variation, and this dataset can be interpreted to determine whether a spherical distance metric can be exploited.

picture

Biplot

PCA projections can be enhanced to dual points, where points are projection instances and whose vectors represent the structure of the data in a high-dimensional space. By using the proj_features=True flag, the vector for each feature in the dataset will be plotted on the scatterplot in the direction of maximum variance for that feature. These structures can be used to analyze the importance of features for decomposition or to find features of correlated variance for further analysis.

# Load the classification data set
data = load_data('concrete')

# Specify the features of interest and the target
target = "strength"
features = [
    'cement', 'slag', 'ash', 'water', 'splast', 'coarse', 'fine', 'age'
]

# Extract the instance data and the target
X = data[features]
y = data[target]

visualizer = PCADecomposition(scale=True, proj_features=True)
visualizer.fit_transform(X, y)
visualizer.poof()

picture

Feature Importance Feature Importance

The feature engineering process involves selecting the minimum features needed to generate an effective model, since the more features a model contains, the more complex it is (sparse data) and thus the more sensitive the model is to errors in variance. A common approach to eliminating features is to describe their relative importance to the model, then eliminate weak features or combinations of features and re-evaluate to determine whether the model is better during cross-validation.

In scikit-learn, Decision Tree models and ensembles of trees such as Random Forest, Gradient Boosting, and AdaBoost provide a feature_importances_ attribute when fitting. The Yellowbrick FeatureImportances visualizer utilizes this property to rank and plot relative importance.

import matplotlib.pyplot as plt

from sklearn.ensemble import GradientBoostingClassifier

from yellowbrick.features.importances import FeatureImportances

# Create a new matplotlib figure
fig = plt.figure()
ax = fig.add_subplot()

viz = FeatureImportances(GradientBoostingClassifier(), ax=ax)
viz.fit(X, y)
viz.poof()

picture

Recursive Feature Elimination Recursive Feature Elimination

Recursive Feature Elimination (RFE) is a feature selection method that trains a model and removes the weakest feature (or features) until a specified number of features is reached. **Features are sorted by the coef_ or feature_importances_ attribute of the model, and by recursively eliminating a small number of features per cycle, RFE attempts to remove dependencies and collinearity that may exist in the model.
** RFE needs to retain a specified number of features, but it is usually not known in advance how many features are valid. To find the optimal number of features, cross-validation is used together with RFE to score different subsets of features and select the best set of scoring features. RFECV Visualization plots the number of features in the model along with their cross-validation test scores and variability, and visualizes a selected number of features.

from sklearn.svm import SVC
from sklearn.datasets import make_classification

from yellowbrick.features import RFECV

# Create a dataset with only 3 informative features
X, y = make_classification(
    n_samples=1000, n_features=25, n_informative=3, n_redundant=2,
    n_repeated=0, n_classes=8, n_clusters_per_class=1, random_state=0
)

# Create RFECV visualizer with linear SVM classifier
viz = RFECV(SVC(kernel='linear', C=1))
viz.fit(X, y)
viz.poof()

picture

The plot shows the ideal RFECV curve, jumping to excellent accuracy when three informative features are captured, then gradually decreasing in accuracy as non-informative features are added to the model. Shaded areas indicate cross-validation variability, one standard deviation above and below the mean precision score where the curve is plotted.

Below is a real dataset where we can see the effect of RFECV on a credit default binary classifier.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold

df = load_data('credit')

target = 'default'
features = [col for col in data.columns if col != target]

X = data[features]
y = data[target]

cv = StratifiedKFold(5)
oz = RFECV(RandomForestClassifier(), cv=cv, scoring='f1_weighted')

oz.fit(X, y)
oz.poof()

picture

In this example, we can see that 19 features were selected, although the f1 score of the model does not seem to improve much after about 5 features. The choice of features to eliminate plays an important role in determining the outcome of each recursion; modifying the step parameter to eliminate multiple features at each step may help to eliminate the worst features early, boosting the rest (and can also be used to speed up Feature Elimination for datasets with a large number of features).

Residuals Plot

In the context of a regression model, a residual is the difference between the observed and predicted value (ŷ) of the target variable (y), e.g., the error in the prediction. A residual plot shows the difference between the residuals on the vertical axis and the dependent variable on the horizontal axis, allowing detection of areas in the target that may be prone to more or less error.

from sklearn.linear_model import Ridge
from yellowbrick.regressor import ResidualsPlot

# Instantiate the linear model and visualizer
ridge = Ridge()
visualizer = ResidualsPlot(ridge)

visualizer.fit(X_train, y_train)  # Fit the training data to the model
visualizer.score(X_test, y_test)  # Evaluate the model on the test data
visualizer.poof()                 # Draw/show/poof the data

picture

Regularized Alpha Selection

Regularization aims to penalize model complexity, so the higher the α, the more complex the model, reducing error due to variance (overfitting). On the other hand, an alpha that is too high increases error due to bias (undershooting). Therefore, it is important to choose the best α so as to minimize the error in both directions.

AlphaSelection Visualizer demonstrates how different alpha values ​​affect model selection during regularization of linear models. In general, α increases the impact of regularization, e.g. if alpha is zero, there is no regularization, the higher the α, the greater the impact of the regularization parameter on the final model.

import numpy as np

from sklearn.linear_model import LassoCV
from yellowbrick.regressor import AlphaSelection

# Create a list of alphas to cross-validate against
alphas = np.logspace(-10, 1, 400)

# Instantiate the linear model and visualizer
model = LassoCV(alphas=alphas)
visualizer = AlphaSelection(model)

visualizer.fit(X, y)
g = visualizer.poof()

picture

Class Prediction Error Class Prediction Error

The Class Prediction Error plot provides a quick way to see how well a classifier is predicting the correct class.

from sklearn.ensemble import RandomForestClassifier

from yellowbrick.classifier import ClassPredictionError

# Instantiate the classification model and visualizer
visualizer = ClassPredictionError(
    RandomForestClassifier(), classes=classes
)

# Fit the training data to the visualizer
visualizer.fit(X_train, y_train)

# Evaluate the model on the test data
visualizer.score(X_test, y_test)

# Draw visualization
g = visualizer.poof()

picture

Of course, there are also visualizations of classification evaluation indicators, including confusion matrix, AUC/ROC, recall rate/precision rate, etc.

Two classification discrimination threshold Discrimination Threshold

Visualization of precision, recall, f1-score and queue rate on discriminative thresholds for binary classifiers. The discrimination threshold is the probability or fraction of selecting the positive class over the negative class. Typically, this is set to 50%, but the threshold can be adjusted to increase or decrease sensitivity to false positives or other application factors.

from sklearn.linear_model import LogisticRegression
from yellowbrick.classifier import DiscriminationThreshold

# Instantiate the classification model and visualizer
logistic = LogisticRegression()
visualizer = DiscriminationThreshold(logistic)

visualizer.fit(X, y)  # Fit the training data to the visualizer
visualizer.poof()     # Draw/show/poof the data

picture

Elbow Method of Clustering Elbow

KElbowVisualizer implements the "elbow" rule to help data scientists choose the optimal number of clusters by making the model have a range of values ​​of K. If the line graph resembles an arm, then the "elbow" (the point of inflection is the curve) is a good indication that the underlying model fits best at that point.

In the example below, KElbowVisualizer fits a KMeans model on a sample 2D dataset with 8 random point sets to obtain a range of K values ​​from 4 to 11. We can see the "elbow" in the plot when the model fits 8 clusters, in this case we know it is the optimal number.

from sklearn.datasets import make_blobs

# Create synthetic dataset with 8 random clusters
X, y = make_blobs(centers=8, n_features=12, shuffle=True, random_state=42)

from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer

# Instantiate the clustering model and visualizer
model = KMeans()
visualizer = KElbowVisualizer(model, k=(4,12))

visualizer.fit(X)    # Fit the data to the visualizer
visualizer.poof()    # Draw/show/poof the data

picture

Intercluster Distance Maps

The inter-cluster distance map shows the embedding of cluster centers in 2D and preserves the distances to other centers. For example. The closer the centers are to the visualization, the closer they are to the original feature space. Adjust the size of the cluster according to the scoring metric. By default, they are sorted by internal data, such as the number of instances belonging to each hub. This gives the relative importance of the clusters. Note, however, that since two clusters overlap in 2D space, it does not mean that they overlap in the original feature space.

from sklearn.datasets import make_blobs

# Make 12 blobs dataset
X, y = make_blobs(centers=12, n_samples=1000, n_features=16, shuffle=True)

from sklearn.cluster import KMeans
from yellowbrick.cluster import InterclusterDistance

# Instantiate the clustering model and visualizer
visualizer = InterclusterDistance(KMeans(9))

visualizer.fit(X) # Fit the training data to the visualizer
visualizer.poof() # Draw/show/poof the data

picture

Model Selection - Learning Curve Learning Curve

Learning curves examine the relationship between model training scores and cross-validation test scores based on varying numbers of training samples. This visualization is usually used to express two things:

  1. Will the model get better as the amount of data increases?

  2. Which model is more sensitive to bias or variance

Below is a visualization of the learning curve generated using yellowbrick. This learning curve applies to classification, regression, and clustering.

picture

Model Selection - Validation Curve Validation Curve

Model validation is used to determine how effective a model is on the data it has been trained on and how well it generalizes to new inputs. To measure the performance of the model, we first split the dataset into training and testing, fit the model on the training data and score on the held-out test data.

To maximize the score, the hyperparameters of the model must be chosen to best allow the model to operate in the specified feature space. Most models have multiple hyperparameters, and the best way to choose a combination of these parameters is to use a grid search. However, it is sometimes useful to plot the effect of individual hyperparameters on training and test data to determine whether a model is underfit or overfitting for certain hyperparameter values.

import numpy as np

from sklearn.tree import DecisionTreeRegressor
from yellowbrick.model_selection import ValidationCurve

# Load a regression dataset
data = load_data('energy')

# Specify features of interest and the target
targets = ["heating load", "cooling load"]
features = [col for col in data.columns if col not in targets]

# Extract the instances and target
X = data[features]
y = data[targets[0]]

viz = ValidationCurve(
    DecisionTreeRegressor(), param_name="max_depth",
    param_range=np.arange(1, 11), cv=10, scoring="r2"
)

# Fit and poof the visualizer
viz.fit(X, y)
viz.poof()

picture

Summarize

yellowbrickIt is very easy to use. First, because it solves the visualization problem in the process of feature engineering and modeling, it greatly simplifies the operation; second, through various visualizations, it can also supplement some blind spots in modeling.

Link: https://github.com/DistrictDataLabs/yellowbrick

おすすめ

転載: blog.csdn.net/qq_34160248/article/details/132025418