A very important job before building a model is to do feature engineering, and in the process of feature engineering, exploratory data analysis is an essential part.
This time introduces a very powerful feature engineering visualization tool: yellowbrick, including radar, one-dimensional sorting, PCA, feature importance, recursive elimination, regularization, residual graph, elbow method, learning curve, verification curve, etc. , with its assistance, more exploration time can be saved, and feature information can be grasped quickly.
Technology Exchange
Technology must learn to share and communicate, and it is not recommended to work behind closed doors. One person can go fast, and a group of people can go farther.
Relevant files and codes have been uploaded, and can be obtained by adding to the communication group. The group has more than 2,000 members. The best way to add notes is: source + interest direction, so that it is convenient to find like-minded friends.
Method ①, add WeChat account: dkl88194, remarks: from CSDN + add group
Method ②, WeChat search official account: Python learning and data mining, background reply: add group
Function
Radar RadViz
RadViz radar chart is a multivariate data visualization algorithm that evenly distributes each feature around the circumference and normalizes each feature value. Typically data scientists use this method to detect associations between classes. For example, is there an opportunity to learn something from the feature set or is there too much noise?
# Load the classification data set
data = load_data("occupancy")
# Specify the features of interest and the classes of the target
features = ["temperature", "relative humidity", "light", "C02", "humidity"]
classes = ["unoccupied", "occupied"]
# Extract the instances and target
X = data[features]
y = data.occupancy
# Import the visualizer
from yellowbrick.features import RadViz
# Instantiate the visualizer
visualizer = RadViz(classes=classes, features=features)
visualizer.fit(X, y) # Fit the data to the visualizer
visualizer.transform(X) # Transform the data
visualizer.poof() # Draw/show/poof the data
From the radar chart above, it can be seen that among the five dimensions, temperature has a relatively large impact on the target class.
One-dimensional sorting Rank 1D
The one-dimensional ranking of features utilizes a ranking algorithm that considers only individual features, by default using the Shapiro-Wilk algorithm to assess the normality of the distribution of instances associated with a feature , and then draws a bar graph showing the relative rank of each feature.
from yellowbrick.features import Rank1D
# Instantiate the 1D visualizer with the Sharpiro ranking algorithm
visualizer = Rank1D(features=features, algorithm='shapiro')
visualizer.fit(X, y) # Fit the data to the visualizer
visualizer.transform(X) # Transform the data
visualizer.poof() # Draw/show/poof the data
PCA Projection
PCA decomposition visualization utilizes principal component analysis to decompose high-dimensional data into two or three dimensions so that each instance can be plotted in a scatterplot. The use of PCA means that a projected dataset can be analyzed along the principal axis of variation, and this dataset can be interpreted to determine whether a spherical distance metric can be exploited.
Biplot
PCA projections can be enhanced to dual points, where points are projection instances and whose vectors represent the structure of the data in a high-dimensional space. By using the proj_features=True flag, the vector for each feature in the dataset will be plotted on the scatterplot in the direction of maximum variance for that feature. These structures can be used to analyze the importance of features for decomposition or to find features of correlated variance for further analysis.
# Load the classification data set
data = load_data('concrete')
# Specify the features of interest and the target
target = "strength"
features = [
'cement', 'slag', 'ash', 'water', 'splast', 'coarse', 'fine', 'age'
]
# Extract the instance data and the target
X = data[features]
y = data[target]
visualizer = PCADecomposition(scale=True, proj_features=True)
visualizer.fit_transform(X, y)
visualizer.poof()
Feature Importance Feature Importance
The feature engineering process involves selecting the minimum features needed to generate an effective model, since the more features a model contains, the more complex it is (sparse data) and thus the more sensitive the model is to errors in variance. A common approach to eliminating features is to describe their relative importance to the model, then eliminate weak features or combinations of features and re-evaluate to determine whether the model is better during cross-validation.
In scikit-learn, Decision Tree models and ensembles of trees such as Random Forest, Gradient Boosting, and AdaBoost provide a feature_importances_ attribute when fitting. The Yellowbrick FeatureImportances visualizer utilizes this property to rank and plot relative importance.
import matplotlib.pyplot as plt
from sklearn.ensemble import GradientBoostingClassifier
from yellowbrick.features.importances import FeatureImportances
# Create a new matplotlib figure
fig = plt.figure()
ax = fig.add_subplot()
viz = FeatureImportances(GradientBoostingClassifier(), ax=ax)
viz.fit(X, y)
viz.poof()
Recursive Feature Elimination Recursive Feature Elimination
Recursive Feature Elimination (RFE) is a feature selection method that trains a model and removes the weakest feature (or features) until a specified number of features is reached. **Features are sorted by the coef_ or feature_importances_ attribute of the model, and by recursively eliminating a small number of features per cycle, RFE attempts to remove dependencies and collinearity that may exist in the model.
** RFE needs to retain a specified number of features, but it is usually not known in advance how many features are valid. To find the optimal number of features, cross-validation is used together with RFE to score different subsets of features and select the best set of scoring features. RFECV Visualization plots the number of features in the model along with their cross-validation test scores and variability, and visualizes a selected number of features.
from sklearn.svm import SVC
from sklearn.datasets import make_classification
from yellowbrick.features import RFECV
# Create a dataset with only 3 informative features
X, y = make_classification(
n_samples=1000, n_features=25, n_informative=3, n_redundant=2,
n_repeated=0, n_classes=8, n_clusters_per_class=1, random_state=0
)
# Create RFECV visualizer with linear SVM classifier
viz = RFECV(SVC(kernel='linear', C=1))
viz.fit(X, y)
viz.poof()
The plot shows the ideal RFECV curve, jumping to excellent accuracy when three informative features are captured, then gradually decreasing in accuracy as non-informative features are added to the model. Shaded areas indicate cross-validation variability, one standard deviation above and below the mean precision score where the curve is plotted.
Below is a real dataset where we can see the effect of RFECV on a credit default binary classifier.
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import StratifiedKFold
df = load_data('credit')
target = 'default'
features = [col for col in data.columns if col != target]
X = data[features]
y = data[target]
cv = StratifiedKFold(5)
oz = RFECV(RandomForestClassifier(), cv=cv, scoring='f1_weighted')
oz.fit(X, y)
oz.poof()
In this example, we can see that 19 features were selected, although the f1 score of the model does not seem to improve much after about 5 features. The choice of features to eliminate plays an important role in determining the outcome of each recursion; modifying the step parameter to eliminate multiple features at each step may help to eliminate the worst features early, boosting the rest (and can also be used to speed up Feature Elimination for datasets with a large number of features).
Residuals Plot
In the context of a regression model, a residual is the difference between the observed and predicted value (ŷ) of the target variable (y), e.g., the error in the prediction. A residual plot shows the difference between the residuals on the vertical axis and the dependent variable on the horizontal axis, allowing detection of areas in the target that may be prone to more or less error.
from sklearn.linear_model import Ridge
from yellowbrick.regressor import ResidualsPlot
# Instantiate the linear model and visualizer
ridge = Ridge()
visualizer = ResidualsPlot(ridge)
visualizer.fit(X_train, y_train) # Fit the training data to the model
visualizer.score(X_test, y_test) # Evaluate the model on the test data
visualizer.poof() # Draw/show/poof the data
Regularized Alpha Selection
Regularization aims to penalize model complexity, so the higher the α, the more complex the model, reducing error due to variance (overfitting). On the other hand, an alpha that is too high increases error due to bias (undershooting). Therefore, it is important to choose the best α so as to minimize the error in both directions.
AlphaSelection Visualizer demonstrates how different alpha values affect model selection during regularization of linear models. In general, α increases the impact of regularization, e.g. if alpha is zero, there is no regularization, the higher the α, the greater the impact of the regularization parameter on the final model.
import numpy as np
from sklearn.linear_model import LassoCV
from yellowbrick.regressor import AlphaSelection
# Create a list of alphas to cross-validate against
alphas = np.logspace(-10, 1, 400)
# Instantiate the linear model and visualizer
model = LassoCV(alphas=alphas)
visualizer = AlphaSelection(model)
visualizer.fit(X, y)
g = visualizer.poof()
Class Prediction Error Class Prediction Error
The Class Prediction Error plot provides a quick way to see how well a classifier is predicting the correct class.
from sklearn.ensemble import RandomForestClassifier
from yellowbrick.classifier import ClassPredictionError
# Instantiate the classification model and visualizer
visualizer = ClassPredictionError(
RandomForestClassifier(), classes=classes
)
# Fit the training data to the visualizer
visualizer.fit(X_train, y_train)
# Evaluate the model on the test data
visualizer.score(X_test, y_test)
# Draw visualization
g = visualizer.poof()
Of course, there are also visualizations of classification evaluation indicators, including confusion matrix, AUC/ROC, recall rate/precision rate, etc.
Two classification discrimination threshold Discrimination Threshold
Visualization of precision, recall, f1-score and queue rate on discriminative thresholds for binary classifiers. The discrimination threshold is the probability or fraction of selecting the positive class over the negative class. Typically, this is set to 50%, but the threshold can be adjusted to increase or decrease sensitivity to false positives or other application factors.
from sklearn.linear_model import LogisticRegression
from yellowbrick.classifier import DiscriminationThreshold
# Instantiate the classification model and visualizer
logistic = LogisticRegression()
visualizer = DiscriminationThreshold(logistic)
visualizer.fit(X, y) # Fit the training data to the visualizer
visualizer.poof() # Draw/show/poof the data
Elbow Method of Clustering Elbow
KElbowVisualizer implements the "elbow" rule to help data scientists choose the optimal number of clusters by making the model have a range of values of K. If the line graph resembles an arm, then the "elbow" (the point of inflection is the curve) is a good indication that the underlying model fits best at that point.
In the example below, KElbowVisualizer fits a KMeans model on a sample 2D dataset with 8 random point sets to obtain a range of K values from 4 to 11. We can see the "elbow" in the plot when the model fits 8 clusters, in this case we know it is the optimal number.
from sklearn.datasets import make_blobs
# Create synthetic dataset with 8 random clusters
X, y = make_blobs(centers=8, n_features=12, shuffle=True, random_state=42)
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer
# Instantiate the clustering model and visualizer
model = KMeans()
visualizer = KElbowVisualizer(model, k=(4,12))
visualizer.fit(X) # Fit the data to the visualizer
visualizer.poof() # Draw/show/poof the data
Intercluster Distance Maps
The inter-cluster distance map shows the embedding of cluster centers in 2D and preserves the distances to other centers. For example. The closer the centers are to the visualization, the closer they are to the original feature space. Adjust the size of the cluster according to the scoring metric. By default, they are sorted by internal data, such as the number of instances belonging to each hub. This gives the relative importance of the clusters. Note, however, that since two clusters overlap in 2D space, it does not mean that they overlap in the original feature space.
from sklearn.datasets import make_blobs
# Make 12 blobs dataset
X, y = make_blobs(centers=12, n_samples=1000, n_features=16, shuffle=True)
from sklearn.cluster import KMeans
from yellowbrick.cluster import InterclusterDistance
# Instantiate the clustering model and visualizer
visualizer = InterclusterDistance(KMeans(9))
visualizer.fit(X) # Fit the training data to the visualizer
visualizer.poof() # Draw/show/poof the data
Model Selection - Learning Curve Learning Curve
Learning curves examine the relationship between model training scores and cross-validation test scores based on varying numbers of training samples. This visualization is usually used to express two things:
-
Will the model get better as the amount of data increases?
-
Which model is more sensitive to bias or variance
Below is a visualization of the learning curve generated using yellowbrick. This learning curve applies to classification, regression, and clustering.
Model Selection - Validation Curve Validation Curve
Model validation is used to determine how effective a model is on the data it has been trained on and how well it generalizes to new inputs. To measure the performance of the model, we first split the dataset into training and testing, fit the model on the training data and score on the held-out test data.
To maximize the score, the hyperparameters of the model must be chosen to best allow the model to operate in the specified feature space. Most models have multiple hyperparameters, and the best way to choose a combination of these parameters is to use a grid search. However, it is sometimes useful to plot the effect of individual hyperparameters on training and test data to determine whether a model is underfit or overfitting for certain hyperparameter values.
import numpy as np
from sklearn.tree import DecisionTreeRegressor
from yellowbrick.model_selection import ValidationCurve
# Load a regression dataset
data = load_data('energy')
# Specify features of interest and the target
targets = ["heating load", "cooling load"]
features = [col for col in data.columns if col not in targets]
# Extract the instances and target
X = data[features]
y = data[targets[0]]
viz = ValidationCurve(
DecisionTreeRegressor(), param_name="max_depth",
param_range=np.arange(1, 11), cv=10, scoring="r2"
)
# Fit and poof the visualizer
viz.fit(X, y)
viz.poof()
Summarize
yellowbrick
It is very easy to use. First, because it solves the visualization problem in the process of feature engineering and modeling, it greatly simplifies the operation; second, through various visualizations, it can also supplement some blind spots in modeling.
Link: https://github.com/DistrictDataLabs/yellowbrick