Data Python machine learning to explore visualization library yellowbrick-tutorial

Background

When sklearn learn from, in addition to over-Hom algorithm, matplotlib have to learn to visualize, for my practical applications, visualization is more important, but ease of use and aesthetics matplotlib really to be commended. Gradually used plotly, seaborn, finally fixed in Bokeh, because it can be perfectly combined with the Flask, kanban data difficult to develop much lower.

A while back to see this library can explore more convenient for data, today was going to learn about space. Original English document is accessed, it has been found that the Chinese do, though looks like a Google translation of the spirit ism, less cost point of spiritual energy, to copy half half science, and still found some documents not quite the same place.

# http://www.scikit-yb.org/zh/latest/tutorial.html

Model Selection Guide

In this tutorial, we will look at a variety of points Scikit-Learn model and use Yellowbrick visual diagnostic tool to compare them to select the best model for our data.

Model selection triples

Discussion of machine learning often focused on the model selected. Whether it is the logistic regression, random forest, Bayesian methods, or artificial neural networks, machine learning practitioners usually quick to show their preferences. This is mainly because of historical reasons. Although modern third-party libraries to make the deployment of all kinds of machine learning models seem insignificant, but traditionally, even where the application and tuning an algorithm also requires years of research. Therefore, compared with other models, machine learning practitioners tend to be specific (and more likely to be familiar) models have a strong preference.

However, the model choose the "right" than simply selecting or "Error" algorithm is more subtle. Practice workflow include:

选择和/或设计最小和最具预测性的特性集
从模型家族中选择一组算法,并且
优化算法超参数以优化性能。

Model selection triples that was first proposed by Kumar et al SIGMOD paper in 2015. In their paper, to talk about the development of a database system for the next generation of predictive modeling and construction. On the very aptly said that as machine learning is highly experimental in practice, there is an urgent need for such a system. "Model selection," they explained, "it is iterative and exploratory, because (model selection triples) space is usually unlimited, and often impossible for analysts to know in advance which (combination) will have a very satisfied with the accuracy and / or insight. "

Recently, a lot of work flow through the grid search method has been standardized API and GUI-based application automated. However, in practice, human intuition and guidance can more effectively focus on model quality than exhaustive search. By choosing process visualization model, scientists can turn to final data, interpretable models and avoid the pitfalls.

Yellowbrick library is a learning machine for visual diagnostic platform, which allows data scientists to control the model selection process. Yellowbrick with a new core object extends Scikit-Learn the API: Visualizer. Visualizers allows visualization of the model as part of the pipeline process Scikit-Learn matching and conversion, thereby providing a visual diagnostics during the conversion of high-dimensional data.

About Data

This tutorial uses modified from the UCI Machine Learning Repository too mushrooms dataset version. Our goal is based on a specific mushrooms, to predict mushrooms are poisonous or edible.

These data include Agaricales (from Agaricus) and Lepiota (Lepiota) of 23 families baked mushrooms hypothesis corresponding sample description. Each of which is determined to be an edible absolute, absolute toxic, edible or unknown and is not recommended (after combination with a class of toxic species).

Our document "agaricus-lepiota.txt", contains three nominally valued attribute information and the 8124 target instance mushrooms (edible 4208, 3916 toxic).

Let's load data with the Pandas.

import os
import pandas as pd
mushrooms = 'data/shrooms.csv'  # 数据集
dataset   = pd.read_csv(mushrooms)
# dataset.columns = names
dataset.head()

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>id</th>
<th>class</th>
<th>cap-shape</th>
<th>cap-surface</th>
<th>cap-color</th>
<th>bruises</th>
<th>odor</th>
<th>gill-attachment</th>
<th>gill-spacing</th>
<th>gill-size</th>
<th>...</th>
<th>stalk-color-above-ring</th>
<th>stalk-color-below-ring</th>
<th>veil-type</th>
<th>veil-color</th>
<th>ring-number</th>
<th>ring-type</th>
<th>spore-print-color</th>
<th>population</th>
<th>habitat</th>
<th>Unnamed: 24</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>1</td>
<td>p</td>
<td>x</td>
<td>s</td>
<td>n</td>
<td>t</td>
<td>p</td>
<td>f</td>
<td>c</td>
<td>n</td>
<td>...</td>
<td>w</td>
<td>w</td>
<td>p</td>
<td>w</td>
<td>o</td>
<td>p</td>
<td>k</td>
<td>s</td>
<td>u</td>
<td>NaN</td>
</tr>
<tr>
<th>1</th>
<td>2</td>
<td>e</td>
<td>x</td>
<td>s</td>
<td>y</td>
<td>t</td>
<td>a</td>
<td>f</td>
<td>c</td>
<td>b</td>
<td>...</td>
<td>w</td>
<td>w</td>
<td>p</td>
<td>w</td>
<td>o</td>
<td>p</td>
<td>n</td>
<td>n</td>
<td>g</td>
<td>NaN</td>
</tr>
<tr>
<th>2</th>
<td>3</td>
<td>e</td>
<td>b</td>
<td>s</td>
<td>w</td>
<td>t</td>
<td>l</td>
<td>f</td>
<td>c</td>
<td>b</td>
<td>...</td>
<td>w</td>
<td>w</td>
<td>p</td>
<td>w</td>
<td>o</td>
<td>p</td>
<td>n</td>
<td>n</td>
<td>m</td>
<td>NaN</td>
</tr>
<tr>
<th>3</th>
<td>4</td>
<td>p</td>
<td>x</td>
<td>y</td>
<td>w</td>
<td>t</td>
<td>p</td>
<td>f</td>
<td>c</td>
<td>n</td>
<td>...</td>
<td>w</td>
<td>w</td>
<td>p</td>
<td>w</td>
<td>o</td>
<td>p</td>
<td>k</td>
<td>s</td>
<td>u</td>
<td>NaN</td>
</tr>
<tr>
<th>4</th>
<td>5</td>
<td>e</td>
<td>x</td>
<td>s</td>
<td>g</td>
<td>f</td>
<td>n</td>
<td>f</td>
<td>w</td>
<td>b</td>
<td>...</td>
<td>w</td>
<td>w</td>
<td>p</td>
<td>w</td>
<td>o</td>
<td>e</td>
<td>n</td>
<td>a</td>
<td>g</td>
<td>NaN</td>
</tr>
</tbody>
</table>
<p>5 rows × 25 columns</p>
</div>

features = ['cap-shape', 'cap-surface', 'cap-color']
target   = ['class']
X = dataset[features]
y = dataset[target]
dataset.shape # 较官方文档少了俩蘑菇
(8122, 25)
dataset.groupby('class').count() # 各少了1个蘑菇

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>id</th>
<th>cap-shape</th>
<th>cap-surface</th>
<th>cap-color</th>
<th>bruises</th>
<th>odor</th>
<th>gill-attachment</th>
<th>gill-spacing</th>
<th>gill-size</th>
<th>gill-color</th>
<th>...</th>
<th>stalk-color-above-ring</th>
<th>stalk-color-below-ring</th>
<th>veil-type</th>
<th>veil-color</th>
<th>ring-number</th>
<th>ring-type</th>
<th>spore-print-color</th>
<th>population</th>
<th>habitat</th>
<th>Unnamed: 24</th>
</tr>
<tr>
<th>class</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<th>e</th>
<td>4207</td>
<td>4207</td>
<td>4207</td>
<td>4207</td>
<td>4207</td>
<td>4207</td>
<td>4207</td>
<td>4207</td>
<td>4207</td>
<td>4207</td>
<td>...</td>
<td>4207</td>
<td>4207</td>
<td>4207</td>
<td>4207</td>
<td>4207</td>
<td>4207</td>
<td>4207</td>
<td>4207</td>
<td>4207</td>
<td>0</td>
</tr>
<tr>
<th>p</th>
<td>3915</td>
<td>3915</td>
<td>3915</td>
<td>3915</td>
<td>3915</td>
<td>3915</td>
<td>3915</td>
<td>3915</td>
<td>3915</td>
<td>3915</td>
<td>...</td>
<td>3915</td>
<td>3915</td>
<td>3915</td>
<td>3915</td>
<td>3915</td>
<td>3915</td>
<td>3915</td>
<td>3915</td>
<td>3915</td>
<td>0</td>
</tr>
</tbody>
</table>
<p>2 rows × 24 columns</p>
</div>

Feature Extraction

Our data, including the target parameters are categorical data. In order to use machine learning, we need these values into numerical data. To this concentrated extract from the data, we must use Scikit-Learn converter (Transformers) is converted into the input data set for the model data set. Fortunately, Sckit-Learn provides a converter for converting an integer category labels: sklearn.preprocessing.LabelEncoder. Unfortunately, it can only convert a vector, so we have to tweak it so that it applies to multiple columns.
Doubt, this is a mushroom classification vector?

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
class EncodeCategorical(BaseEstimator, TransformerMixin):
    """
    Encodes a specified list of columns or all columns if None.
    """

    def __init__(self, columns=None):
        self.columns  = [col for col in columns]
        self.encoders = None

    def fit(self, data, target=None):
        """
        Expects a data frame with named columns to encode.
        """
        # Encode all columns if columns is None
        if self.columns is None:
            self.columns = data.columns

        # Fit a label encoder for each column in the data frame
        self.encoders = {
            column: LabelEncoder().fit(data[column])
            for column in self.columns
        }
        return self

    def transform(self, data):
        """
        Uses the encoders to transform a data frame.
        """
        output = data.copy()
        for column, encoder in self.encoders.items():
            output[column] = encoder.transform(data[column])

        return output

Modeling and Evaluation

Common indicators to assess classifier

Number of accuracy (Precision) is true positive results divided by the number of positive results (for example, we predict how many actually edible mushrooms?)

The number of positive results of the recall (Recall) is correct positive results divided by the number that should be returned (for example, we accurately predict how much poisonous mushroom is poisonous?)

F1 fraction (F1 score) is a measure of the accuracy of the test standard. It also calculates a score about the precision and recall testing. F1 score can be interpreted as a weighted average of precision and recall, where F1 score at 1 reaches the optimum value, the worst value reaches 0.
precision = true positives / (true positives + false positives)

recall = true positives / (false negatives + true positives)

Score 2 = F1 ((Precision Recall) / (Precision + Recall))
Now we are ready to make some predict!

Let's build a method of evaluating multiple estimator (multiple estimators) - the first using conventional numerical scores (we will compare with Yellowbrick library of some visual diagnosis later).

from sklearn.metrics import f1_score
from sklearn.pipeline import Pipeline
def model_selection(X, y, estimator):
    """
    Test various estimators.
    """
    y = LabelEncoder().fit_transform(y.values.ravel())
    model = Pipeline([
         ('label_encoding', EncodeCategorical(X.keys())),
         ('one_hot_encoder', OneHotEncoder(categories='auto')),  # 此处增加自动分类,否则有warning
         ('estimator', estimator)
    ])

    # Instantiate the classification model and visualizer
    model.fit(X, y)

    expected  = y
    predicted = model.predict(X)

    # Compute and return the F1 score (the harmonic mean of precision and recall)
    return (f1_score(expected, predicted))
from sklearn.svm import LinearSVC, NuSVC, SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegressionCV, LogisticRegression, SGDClassifier
from sklearn.ensemble import BaggingClassifier, ExtraTreesClassifier, RandomForestClassifier
model_selection(X, y, LinearSVC())
0.6582119537920643
import warnings
warnings.filterwarnings("ignore", category=FutureWarning, module="sklearn")  # 忽略警告
model_selection(X, y, NuSVC())
0.6878837238441299
model_selection(X, y, SVC())
0.6625145971195017
model_selection(X, y, SGDClassifier())
0.5738408700629649
model_selection(X, y, KNeighborsClassifier())
0.6856846473029046
model_selection(X, y, LogisticRegressionCV())
0.6582119537920643
model_selection(X, y, LogisticRegression())
0.6578749058025622
model_selection(X, y, BaggingClassifier())
0.6873901878632248
model_selection(X, y, ExtraTreesClassifier())
0.6872294372294372
model_selection(X, y, RandomForestClassifier())
0.6992081007399714

Preliminary assessment model

According to the results of the above fraction F1, which model best performance?

Visual model assessment

Now, let's reconstruction model evaluation function, using Yellowbrick of ClassificationReport class, which is a model visualization tool that shows the precision, recall and F1 score. Visualization of the model analysis tools integrated score value, and color-coded thermodynamic diagram to explain and support the simple detection, particularly where it is related to our use cases (life and death!) Of a first type of error (Type I error) and a second error type (type II error) nuances.

The first type of error (or "false positives (false positive)") is present in a non-detectable effect (e.g., when the mushroom is actually edible, it is toxic).

Type II error (or "false negatives" "false negative") is unable to detect the presence of the effect (for example, when in fact poisonous mushrooms, but that it is edible).

from sklearn.pipeline import Pipeline
from yellowbrick.classifier import ClassificationReport

def visual_model_selection(X, y, estimator):
    """
    Test various estimators.
    """
    y = LabelEncoder().fit_transform(y.values.ravel())
    model = Pipeline([
         ('label_encoding', EncodeCategorical(X.keys())),
         ('one_hot_encoder', OneHotEncoder()),
         ('estimator', estimator)
    ])

    # Instantiate the classification model and visualizer
    visualizer = ClassificationReport(model, classes=['edible', 'poisonous'])
    visualizer.fit(X, y)
    visualizer.score(X, y)
    visualizer.poof()
visual_model_selection(X, y, LinearSVC())

file

# 其他分类器可视化略
visual_model_selection(X, y, RandomForestClassifier())

file

test

现在,哪种模型看起来最好?为什么?
哪一个模型最有可能救你的命?
可视化模型评估与数值模型评价,体验起来有何不同?

Precision Recall Recall accuracy and comprehensive evaluation index-Measure F1
http://www.makaidong.com/%E5%8D%9A%E5%AE%A2%E5%9B%AD%E7%83%AD%E6% 87% 96 / 437.shtml
f1-Score considering the precision and recall rates.
Well is intuitive visualization, escape ~

About the Author

Know almost yeayee, Py 5 years of age, good Flask + MongoDB + SKlearn + Bokeh

Guess you like

Origin blog.51cto.com/14509091/2431116