Introduction to Supervised Learning

table of Contents

1>Understand supervised learning

2>Supervised learning in OpenCV

3>Use scoring function to evaluate model performance


1>Understand supervised learning

label:

In simple linear regression, the label is what we want to predict, that is, the dependent variable y. The label can be the future house price, the species shown in the picture, and anything else.

feature:

In simple linear regression, the feature is the input variable, that is, the independent variable X. Simple machine learning projects may use a single feature, while more complex machine learning projects may use millions of features, specified as follows:

\left \{ X_{1},X_{2},...,X_{n} \right \}

In the example of a spam detector, features might include:

  • The address of the sender.
  • The time period during which the email is sent.
  • Words in the email text.

The goal of supervised learning is to predict the label or target value of some data, so supervised learning can be divided into two forms:

  • Classification : These supervised learning that uses data to predict categories is called classification. For example: predict whether an image contains a dog or a cat. Here, the label of the data is the category, and it can only be one of the categories or the other, and it cannot be a mixed result of the two categories. When there are only two choices, it is called binary classification or binary classification. When there are more than two categories, such as predicting the weather of the next day, it is called multi-category classification.
  • Regression : These supervised learning that uses data to predict the true value is called regression. For example, when we predict the value of stocks, unlike predicting stock categories, the goal of regression is to predict the target value as accurately as possible.

2>Supervised learning in OpenCV

Building a machine learning model in OpenCV always follows the following logic:

  • Initialization : Call the model by name to create an empty instance of the model.
  • Setting parameters : If the model requires some parameters, they can be set through the setting function.
  • Training model : Each model must provide a class function called train, which is used to make the model fit some data.
  • Predicting new labels : Each model must provide a class function called predict to allow the model to predict the labels of new data.
  • Evaluation model : Each model must provide a class function called calcError to evaluate the performance of the model.

3>Use scoring function to evaluate model performance

In the task of binary classification, there are several different methods to evaluate the effect of classification. Some common indicators are as follows:

  • Accuracy (accuracy_score) : The ratio of the number of correct predictions to the total number of positive and negative cases. For example, when classifying pictures into categories of cats or dogs, the accuracy rate indicates the proportion of pictures that are correctly classified as containing cats or dogs.
  • Precision (precision_score) : The proportion of the samples whose predictions are positive. For example, among all the pictures predicted to contain cats, the proportion of pictures that actually contain cats.
  • Recall rate (recall_score) : the percentage of correct predictions among the samples that are actually positive. For example, among all the pictures that actually contain cats, the proportion of pictures that have been correctly identified as cats.

Open a new IPython session:

ipython

Use some class labels that only contain 0 or 1 to simulate. First, set the seed of the random number generator to a fixed value:

import numpy as np
np.random.seed(42)

By randomly taking an integer in the range of 0~(2-1), generate 5 random labels with either 0 or 1:

y_true=np.random.randint(0, 2, size=5)
y_true
#结果:array([0, 1, 0, 0, 0])

Suppose there is a classifier that tries to predict the class label. For the convenience of explanation, we assume that this classifier is not accurate and always predicts label 1:

y_pred=np.ones(5, dtype=np.int32)
y_pred
#结果:array([1, 1, 1, 1, 1])

Calculate the accuracy rate (two methods):

np.sum(y_true==y_pred)/len(y_true)
#结果:0.2

from sklearn import metrics
metrics.accuracy_score(y_true, y_pred)
#结果:0.2

Confusion matrix:

Forecast result (True/False) reality
Positive Negative
Positive example TP (real case) FN (false negative) 
Negative number FP (false positive) TN (True Negative Case) 

Calculate TP (real example, that is, our prediction is 1, the label is also 1):

true_a_positive=(y_true==1)
pred_a_positive=(y_pred==1)
true_positive=np.sum(pred_a_positive*true_a_positive)
true_positive
#结果:1

Calculate TN (true negative example, that is, our prediction is 0, and the label is also 0):

true_negative=np.sum((y_pred==0)*(y_true==0))
true_negative
#结果:0
Calculate FP (false positive example, that is, our prediction is 1, but the label is 0):
false_positive=np.sum((y_pred==1)*(y_true==0))
false_positive
#结果:4

Calculate FN (false negative example, that is, we predict 0, but the label is 1):

false_negative=np.sum((y_pred==0)*(y_true==1))
false_negative
#结果:0

To ensure that there is no error, calculate the accuracy rate again:

Accuracy=\frac{TP+TN}{TP+TN+FP+FN}

accuracy=(true_positive+true_negative)/len(y_true)
accuracy
#结果:0.2

Calculation accuracy:

Precision=\frac{TP}{TP+FP}

precision=true_positive/(true_positive+false_positive)
precision
#结果:0.2

Use scikit-learn to check our math results:

metrics.precision_score(y_true, y_pred)
#结果:0.2

Calculate the recall rate:

Recall=\frac{TP}{TP+FN}

recall=true_positive/(true_positive+false_negative)
recall
#结果:1.0

Use scikit-learn to check our math results:

metrics.recall_score(y_true, y_pred)
#结果:1.0

Use the correlation scoring function of mean square error, interpretable variance and R-square value to score the regressor:

  • mean_squared_error (mean square error) : the average of the sum of squares of the difference between the predicted value of each data and the true value. The closer the mean square error value is to 0, the closer the distribution of the predicted value and the true value is.

MSE=\frac{1}{n}\sum_{i=1}^{n} \left ( y_{i} - \hat{y_{i}}\right )

  • explained_variance_score (explainable variance) : The closer the value of the explainable variance is to 1, the closer the distribution of the predicted value and the true value is.
  • r2_score (R-square value) : R^{2}It is also called the coefficient of determination. The closer the coefficient of determination is to 1, the closer the distribution of the predicted value and the true value is.

Create another simulation data set, create a linear space from 0 to 10 on the x-axis, and 100 sampling points:

x=np.linspace(0, 10, 100)

Real data always contains noise. In order to comply with this fact, we add noise to the target value y_true. This operation is implemented by adding noise to the sin function:

y_true=np.sin(x)+np.random.rand(x.size)-0.5

Here the rand function in NumPy is used to add noise in the range of [0,1), and subtract 0.5, so that the noise is centered at 0.

Forecast y value:

y_pred=np.sin(x)

Use Matplotlib to visualize it:

import matplotlib.pyplot as plt
plt.style.use('ggplot')
%matplotlib
plt.plot(x, y_pred, linewidth=4, label='model')
plt.plot(x, y_true, 'o', label='data')
plt.xlabel('x')
plt.ylabel('y')
plt.legend(loc='lower left')

Get the following drawing output result:

Use the mean square error to determine the quality of the model:

mse=np.mean((y_true-y_pred)**2)
mse
#结果:0.08531839480842378

Use scikit-learn to check our math results:

metrics.mean_squared_error(y_true, y_pred)
#结果:0.08531839480842378

Use the interpretable variance to determine the quality of the model:

fvu=np.var(y_true-y_pred)/np.var(y_true)
fvu
#结果:0.163970326266295

fve=1.0-fvu
fve
#结果:0.836029673733705

Use scikit-learn to check our math results:

metrics.explained_variance_score(y_true, y_pred)
#结果:0.08531839480842378

Use the coefficient of determination to determine the quality of the model:

r2=1.0-mse/np.var(y_true)
r2
#结果:0.8358169419264746

Use scikit-learn to check our math results:

metrics.r2_score(y_true, y_pred)
#结果:0.8358169419264746

A constant model that predicts the value of y without using the value of x, the value of which R^{2}is always 0:

metrics.r2_score(y_true, np.mean(y_true)*np.ones_like(y_true))
#结果:0.0

Guess you like

Origin blog.csdn.net/Kannyi/article/details/112446717