[Machine Learning] Basics of Supervised Algorithms

The concept of supervised learning

  • The main task of supervised learning is to predict the label according to the characteristics of the data object , and the corresponding learning algorithm needs to learn from experience
  • Experience comes from labeled training data , which is randomly sampled data from a collection of data objects, also called samples .

feature group (features)


Definition : feature group In supervised learning tasks, the vector x = ( x 1 , x 2 , . . . , xn ) ∈ R nx=(x_{1},x_{2},...,x_{n})\in R^ n x=(x1,x2,...,xn)Rn , is called the feature group of the object, that is, the feature vector. LetX ⊆ R n X\subseteq R^nXRn is the set of all possible values ​​of the feature group, calledXXX isthe sample space.

Label


Definition : In a regression problem, the training data contains a numerical label y ∈ R y \in RyR ; inkkIn the k -ary classification problem, the training data contains a vector labely ∈ [ 0 , 1 ] ky\in[0,1]^ky[0,1]k。设YYY is all possible label values, calledYYY is the label space.
Classification problems:
For example: handwritten digit recognition problem:y ∈ [ 0 , 1 ] 10 y\in [0,1]^{10}y[0,1]10

  1. Vector label for number 3: y = ( 0 , 0 , 0 , 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ) y=(0,0,0,1,0,0,0,0,0,0,0)y=(0,0,0,1,0,0,0,0,0,0,0)
  2. Scalar label for number 3: y = 3 , y ∈ Y , Y = { 0 , 1 , 2 , 3 , . . . , 9 } y=3,y\in Y,Y=\{0,1,2,3,...,9\}y=3,yY,Y={ 0,1,2,3,...,9}

Model


Definition : Let XXX is the sample space,YYY is the label space,ϕ \phiϕX − > Y X->YX>The mappingsetof Y is called ϕ \phiϕ isthe model spaceh ∈ ϕ h\in \phiin any model spacehϕ , called amodel

supervised learning tasks


Definition : given sample space XXX , label spaceYYY , unknown characteristic distributionDDD and label distribution{ D x : x ∈ X } \{D_{x}:x\in X\}{ Dx:xX } ,the supervised learning taskis to train amodel hhh , which is calculated fromXXXYYAmapping of Y , denoted as: h : X − > Y h:X->Yh:X>The eigenvector xx of Y for any samplex , given byh ( x ) h(x)h ( x ) as pairxxThe prediction of the label of x . y ^ = h ( x ) \hat{y}=h(x)y^=h(x)

loss function


Definition : Let XXX is the sample space,YYY is the label space, the loss function is a fromY × YY\times YY×Y to positive real numberR + R+R + function l : Y × Y − > R + l:Y\times Y->R^+l:Y×Y>R+ : and satisfy the following properties, for anyy ∈ Y y\in YyY ,有l ( y , y ) = 0 l(y,y)=0l ( y ,y)=0Specifies : 0-1 Letl ( y , y ^ ) = { 0 , if y = y ^ 1 , otherwisel(y,\hat{y})= \left\{\begin{aligned}0, if\y=\hat{y}\\1,otherwise\end{aligned}\right.l ( y ,y^)={ 0,if y=y^1,otherwise
其中: y ^ = h ( x ) , x ∈ X , y ^ ∈ Y \hat{y}=h(x),x\in X,\hat{y}\in Y y^=h(x),xX,y^and (and andy is the true value of the label of the sample,y ^ \hat{y}y^is the label prediction value for sample x)

For example: square loss function l ( y , y ^ ) = ( y − y ^ ) 2 , y , y ^ ∈ R l(y,\hat{y})=(y-\hat{y})^2,y,\hat{y}\in Rl ( y ,y^)=(yy^)2,y,y^R

Test data and model metrics (test error)


In supervised learning, given a sample space XXX , label spaceYYY ,the unknown characteristic distribution DDD and label distribution{ D x : x ∈ X } , \{D_x:x\in X\},{ Dx:xX } , assuming that the model type output by the supervised learning algorithm ishhh . Given a set of data T = { ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , . . . , ( x ( t ) , y ( t ) ) } T=\{(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),...,(x^{(t )},y^{(t)})\}T={(x(1),y(1)),(x(2),y(2)),...,(x(t),y(t))}其中, x ( 1 ) , x ( 2 ) , . . . , x ( t ) ∼ D , x^{(1)},x^{(2)},...,x^{(t)}\thicksim D, x(1),x(2),...,x(t)D , forXXAccording to the characteristic distributionDD in XIndependent sampling of D , and for any1 ≤ i ≤ t , 1\leq i\leq t,1it, y ( i ) ∼ D x ( i ) , y^{(i)}\thicksim D_{x^{(i)}}, y(i)Dx(i), willTTT isthe test data set, with modelhhh in the test datasetTTT上的平均损失LT ( h ) = 1 t ∑ i = 1 tl ( h ( x ( i ) , y ( i ) ) ) L_T(h)=\frac{1}{t}\sum_{i=1}^{t}l(h(x^{(i)},y^{(i)}))LT(h)=t1i=1tl(h(x(i),y( i ) ))as modelhhA measure of the effect ofh When
the test data size is sufficiently large, the empirical loss approximates the expected loss well.

Generally speaking, given a data set (it is not clear about its distribution DDD ), first randomly divide it into a test set and a training set in a certain proportion, so that both sets will satisfy the distributionDDD , train the modelhhh , compute the loss on the test set, as modelhhThe effect measure of h

Empirical Loss Minimization Algorithm Architecture (Training Error)

Given a loss function l, l,l , and given a set of data S = { ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , . . . , ( x ( t ) , y ( t ) ) } S=\{(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),...,(x^ {(t)},y^{(t)})\}S={(x(1),y(1)),(x(2),y(2)),...,(x(t),y(t))}其中, x ( 1 ) , x ( 2 ) , . . . , x ( t ) ∼ D , x^{(1)},x^{(2)},...,x^{(t)}\thicksim D, x(1),x(2),...,x(t)D , forXXAccording to the characteristic distributionDD in XIndependent sampling of D , and for any1 ≤ i ≤ t , 1\leq i\leq t,1it, y ( i ) ∼ D x ( i ) , y^{(i)}\thicksim D_{x^{(i)}}, y(i)Dx(i), theSSS isthe training data sethhtrained on the training data seth inthe training dataSSS上的平均损失LS ( h ) = 1 t ∑ i = 1 tl ( h ( x ( i ) , y ( i ) ) ) L_S(h)=\frac{1}{t}\sum_{i=1}^{t}l(h(x^{(i)},y^{(i)}))LS(h)=t1i=1tl(h(x(i),y( i ) ))as modelhhh 'sempirical loss
unconstrained empirical loss minimization algorithm architecture:
  given sample spaceXXX , label spaceYYY , model spaceΦ \varPhiΦ ,
  loss function:l : Y × Y − > R + l:Y\times Y->R^+l:Y×Y>R+
  Enter:mmm training dataS = { ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , . . . , ( x ( m ) , y ( m ) ) } S=\{(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),...,(x^{(m) },y^{(m)})\}S={(x(1),y(1)),(x(2),y(2)),...,(x(m),y( m ) )}
  Output model:hs = arg min ⁡ h ∈ ϕ L s ( h ) h_{s}=\argmin_{h\in\phi}L_s(h)hs=argminhϕLs( h )
features:hs ( x ) h_s(x)hs( x ) in the training dataSSExperience loss on S L s ( h ) = 0 L_s(h)=0Ls(h)=0 , the unconstrained empirical loss minimization algorithm is prone tooverfittingproblems
ps: Unconstrained refers to the space ( ϕ \phiϕ ) Unconstrained (model space composed of all qualified models)
overfitting example:
  Suppose the sample space X=[-1,1], the feature distribution D is a uniform distribution on X, the label space Y=ℝ, the label distribution Dx=N(x, 0.3) is a normal distribution with an expectation of x and a standard deviation of 0.3, and the loss function is a square loss function.

Overfitting: small training error, large test error!

  1. h s ( x ) h_s(x) hs( x ) inthe training dataSSExperience lossLSon S ( h ) L_S(h)LS( h ) ---- training error
  2. h s ( x ) h_s(x) hs( x ) intest dataSSExperience lossLTon S ( h ) L_T(h)LT( h ) ----Test error
    Overfitting: small training error, large test errorUnderfitting
    : relatively large training errorGeneralization ability: small training error, small test error

We work on the generalization ability of the model, preventing the method of fitting:

  1. Introduce model assumptions (make appropriate assumptions about label distribution or model structure, such as the above example, which can be assumed to be a linear model)
  2. Regularization ( L 1 , L 2 L_1, L_2L1L2Regularization)
  3. d r o p o u t dropout d ro p o u t (for deep neural networks)
  4. Expand the training sample

EXAMPLE OF EXPERIENCE LOSS MINIMIZATION: BLOCK CLASSIFICATION


insert image description here

from sklearn.datasets._samples_generator import make_blobs
import pandas as pd
import matplotlib.pylab as plt
import numpy as np
from sklearn.model_selection import train_test_split


# 感知器算法
class Perception:
    def __init__(self):
        self.b = None
        self.w = None

    def fit(self, X, y):
        m, n = X.shape
        w = np.zeros((n, 1))
        b = 0
        done = False
        while not done:
            done = True
            for i in range(m):
                x = X[i].reshape(1, -1)
                if y[i] * (x.dot(w) + b) <= 0:
                    w = w + y[i] * x.T
                    b = b + y[i]
                    done = False
        self.w = w
        self.b = b

    def predict(self, x):
        return np.sign(x.dot(self.w) + self.b)


# 构建数据集合
X, y = make_blobs(n_samples=100, centers=2, n_features=2, cluster_std=0.6, random_state=0)
y[y == 0] = -1
data = pd.DataFrame(X, columns=['x1', 'x2'])
data['y'] = y
# 根据题意:划分不同的test_size
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(12, 8))
axes = axes.ravel()
ax1 =fig.add_subplot(131)
ax1.plot(data['x1'][data['y'] == 1], data['x2'][data['y'] == 1], "bs", ms=3)
ax1.plot(data['x1'][data['y'] == -1], data['x2'][data['y'] == -1], "rs", ms=3)
ax1.set_title('Original Data')


for i, test_size in enumerate([0.5, 0.4, 0.3, 0.2], start=1):
    print(i,test_size)
    # 划分数据集(训练集,测试集)
    X_train, X_test, y_train, y_test = train_test_split(data[['x1', 'x2']], data['y'], test_size=test_size)
    # 训练模型
    model = Perception()
    model.fit(np.array(X_train), np.array(y_train))
    w = model.w
    b = model.b

    # 作图
    if i == 3:
        i = i+2
    ax = axes[i]
    ax.plot(data['x1'][data['y'] == 1], data['x2'][data['y'] == 1], "bs", ms=3)
    ax.plot(data['x1'][data['y'] == -1], data['x2'][data['y'] == -1], "rs", ms=3)
    ax.set_title('Test Size: {:.1f}'.format(test_size))
    x_0 = np.linspace(-1, 3.5, 200)
    line = -w[0] / w[1] * x_0 - b / w[1]
    ax.plot(x_0, line)

plt.subplots_adjust(hspace=0.3)
plt.show()



insert image description here
in conclusion:

  1. Under all test set ratios, the perceptron model showed a good classification effect, and successfully divided the data set into two clusters.
  2. When the test set ratio is 0.2, the performance of the model is the best, and the classification effect is closest to the real dividing line.

regularization algorithm


Foreword:

  • In the process of minimizing empirical loss , a reasonable choice of model assumptions is an effective way to avoid overfitting
  • How to deal with the possibility of overfitting despite the choice of model assumptions? regularization algorithm

Common strategies for regularization:

  • L1 L_1L1Regularization
  • L 2 L_2L2Regularization

Occam's Razor: Don't Multiply Entities Unnecessarily

For example: each nnn -ary linear function h ( x ) = < w , x > + b , x ∈ R nh(x)=<w,x>+b,x\in R^nh(x)=<w,x>+b,xRn can usen + 1 n+1n+1 parameter means: w = ( w 1 , w 2 , . . . , wn ) and bw=(w_1,w_2,...,w_n) and bw=(w1,w2,...,wn) and b usehw h_whwDisplay reason w = ( w 1 , w 2 , . . . , wn ) w=(w_1,w_2,...,w_n)w=(w1,w2,...,wn) The model represented by this set of parameters.
In machine learning,use the parameter wwThe norm of w to quantize the model hw h_whwComplexity L 1 norm: ∣ w ∣ = ∣ w 1 ∣ + ∣ w 2 ∣ + . . . + ∣ wn ∣ L_1 norm: |w|=|w_1|+|w_2|+...+|w_n|L1Norm: w =w1+w2+...+wn L 2 numbers: ∣ ∣ w ∣ ∣ = w 1 2 + w 2 2 + .L2Norm: ∣∣ w ∣∣=w12+w22+...+wn2

ps: what we are after is hw h_whwThe complexity is low, so it is wwThe w vectorhas fewandsmall


L 1 number: ∣ w ∣ = ∣ w 1 ∣ + ∣ w 2 ∣ + .L1Norm: w =w1+w2+...+wn

  • Every non-zero parameter wj w_jwjBoth bring size λ ∣ wj ∣ \lambda |w_j|λwj punishment
  • Algorithm L s L_sLsWhen the value is close, choose L 1 L_1L1A model with a smaller norm

L 2 numbers: ∣ ∣ w ∣ ∣ = w 1 2 + w 2 2 + .L2Norm: ∣∣ w ∣∣=w12+w22+...+wn2

  • Every non-zero parameter wj w_jwjBoth bring size λ wj 2 \lambda w_j^2λwj2punishment

Guess you like

Origin blog.csdn.net/qq_25218219/article/details/129453359