supervised study notes
The concept of supervised learning
- The main task of supervised learning is to predict the label according to the characteristics of the data object , and the corresponding learning algorithm needs to learn from experience
- Experience comes from labeled training data , which is randomly sampled data from a collection of data objects, also called samples .
feature group (features)
Definition : feature group In supervised learning tasks, the vector x = ( x 1 , x 2 , . . . , xn ) ∈ R nx=(x_{1},x_{2},...,x_{n})\in R^ n x=(x1,x2,...,xn)∈Rn , is called the feature group of the object, that is, the feature vector. LetX ⊆ R n X\subseteq R^nX⊆Rn is the set of all possible values of the feature group, calledXXX isthe sample space.
Label
Definition : In a regression problem, the training data contains a numerical label y ∈ R y \in Ry∈R ; inkkIn the k -ary classification problem, the training data contains a vector labely ∈ [ 0 , 1 ] ky\in[0,1]^ky∈[0,1]k。设YYY is all possible label values, calledYYY is the label space.
Classification problems:
For example: handwritten digit recognition problem:y ∈ [ 0 , 1 ] 10 y\in [0,1]^{10}y∈[0,1]10
- Vector label for number 3: y = ( 0 , 0 , 0 , 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ) y=(0,0,0,1,0,0,0,0,0,0,0)y=(0,0,0,1,0,0,0,0,0,0,0)
- Scalar label for number 3: y = 3 , y ∈ Y , Y = { 0 , 1 , 2 , 3 , . . . , 9 } y=3,y\in Y,Y=\{0,1,2,3,...,9\}y=3,y∈Y,Y={ 0,1,2,3,...,9}
Model
Definition : Let XXX is the sample space,YYY is the label space,ϕ \phiϕ的X − > Y X->YX−>The mappingsetof Y is called ϕ \phiϕ isthe model spaceh ∈ ϕ h\in \phiin any model spaceh∈ϕ , called amodel
supervised learning tasks
Definition : given sample space XXX , label spaceYYY , unknown characteristic distributionDDD and label distribution{ D x : x ∈ X } \{D_{x}:x\in X\}{ Dx:x∈X } ,the supervised learning taskis to train amodel hhh , which is calculated fromXXX到YYAmapping of Y , denoted as: h : X − > Y h:X->Yh:X−>The eigenvector xx of Y for any samplex , given byh ( x ) h(x)h ( x ) as pairxxThe prediction of the label of x . y ^ = h ( x ) \hat{y}=h(x)y^=h(x)
loss function
Definition : Let XXX is the sample space,YYY is the label space, the loss function is a fromY × YY\times YY×Y to positive real numberR + R+R + function l : Y × Y − > R + l:Y\times Y->R^+l:Y×Y−>R+ : and satisfy the following properties, for anyy ∈ Y y\in Yy∈Y ,有l ( y , y ) = 0 l(y,y)=0l ( y ,y)=0Specifies : 0-1 Letl ( y , y ^ ) = { 0 , if y = y ^ 1 , otherwisel(y,\hat{y})= \left\{\begin{aligned}0, if\y=\hat{y}\\1,otherwise\end{aligned}\right.l ( y ,y^)={
0,if y=y^1,otherwise
其中: y ^ = h ( x ) , x ∈ X , y ^ ∈ Y \hat{y}=h(x),x\in X,\hat{y}\in Y y^=h(x),x∈X,y^∈and (and andy is the true value of the label of the sample,y ^ \hat{y}y^is the label prediction value for sample x)
For example: square loss function l ( y , y ^ ) = ( y − y ^ ) 2 , y , y ^ ∈ R l(y,\hat{y})=(y-\hat{y})^2,y,\hat{y}\in Rl ( y ,y^)=(y−y^)2,y,y^∈R
Test data and model metrics (test error)
In supervised learning, given a sample space XXX , label spaceYYY ,the unknown characteristic distribution DDD and label distribution{ D x : x ∈ X } , \{D_x:x\in X\},{
Dx:x∈X } , assuming that the model type output by the supervised learning algorithm ishhh . Given a set of data T = { ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , . . . , ( x ( t ) , y ( t ) ) } T=\{(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),...,(x^{(t )},y^{(t)})\}T={(x(1),y(1)),(x(2),y(2)),...,(x(t),y(t))}其中, x ( 1 ) , x ( 2 ) , . . . , x ( t ) ∼ D , x^{(1)},x^{(2)},...,x^{(t)}\thicksim D, x(1),x(2),...,x(t)∼D , forXXAccording to the characteristic distributionDD in XIndependent sampling of D , and for any1 ≤ i ≤ t , 1\leq i\leq t,1≤i≤t,有 y ( i ) ∼ D x ( i ) , y^{(i)}\thicksim D_{x^{(i)}}, y(i)∼Dx(i), willTTT isthe test data set, with modelhhh in the test datasetTTT上的平均损失LT ( h ) = 1 t ∑ i = 1 tl ( h ( x ( i ) , y ( i ) ) ) L_T(h)=\frac{1}{t}\sum_{i=1}^{t}l(h(x^{(i)},y^{(i)}))LT(h)=t1i=1∑tl(h(x(i),y( i ) ))as modelhhA measure of the effect ofh When
the test data size is sufficiently large, the empirical loss approximates the expected loss well.
Generally speaking, given a data set (it is not clear about its distribution DDD ), first randomly divide it into a test set and a training set in a certain proportion, so that both sets will satisfy the distributionDDD , train the modelhhh , compute the loss on the test set, as modelhhThe effect measure of h
Empirical Loss Minimization Algorithm Architecture (Training Error)
Given a loss function l, l,l , and given a set of data S = { ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , . . . , ( x ( t ) , y ( t ) ) } S=\{(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),...,(x^ {(t)},y^{(t)})\}S={(x(1),y(1)),(x(2),y(2)),...,(x(t),y(t))}其中, x ( 1 ) , x ( 2 ) , . . . , x ( t ) ∼ D , x^{(1)},x^{(2)},...,x^{(t)}\thicksim D, x(1),x(2),...,x(t)∼D , forXXAccording to the characteristic distributionDD in XIndependent sampling of D , and for any1 ≤ i ≤ t , 1\leq i\leq t,1≤i≤t,有 y ( i ) ∼ D x ( i ) , y^{(i)}\thicksim D_{x^{(i)}}, y(i)∼Dx(i), theSSS isthe training data sethhtrained on the training data seth inthe training dataSSS上的平均损失LS ( h ) = 1 t ∑ i = 1 tl ( h ( x ( i ) , y ( i ) ) ) L_S(h)=\frac{1}{t}\sum_{i=1}^{t}l(h(x^{(i)},y^{(i)}))LS(h)=t1i=1∑tl(h(x(i),y( i ) ))as modelhhh 'sempirical loss
unconstrained empirical loss minimization algorithm architecture:
given sample spaceXXX , label spaceYYY , model spaceΦ \varPhiΦ ,
loss function:l : Y × Y − > R + l:Y\times Y->R^+l:Y×Y−>R+
Enter:mmm training dataS = { ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , . . . , ( x ( m ) , y ( m ) ) } S=\{(x^{(1)},y^{(1)}),(x^{(2)},y^{(2)}),...,(x^{(m) },y^{(m)})\}S={(x(1),y(1)),(x(2),y(2)),...,(x(m),y( m ) )}
Output model:hs = arg min h ∈ ϕ L s ( h ) h_{s}=\argmin_{h\in\phi}L_s(h)hs=argminh∈ϕLs( h )
features:hs ( x ) h_s(x)hs( x ) in the training dataSSExperience loss on S L s ( h ) = 0 L_s(h)=0Ls(h)=0 , the unconstrained empirical loss minimization algorithm is prone tooverfittingproblems
ps: Unconstrained refers to the space ( ϕ \phiϕ ) Unconstrained (model space composed of all qualified models)
overfitting example:
Suppose the sample space X=[-1,1], the feature distribution D is a uniform distribution on X, the label space Y=ℝ, the label distribution Dx=N(x, 0.3) is a normal distribution with an expectation of x and a standard deviation of 0.3, and the loss function is a square loss function.
Overfitting: small training error, large test error!
- h s ( x ) h_s(x) hs( x ) inthe training dataSSExperience lossLSon S ( h ) L_S(h)LS( h ) ---- training error
- h s ( x ) h_s(x) hs( x ) intest dataSSExperience lossLTon S ( h ) L_T(h)LT( h ) ----Test error
Overfitting: small training error, large test errorUnderfitting
: relatively large training errorGeneralization ability: small training error, small test error
We work on the generalization ability of the model, preventing the method of fitting:
- Introduce model assumptions (make appropriate assumptions about label distribution or model structure, such as the above example, which can be assumed to be a linear model)
- Regularization ( L 1 , L 2 L_1, L_2L1、L2Regularization)
- d r o p o u t dropout d ro p o u t (for deep neural networks)
- Expand the training sample
EXAMPLE OF EXPERIENCE LOSS MINIMIZATION: BLOCK CLASSIFICATION
from sklearn.datasets._samples_generator import make_blobs
import pandas as pd
import matplotlib.pylab as plt
import numpy as np
from sklearn.model_selection import train_test_split
# 感知器算法
class Perception:
def __init__(self):
self.b = None
self.w = None
def fit(self, X, y):
m, n = X.shape
w = np.zeros((n, 1))
b = 0
done = False
while not done:
done = True
for i in range(m):
x = X[i].reshape(1, -1)
if y[i] * (x.dot(w) + b) <= 0:
w = w + y[i] * x.T
b = b + y[i]
done = False
self.w = w
self.b = b
def predict(self, x):
return np.sign(x.dot(self.w) + self.b)
# 构建数据集合
X, y = make_blobs(n_samples=100, centers=2, n_features=2, cluster_std=0.6, random_state=0)
y[y == 0] = -1
data = pd.DataFrame(X, columns=['x1', 'x2'])
data['y'] = y
# 根据题意:划分不同的test_size
fig, axes = plt.subplots(nrows=2, ncols=3, figsize=(12, 8))
axes = axes.ravel()
ax1 =fig.add_subplot(131)
ax1.plot(data['x1'][data['y'] == 1], data['x2'][data['y'] == 1], "bs", ms=3)
ax1.plot(data['x1'][data['y'] == -1], data['x2'][data['y'] == -1], "rs", ms=3)
ax1.set_title('Original Data')
for i, test_size in enumerate([0.5, 0.4, 0.3, 0.2], start=1):
print(i,test_size)
# 划分数据集(训练集,测试集)
X_train, X_test, y_train, y_test = train_test_split(data[['x1', 'x2']], data['y'], test_size=test_size)
# 训练模型
model = Perception()
model.fit(np.array(X_train), np.array(y_train))
w = model.w
b = model.b
# 作图
if i == 3:
i = i+2
ax = axes[i]
ax.plot(data['x1'][data['y'] == 1], data['x2'][data['y'] == 1], "bs", ms=3)
ax.plot(data['x1'][data['y'] == -1], data['x2'][data['y'] == -1], "rs", ms=3)
ax.set_title('Test Size: {:.1f}'.format(test_size))
x_0 = np.linspace(-1, 3.5, 200)
line = -w[0] / w[1] * x_0 - b / w[1]
ax.plot(x_0, line)
plt.subplots_adjust(hspace=0.3)
plt.show()
in conclusion:
- Under all test set ratios, the perceptron model showed a good classification effect, and successfully divided the data set into two clusters.
- When the test set ratio is 0.2, the performance of the model is the best, and the classification effect is closest to the real dividing line.
regularization algorithm
Foreword:
- In the process of minimizing empirical loss , a reasonable choice of model assumptions is an effective way to avoid overfitting
- How to deal with the possibility of overfitting despite the choice of model assumptions? regularization algorithm
Common strategies for regularization:
- L1 L_1L1Regularization
- L 2 L_2L2Regularization
Occam's Razor: Don't Multiply Entities Unnecessarily
For example: each nnn -ary linear function h ( x ) = < w , x > + b , x ∈ R nh(x)=<w,x>+b,x\in R^nh(x)=<w,x>+b,x∈Rn can usen + 1 n+1n+1 parameter means: w = ( w 1 , w 2 , . . . , wn ) and bw=(w_1,w_2,...,w_n) and bw=(w1,w2,...,wn) and b usehw h_whwDisplay reason w = ( w 1 , w 2 , . . . , wn ) w=(w_1,w_2,...,w_n)w=(w1,w2,...,wn) The model represented by this set of parameters.
In machine learning,use the parameter wwThe norm of w to quantize the model hw h_whwComplexity L 1 norm: ∣ w ∣ = ∣ w 1 ∣ + ∣ w 2 ∣ + . . . + ∣ wn ∣ L_1 norm: |w|=|w_1|+|w_2|+...+|w_n|L1Norm: ∣ w ∣=∣w1∣+∣w2∣+...+∣wn∣ L 2 numbers: ∣ ∣ w ∣ ∣ = w 1 2 + w 2 2 + .L2Norm: ∣∣ w ∣∣=w12+w22+...+wn2
ps: what we are after is hw h_whwThe complexity is low, so it is wwThe w vectorhas fewandsmall
L 1 number: ∣ w ∣ = ∣ w 1 ∣ + ∣ w 2 ∣ + .L1Norm: ∣ w ∣=∣w1∣+∣w2∣+...+∣wn∣
- Every non-zero parameter wj w_jwjBoth bring size λ ∣ wj ∣ \lambda |w_j|λ∣wj∣ punishment
- Algorithm L s L_sLsWhen the value is close, choose L 1 L_1L1A model with a smaller norm
L 2 numbers: ∣ ∣ w ∣ ∣ = w 1 2 + w 2 2 + .L2Norm: ∣∣ w ∣∣=w12+w22+...+wn2
- Every non-zero parameter wj w_jwjBoth bring size λ wj 2 \lambda w_j^2λwj2punishment