Chapter 2 - Perceptron

After reading the last chapter of the study summary, we began to enter the second chapter - Perceptron. For statistical learning methods, it contains three elements, the model + strategy + algorithm, then you fully understand the perception machine from the following three elements.

Perceptron Model

Briefly, perceptron (Perceptron's) for the binary linear model, which is an example of the input feature vector, the output of the class instances, take + 1, -1. Language abstract mathematical model can be described as:
\ [assuming that the input space (feature space) is \ chi \ in R ^ n, the output space is the y = {+ 1, -1} , the input x \ in \ chi showing an example of feature vector, \\ space corresponding to the input point; output y \ in Y represents a class instance. From the input space to the output space satisfy the following function: \\ f (x) = sign (w \ cdot x + b) \\ where w is a weighting parameter, b is a bias term, Sign is a sign function, namely: \\ sign (x) = \ left \ { \ begin {aligned} 1 &, & x \ geq 0 \\ -1 &, & x <0 \ end {aligned} \ right. \]
perceptron model belongs discriminant model, intended obtaining the training data is divided linear separating hyperplane . as the picture shows:

Perceptron learning strategies

In for a training data set is linearly separable case, the perception is that the target machine learning is to find a positive and negative examples of instances point exactly the right point separating hyperplane. As to such a clear separation hyperplane, then you need to identify a Perceptron model, which is the need to identify a learning strategy, which defined the loss of function and loss function is minimized.

For the common loss is a function of natural selection is the total number of misclassified points, but such a loss function is not continuously differentiable function, not easy to optimize , then the other option is to point to a misclassification total distance hyperplane. Abstract can be described as:
\ [R_n input space to any point x_0 hyperplane distance S can be expressed as: \ frac {| w \ cdot x + b |} {|| w ||}, where || W | | w expressed as the number of L2 norm \]
Next, for misclassification point (x, y) is, -y (w x + b) > 0 holds.
\ [\ Because w \ cdot x + b> 0 when, y_i = 1 \\ and \ because w \ cdot x + b > 0 when, y_i = \\ \ therefore total distance -1: - \ frac {1} {|| w ||} \ sum_ {
X_i \ in M} y_i (w \ cdot x + b) \] Accordingly, this can be described as:
\ [a given data set T = {(X_1, Y_1) , ( X_2, Y_2) ... (x_N, Y_N)}, where, x_i \ in \ chi = R ^ n, y_i \ in Y = {+ 1, -1}, i = 1,2, ... N \ \ loss function is defined as: L (w, b) = - \ sum_ {X_i \ in M} y_i (w \ cdot x + b), \\ misclassification where M is the set of points \]

Perceptron algorithm

Perceptron learning problems into solving the optimization problem of the loss function, which direction is optimized stochastic gradient descent method (Stochastic Gradient Descent) . (Updated every iteration of two vectors)

Original form

\ [Input: training set T = {(X_1, Y_1), (X_2, Y_2) ... (X_N, Y_N)}, \\ wherein, x_i \ in \ chi = R ^ n, y_i \ in Y = {+ 1, -1}, i = 1,2, ... N \\; learning rate \ eta (0 <\ eta \ leq 1); \\ output: w, b; perceptron model f (x) = sign (w \ cdot x + b) \\ (1) select the initial value w_0, b_0; \\ (2) select the centralized data (x_i, y_i) training \\ (3) If y_i (w \ cdot x + b) \ leq 0 \\ w \ leftarrow w + \ eta y_ix_i \\ b \ leftarrow b + \ eta y_i \\ (4) go to (2), until there is no misclassified training set point \]

Dual form

The basic idea is to form dual w and b represent examples of the linear combination of x and y is labeled, and is obtained by solving the coefficients w and b, so as to gradually update w and b. A Then increment on w and b (x, y) are described as:
\ [w = \ sum_. 1} = {I ^ N \ \\ alpha_iy_ix_i b = \ sum_. 1} = {I ^ N \ \\ wherein alpha_iy_i the more, \ alpha_i = n_i \ eta. examples of point update, indicates the distance separating hyperplane closer, that is more difficult to classify. \]
The following shining original form can be described dual form:
\ [Input: training set T = {(X_1, Y_1) , (X_2, Y_2) ... (X_N, Y_N)}, \\ wherein, x_i \ in \ chi = R ^ n, y_i \ in Y = {+ 1, -1}, i = 1,2, ... N \\; learning rate \ eta (0 <\ eta \ leq 1); \\ output: w, b; perceptron model f (x) = sign (\ sum_ {i = 1} ^ N \ alpha_iy_ix_i \ cdot x + b) \\ where \ alpha = (\ alpha_1, \ alpha_2, \ alpha_3 ,. .. \ alpha_N) \\ (1) \ alpha \ leftarrow 0, b \ leftarrow 0; \\ (2) concentrated in the selected data train (x_i, y_i) \\ (3 ) If y_i (\ sum_ {i = 1 } ^ N \ alpha_iy_ix_i \ cdot x_i + b) \ leq 0 \\ \ alpha_i \ leftarrow \ alpha_i + \ eta b \ leftarrow + \ eta y_i (4) go to (2) until no misclassified points \]
In fact, the form of the dual the training examples only within the plot of the form, for convenience, in fact, may be the inner product between the training set of examples to be calculated are stored in a matrix, the matrix that is the so-called Gram matrix , namely:
\ [G = [x_i \ cdot
x_j] _ {N \ times N} \] back to the original form of comparison, it can be seen in its original form is updated each time the substance of two vectors, is computationally intensive, but dual form since the first example of the inner product is calculated between the first store, so the dual form data is updated for each iteration , comparison, update cost less.

Novikoff mathematical derivation theorem

\ [Description Theorem: Let the training set T = {(x_1, y_1), (x_2, y_2) ... (x_N, y_N)} are linearly separable, wherein \\ x_i \ in \ chi = R ^ n, y_i \ in Y = {+ 1, -1}, i = 1,2,3 ... N, then: \\ (1) that satisfies the condition || \ hat {w} || = 1 hyperplane \ hat {w} _ {opt} \ cdot \ {x} = w_ {opt} \ cdot x + b_ {opt} = 0 completely separate from the training data set correctly Hat; and there \ gamma> 0, \\ for all i = 1,2 ... N satisfies: \\ y_i (\ hat {w} _ {opt} \ cdot \ hat {x}) = y_i (w_ {opt} \ cdot x + b_ {opt}) \ geq \ gamma \\ (2) so that R = max || \ hat {x_i} ||, the perceptron in the number of misclassified training set k satisfies the inequality: \\ k \ leq (\ frac {R} {\ gamma}) ^ 2 \]

prove

Derivation for convenience of description, the offset b is incorporated in the weight vector w, the input vector also as extension, add a constant .

$$
the above prompt, obtained \ hat {w} = (w ^ T, b) ^ T; \ hat {x} = (x ^ T, 1) ^ T; \ hat {x} \ in R ^ {n + 1}; \ hat {w} \ in R ^ {n + 1} \

1. proof (1): \
Since the training set is linearly separable, there is a hyperplane completely separate set of data properly, \
take hyperplane \ Hat {W} {} opt \ CDOT \ Hat W {X} = { opt} \ cdot x + b_ { opt} = 0, so that || \ Hat (W) {} || = opt. 1.
for limited i = 1,2 ... N, are \
y_i (\ Hat { W}
{opt} \ CDOT \ Hat {X}) = y_i (W_ {opt} \ CDOT X + B_ {opt}) \ GEQ 0 \
there is, \ gamma = min (y_i ( w_ {opt} \ cdot x + b_ {opt})), \
such y_i (\ Hat {W} {} opt \ CDOT \ Hat {X}) = y_i (W {opt} \ {B_ CDOT opt + X}) \ GEQ \ Gamma \

2. The proof of (2): \
perceptron algorithm from \ hat {w} = 1 starts, if the instance is misclassified, updating the weights \
command \ hat {w_ {k-1 }} is the k th misclassification previous examples Extended weight vector, namely: \
\ Hat {W_ {K-. 1}} = (W_ {K-. 1} ^ T, B_ {K-. 1}) ^ T \
then there is: \
y_i (\ Hat {W} {K-. 1} \ CDOT \ Hat {X}) = y_i (W {K-. 1} \ CDOT X + B_ {K-. 1}) \ Leq 0 \
If (x_i, y_i) is \ hat {w_ { k-1}} = (w_ {k-1} ^ T, b_ {k-1}) data T misclassification ^, then w and b update is: \
W_ {K-. 1} \ LeftArrow W_ {K +} -1 \ ETA y_ix_i \
B_ {}. 1-K \ K-LeftArrow B_. 1} + {\ ETA y_i \
namely: \
\ Hat {{K}} = W_ \ Hat W_ {{}} +. 1-K \ ETA y_i \ Hat {x_i} \
\ THEREFORE \ Hat {W_ {K}} \ CDOT \ Hat {W_ {opt}} \
= (\ Hat {W_ {K-. 1}} + \ ETA y_i \ Hat {x_i }) \ CDOT \ Hat {W_ {opt}} \
= \ Hat {W_ {K-. 1}} \ CDOT \ Hat {W_ {opt}} + \ ETA y_i \ Hat {W_ {opt}} \ CDOT \ Hat x_i} {\
\ GEQ \ Hat {{K-W_. 1}} \ CDOT \ Hat opt {}} + {W_ \ ETA \ Gamma \
\geq \hat{w_{k-2}} \cdot \hat{w_{opt}} + 2\eta\gamma\
\geq \hat{w_{k-3}} \cdot \hat{w_{opt}} + 3\eta\gamma
...
\geq k\eta\gamma \
又\because ||\hat{w_k}||^2 \
= (\hat{w_{k-1}} + \eta y_i\hat{x_i})^2 \
= ||\hat{w_{k-1}}||^2 + 2\eta y_i\hat{w_{k-1}} \cdot \hat{x_i} + \eta^2||\hat{x_i}||^2 \
\leq ||\hat{w_{k-1}}||^2 + \eta^2||\hat{x_i}||^2 \
\leq ||\hat{w_{k-1}}||^2 + \eta^2R^2 \
\leq ||\hat{w_{k-2}}||^2 + 2\eta^2R^2
...
\leq k\eta^2R^2 \
\therefore 由不等式可得: \
k\eta \gamma \leq \hat{w_{k}} \cdot \hat{w_{opt}} \leq ||\hat{w_{k}}||\hat{w_{opt}} \leq \sqrt{k}\eta R \
\therefore k^2 \gamma^2 \leq kR^2 \
\therefore 既可得证:k \leq (\frac{R}{\gamma})^2
$$

Think

1. What Perceptron Model hypothesis space? Complexity of the model is reflected in where?

Is a linear perceptron model classification, discriminant model belongs. Hypothesis space which it defines all linear classification model feature space, i.e. wx + b;

Complexity of the model which is reflected in the number of features all instances, that is the feature dimensions.

Guess you like

Origin www.cnblogs.com/cecilia-2019/p/11328010.html