[Notes - Statistical Learning Method] Perceptron perceptron

A few days ago, I thought I had finished reading the chapter on Perceptron, and I took some notes by the way.
Now I am organizing the notes for the third time.

zero, summary

  1. It can be said that it is very limited for binary classification problems with linearly separable datasets.
  2. The perceptron is essentially a separating hyperplane
  3. When the dimension of the vector (number of features) is too high, choose the dual form algorithm
    When the number of vectors (number of samples) is too large, choose the original algorithm
  4. The difference and advantages of batch gradient descent and stochastic gradient descent
    Reference link: Formula comparison and implementation comparison of Stochastic gradient descent and Batch gradient descent
  • Batch Gradient Descent (BGD, Batch Gradient Descent)
    $ \theta \leftarrow \theta + \eta \sum \frac{\partial L}{\partial \theta}$
    that is to update the parameters of the global sample multiple times
    Disadvantage: time-consuming calculation
    Advantages: can tend to the global optimum, less affected by data noise
  • Stochastic Gradient Descent (SGD, Srochastic Gradient Descent)
    $ \theta \leftarrow \theta + \eta \frac{\partial L}{\partial \theta}$
    That is to update the parameters of a single sample multiple times
    Disadvantages: The training time is short
    Advantages: It does not necessarily tend to the global optimal (often optimal/optimal, except for unimodal problems), and is greatly affected by data noise

1. Model

Input space $ \mathcal{X} \subseteq R^n $
Output space $ \mathcal{Y} \subseteq {-1, +1} $
Hypothetical space $ \mathcal{F} \subseteq {f|f(x) = \omega \cdot x + b} $
parameter$ \omega \in R^n, b \in R $
model$ f(x) = sign(\omega \cdot x + b) $

where the
sign function is
\[ sign(x)=\left\{\begin{matrix} +1 , x \geqslant 0\\ -1 , x \geqslant 0 \end{matrix}\right. \]

The linear equation
$ \omega \cdot x + b $
can be expressed as a separating hyperplane in the feature space $ R^n $

2. Strategy

(defining the loss function and minimizing the loss function)
(note the non-negative nature of the loss function)

In order to make the loss function easier to optimize , we choose the distance from the misclassification point to the hyperplane as the loss function. The distance of
any vector \(x \in R^n\) from the separating hyperplane is
$ S=\frac{1}{| \omega|}|\omega \cdot x + b| $

Next, optimize this distance to make it a better loss function

  1. To be continuously differentiable, go to the absolute value
    $ S=-\frac{1}{|\omega|} y_i(\omega \cdot x + b) $
  2. Remove irrelevant coefficients (to avoid wasting calculations), get
    $ L(\omega, b)=-\sum_{x_i \in M} y_i(\omega \cdot x + b) $
    where $ M $ is the set of misclassified points

3. Algorithm

(How to realize the optimization problem)
Note that the value of the final trained model parameters depends on the selection of the initial value and the misclassification point, so the general value is different

In order to minimize the loss function, we use the gradient descent method

  1. original form algorithm
  • Assign initial value $ \omega \leftarrow 0 , b \leftarrow 0 $
  • Pick data point $ (x_i, y_i) $
  • Determine whether the data point is a misclassification point of the current model, that is, if $ y_i(\omega \cdot x + b) <=0 $
    , update
    \[ \begin{matrix} \omega &\leftarrow \omega + \eta n_ix_iy_i \\ b &\leftarrow b + \eta n_iy_i \end{matrix}\]
  1. Dual Form Algorithm
    Note that in the original form algorithm, the final trained model parameters are like this, where $ n_i $ indicates that the i-th data point has been updated several times
    \[ \begin{matrix} \omega &= \eta \ sum_i n_ix_iy_i \\ b &= \eta \sum_i n_iy_i \end{matrix} \]
    So we can make the following simplification
  • Assign initial value $ n \leftarrow 0, b \leftarrow 0 $
  • Pick data point $ (x_i, y_i) $
  • Determine whether the data point is a misclassification point of the current model, that is, if $ y_i(\eta \sum n_iy_ix_i \cdot x + b) <=0 $
    , then update
    \[ \begin{matrix} n_i \leftarrow n_i + 1 \ \ b &\leftarrow b + \eta y_i \end{matrix}\]
    In order to reduce the amount of calculation, we can pre-calculate the inner product in the formula to get the Gram matrix
    $ G=[x_i, x_j]_{N \times N} $
  1. The choice of the original form and the dual form How to understand the dual form of the perceptron learning algorithm
    ? When the vector dimension (number of features) is too high, the calculation of the inner product is very time-consuming, and the dual form algorithm should be selected to speed up. When the number of vectors (number of samples) is too large, the cumulative sum ( \(\omega \) ) is not necessary, the original algorithm should be chosen

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325119220&siteId=291194637