Linear model

basic concept

  Linear models are functions that use linear combinations of attributes to make predictions:

  For an n-dimensional data $\mathbf{x}=\{x_1,x_2,…,x_n\}$, a set of weights to be learned $w_1,w_2,…,w_n;b$, so that the prediction result is:

  $f(x) = w_1x_1 + w_2x_2+…+w_nx_n +d$

  The form of the vector is this $f(\mathbf{x}) = \mathbf{w}^T \mathbf{x}+b$

  The weight of the linear model can represent the proportion of each attribute, and the larger the weight, the more important the attribute is. So linear models can be used as a kind of embedded feature selection, using their coefficient coefficient, the model attribute coef_ in sklearn. Corresponding to it is the tree structure, using the importance_ attribute, which uses a certain metric (such as information entropy or Gini coefficient) to measure the proportion of the attribute when the tree is divided.

 

Linear regression

  Let's first look at linear models for regression problems, and then move on to classification problems. Because linearity is relatively simple and easy to understand in regression.

 

one-dimensional dataset

  And let's first look at the data with only one attribute, that is, we will eventually learn to get $f(x_i) = wx_i +b$. Minimize the difference between the predicted value $f(x_i)$ and the true value $y_i$.

  So which optimization function do we use in this process. For regression problems, the most classic performance evaluation function is the mean squared error $E(f;D)= \frac{1}{m} \sum_{i=1}^{m}{(\textbf{x}_i – y_i)}^2$, where D is the dataset. We want to minimize the mean squared error. Graphically, it is to find a straight line so that the distance between all points and the straight line is the shortest.

  The solution method is to find the partial derivative of the function, which is equal to 0, and then the values ​​of the parameters w and b can be found.

 

cube

  Well, now let's expand the problem to high-dimensional data sets. In fact, the essence of the problem has not changed, but the process of solving it takes a little more effort.

  For multidimensional, we want to learn $f(\mathbf{x_i}) = \mathbf{w}^T \mathbf{x_i}+b$ such that the predicted value $f(\mathbf{x_i})$ is the same as the real The value $y_i$ has the smallest difference.

  At this time, we combine $\mathbf{w}$ and b to form a new parameter $\hat{\mathbf{w}}$ that is $\hat{\mathbf{w}} = (\mathbf{w}; b)$. And we use a new variable to represent the original dataset plus a column of all 1s. That is, X is

$$ \left[ \begin{matrix} x_{11} & x_{12} & \cdots & x_{1n} & 1\\ x_{21} & x_{22} & \cdots & x_{2n}  & 1\\ \vdots & \vdots & \ddots & \vdots & \vdots  \\ x_{31} & x_{32} & \cdots & x_{3n} & 1 \\ \end{matrix} \right] $$

  So now we use this formula to express the mean squared error in high dimensions $f(\hat{\mathbf{w}}) = {( \mathbf{y} – \mathbf{X} \hat{\mathbf{ w}}) }^T ( \mathbf{y} – \mathbf{X} \hat{\mathbf{w}})$

  Dimension description: y is $m \times 1$ dimension, (m is the number of datasets, n is the number of attributes) X is $m \times (n+1)$ dimension $\hat{\mathbf{w }}$ is $(n+1) \times 1$ dimension

  Ok, now our goal is to use some optimization method to minimize the value of the function $f(\hat{\mathbf{w}}) $. The first is to find the derivative. (Due to the lack of some knowledge of matrix derivation and optimization, this piece is more difficult)

$\frac{ \partial f(\hat{\mathbf{w}}) }{\partial \hat{\mathbf{w}}} $ = $ 2 \mathbf{X}^T (\mathbf{X} \hat{ \mathbf{w} }– \mathbf{y} ) $

When $\mathbf{X}^T \mathbf{X}$ is a full rank matrix or a positive definite matrix, the optimal solution is $(\mathbf{X}^T \mathbf{X})^{-1} \ mathbf{X}^T \mathbf{y}$ When the matrix is ​​not a full rank matrix, a regularization term needs to be introduced, that is, this function is regarded as an empirical loss, and a certain penalty function needs to be added to become a structural loss.

 

Generalized Linear Model

  The function forms we discussed above are to make the predicted value of the linear model approximate to the y value, then we can also make the linear model $ \mathbf{w}^T \mathbf{x}+b$ approximate to the variant of y, such as $ln(y)$. Then at this time we want to make $ln(y) = \mathbf{w}^T \mathbf{x}+b$, more generally we can consider the differentiable function $g(\cdot) such that y = g^{- 1}(\mathbf{w}^T \mathbf{x}+b)$

 

Logarithmic regression (logistic regression)

  Now we start to discuss how we should do it for classification problems. We know that the result produced by the linear model is a real number, and the result of the binary classification problem is the label such as {0,1}. So how should we deal with this problem, ideally we want to use a unit step function, so that when the linear result prediction > 0, the output is 1, and when the linear result output is < 0, we let the output be 0, But the bad thing is that step functions are not differentiable. So we want to find a differentiable function, the sigmoid function, whose graph is s-shaped, and such a function meets our requirements. The logistic function has the following form

$ y = \ frac {1} {1 + e ^ {- z}} $

The function graph is as follows:

Figure_1-2

In fact, this function does not directly map the linear output to the label, but maps to a value between 0 and 1, which is a very interesting thing. We can use this value to represent the probability that the output is a positive example , For example, when the value of the mapping is 0.8, we think that the probability of the output being a positive example is relatively high, so we classify it as a positive class. When the output value is 0.2, we think that the probability of its output being a positive class is relatively small, so it is judged as a negative class. This way of outputting probabilities instead of directly outputting class labels is common in many models, and in fact it is a very effective method. Only by understanding this: after using the logistic function, the output is a probability value, then many of the following formulas can be understood.

After using the logistic function, the linear output of the model is $z = \mathbf{w}^T \mathbf{x}+ b$, after adding the logistic function, convert it to a y with an output between 0 and 1 value, so the result becomes:

$ y = \frac{1}{1 + e^{\mathbf{w}^T \mathbf{x}+ b}}$, according to the above discussion, we can regard y as the probability that the output is a positive example .

This function can be transformed into $ln \frac{y}{1-y} = \mathbf{w}^T \mathbf{x}+ b$

y is the probability that the output is a positive example, then 1-y is the probability that the output is a negative class, so the ratio of the two is called "probability", and then the logarithm is taken, and this function is called logarithmic probability.

How to find the parameters W and b here, we use the maximum likelihood method in parameter estimation to solve.

Here is just a general idea. According to the solution of the maximum likelihood method, we need to know the probability that the output is 0 and the output is 1. We can just use the y defined above to represent the probability of the output bit 1, and 1-y represents the output as 0 probability, and then build the likelihood function. After simplification, a convex function is obtained, which is solved according to the convex optimization theory.

 

Linear Discriminant Analysis (LDA)

  Linear discriminant analysis also uses a linear function $y = \mathbf{w}^T \mathbf{x} $, the purpose of which is to map our values ​​to a one-dimensional space. When I thought about what a one-dimensional space is, at first I thought it was mapped to a straight line, which is indeed what the book draws, mapping our samples to a straight line. However, if you consider high dimensions, such as three dimensions, is it mapped to a plane? no. Understanding the one dimension here should not be understood from the graphics, but should say that y is a numerical value and is one-dimensional, so map it to one-dimensional.

  After mapping to one dimension, we have an intuitive feeling that we want to make the distance between samples of the same type as close as possible, and the distance between samples of different classes as far as possible. From this idea, let's define our optimization function. For the two-class problem {0,1}, then we use $\mu_0$ and $\mu_1$ to represent the mean of the 0th and 1st class samples, note that $\mu_0$ and $\mu_1$ are n-dimensional The vector of , and n is the number of features in the sample. We use $\Sigma_0$ and $\Sigma_1$ to denote the covariance of samples of class 0 and class 1.

  Then, in order to make the distance between the two samples as far as possible, we make the distance between their projection points as large as possible: ie ${ \Vert \mathbf{w}^T \mu_0 – \mathbf{w}^T \ mu_1 \Vert }_2 ^2$ The bigger the better.

  In order to make the distance between similar samples as close as possible, we make the covariance projection as large as possible: ie $\mathbf{w}^T \Sigma_0 \mathbf{w} + \mathbf{w}^T \Sigma_1 \mathbf{ The larger w}$ the better. This formula measures the variance between projected points.

Therefore, our objective function is: we want the following function to be as large as possible.

$J = \frac    {{ \Vert \mathbf{w}^T \mu_0 – \mathbf{w}^T \mu_1 \Vert }_2 ^2}     {\mathbf{w}^T \Sigma_0 \mathbf{w} + \mathbf{w}^T \Sigma_1 \mathbf{w}}$  =  $\frac    {{ \Vert \mathbf{w}^T (\mu_0 – \mu_1 ) \Vert }_2 ^2}     {\mathbf{w}^T \Sigma_0 \mathbf{w} + \mathbf{w}^T \Sigma_1 \mathbf{w}}$  =$\frac    { \mathbf{w}^T (\mu_0 – \mu_1) (\mu_0 – \mu_1)^T \mathbf{w}}     {\mathbf{w}^T  ( \Sigma_0  + \Sigma_1 ) \mathbf{w}} $

  For this function variant, we define the within-class scatter matrix and the between-class scatter matrix. The intra-class divergence matrix is ​​mainly the lower part of the formula, $S_w = \Sigma_0 + \Sigma_1$, and the inter-class divergence matrix is ​​mainly the upper half of the formula $S_b = (\mu_0 – \mu_1) (\mu_0 – \ mu_1)^T $, so the formula is rewritten as

$J = \frac {\mathbf{w}^T S_b \mathbf{w}} {\mathbf{w}^T S_w \mathbf{w}}$ So, the goal of this optimization is called maximizing class spacing, minimizing class distance.

When solving, use the Lagrange multiplier method, as well as singular value decomposition, etc. to solve.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324747472&siteId=291194637