Optimization: Modeling, Algorithms and Theory (Optimization Modeling - 1)

Optimization: Modeling, Algorithms and Theory

I am currently studying the book Optimization: Modeling, Algorithms and Theory. I will record it here and take some notes. I will also add some of my own understanding to it and try not to write in such a rigid way (of course the most basic still have)

Chapter 3 Optimization Modeling

This chapter will start with common modeling techniques, and then introduce common optimization models in statistics, signal processing, image processing, and machine learning. We will focus on explaining the ideas and practical implications behind optimization modeling.

3.1 Modeling techniques

3.1.1 Design of objective function

1. Least square method

In my opinion, very few beginners should study this book directly about the least squares method, so everyone should be familiar with this thing.
Let ϕ i ( x ) : R n → R , i = 1 , 2 , ⋯ , m \phi_i(x):R^n{\rightarrow}R,i=1,2,\cdots,mϕi(x):RnR,i=1,2,,m tonnn -ary function, and there are the following equations
bi = ϕ i ( x ) , i = 1 , 2 , ⋯ , m (3.1.1) b_i={\phi_i(x)},i=1,2,\cdots, m \tag{3.1.1}bi=ϕi(x),i=1,2,,m( 3.1.1 )
insidebi b_ibiis a known real number. We know that this problem is not always solvable. First of all, if the number of equations mmm exceeds the number of independent variablesnnn , so the solution to the system of equations may not exist. Secondly, due to factors such as measurement errors, the equation relationship of the system of equations may not be established accurately. In order to solve it in actual situations, the idea of ​​the least squares method is to minimize the errorl 2 l_2l2The number square, immediately
min ⁡ x ∈ R n ∑ i = 1 m ( bi − ϕ i ( x ) ) 2 (3.1.2) {\min_{x{\in}R^n}{\sum_{i= 1}^m}(b_i-{\phi_i(x)})^2} \tag{3.1.2}xRnmini=1m(biϕi(x))2( 3.1.2 )
Ifϕ i ( x ) \phi_i(x)ϕi( x ) is a linear function, then it is called linear least squares, otherwise it is called nonlinear least squares. The
idea of ​​least squares is very intuitive. If there is a solution to equation (3.1.1), then solve problem (3.1. The global optimal solution of 2) is 0, which is equivalent to finding the solution to the equation. If the solution to the equation does not exist, problem (3.1.2) gives a solution with the smallest error to a certain extent.

The least squares method uses l 2 l_2l2Norm is used to measure the size of the error. Its main advantages are two
(1) l 2 l_2l2The square norm is smooth and differentiable, which will bring better properties to the objective function
(2) l 2 l_2l2The norm is optimal for handling certain errors, and the answer will be given later.

Of course, least squares is not always the most reasonable. No free lunch theorem. According to actual problems, we often build least squares models and max-min models. The idea is to use different norms instead of l 2 l_2l2Norm, if you want to ensure the minimum regression of the sum of deviations and absolute values, the corresponding model is:
min ⁡ x ∈ R n ∑ i = 1 m ∣ bi − ϕ i ( x ) ∣ (3.1.3) \min_{x{ \in}R^n}\sum_{i=1}^{m}|b_i-\phi_i(x)|\tag{3.1.3}xRnmini=1mbiϕi(x)( 3.1.3 )
If you want to minimize the maximum deviation, the corresponding optimization model is:
min ⁡ x ∈ R n max ⁡ i ∣ bi − ϕ i ( x ) ∣ (3.1.4) \min_{x{\in} R^n}\max_i|b_i-{\phi_i(x)}|\tag{3.1.4}xRnminimaxbiϕi(x)(3.1.4)

2. Regularization

When modeling, we often need to rely on the properties of the solution we want to obtain. For example, when there is more than one optimal solution, not all solutions are necessarily what we want. In order to make the solution smooth and overcome the problem The pathological properties of , or in order to solve the problem of over-fitting (more smoothness is actually a disguised solution to over-fitting), then the improved model is
min ⁡ x ∈ R n ∑ i = 1 m ( bi − ϕ i ( x ) ) 2 + μ ∣ ∣ x ∣ ∣ 2 2 (3.1.5) {\min_{x{\in}R^n}{\sum_{i=1}^m}(b_i-{\phi_i(x )})^2}+\mu||x||_2^2 \tag{3.1.5}xRnmini=1m(biϕi(x))2+μ∣∣x22( 3.1.5 )
whereμ > 0 \mu>0m>0 is a balance parameter. If you want to get a sparse solution, you can usel 0 l_0l0The norm constructs the following model ( l 0 l_0l0The norm is defined as the number of non-zero elements in the vector)
min ⁡ x ∈ R n ∑ i = 1 m ( bi − ϕ i ( x ) ) 2 + μ ∣ ∣ x ∣ ∣ 0 (3.1.6) {\min_ {x{\in}R^n}{\sum_{i=1}^m}(b_i-{\phi_i(x)})^2}+\mu||x||_0 \tag{3.1.6 }xRnmini=1m(biϕi(x))2+μ∣∣x0( 3.1.6 )
whereμ > 0 \mu>0m>0 is used to control the sparsity of the solution, but sincel 0 l_0l0The norm is difficult to handle in practice, so we often use l 1 l_1l1norm to ensure sparsity, the model is as follows
min ⁡ x ∈ R n ∑ i = 1 m ( bi − ϕ i ( x ) ) 2 + μ ∣ ∣ x ∣ ∣ 1 (3.1.7) {\min_{x{\ in}R^n}{\sum_{i=1}^m}(b_i-{\phi_i(x)})^2}+\mu||x||_1 \tag{3.1.7}xRnmini=1m(biϕi(x))2+μ∣∣x1(3.1.7)

In image processing, xxx itself may not be sparse, but it is sparse in the transform domain, and the corresponding model is
min ⁡ x ∈ R n ∑ i = 1 m ( bi − ϕ i ( x ) ) 2 + μ ∣ ∣ W ( x ) ∣ ∣ 0 (3.1.8) {\min_{x{\in}R^n}{\sum_{i=1}^m}(b_i-{\phi_i(x)})^2}+\mu| |W(x)||_0 \tag{3.1.8}xRnmini=1m(biϕi(x))2+μ∣∣W(x)0(3.1.8)
以及
min ⁡ x ∈ R n ∑ i = 1 m ( b i − ϕ i ( x ) ) 2 + μ ∣ ∣ W ( x ) ∣ ∣ 1 (3.1.9) {\min_{x{\in}R^n}{\sum_{i=1}^m}(b_i-{\phi_i(x)})^2}+\mu||W(x)||_1 \tag{3.1.9} xRnmini=1m(biϕi(x))2+μ∣∣W(x)1( 3.1.9 )
whereW : R n → R p W:R^n{\rightarrow}R^pW:RnRp represents some kind of transformation, commonly used ones include total variation and wavelet transform

The significance of the regular term in the objective function is obvious. We must satisfy the error as small as possible and its coefficient should not be too large. If the coefficient is too large, the penalty of its regular term will be very high, and when all the coefficients If all are reduced, the model will naturally become smooth.

3. Maximum likelihood estimation

In practical problems, many data come from unknown distributions. It is very bug-biting to infer the specific form of the distribution from the data. Maximum likelihood estimation is a method commonly used in statistics to estimate the probability distribution, and through the maximum likelihood function, Make the observed data obey the assumed model as much as possible.

Here we consider a simple situation, assuming that we already know that the data comes from a specific distribution, but we do not know the specific parameters of the distribution, for convenience, let p ( a ; x ) p(a;x )p(a;x ) is its distribution law or probability density function, wherexxx is an unknown parameter, in order to estimatexxx , we select a sequence of independent and identically distributed sample pointsa 1 , a 2 , ⋯ , an a_1,a_2,\cdots,a_na1,a2,,an, the likelihood function is defined as the parameter xxx , number set { ai , i = 1 , 2 , ⋯ , n a_i,i=1,2,\cdots,n ai,i=1,2,,n } probability of occurrence, that is,
L ( x ) = ∏ i = 1 np ( ai ; x ) L(x)=\prod_{i=1}^{n}p(a_i;x)L(x)=i=1np(ai;x)
L ( x ) L(x) L ( x ) is actually thisnnThe joint probability (joint density) of n points, but at this time the independent change becomes the parameterxxx
Then it is now obvious that the maximum likelihood estimate of the parameters is defined as
x ^ ∈ arg max ⁡ x ∈ χ L ( x ) \hat{x}{\in}\argmax_{x{\in}\chi}L (x)x^xχargmaxL ( x )
is actually asking forL ( x ) L(x)The x that is the largest of L ( x ) is determined by the parameterxxx maximizes the probability of this event happening,
whereχ \chiχ is the parameter space. Assuming that the maximum likelihood estimate exists, solving the maximum likelihood estimate is essentially to find the parameters that are most likely to produce the sample in a family of distributions (similar to what I said above). In practice, the likelihood function The maximum value of the logarithm is easier to solve, that is, consider the maximization problem
max ⁡ x ∈ χ l ( x ) = ln ⁡ L ( x ) (3.1.10) \max_{x{\in}{\chi}}l( x)={\ln}L(x)\tag{3.1.10}xχmaxl(x)=ln L ( x )( 3.1.10 )
Becauseln (x) ln(x)l n ( x ) is strictly monotonically increasing. In actual calculations, I think the biggest reason may be that multiplication can be converted into addition, which is easier to operate.

4. Cost, loss, and gain functions

Many problems in operations research are the process of minimizing costs (losses) and maximizing benefits. For example, when we play a game, there will be a corresponding score reward for every step we take. We naturally expect the higher the final score. The better. When traveling, we hope to have the shortest route or the least travel expenses when visiting all cities. When pricing items in supermarkets, we usually determine the most beneficial pricing based on the price and its corresponding possible sales number. In these In practical problems, they can all be written in the form of optimization problems. The objective function is either to minimize losses, maximize benefits, or both (minimize risk, maximize benefits).

5. Functional, variation

Many problems in physics and chemistry can be expressed in the form of energy minimization. For example, in the calculation of electronic structure, we calculate the stable state by minimizing the interaction energy between atoms and electrons. Generally speaking, the energy is generally The function is defined on the function space, that is, the independent variable of the corresponding optimization problem is a function in the infinite-dimensional space. We can obtain its corresponding optimality conditions through variation. Another method commonly used in practice is Using appropriate discretization, the minimization problem of the energy functional is pulled back from the infinite dimensional space to the finite space, thereby obtaining the discrete solution of the corresponding problem. Note that different discretizations generally correspond to different objective functions and different optimization problem.
PS: I haven’t learned this yet, so I can’t quite understand it in depth.

6. Slack problem

When the original problem is not easy to solve, a commonly used technique in optimization is relaxation. The basic idea of ​​this technique is to use simple terms to replace the difficult-to-handle terms in the objective function while retaining some of the properties of the original problem, so that Problems are easier to solve. For example, l 0 l_0l0The norm is non-differentiable and non-convex, since l 1 l_1l1The norm is l 0 l_0l0The norm is approximate to some extent. In practice, l 1 l_1 is often used.l1Norm instead of l 0 l_0l0norm, since l 1 l_1l1The norm is convex, and the theoretical analysis and algorithm design of its corresponding model will be simpler.
For low-rank optimization problems, the rank corresponds to the number of non-zero elements in the singular values ​​of the matrix, which is also non-convex and non-differentiable. The common way is to use the kernel norm of the matrix (l 1 l_1 of the vector composed of the singular values ​​of the matrixl1norm) instead, resulting in a more tractable convex relaxation term

Another slack strategy is e.g. dealing with a minf ( x ) minf(x)min f ( x ) problem, using a lower bound of the objective functionf R ( x ) f_R(x)fR( x ) to replacef ( x ) f(x)Solve for f ( x ) , where f R (x) f_R(x)fR( x ) should satisfy:
(1)f R ( x ) ≤ f ( x ) , ∀ x ∈ χ f_R(x){\le}f(x),{\forall}x{\in}{\chi}fR(x)f(x),xχ
(2) f R ( x ) f_R(x) fR( x ) has a simple structure
. The Lagrangian function introduced later can actually be regarded as the relaxation of the objective function of the original problem. It should be noted
that the problem after relaxation is not necessarily equivalent to the original problem. We have repeatedly mentionedl0 l_0l0The norm can be replaced by l 1 l_1l1Norms are performed under certain conditions, l 2 l_2l2The norm generally cannot be taken as l 0 l_0l0norm relaxation

3.1.2 Design of constraints

1. The physical nature of the problem itself

Depending on the actual significance of the problem, the decision variables of the optimization problem need to satisfy various constraints. For example, in electronic structure design, we assume that orbital functions are orthogonal to each other; also in the design of aircraft wings, Affected by the airflow around the wing, the change in wing shape corresponds to a differential equation and in many cases non-negative constraints. We also relax isochronous constraints into inequality constraints when linear or general equality observations are noisy or more robustness is needed.

2. Equivalent substitution

For an optimization problem, if the objective function is in the form of a composite function, such as:
min ⁡ xf ( A x + b ) \min_xf(Ax+b)xminf(Ax+b )
We often introduce the variableyyy and the equality constrainty = A x + by=Ax+by=Ax+bConsidering the equivalent problem with constraints, it becomes:
min ⁡ xf ( y ) s . t . y = A x + b \min_xf(y) \quad st\quad y=Ax+bxminf(y)s.t.y=Ax+bFor
the optimization problemmin ⁡ xf ( x ) = h ( x ) + r ( x ) \min_xf(x) = h(x) + r(x)minxf(x)=h(x)+r ( x ) , we often introduceyyy and the constraintx = yx=yx=y , convert it into
min ⁡ xh ( x ) + r ( y ) s . t . x = y \min_xh(x)+r(y) \quad st \quad x=yxminh(x)+r(y)s.t.x=In this way, y
splits the objective function. For inequality constraints, we can also convert them into equality constraints and simple non-negative or non-positive constraints by introducing slack variables. For example, consider the constraint c ( x ) ≤0 c(x){\le}0c ( x ) 0 , pull-iny ≥ 0 y{\ge}0y 0 , which can be equivalently converted into
c (x) + y = 0, y ≥ 0 c(x)+y=0,y{\ge}0c(x)+y=0,y 0
In addition, we can also use the above figure to get the equivalent form of the problem, for the optimization problemmin ⁡ xf ( x ) \min_xf(x)minxf ( x ) , according to the definition of the figure above, it can be seen that it is equivalent to the optimization problem
min ⁡ x , tt , s . t . f ( x ) ≤ t \min_{x,t}t,\quad st \quad f(x ){\le}tx,tmint,s.t.f ( x ) t
Equivalent conversion of the constraints of the optimization problem will help us study the mathematical properties of the problem more conveniently. After equivalent conversion, many problems may become the same type of problem, and then we can Find a unified algorithm or model to solve.

3. Relaxation

When the constraints of the original optimization model are too complex, we can also use relaxation techniques to replace the intractable constraints with easily tractable constraints. The feasible region of the relaxed problem will be larger than the original problem. For example, the box constraint x ∈ [0, 1] x{\in}[0,1] can be usedx[0,1 ] instead of the integer constraintx ∈ { 0 , 1 } x{\in}\{0,1\}x{ 0,1 } , or use the inequality constraintc ( x ) ≥ 0 c(x){\ge}0c ( x ) 0 instead of the equality constraintc ( x ) = 0 c(x)=0c(x)=0
There are two basic principles to follow when enlarging the feasible region:
(1) Simplify the original constraints, that is, the problem after relaxation must be easier to handle, otherwise why relax it?
(2) Don’t make the feasible region too large. , over-enlarging the feasible region will lose the key information of the original problem, and it will be meaningless to solve it.

There is a very natural question, what is the connection between the solution of the relaxation problem and the solution of the original problem? Generally speaking, the relaxed problem is not equivalent to the original problem, but under certain conditions it can be proved that the solution of the relaxed problem is the solution of the original problem.

3.2 Regression analysis

3.2.1 Overview

The general regression model can be written in the following form
b = f ( a ) + ϵ (3.2.1) b=f(a)+\epsilon \tag{3.2.1}b=f(a)+ϵ(3.2.1)
其中 a ∈ R d a{\in}R^d aRd is the independent variable,b ∈ R b{\in}Rb R is the response variable,ϵ ∈ R \epsilon{\in}Rϵ R is the error (noise) of the model. In practical problems, we generally only knowaaa andbbThe observed value of b , and the error is unknown, the regression model is established usingmmm个观测值(ai , bi a_i,b_iai,bi) to find ffThe specific form of f , and then make predictions on the response variable through the newly observed independent variable.

How to choose fff is very important, of course, we can of course make a model letf ( ai ) = bif(a_i)=b_if(ai)=biAll are correct, but if ffIf f is very complex, the generalization ability will be relatively poor, which is overfitting. Of course, it can’t be too different. A good model needs to take into account two aspects. There is a small error in the observed data, and at the same time it has a simple form.

3.2.2 Linear regression model

( x i , y i ) , i = 1 , 2 , ⋯   , m (x_i,y_i),i=1,2,\cdots,m (xi,yi),i=1,2,,m is the observed independent variable and response variable, and different data points are independent of each other, then for each data point (the independent variables and response variables in the book are always written in the form a, b, w, etc., which makes me uncomfortable, I wrote it according to my previous habit) The coefficient of the non-constant term here isn − 1 n-1n1 is for convenience, after adding the constant term, the total isnnn个好看一点
y = w 1 x i 1 + w 2 x i 2 + ⋯ + w n − 1 x i n − 1 + b + ϵ , i = 1 , 2 , ⋯   , m y=w_{1}x_{i1}+w_2x_{i2}+\cdots+w_{n-1}x_{in-1}+b+{\epsilon},i=1,2,\cdots,m y=w1xi 1+w2xi2++wn1xin1+b+ϵ ,i=1,2,,m
insidewi w_iwiis the parameter that needs to be determined. The input feature plus the constant term xi = (xi 1) T x_i=(x_i\quad1)^Txi=(xi1)T,令 w = ( w 1 , w 2 , ⋯   , w n − 1 , b ) T ∈ R n w=(w_1,w_2,\cdots,w_{n-1},b)^T{\in}R^n w=(w1,w2,,wn1,b)TRn , then the linear regression model can be written as
yi = w T xi + ϵ i (3.2.2) y_i=w^Tx_i+\epsilon_i\tag{3.2.2}yi=wTxi+ϵi( 3.2.2 )
If we want to write it in matrix form, we set
X = [ x 1 T x 2 T x 3 T ⋮ xm T ] X=\left[ \begin{matrix} x_1^T\\ x_2^T\\ x_3^T\\ \vdots\\ x_m^T \end{matrix} \right]X= x1Tx2Tx3TxmT
y = [ y 1 y 2 y 3 ⋮ y m ] y=\left[ \begin{matrix} y_1\\ y_2\\ y_3\\ \vdots\\ y_m \end{matrix} \right] y= y1y2y3ym
ϵ = [ ϵ 1 ϵ 2 ϵ 3 ⋮ ϵ m ] \epsilon=\left[ \begin{matrix} \epsilon_1\\ \epsilon_2\\ \epsilon_3\\ \vdots\\ \epsilon_m \end{matrix} \right]ϵ= ϵ1ϵ2ϵ3ϵm
(I don’t know how to combine them at the moment, sorry)
XXX ism ∗ nm*nmmatrix of n , y and ϵ y and \epsilony and ϵ arem ∗ 1 m*1m1 w w w isn ∗ 1 n*1n1 , we get the matrix form
y = X w + ϵ (3.2.3) y=Xw+\epsilon\tag{3.2.3}y=Xw+ϵ( 3.2.3 )
Now we have to consider how to solve this linear regression model. We don’t want to take it for granted here to minimize the error. We use maximum likelihood to do it, which is more mathematical.

Assume ϵ i \epsilon_iϵi是高斯白噪声,即 ϵ i ∼ N ( 0 , σ 2 ) \epsilon_i{\sim}N(0,\sigma^2) ϵiN0σ2,那么根据 ϵ i = y i − w T x i \epsilon_i=y_i-w^Tx_i ϵi=yiwTxi,我们有
p ( ϵ i ) = p ( y i ∣ x i ; w ) = 1 2 π σ 2 e x p ( − ( y i − w T x i ) 2 2 σ 2 ) p(\epsilon_i)=p(y_i|x_i;w)=\frac{1}{\sqrt{2\pi{\sigma}^2}}exp(-\frac{(y_i-w^Tx_i)^2}{2{\sigma}^2}) p(ϵi)=p(yixi;w)=2πσ2 1exp(2σ2(yiwTxi)2)
其对数似然函数为
l ( x ) = l n ∏ i = 1 m p ( y i ∣ x i ; w ) = − m 2 l n ( 2 π ) − m l n σ − ∑ i = 1 m ( y i − w T x i ) 2 2 σ 2 l(x)=ln{\prod_{i=1}^m}p(y_i|x_i;w)=-\frac{m}{2}ln(2\pi)-mln\sigma-\sum_{i=1}^m\frac{(y_i-w^Tx_i)^2}{2\sigma^2} l(x)=lni=1mp ( andixi;w)=2mln(2π)mli=1m2 p2(yiwTxi)2
The maximum likelihood estimation is to maximize the likelihood function. After removing the constant term, we get the following least squares problem
min ⁡ x ∈ R n 1 2 ∣ ∣ W x − y ∣ ∣ 2 2 (3.2.4) \min_ {x{\in}R^n}\frac{1}{2}||Wx-y||_2^2\tag{3.2.4}xRnmin21∣∣Wxy22( 3.2.4 ) Note that ϵ i \epsilon_i
does not need to be known when constructing the maximum likelihood estimateϵiThe most important thing
above is to establish the connection between solving the regression model and the least squares method. When the error is assumed to be Gaussian white noise, the least squares solution you solve is the solution of the linear regression model. Of course, if ϵ i
\ epsilon_iϵiIf it is not subject to Gaussian white noise, the solution to the linear regression model is not the solution to the least squares model. It may be the solution to the least one multiplication problem under certain noises, etc. I don’t know if everyone can understand clearly

3.2.3 Regularized linear regression model

Regularization is widely used in regression models. For example, when the number of features in a data set is greater than the number of samples, the solution to problem 3.2.4 is not unique, and it is necessary to use regularization items to select solutions with different properties.

1. Tikhonov regularization

To balance the fitting properties of the model with the smoothness of the solution, Tikhonov regularization or ridge regression adds l 2 l_2l2The square of the norm is a regular term, assuming ϵ i \epsilon_iϵiis Gaussian white noise, then it has l 2 l_2l2The linear regression of the norm square regular term actually solves the following problem:
min ⁡ w ∈ R n 1 2 ∣ ∣ W x − y ∣ ∣ 2 2 + μ ∣ ∣ w ∣ ∣ 2 2 (3.2.6) \min_{ w{\in}R^n}\frac{1}{2}||Wx-y||_2^2+{\mu}||w||_2^2\tag{3.2.6}wRnmin21∣∣Wxy22+μ∣∣w22( 3.2.6 )
Due to the existence of regular terms, the objective function of this problem is a strongly convex function, which is actually forwww for punishment, another common deformation is to give the parameterσ > 0 \sigma>0p>0,求解
min ⁡ w ∈ R n 1 2 ∣ ∣ W x − y ∣ ∣ 2 2 s . t . ∣ ∣ w ∣ ∣ 2 ≤ σ (3.2.7) \min_{w{\in}R^n}\frac{1}{2}||Wx-y||_2^2 {\quad} s.t. {\quad} ||w||_2 {\le} \sigma \tag{3.2.7} wRnmin21∣∣Wxy22s.t.∣∣w2σ( 3.2.7 )
μ \muµσ \sigmaWhen σ satisfies a certain relationship, their solutions can be the same

2.Lasso problem and deformation

If you want the solution to be sparse, you can consider adding l! l_!l!The norm is a regular term, and the LASSO regression problem is as follows
min ⁡ w ∈ R n 1 2 ∣ ∣ W x − y ∣ ∣ 2 2 + μ ∣ ∣ w ∣ ∣ 1 \min_{w{\in}R^n}\frac {1}{2}||Wx-y||_2^2+{\mu}||w||_1wRnmin21∣∣Wxy22+μ∣∣w1
The specific reason is emm. You can draw a picture to understand it. The LASSO model plays the function of feature extraction.
Similar to question 3.2.7, you can also consider the problem
min ⁡ w ∈ R n 1 2 ∣ ∣ W x − y ∣ ∣ 2 2 s . t . ∣ ∣ w ∣ ∣ 1 ≤ σ (3.2.8) \min_{w{\in}R^n}\frac{1}{2}||Wx-y||_2^2 {\quad} st {\quad} ||w||_1 {\le} \sigma \tag{3.2.8}wRnmin21∣∣Wxy22s.t.∣∣w1σ( 3.2.8 )
Considering the noiseϵ \epsilonThe existence of ϵ can also be givenv > 0 v>0v>0,考虑模型
min ⁡ w ∈ R n ∣ ∣ w ∣ ∣ 1 s . t . ∣ ∣ W x − y ∣ ∣ ≤ v (3.2.9) \min_{w{\in}R^n}||w||_1{\quad}s.t.{\quad}||Wx-y||{\le}v\tag{3.2.9} wRnmin∣∣w1s.t.∣∣Wxy∣∣v( 3.2.9 )
The essential ideas of optimization models 3.2.8 and 3.2.9 are similar, that is, "under the condition of controlling the error,www isl 1 l_1l1"Norm should be as small as possible", but the types of optimization problems they belong to are actually different. We will further explain it later in Chapter 4.

Of course, if ϵ {\epsilon}If ϵ is not Gaussian white noise, the loss function needs to be selected according to the specific type.
The traditional LASSO problem requireswww is a sparse solution, but the sparsity of the actual problem may be expressed in many ways. If askedwww has a grouped sparse solution, i.e.xxThe components of x can be divided into G groups, and the parameters in each group must be 0 or non-0 at the same time. The solution of the traditional LASSO problem cannot meet such requirements. Therefore, a grouped LASSO model is proposed: min ⁡
x ∈ R n 1 2 ∣ ∣ A x − b ∣ ∣ 2 2 + μ ∑ i = 1 G nl ∣ ∣ w I l ∣ ∣ 2 (3.1.12) \min_{x{\in}R^n}\frac {1}{2}||Ax-b||_2^2+{\mu}\sum_{i=1}^G\sqrt{n_l}||w_{I_l}||_2\tag{3.1.12 }xRnmin21∣∣Axb22+mi=1Gnl ∣∣wIl2( 3.1.12 )
PartI l I_lIlBelongs to the llthIndicator set of l
group of variables and ∣ I l ∣ = nl , ∑ l = 1 G nl = n |I_l|=n_l, {\sum_{l=1}^G}n_l=nIl=nll=1Gnl=n
n l = 1 , l = 1 , 2 , ⋯   , G n_l=1,l=1,2,\cdots,G nl=1,l=1,2,,When G , 3.1.12 degenerates into the traditional LASSO problem. It can be seen that the regular term is∣ ∣ x I l ∣ ∣ ||x_{I_l}||∣∣xIl∣∣ l1 l_1l1范数,分组LASSO问题把稀疏性从单个特征提升到了组的级别上,但不要求组内的稀疏性。
如果既要保证分组的稀疏性,也要保证单个特征的稀疏性,可以考虑将两种正则项结合起来,like this:
min ⁡ x ∈ R n 1 2 ∣ ∣ A x − b ∣ ∣ 2 2 + μ 1 ∑ i = 1 G n l ∣ ∣ w I l ∣ ∣ 2 + μ 2 ∣ ∣ w ∣ ∣ 1 (3.1.13) \min_{x{\in}R^n}\frac{1}{2}||Ax-b||_2^2+{\mu_1}\sum_{i=1}^G\sqrt{n_l}||w_{I_l}||_2+{\mu_2}||w||_1\tag{3.1.13} xRnmin21∣∣Axb22+μ1i=1Gnl ∣∣wIl2+μ2∣∣w1(3.1.13)

对于某些实际问题,特征 w w w itself is not sparse, but it is sparse under a certain transformation, so we also need to adjust the corresponding regularization term. The general problem form can be:
min ⁡ w ∈ R n 1 2 ∣ ∣ W x − y ∣ ∣ 2 2 + μ ∣ ∣ F w ∣ ∣ 1 (3.2.14) \min_{w{\in}R^n}\frac{1}{2}||Wx-y||_2^2+\mu| |Fw||_1\tag{3.2.14}wRnmin21∣∣Wxy22+μ∣∣Fw1( 3.2.14 )
Of course, this transformation is sparse and can also be used withwww itself is sparsely combined, so that a sparse solution can also be obtained, andwwThe changes between the components of w are relatively gentle.

3.3 Logistic regression

In a classification problem, the output variable takes a value in a discrete space. For a two-class classification problem, there are only two predictor variables, namely -1 and 1. In the past, the most common ones are actually label 0 and 1. I believe that both It’s commonplace, so let’s introduce -1 and 1 here to see if it will be easier. We introduce one of the most classic and basic classification models: the logistic regression model, given the feature xxx , logistic regression assumes that the probability that this sample belongs to category 1
p ( 1 ∣ x ; w ) = P ( y = 1 ∣ x ; w ) = θ ( w T x ) p(1|x;w)=P(y =1|x;w)=\theta(w^Tx)p(1∣x;w)=P ( and=1∣x;w)=θ ( wT x)
sigmoidsigmoidsigmoid函数
θ ( z ) = 1 1 + e x p ( − z ) \theta(z)=\frac{1}{1+exp(-z)} θ ( z )=1+exp(z)1
Then the probability of belonging to category -1
p ( − 1 ∣ x ; w ) = 1 − p ( 1 ∣ x ; w ) = θ ( − w T x ) p(-1|x;w)=1-p(1 |x;w)=\theta(-w^Tx)p(1∣x;w)=1p(1∣x;w)=θ ( wT x)
Therefore, for the above problem, we can simply write it as
p ( y ∣ x ; w ) = θ ( y ∗ w T x ) p(y|x;w)=\theta(y*w^Tx)p(yx;w)=θ ( ywT x)
Assume the data pairxi , yi , i = 1 , 2 , ⋯ , m {x_i,y_i},i=1,2,\cdots,mxi,yi,i=1,2,,m are independently and identically distributed, then givenx 1 , x 2 , ⋯ , xm x_1,x_2,\cdots,x_mx1,x2,,xm情况下, y 1 , y 2 , ⋯   , y m y_1,y_2,\cdots,y_m y1,y2,,ym的联合概率密度是
p ( y 1 , y 2 , ⋯   , y m ∣ x 1 , x 2 , ⋯   , x m ; w ) = ∏ i = 1 m p ( y i ∣ x i ; w ) = 1 ∏ i = 1 m ( 1 + e x p ( − y i ∗ w T x ) ) p(y_1,y_2,\cdots,y_m|x_1,x_2,\cdots,x_m;w)=\prod_{i=1}^mp(y_i|x_i;w)=\frac{1}{\prod_{i=1}^m(1+exp(-y_i*w^Tx))} p ( and1,y2,,ymx1,x2,,xm;w)=i=1mp ( andixi;w)=i=1m(1+exp(yiwTx))1
In fact, it is to find the maximum probability, and then take ln lnl n , changed together with the negative sign, the maximum likelihood estimation is to solve the following model:
min ⁡ w ∈ R n ∑ i = 1 mln ( 1 + exp ( − yi ∗ w T x ) ) (3.3.2) \min_ {w{\in}R^n}{\sum_{i=1}^m}ln(1+exp(-y_i*w^Tx))\tag{3.3.2}wRnmini=1mln(1+exp(yiwTx))( 3.3.2 )
Of course, you can also add a series of regular terms after this model, such as
min ⁡ w ∈ R n ∑ i = 1 mln ( 1 + exp ( − yi ∗ w T x ) ) + λ ∣ ∣ w ∣ ∣ 2 2 (3.3.3) \min_{w{\in}R^n}{\sum_{i=1}^m}ln(1+exp(-y_i*w^Tx))+{\lambda }||w||_2^2\tag{3.3.3}wRnmini=1mln(1+exp(yiwTx))+λ∣∣w22(3.3.3)

The result number is xi, yi {x_i,y_i}xi,yiis a pair of random variables { α , β } \{ {\alpha},{\beta}\}{ a ,β } , then the loss function can be written in the mean form
E [ ln ( 1 + exp ( − β ∗ α T w ) ) ] E[ln(1+exp(-\beta*{\alpha^T}w)) ]E [ l n ( 1+e x p ( baT w))]
and the regular term can be written as the following abstract form:
min ⁡ w ∈ R n E [ ln ( 1 + exp ( − β ∗ α T w ) ) ] + λ r ( x ) \min_{w{ \in}R^n}E[ln(1+exp(-\beta*{\alpha^Tw}))]+{\lambda}r(x)wRnminE [ l n ( 1+e x p ( baTw))]+λ r ( x )
wherer ( x ) r(x)r ( x ) is the regular term,λ \lambdaλ is a regular parameter. In fact, many machine learning models can be written as more general stochastic optimization problems and its discrete version
min ⁡ w ∈ R nf ( x ) + λ r ( x ) (3.3.5) \min_{w{\in }R^n}f(x)+{\lambda}r(x)\tag{3.3.5}wRnminf(x)+λr(x)(3.3.5)

where
f ( x ) = E [ F ( x , ξ ) ] f(x)=E[F(x,\xi)]f(x)=E[F(x,ξ )]
Specifically, it is the integral form or the direct mean, depending on whether the loss function is continuous or discrete.

3.4 Support vector machine

Support Vector Machine (SVM) is another widely used binary classification model. Let us first consider a simple case, assuming that the training data set is linearly separable. As shown in the figure, it is easy to see that given a linearly
insert image description here
separable After setting the data, the hyperplane that meets the partitioning requirements is generally not unique. So what is the ideal hyperplane? It should have the following characteristics: the data points are relatively far away from this plane, so it will have good robustness. a little xx
in spacex to hyperplanew T x + b = 0 w^Tx+b=0wTx+b=The distance d from 0
= ∣ w T x + b ∣ ∣ ∣ w ∣ ∣ 2 d=\frac{|w^Tx+b|}{||w||_2}d=∣∣w2wTx+b
For a sample point (xi, yi), y ∈ { − 1, 1 } (x_i,y_i),y{\in}\{-1,1\}(xi,yi),y{ 1,1 } , if he classifies correctly, then
yi ( w T xi + b ) > 0 , i = 1 , 2 , ⋯ , m y_i(w^Tx_i+b)>0,i=1,2,\cdots,myi(wTxi+b)>0,i=1,2,,m
In order to find the ideal hyperplane, we hope that the points in the two types of data go to the hyperplanew T x + b = 0 w^Tx+b=0wTx+b=The minimum distanceof 0 should be as large as possible, and the following initial model can be established
max ⁡ w , b , γ γ s . t . yi ( w T xi + b ) ∣ ∣ w ∣ ∣ 2 ≥ γ , i = 1.2. ⋯ , m (3.4.1) \max_{w,b,\gamma}\gamma{\quad}st\frac{y_i(w^Tx_i+b)}{||w||_2}{\ge}{\gamma} ,i=1.2.\cdots,m\tag{3.4.1}w , b , cmaxcs.t.∣∣w2yi(wTxi+b)γ,i=1.2.,m(3.4.1)
γ \gamma γ is the minimum distance from all sample points to the hyperplane, and the goal is to maximize it.
Note that the constraints in 3.4.1 are equivalent to
yi ( w T xi + b ) ≥ γ ∣ ∣ w ∣ ∣ 2 y_i(w^ Tx_i+b){\ge}{\gamma}||w||_2yi(wTxi+b)γ∣∣w2

Now you can think about it, w and bw and bW and b are scaled by the same positive multiple, which will certainly not affect our overall solution? Just likew = 1, b = 2 w=1,b=2w=1,b=2 legsw = 2 , b = 4 w=2,b=4w=2,b=The straight lines represented by 4 are the same. Therefore, if we do not fix∣ ∣ w ∣ ∣ 2 ||w||_2∣∣w2If you solve it, you will get many solutions that represent the same straight line. This is not necessary, as long as we can ensure that the line we find is that one. Therefore, for convenience, we force ∣ ∣ w ∣ ∣ 2 = 1 γ ||w||_2=\frac{1}{\gamma}∣∣w2=c1, then this problem is equivalent to
min ⁡ x , y 1 2 ∣ ∣ w ∣ ∣ 2 2 s . t . yi ( w T xi + b ) ≥ 1 , i = 1 , 2 , ⋯ , m (3.4.2 ) \min_{x,y}\frac{1}{2}||w||_2^2{\quad}st{\quad}y_i(w^Tx_i+b){\ge}1,i=1 ,2,\cdots,m\tag{3.4.2}x,ymin21∣∣w22s.t.yi(wTxi+b)1,i=1,2,,m(3.4.2)

My name can be used yi ∗ ( w T xi + b ) = 1 y_i*(w^Tx_i+b)=1yi(wTxi+b)=1 establishedxi x_ixiis called a support vector. It is not difficult to find that the parameters w, bw, b of the hyperplanew and b are completely determined by support vectors.

When the assumption of linear separability does not hold, we introduce a non-negative slack variable ξ \xi for each data pointξ , allowing misclassified points, then the constraints in 3.4.1 become
yi ( w T xi + b ) ∣ ∣ w ∣ ∣ 2 ≥ γ ( 1 − ξ i ) , ξ ≥ 0 , i = 1 , 2 , ⋯ , m \frac{y_i(w^Tx_i+b)}{||w||_2}{\ge}{\gamma}(1-{\xi_i}),{\quad}\xi{\ge} 0,i=1,2,\cdots,m∣∣w2yi(wTxi+b)γ(1Xi),ξ0,i=1,2,,m

Here use γ ( 1 − ξ i ) {\gamma}(1-{\xi_i})c ( 1Xi) to represent the distance of the misclassified point. Obviously, the misclassified point should not be too large. We use the function of the slack variable∑ i = 1 m ξ i {\sum_{i=1}^m}\xi_ii=1mXito control, finally we can get
min ⁡ w , b , ξ 1 2 ∣ ∣ x ∣ ∣ 2 2 + μ ∑ i = 1 m ξ is . t . yi ( w T xi + b ) ≥ 1 − ξ , ξ i ≥ 0 , i = 1 , 2 , ⋯ , m (3.4.3) \min_{w,b,{\xi}}\frac{1}{2}||x||_2^2+{\mu} {\sum_{i=1}^m}{\xi_i}{\quad}st{\quad}{y_i(w^Tx_i+b)}{\ge}1-{\xi},{\xi_i{\ ge}0,i=1,2,\cdots,m}\tag{3.4.3}w , b , xmin21∣∣x22+mi=1mXis.t.yi(wTxi+b)1x ,Xi0,i=1,2,,m( 3.4.3 )
whereμ {\mu}μ is the penalty coefficient, increasingμ {\mu}μ will increase the penalty for misclassification

Model 3.4.3 is also equivalent to the unconstrained optimization problem
min ⁡ w , b 1 2 ∣ ∣ x ∣ ∣ 2 2 + μ ∑ i = 1 mmax { 1 − yi ( w T xi + b ) , 0 } (3.4. 4) \min_{w,b}\frac{1}{2}||x||_2^2+{\mu}{\sum_{i=1}^m}max\{1-y_i(w^ Tx_i+b),0\}\tag{3.4.4}w,bmin21∣∣x22+mi=1mmax{ 1yi(wTxi+b),0}(3.4.4)

We see that $max{1-y_i(w^Tx_i+b),0}$ is the pair that does not satisfy the inequality yi ( w T xi + b ) y_i(w^Tx_i+b)yi(wTxi+b ) , if you are not satisfied with the penalty, then it is a positive number. If it is, it is 0. This can also be regarded as the penalty function method of 3.4.2, although max { z , 0 } max\{ z,0\}max{ z,0 } is not differentiable, but its simple form also provides many possibilities for algorithm design. It can also be seen that introducing different penalty functions can construct different SVM models.
In addition, when there are redundant features in the training data, people also consider∣ ∣ x ∣ ∣ 2 2 ||x||_2^2∣∣x22Replace with l 1 l1l 1 norm

3.5 Probabilistic graphical model

The probability graphical model is an important concept in probability theory. It is a probability model that uses a graph structure to describe the conditional independent relationship between multivariate random variables.
Given a random vector X in an n-dimensional space here = ( X 1 , X 2 , X 3 , ⋯ , X n ) X=(X_1,X_2,X_3,\cdots,X_n)X=(X1,X2,X3,,Xn) , the corresponding joint probability isnnn -ary function, according to the conditional probability,
P ( X = x ) = ∏ k = 1 n P ( X k = xk ∣ X 1 = x ! , ⋯ , X k − 1 = xk − 1 ) P(X=x) ={\prod_{k=1}^n}P(X_k=x_k|X_1=x_!,\cdots,X_{k-1}=x_{k-1})P(X=x)=k=1nP(Xk=xkX1=x!,,Xk1=xk1)
Assume that each variableX i , i = 1 , 2 , ⋯ , n X_i, i=1,2,\cdots,nXii=1,2,,n is discrete, and each hasmmm values, then without any independence assumption, we need( mn − 1 ) (m^n-1)(mn1 ) parameters can be used to determine its probability distribution. I thought about it for a long time when I saw this. I don’t know why it is this number. So, let’s give an example first (you can skip it if you understand it) assuming X 1
,X 2 , X 3 X_1,X_2,X_3X1,X2,X3The three variables are binary (0,1) distributions. When the interdependence relationship is not known, if the probability distribution is calculated,
P ( X = x ) = P ( X 1 = x 1 ) P ( X 2 = x 2 ∣ X 1 = x 1 ) P ( X 3 = x 3 ∣ X 1 = x 1 , X 2 = x 2 ) P(X=x)=P(X_1=x_1)P(X_2=x_2|X_1=x_1)P( X_3=x_3|X_1=x_1,X_2=x_2)P(X=x)=P(X1=x1)P(X2=x2X1=x1)P(X3=x3X1=x1,X2=x2)
大家这时候想一下,如果我们要知道X的概率分布,我们要知道多少个参数(概率)呢?
要知道 P ( X 1 = 0 ) P(X_1=0) P(X1=0) P ( X 1 = 1 ) P(X_1=1) P(X1=1)
P ( X 2 = 0 ∣ X 1 = 0 ) , P ( X 2 = 1 ∣ X 1 = 0 ) , P ( X 2 = 0 ∣ X 1 = 0 ) , P ( X 2 = 1 ∣ X 1 = 1 ) P(X_2=0|X_1=0),P(X_2=1|X_1=0),P(X_2=0|X_1=0),P(X_2=1|X_1=1) P(X2=0∣X1=0)P(X2=1∣X1=0),P(X2=0∣X1=0),P(X2=1∣X1=1)
P ( X 3 = 0 ∣ X 1 = 0 , X 2 = 0 ) , P ( X 3 = 1 ∣ X 1 = 0 , X 2 = 0 ) , . . . . . . . . . . P(X_3=0|X_1=0,X_2=0),P(X_3=1|X_1=0,X_2=0),.......... P(X3=0∣X1=0,X2=0)P(X3=1∣X1=0,X2=0),..........
这里我们乍一看很多对把,但是其实你只要知道 P ( X 1 = 0 ) P(X_1=0) P(X1=0)那自然就知道 P ( X 1 = 1 ) P(X_1=1) P(X1=1)了吧?所以对于 P ( X 1 = x 1 ) P(X_1=x_1) P(X1=x1)我们只需要一个参数
那按照这个思想,对于 P ( X 2 = x 2 ∣ X 1 = x 1 ) P(X_2=x_2|X_1=x_1) P(X2=x2X1=x1)总共需要两个参数, P ( X 3 = x 3 ∣ X 1 = x 1 , X 2 = x 2 ) P(X_3=x_3|X_1=x_1,X_2=x_2) P(X3=x3X1=x1,X2=x2)需要四个参数,总共需要1+2+4=7个参数
好那现在我们来推一下,对于 n n n元函数,每个 X i X_i Xi m m m个取值,为什么需要 m n − 1 m^n-1 mn1个参数呢?
依照刚刚的想法, P ( X 1 = x 1 ) P(X_1=x_1) P(X1=x1)需要 m − 1 m-1 m1个参数, P ( X 2 = x 2 ∣ X 1 = x 1 ) P(X_2=x_2|X_1=x_1) P(X2=x2X1=x1)需要 ( m − 1 ) ∗ m (m-1)*m (m1)m个参数
P ( X n ∣ X 1 = x 1 , ⋯   , X n − 1 = x n − 1 ) P(X_n|X_1=x_1,\cdots,X_{n-1}=x_{n-1}) P(XnX1=x1,,Xn1=xn1)需要 ( m − 1 ) m n − 1 (m-1)m^{n-1} (m1)mn 1 parameters
Then a total ofm − 1 + ( m − 1 ) m + ⋯ + ( m − 1 ) mn − 1 m-1+(m-1)m+\cdots+(m-1)m^{ n-1}m1+(m1)m++(m1)mn 1 , mention( m − 1 ) (m-1)(m1 ) , and then wait for the summation (this will always be the case), and then you can getmn − 1 m^n-1mn1 la

Saying so much is actually to calculate the assumption that when certain variables are known, there will be independence between variables. Going back to the first example, assume that X 2 X_2 is knownX2When, X 1 X_1X1and X 3 X_3X3独立,则有
P ( X = x ) = P ( X 1 = x 1 ) P ( X 2 = x 2 ∣ X 1 = x 1 ) P ( X 3 = x 3 ∣ x 2 ) P(X=x)=P(X_1=x_1)P(X_2=x_2|X_1=x_1)P(X_3=x_3|X_1=x_1,X_2=x_2)P(X=x)=P(X1=x1)P(X2=x2X1=x1)P(X3=x3X1=x1,X2=x2)
= P ( X 1 = x 1 ) P ( X 2 = x 2 ∣ X 1 = x 1 ) P ( X 3 = x 3 ∣ X 1 = x 1 ) =P(X_1=x_1)P(X_2=x_2|X_1=x_1)P(X_3=x_3|X_1=x_1) =P(X1=x1)P(X2=x2X1=x1)P(X3=x3X1=x1)
According to what we said before, only five parameters (two less) are needed to determine the joint distribution

When there are many variables in the probability model, their corresponding dependencies will also be more complex. At this time, the graphical model can help us understand the conditional independence relationship between random variables more intuitively. Next, we will introduce the undirected graphical model, also known as is a Markov random field or a Markov network, which uses an undirected graph to describe the joint distribution of a set of random variables with Markov properties.

Definition (Markov Random Field): For a random vector X ( X 1 , X 2 , ⋯ , X n ) X(X_1,X_2,\cdots,X_n)X(X1,X2,,Xn) and there arennUndirected graphG = ( V , E ) G=(V,E) with n nodesG=(V,E),其中 V V V表示节点集合且 V = { X 1 , X 2 , ⋯   , X n } V=\{X_1,X_2,\cdots,X_n\} V={ X1,X2,,Xn} E E E表示节点之间边的集合,如果 ( G , X ) (G,X) (G,X)满足局部马尔可夫性质,即给定变量 X k X_k Xk邻居的取值,其与其他的变量独立
P ( X k = x k ∣ X − k ) = P ( X k = x k ∣ X N ( k ) ) P(X_k=x_k|X_{-k})=P(X_k=x_k|X_{N(k)}) P(Xk=xkXk)=P(Xk=xkXN(k))
其中 X − k X_{-k} Xkmeans except X k X_kXkExternal to the set of other random variables, XN ( k ) X_{N(k)}XN(k)means X k X_kXkThe set of neighbors, that is, and X k X_kXkA set of random variables directly connected by edges, then we call it (G, X) (G,X)(G,X ) is a Markov random field.
Assume that the random vector in the probability graphical model obeys the multivariate Gaussian divisionN ( μ , Σ ) N(\mu,\Sigma)N ( μ ,Σ ) ,fromΘ= Σ − 1 \Theta={\Sigma^{-1}}Th=S1 is the covariance matrixΣ \SigmaThe inverse matrix of Σ is called the accuracy matrix. We have the following proposition
θ ij = 0 ⇔ Given X k ( k ≠ i , j ), then X i and X j are independent {\theta}_{ij }=0{\Leftrightarrow}Given X_k(k{\not=}i,j), then X_i and X_j are independentiij=0 givenX _k(k=i,j ) , then Xiand _jis independent , which means that the elements θ ij \theta_{ij}
in the accuracy matrixiijis 0, the node X i X_i in the undirected graphXiand _XjThere is no directly connected edge between them, that is, given neighbor information, X i X_iXiand _XjConditionally independent, if θ ij ≠ 0 {\theta_{ij}}{\not=}0iij= 0 , then it meansX i X_iXiand _XjThere are directly connected edges between them, so the precision matrix gives the structural information of the undirected graph and the parameter information of the distribution.
In practice, we care about how to learn the precision matrix from the data. Using the likelihood function of the precision matrix, we can get a convex optimization problem

Specifically, given nnn维高斯随机向量 Y = ( Y 1 , Y 2 , ⋯   , Y n ) ∼ N ( μ , Σ ) , μ ∈ R n , Σ ∈ S + + n Y=(Y_1,Y_2,\cdots,Y_n){\sim}N({\mu},{\Sigma}),{\mu}{\in}R^n,{\Sigma}{\in}S^n_{++} Y=(Y1,Y2,,Yn)N(μ,Σ),μRn,ΣS++n的一组实际取值 { y 1 , y 2 , ⋯   , y n } ( y 1 = ( y 1 , y 2 , ⋯   , y n ) T ) \{y^1,y^2,\cdots,y^n\}(y^1=(y_1,y_2,\cdots,y_n)^T) { y1,y2,,yn}(y1=(y1,y2,,yn)T ), its empirical covariance matrix can be obtained as
S = 1 m ∑ i = 1 m ( yi − y ˉ ) ( yi − y ˉ ) TS=\frac{1}{m}\sum_{i=1}^ m(y^i-\bar{y})(y^i-\bar{y})^TS=m1i=1m(yiyˉ)(yiyˉ)T
其中 y ˉ = 1 m ∑ i = 1 m y i \bar{y}=\frac{1}{m}\sum_{i=1}^my^i yˉ=m1i=1myi is the sample mean.

Some students who have no foundation here may ask what empirical covariance is. My understanding is to use a set of data to infer what the approximate covariance matrix of the sample looks like. insert image description here
About precision matrix XXThe log likelihood function of X
is l ( x ) = lndet ( X ) + T r ( XS ) , X ≻ 0 l(x)=lndet(X)+Tr(XS),X\succ0l(x)=l n d e t ( X )+Tr(XS),X0whereX
0 X{\succ}0X 0 represents the independent variableXXX takes the value in the positive definite matrix space by maximizing the log likelihood function
max ⁡ X ≻ 0 l ( X ) (3.5.1) \max_{X{\succ0}}l(X)\tag{3.5.1}X0maxl(X)( 3.5.1 )
We can get the accuracy matrixXXEstimates of Not fully connected, we build the following improved model:
max ⁡ X ≻ 0 l ( X ) − λ ∣ ∣ X ∣ ∣ 1 (3.5.2) \max_{X{\succ}0}l(X)-{ \lambda}||X||_1{\tag{3.5.2}}X0maxl(X)λ∣∣X1( 3.5.2 )
whereλ > 0 \lambda>0l>0 is a parameter used to control sparsity. The sparse solution obtained by the above model can be used to estimate the conditional independence between high-dimensional random variables.

In addition to using the likelihood function to model, we can also directly design the optimization problem from the idea of ​​adding a regular term to the loss function. According to the definition of the accuracy matrix, the real accuracy matrix Θ = Σ − 1 \Theta={\Sigma } ^{-1}Th=S1 , Σ \Sigmacan be obtained through samplingEstimateSS of ΣS , which is the empirical covariance matrix mentioned above, our goal is to estimateΣ − 1 \Sigma^{-1}S1 , and make it have a sparse structure, so the following optimization problem can be designed
min ⁡ X ∣ ∣ SX − I ∣ ∣ + λ ∣ ∣ X ∣ ∣ 1 , s . t . X}||SX-I||+{\lambda}||X||_1,{\quad}stX{\succ}0\tag{3.5.3}Xmin∣∣SXI∣∣+λ∣∣X1,s.t.X0(3.5.3)
其中 ∣ ∣ ⋅ ∣ ∣ ||·|| ∣∣∣∣可以是任意一种范数(较为常用的为 F F F范数或者 l ! l_! l!范数) X ≻ 0 X{\succ}0 X0表示 X X X在半正定矩阵空间中取值,每一项的含义是很明显的,第一项表示希望 X X X尽可能为 S S S的逆矩阵, ∣ ∣ X ∣ ∣ 1 ||X||_1 ∣∣X1是要求 X X X本身稀疏, X ≻ 0 X{\succ}0 X0是保证了求得是精度矩阵是半正定的,我们也可以给出问题的一个变形
min ⁡ X ∣ ∣ X ∣ ∣ 1 s . t . ∣ ∣ S X − I ∣ ∣ ≤ σ , X ⪰ 0 (3.5.4) \min_{X}||X||_1{\quad}s.t.||SX-I||{\le}\sigma,X{\succeq}0\tag{3.5.4} Xmin∣∣X1s.t.∣∣SXI∣∣σ,X0(3.5.4)
满足一定误差范围内的 X X X寻找 l 1 l_1 l1范数最小的解,当然如果 σ \sigma σ过小,可行域可能是空集

3.6 相位恢复

相位恢复是信号处理中的一个重要问题,它是从信号在某个变换域的幅度测量值来恢复信号。
问题背景如下:将待测物体(信号)放置在指定位置,用投射光照射,经过衍射成像,可以由探测器得到其振幅分布,我们需要从该振幅分布中恢复出原始信号的信息。由 F r a u n h o f e r Fraunhofer Fraunhofer衍射方程可知,探测器处的光场可以被观测物体的傅立叶变换很好的接近。但是因为实际中的探测器只能测量光的强度,因此我们只能得到振幅信息。

信号的相位通常包含丰富的信息,下图的第一列给初了两个图片Y和S,分别对他们做二维离散傅立叶变换 F F F得到 F ( Y ) F(Y) F(Y) F ( S ) F(S) F(S),由于变换后的图片 F ( Y ) F(Y) F(Y)是复数矩阵,它可以由模长 ∣ F ( Y ) ∣ |F(Y)| F(Y)和相位 p h a s e ( F ( Y ) ) phase(F(Y)) phase(F(Y))来表示,即
F ( Y ) = ∣ F ( Y ) ∣ ⊙ p h a s e ( F ( Y ) ) F(Y)=|F(Y)|{\odot}phase(F(Y)) F(Y)=F(Y)phase(F(Y))
其中 ∣ F ( Y ) ∣ |F(Y)| F(Y)表示对每个元素取模长,运算 ⊙ \odot 表示矩阵对应元素相乘,现在交换 Y Y Y S S S的相位,但保留模长,然后做傅立叶逆变换 F − 1 F^{-1} F1,即
S ^ = F − 1 ( ∣ F ( Y ) ∣ ⊙ p h a s e ( F ( S ) ) ) Y ^ = F − 1 ( ∣ F ( S ) ∣ ⊙ p h a s e ( F ( Y ) ) ) \hat{S}=F^{-1}(|F(Y)|{\odot}phase(F(S))) \\ \hat{Y}=F^{-1}(|F(S)|{\odot}phase(F(Y))) S^=F1(F(Y)phase(F(S)))Y^=F1(F(S)phase(F(Y)))
insert image description here
可以看到, S ^ \hat{S} S^基本上是 S S S的形状,而 Y ^ \hat{Y} Y^基本上是 Y Y Y的形状,这个实验告诉我们相位信息可能比模长信息更重要。
(信号处理这一块我实在是不是很熟,我只能尽量的理解)
在实际应用中,我们不一定使用傅立叶变换对原始信号进行采样处理,给定复信号 x = ( x 0 , x 1 , ⋯   , x n − 1 ) T ∈ C n x=(x_0,x_1,\cdots,x_{n-1})^T{\in}C^n x=(x0,x1,,xn1)TCn以及采样数 m m m,我们可以逐分量定义如下线性变换:
( A ( x ) ) k = a ˉ k x , k = 1 , 2 , ⋯   , m (A(x))_k=\bar{a}_kx,k=1,2,\cdots,m (A(x))k=aˉkx,k=1,2,,m
其中 a k ∈ C n a_k{\in}C^n akCn为已知复向量。 a ˉ \bar{a} aˉ应该是表示复向量的共轭转置,容易验证当该线性变换 A A A为离散傅立叶变换时, a k a_k ak有如下形式:
a k = ( e 2 π i k − 1 n t ) t = 0 n − 1 , k = 1.2. ⋯   , m a_k=(e^{2\pi{i}\frac{k-1}{n}t})^{n-1}_{t=0},k=1.2.\cdots,m ak=(e2πink1t)t=0n1,k=1.2.,m
针对一般形式的 a k a_k ak,如果将其对应的振幅观测记为 b k b_k bk. Then the phase recovery problem is essentially to solve the following quadratic equations:
bk 2 = ∣ a ˉ k T x ∣ 2 , k = 1 , 2 , ⋯ , m (3.6.1) b_k^2=|\bar{a} _k^Tx|^2,k=1,2,\cdots,m\tag{3.6.1}bk2=aˉkTx2,k=1,2,,m( 3.6.1 )
How should we understand this? It is said above thatbk b_kbkIt should correspond to a transformation. It is said to be amplitude here. My understanding is that the amplitude that can be detected can be obtained by transforming the original signal. This amplitude should also have a corresponding ak a_kak, or each different kind of signal has its own ak a_k with respect to the amplitudeak? I don’t quite understand it either. For the time being, I can only understand it this way and can barely explain it.

Although solving a system of linear equations is very simple, solving a system of quadratic equations is an NP-hard problem. Below we introduce two methods of converting the problem into a solvable optimization model.

1. Least squares model

It is more common to transform problem (3.6.1) into a nonlinear least squares problem
min ⁡ x ∈ C n ∑ i = 1 m ( ∣ a ˉ i T x ∣ 2 − bi 2 ) 2 (3.6.2) \ min_{x{\in}C^n}\sum_{i=1}^{m}(|\bar{a}_i^Tx|^2-b_i^2)^2\tag{3.6.2}xCnmini=1m(aˉiTx2bi2)2( 3.6.2 )
The goal of this model is a differentiable quartic function, which is a non-convex optimization problem. Compared with the problem (3.6.1), the model (3.6.2) can better deal with the noise in the observation. In practice, we often construct the following non-smooth model
min ⁡ x ∈ C n ∑ i = 1 m ( ∣ a ˉ i T x ∣ − bi ) 2 (3.6.3) \min_{x{\in}C^ n}\sum_{i=1}^{m}(|\bar{a}_i^Tx|-b_i)^2\tag{3.6.3}xCnmini=1m(aˉiTxbi)2( 3.6.3 )
假设ai a_iaiand xxAll x are real, and we can get the corresponding model in the case of real numbers:
min ⁡ x ∈ C n ∑ i = 1 m ( ∣ < a ˉ i , x > ∣ 2 − bi 2 ) 2 (3.6.4) \ min_{x{\in}C^n}\sum_{i=1}^{m}(|<\bar{a}_i,x>|^2-b_i^2)^2\tag{3.6.4 }xCnmini=1m(<aˉi,x>2bi2)2(3.6.4)
min ⁡ x ∈ C n ∑ i = 1 m ( ∣ < a ˉ i , x > ∣ − b i ) 2 (3.6.5) \min_{x{\in}C^n}\sum_{i=1}^{m}(|<\bar{a}_i,x>|-b_i)^2\tag{3.6.5} xCnmini=1m(<aˉi,x>bi)2(3.6.5)

Because the phase recovery problem has important applications in practice, how to find the global optimal solution of models (3.6.2)-(3.6.5) has attracted widespread attention.

2. Phase improvement

Phase lift (phase lift) is another method to solve the phase recovery problem. In front of the razor, the essential difficulty of the phase recovery problem lies in dealing with the quadratic equations (3.6.1). Note that ∣ a ˉ i T x ∣ 2 =
a ˉ i T xx ˉ ai = T r ( xx ˉ T aia ˉ i T ) |\bar{a}_i^Tx|^2=\bar{a}_i^Tx\bar{x}a_i=Tr(x\ bar{x}^Ta_i\bar{a}_i^T)aˉiTx2=aˉiTxxˉai=Tr(xxˉT aiaˉiT)
insert image description here
X = x x ˉ T X=x{\bar{x}}^T X=xxˉT , the system of equations 3.6.1 can be transformed into
T r ( X aia ˉ i T ) = bi 2 , i = 1 , 2 , ⋯ , m , X ⪰ 0 , rank ( Xa_i\bar{a}_i^T)=b_i^2,i=1,2,\cdots,m,X{\succeq0},rank(X)=1\tag{3.6.6}T r ( X aiaˉiT)=bi2,i=1,2,,m,X0,rank(X)=1( 3.6.6 )
IfX = xx ˉ TX = x\bar{x}^TX=xxˉT , thenXXThe rank of X is usually 1.

This is because XXX is a column vectorxxx and its conjugate transposex ˉ T \bar{x}^Txˉobtained by multiplying T. The properties of this matrix are called rank-1 matrices. A rank-1 matrix always has rank 1 because all columns are linear combinations of other columns.

Specifically, if xxx is a non-zero complex vector, thenXXX is a rank-1 matrix with rank 1. ifxxx is a zero vector, thenXXX is also a zero matrix, with rank 0.

In short, X = xx ˉ TX = x\bar{x}^TX=xxˉThe rank of T is usually 1 unlessxxx is a zero vector.

If the system of equations (3.6.1) has xxx exists, thenX = xx ˉ TX=x\bar{x}^TX=xxˉT is the solution of the system of equations (3.6.6). For the system of equations (3.6.6), we consider the optimization problem
min ⁡ X rank ( X ) s . t . T r ( X aiai ˉ = bi 2 ) , i = 1 , 2 , ⋯ , m m\tag{3.6.7} \\ X{\succeq}0Xminrank(X)s.t.T r ( X aiaiˉ=bi2),i=1,2,,mX0( 3.6.7 )
Because a rank-one solution exists, the rank of the optimal solution to problem (3.6.7) is at most 1. Make a rank-one decompositionX = xx ˉ TX=x\bar{x}^TX=xxˉT。那么 c x , c ∈ C cx,c{\in}C cx,c C and∣ c ∣ = 1 |c|=1c=1 is the solution to equation (3.6.1).
The above discussion shows that the problem (3.6.7) and the phase recovery problem (3.6.1) are equivalent, and the difference in form is that the independent variable of the problem (3.6.7) is the matrixXXX , the operation of converting vector variables into matrix variables is also called "lifting". The purpose of using lifting is to change the constraints from about the vectorxxThe quadratic function of x is transformed into matrixXXlinear function of X.

Because of the computational complexity of rank optimization, we use the nuclear norm to relax it, and get the following optimization problem
min ⁡ XT r ( X ) s . t . T r ( X aiai ˉ T ) = bi 2 , i = 1 , 2 , ⋯ , m , X ⪰ 0 (3.6.8) \min_{X}Tr(X) \\ st{\quad}Tr(Xa_i\bar{a_i}^T)=b_i^2,i=1,2 ,\cdots,m,X{\succeq}0\tag{3.6.8}XminTr(X)s.t.T r ( X aiaiˉT)=bi2,i=1,2,,m,X0( 3.6.8 )
whereT r ( X ) = ∣ ∣ X ∣ ∣ Tr(X)=||X||Tr(X)=∣∣ X ∣∣ , from matrixXXThe positive semi-definiteness of X , note that the objective function of the problem (3.6.8) becomes a linear function, and its only non-linear part is the positive semi-definite constraintX ⪰ 0 X{\succeq}0X 0 , when there is a unique solution to the problem (3.6.8), the original phase recovery problem can be obtained by rank-one decomposition. Some articles have proved that whenm ≥ c 0 nlnn ) m{\ge}c_0nlnn)mc0n l nn ) (c 0 c_0c0is a problem-dependent constant), the solution to problem (3.6.8) is rank-one with high probability.

PS: Rank-One Decomposition is a decomposition method that represents a matrix as a linear combination of a rank-1 matrix. In linear algebra, for a matrix, the form of rank-one decomposition is usually:
A = uv TA=uv^TA=uvT
,AAA is a matrix,uuu sumvvv is a vector,⋅ T \cdot^TT represents the transpose of a vector. The key feature of this decomposition is thatAAA can be expressed as the outer product of two vectors. Vectoruuu sumvvv is called the basis vector of decomposition, and their outer productuv T uv^TuvT is a matrix of rank 1.

3.7 Principal component analysis

Principal component analysis is an important technique for data processing and dimensionality reduction. It provides a method to express points in high-dimensional space in low-dimensional subspace. Given data ai ∈ R p , i = 1 , 2 , ⋯ , n a_i{\in}R^p,i=1,2,\cdots,naiRp,i=1,2,,nwhere nn _n represents the number of samples, defineA = [ a 1 , a 2 , ⋯ , an ] A=[a_1,a_2,\cdots,a_n]A=[a1,a2,,an] , without loss of generality, we assume thatAAThe row sum of A is 0 (otherwise the average value of the row can be subtracted, and the relative structure of the data will not change).
The idea of ​​principal component analysis is to find a subspace composed of several directions with the largest variance of sample points, and then project data points into the subspace to achieve dimensionality reduction.
PS: In fact, there is a more easy-to-understand way of understanding the principal components here, which may be a bit more professional mathematics.
Suppose we want to changeR p R^pRSet of data points in p { ai } i = 1 n \{a_i\}_{i=1}^n{ ai}i=1n, projected onto R p R^pRadd of pd- dimensional subspace (d < p d<pd<p)中,记 X ∈ R p ∗ d X{\in}R^{p*d} XRp d is the column orthogonal matrix formed by the standard orthonormal basis of the subspace

PS:(“记 X ∈ R p ∗ d X{\in}R^{p*d} XRp d is the column orthogonal matrix formed by the standard orthonormal basis of this subspace": In this low-dimensional subspace, we can use a matrixXXX represents the orthonormal basis of this subspace,XXX is ap × dp\times dp×matrix of d , whereppp is the dimension of the original data,ddd is the dimension of the target subspace. This matrixXXThe columns of X are orthogonal, and they form a set of basis for the subspace. This matrixXXX is often called a "projection matrix" and can be used to project original data points into a low-dimensional subspace. )

Number of points ai a_iaiin XXThe projection of the subspace formed by X is PX ( ai ) = XXT ai P_X(a_i)=XX^Ta_iPX(ai)=XXT ai, I still don’t quite understand it until this point. I still don’t understand it after thinking about it for a long time. XXT XX^TXXT isp × pp \times pp×matrix of p , ai a_iaiis p × 1 {p \times 1}p×1 (or1 ∗ p 1*p1p ?), I don’t understand how he projected toddIt's d- dimensional. If anyone sees this and can give me an answer, thank you! !

So let me talk about PCA that I understood before.

First of all, if we want to describe a vector, we need a set of bases. It is enough to give the projection values ​​of the vector on the straight line where the base is located. If the two-dimensional vector wants to be reduced to one-dimensional, find a set of bases. As shown below, we can project all these two-dimensional data points to x 1 x_1x1superior

insert image description here
The question is, what do we need to satisfy this base? If we want to retain more information, we must make the data dispersed as much as possible. So how to express data dispersion? That's right! Variance
We can transform the above problem into finding a set of basis, so that when all the data is transformed to this set of basis, the variance is maximum

For the problem of reducing two dimensions to one dimension, we only need to find the direction that maximizes the variance, but for higher dimensions, there is another problem to be solved. If you reduce three dimensions to two dimensions, first we hope to find One direction maximizes the variance, thus completing the selection of the first direction, and then we select the second direction.
If we simply choose the direction with the largest variance at this time, it is obvious that this direction will almost coincide with the first direction. Obviously, such a dimension is useless. Intuitively speaking, these two directions should be as close as possible To represent more information, we don't want a linear correlation between them, because correlation means that the two directions will definitely express the same information repeatedly.

Mathematically, the covariance of two fields (attributes) represents their correlation. As we have said before, let the mean value of each field (attribute) be 0, then cov ( a , b ) = 1 m ∑ i = 1
maibi cov(a,b)=\frac{1}{m}\sum_{i=1}^ma_ib_ico v ( a ,b)=m1i=1maibi

Everything mentioned now is the prerequisite, and we haven’t reached the PCA part yet.
Suppose we only have aa.a andbbb two attributes, we form them into a matrixXXX
X = ( a 1 a 2 ⋯ a n b 1 b 2 ⋯ b n ) X=\begin{pmatrix} {a_{1}}&{a_{2}}&{\cdots}&{a_{n}}\\ {b_{1}}&{b_{2}}&{\cdots}&{b_{n}}\\ \end{pmatrix} X=(a1b1a2b2anbn)
we useXXX timesXTX^TXT,并乘上系数1\m:
1 m X X T = ( 1 m ∑ i = 1 m a i 2 1 m ∑ i = 1 m a i b i 1 m ∑ i = 1 m a i b i 1 m ∑ i = 1 m b i 2 ) \frac{1}{m}XX^T=\begin{pmatrix} {\frac{1}{m}\sum_{i=1}^ma_i^2}&\frac{1}{m}\sum_{i=1}^ma_ib_i\\ {\frac{1}{m}\sum_{i=1}^ma_ib_i}&{\frac{1}{m}\sum_{i=1}^mb_i^2}\\ \end{pmatrix} m1XXT=(m1i=1mai2m1i=1maibim1i=1maibim1i=1mbi2)
A miracle happened! , this is called the covariance matrix. The two elements on the diagonal of this matrix are the variances of the two attributes, while the other elements areaaa andbbCovariance of b
So what do we hope now? We hope that the transformed data is assumed to beYYY ,empty
valueYYT = ( a 0 0 0 0 b 0 0 0 0 c 0 0 0 0 d ) YY^T=\begin{pmatrix} {a}&0&0&0\\ 0&{b}&0&0\\ 0&0&c&0\\ 0&0&0&d \\\end{pmatrix}YYT= a0000b0000c0000d
Is it a square array similar to this? It's similar. The specific size is uncertain. I just hope that there are only elements on the diagonal and all other elements are 0.

Let the original data matrix X ∈ R p ∗ n X{\in}R^{p*n}XRThe covariance matrix corresponding to p n is C, and P ∈ R d ∗ p P{\in}R^d*pPRdp is a matrix composed of a set of bases (after dimensionality reduction) by row. SupposeY = PX ∈ R d ∗ n Y=PX{\in}R^{d*n}Y=PXRd n ,则YYY isXXX versusPPP is used as the basis transformed data, assumingYYThe covariance matrix of Y is DDD , let us deriveDDD andCCC的关系
D = 1 m Y Y T = 1 m ( P X ) ( P X ) T = 1 m P X X T P T = P ( 1 m X X T ) P T = P C P T D=\frac{1}{m}YY^T=\frac{1}{m}(PX)(PX)^T=\frac{1}{m}PXX^TP^T=P(\frac{1}{m}XX^T)P^T=PCP^T D=m1YYT=m1(PX)(PX)T=m1PXXTPT=P(m1XXT)PT=PCPTNow
we understand,the PPP is the PPthat diagonalizes the original covariance matrixP. _ In other words, the optimization goal becomes to find a matrixPPP , satisfiesPCPT PCP^TPCPT is a diagonal matrix, and the diagonal elements are arranged in order from large to small, the firstKKLine K means we want to drop toKKK- dimensional basis, therefore, we should thank mathematicians for their advance.

It is known from the above that the covariance matrix is ​​a symmetric matrix. In linear algebra, the real symmetric matrix has a series of very good properties
(1) The eigenvectors corresponding to different eigenvalues ​​​​of the real symmetric matrix must be orthogonal
(2) Let the eigenmatrix λ {\lambda}λ multiplicity isrrr , then there must berrr linearly independent eigenvectors corresponding toλ \lambdaλ , so thisrrOrthogonalization of r eigenvector units
From the above, it can be seen that annnn linesnA real symmetric matrix with n columns must be able to findnnn unit orthogonal eigenvectors, let thisnThe n feature vectors aree 1 , e 2 , ⋯ , en e_1,e_2,\cdots,e_ne1,e2,,en,记
E = ( e ! , e 2 , ⋯   , e n ) E=(e_!,e_2,\cdots,e_n) E=(e!,e2,,en)
对协方差矩阵有如下结论:
E T C E = A = [ λ ! 0 ⋯ 0 0 λ 2 ⋯ 0 ⋮ ⋮ ⋱ ⋮ 0 0 ⋯ λ n ] E^TCE=A=\begin{bmatrix} {\lambda_!}&{0}&{\cdots}&{0}\\ {0}&{\lambda_2}&{\cdots}&{0}\\ {\vdots}&{\vdots}&{\ddots}&{\vdots}\\ {0}&{0}&{\cdots}&{\lambda_n}\\ \end{bmatrix} ETCE=A= λ!000λ2000λn
那么我们已经找到我们需要的矩阵 P P P了, P = E T P=E^T P=ET
P P P是协方差矩阵的特征向量单位化后按行排列出的矩阵,其中每一行都是 C C C的一个特征向量,如果设 P P P according toAAThe eigenvalues ​​of A are from large to small, arrange the eigenvectors from top to bottom, and use the firstKKThe matrix consisting of K rows is multiplied by the original matrix XXX , you can get the dimensionally reduced data matrixYYY is
here. I have doubts here. The directions selected by these eigenvalues, in addition to ensuring that the directions are orthogonal, can only guarantee the relative size? There is no guarantee that his direction has the largest variance among all directions, and then I searched. There are mathematical derivations that can prove that the eigenvector is the direction that can maximize the variance.
Let’s borrow this article from Zhihu: https://zhuanlan.zhihu.com/p/338322144
insert image description here
insert image description here
insert image description here

To summarize: Suppose there are mmm n_n- dimensional data
(1) compose the original data intonnn linesmmm- column matrixXXX
(2) willXX
Eachrow of _C=m1XXT
(4) Find the eigenvalues ​​of the covariance matrix and the corresponding eigenvectors
(5) Arrange the eigenvalues ​​corresponding to the eigenvectors into a matrix from top to bottom, and take the firstkkK rows make up the matrixPPP
(6) Y = P X Y=PX Y=PX is dimensionality reduction tokkData after k dimension

3.8 Matrix separation problem

The matrix separation problem is also an important low-rank matrix calculation problem. Given the matrix M ∈ R m × n M{\in}R^{m \times n}MRm × n , we want to decompose it into a low-rank matrixXXX and sparse matrixSSS , such thatX + S = M X+S=MX+S=M , while trying to make the matrixXXrank sum sparse matrixSS of XS l0 l_0l0The norms are relatively small, so the following model
min ⁡ X , S ∈ R m × nrank ( X ) + μ ∣ ∣ S ∣ ∣ 0 s . t . S{\in}R^{m \times n}}rank(X)+{\mu}||S||_0 \\ st{\quad} X+S=M\tag{3.8.1}X,SRm×nminrank(X)+μ∣∣S0s.t.X+S=M( 3.8.1 )
Since the model contains the rank sum of the matrixl 0 l_0l0Norm is difficult to solve directly. So we use the nuclear norm instead of rank and minimize l 1 l_1l1The norm restricts the sparsity of the noise matrix, (kernel norm: the singular value addition of the matrix)
min ⁡ X , S ∈ R m × n ∣ ∣ X ∣ ∣ ∗ + μ ∣ ∣ S ∣ ∣ 1 s . t . X + S = M (3.8.2) \min_{X,S{\in}R^{m \times n}}||X||_*+{\mu}||S||_1 \\ st { \quad} X+S=M \tag{3.8.2}X,SRm×nmin∣∣X+μ∣∣S1s.t.X+S=M( 3.8.2 )
The matrix separation problem is sometimes called robust principal component analysis (robust PCA), the goal is to remove the noise in the original data to the greatest extent in image processing, and find the optimal projection.

Limitations Consider the video segmentation problem, which refers to extracting objects of interest from video scenes. For example, segment a still part of a video. Each frame of the video is actually a static picture. Although the static objects in each picture may be affected by lighting changes, occlusion, translation, noise, etc., resulting in subtle differences between different pictures, it is undeniable that they are different from each other. There is a high degree of similarity between them. If the static parts in all pictures are represented as a matrix, and because the static objects have a certain internal structure, the matrix composed of static objects must be of low rank (each row or column is linearly related), similarly, the dynamics in the video part and other background factors can be regarded as noise, then our task becomes to decompose the information matrix contained in the video into the sum of low-rank matrix and sparse noise matrix with internal structure.

We choose actual video data to illustrate specifically, assuming that the video has a total of nnn frames, each frame (each picture) has a total ofmmm pixels,we represent each picture as a column vector, and these column vectors are put together to form the given matrixMMM , since the background of the lobby is basically the same in every frame, it corresponds toMMThe low-rank part of M , letμ = 1 2 m \mu=\frac{1}{2\sqrt{m}}m=2m 1Solving the model (3.8.2), we can get the matrix XXX andSSS respectively opposite the second and third columns
insert image description here

3.9 Dictionary learning

Just as all kinds of knowledge can be expressed by permutations and combinations of words in a dictionary, the purpose of dictionary learning is to compress existing large-scale data sets and find the most basic knowledge hidden behind these data points. principle.
consider a mmData set aii = 1 n in m- dimensional space , ai ∈ R m {a_i}_{i=1}^n,a_i{\in}R^maii=1n,aiRm , assuming that eachai a_iaiare all generated by the same dictionary, and the generated data contains noise, so the linear model of dictionary learning can be expressed as
a = D x + ea=Dx+ea=Dx+ehereD ∈ R m × k D{\in}R^{m \times k
}DRm × k is an unknown dictionary, and each of its columnsdi d_idiis a basis vector of the dictionary; xxx is the coefficient of the basis in the dictionary, also unknown;eee is some kind of noise. How to understand here, let's imagine, suppose you have 1,000,000 pictures, each picture is 1010, that is, 100 dimensions, and now through dictionary learning, you can find kkk pictures of 100 dimensions (1010), used as a dictionary, through thiskk pictures to represent these 1,000,000 pictures.
The dictionary learning model is different from the multiple linear regression model because we need to solve the dictionaryDDD and coefficientxxx (each picture has its own correspondingxxx)。

Generally speaking, the dimension of the data is mmm and the number of basis vectors in the dictionarykkk is much smaller than the number of observationsnnn ifk < m k<mk<m , we call the dictionaryDDD is incomplete ifk > m k>mk>m , we call the dictionaryDDD is overcomplete,k = mk=mk=The dictionary corresponding to m cannot bring any improvement to the representation and is not considered in practice. (Over-complete dictionaries are generallyxxx are all sparse, which also has advantages)

This eeWhen e
is Gaussian white noise, the loss function f ( D , X ) = 1 2 n ∣ ∣ DX − A ∣ ∣ F 2 f(D,X)=\frac{1}{2n}||DX-A ||_F^2f(D,X)=2n _1∣∣DXAF2
其中 A = [ a 1 , a 2 , ⋯   , a n ] ∈ R m × n A=[a_1,a_2,\cdots,a_n]{\in}R^{m \times n} A=[a1,a2,,an]Rm × n is the total of all observation data,X = [ x 1 , x 2 , ⋯ , xn ] ∈ R k × n X=[x_1,x_2,\cdots,x_n]{\in}R^{k \times n }X=[x1,x2,,xn]Rk × n is the total set of all basis coefficients
. In actual calculations, we do not requireDDThe columns of D are orthogonal (meaning that the basis vectors are not required to be linearly independent), so a sample pointai a_iaiThere may be many different representations, and this redundancy introduces sparsity into the representation, which means that the dictionary is overcomplete. Sparsity can also help us quickly determine which basis vectors represent sample points, thereby improving calculation speed. Specifically, in eeUnder the condition that e
is Gaussian white noise, we define the sparse coding loss function f ( D , X ) = 1 2 n ∣ ∣ DX − A ∣ ∣ F 2 + λ ∣ ∣ X ∣ ∣ 1 f(D,X)=\ frac{1}{2n}||DX-A||_F^2+\lambda{||X||_1}f(D,X)=2n _1∣∣DXAF2+λ∣∣X1
where λ \lambdaλ is a regularization parameter, its size is used to controlXXThe sparsity of X , we note that in f (D, X) f(D,X)f(D,X ) has a product termDX DXD X , obviouslyffThe minimum value point of f must have ∣ ∣ X ∣ ∣ 1 → 0 ||X||_1{\rightarrow}0∣∣X1 0 (because assuming( D , X ) (D,X)(D,X ) is the minimum point of the problem, thenf ( c D , 1 c X ) < f ( D , X ) , ∀ c > 1 f(cD,\frac{1}{c}X)<f(D, X),\forall{c}>1f ( cD ,c1X)<f(D,X),c>1 ), this should be easy to understand,DX DXD X product will not change, butXXAs X becomes smaller, the regular term becomes smaller, resulting inf (D, X) f(D,X)f(D,X ) becomes smaller.
Therefore, the sparse-preserving regular term here is meaningless. An improved approach is to require that the base vector module length in the dictionary cannot be too large.
The final optimization problem is
min ⁡ D , X ∣ ∣ DX − A ∣ ∣ F 2 + λ ∣ ∣ X ∣ ∣ 1 s . t . ∣ ∣ D ∣ ∣ F ≤ 1 (3.9.1) \min_{D, X}||DX-A||_F^2+\lambda{||X||_1} \\ st {\quad} ||D||_F{\le}1 \tag{3.9.1}D,Xmin∣∣DXAF2+λ∣∣X1s.t.∣∣DF1( 3.9.1 )
In this case, it is impossible to alwaysc D cDSuch a change in cD limits D from being too large, so X will not be too small.

Guess you like

Origin blog.csdn.net/abc1234564546/article/details/132555186