Machine Learning and High-Dimensional Information Retrieval - Note 5 - (Deep) Feedforward Neural Networks and related examples based on CVXOPT

Note 5 - (Deep) Feedforward Neural Networks and related examples

5.1 Definition and motivation of FNN

Roughly speaking, feedforward neural networks (FNN) are a special class of functions that are very powerful at minimizing any kind of expected loss, but at the cost of training a large number of parameters. More precisely, consider an input variable X ∈ R p \mathcal{X} \in \mathbb{R}^{p}XRp and a function classF \mathcal{F}F , we want to find a functionfff , so that a certain loss functionLLL has the smallest expected value. For example, consider the simple loss functionL ( f ( X ) ) = ∥ f ( X ) − {X}\|_{2}^{2}L(f(X))=f(X)X22, for the purpose of reconstructing X \mathcal{X}Autoencoder for X samples. Here,fff consists of the connection of the encoder function (mapped to a low-dimensional space) and the decoder function (mapped back to the original data space).

Another example of supervised learning is regression, where we have a joint distribution of input and output variables ( X , Y ) (\mathcal{X}, \mathcal{Y})(X,Y ) , the purpose is to be inF \mathcal{F}Find the bestff in Ff,使 L ( f ( X ) , Y ) = ∥ f ( X ) − Y ∥ 2 2 L(f(\mathcal{X}), \mathcal{Y})=\|f(\mathcal{X})-\mathcal{Y}\|_{2}^{2} L(f(X),Y)=f(X)Y22expectations are minimal. We discuss multilevel classification later in this section.

If F \mathcal{F}All functions in F can be described by a set of parameters, say Θ ∈ RN \Theta\in \mathbb{R}^{N}ThRN , if some training samples are given, such asnnn , then these learning problems will produce a minimization process

Θ ^ = arg ⁡ min ⁡ Θ ∈ R N 1 n ∑ i L ( f Θ ( x i ) ) . (5.) \hat{\Theta}=\arg \min _{\Theta \in \mathbb{R}^{N}} \frac{1}{n} \sum_{i} L\left(f_{\Theta}\left(\mathbf{x}_{i}\right)\right) . \tag{5.} Th^=argΘRNminn1iL(fTh(xi)).(5.)

A very important class of functions in many applications are the so-called feedforward neural networks (FNN) (FNN)( F N N ) . Feedforward neural network is a linear function1 { }^{1}1 in series.

φ W : R p → R m , h ↦ W h (5.) \varphi_{\mathbf{W}}: \mathbb{R}^{p} \rightarrow \mathbb{R}^{m}, \mathbf {h}\mapsto\mathbf{W h}\tag{5.}PhiW:RpRm,hWh(5.)

1 { }^{1} 1 This also includes affine functions, since we can simply append an extra component 1 to our input vector.

What follows is the so-called activation function, which performs component operations on a vector. Examples of activation functions are: Rectified Linear Unit (ReLU)
σ ( t ) : = max ⁡ { 0 , t } , (5.) \sigma(t):=\max \{0, t\} ,\tag{5.}s ( t ):=max{ 0,t},(5.)

and some other activation functions, see Wikipedia . With a slight misuse of notation, we also represent the vector activation function
σ in the following way: R m → R m , x ↦ [ σ ( x 1 ) ⋮ σ ( xm ) ] (5.) \sigma: \mathbb{R} ^{m} \rightarrow \mathbb{R}^{m}, \quad \mathbf{x} \mapsto\left[\begin{array}{c} \sigma\left(x_{1}\right) \\ \vdots \\ \sigma\left(x_{m}\right) \end{array}\right]\tag{5.}p:RmRm,xp(x1)p(xm)(5.)

A feedforward neural network is a function
f: R p → R o , x ↦ σ l ∘ φ W l ∘ ⋯ ∘ σ 1 ∘ φ W 1 ( x ) , (5.) f: \mathbb{R}^{ p} \rightarrow \mathbb{R}^{o}, \quad \mathbf{x} \mapsto \sigma_{l} \circ \varphi_{\mathbf{W}_{l}} \circ \cdots \circ \ sigma_{1} \circ \varphi_{\mathbf{W}_{1}}(\mathbf{x}),\tag{5.}f:RpRo,xplPhiWlp1PhiW1(x),(5.)

where, as usual, ∘ \circ represents the connection of functions, different function classesF \mathcal{F}F consists of different activation functions and number of layerslll and the size of the matrixW i ∈ R ni × mi \mathbf{W}_{i} \in \mathbb{R}^{n_{i} \times m_{i}}WiRni×midefinition. If the number of layers exceeds three, we usually call such FNN a deep feedforward neural network. Note that the output dimension ooo is determined by the loss function, because the output of the FNN can be used as the input of the loss.

5.2 Training FNNs

Training FNNs in practice is an art in itself, and there are many tricks and regularization techniques that determine the failure or success of learning a powerful FNN. In this section, we focus on the basic method of finding the best FNN for the general problem (5.1), that is, finding the best weight Θ ^ = ( W 1 , … WL ) \hat{\Theta in a given FNN category }=\left(\mathbf{W}_{1}, \ldots \mathbf{W}_{L}\right)Th^=(W1,WL) . In fact, it is a gradient descent method that iteratively updates weights, a method known in the literature asbackpropagation. Below we describe how it works.

Here, the most important tools we need in undergraduate mathematics courses are the chain rule and the Jacobian matrix . To recap, if g: R k ⇒ R lg: \mathbb{R}^{k} \Rightarrow\mathbb{R}^{l}gRkRl h : R l → R m h: \mathbb{R}^{l} \rightarrow \mathbb{R}^{m} h:RlRm is two functions,ggg at​​x \mathbf{x}Differentiable at x , hhhy = g ( x ) \mathbf{y}=g(\mathbf{x})y=Differentiable at g ( x ) , Jacobian matrix J g ( x ) \mathbf{J}_{g}(\mathbf{x})Jg( x )J h ( y ) \mathbf{J}_{h}(\mathbf{y})Jh( y ) . So aboutx \mathbf{x}Function h ∘ g of x : R k → R mh \circ g: \mathbb{R}^{k} \rightarrow \mathbb{R}^{m}hg:RkRm in the graphical functionJ h ∘ g ( x ) = J h ( g ( x ) ⋅ J g ( x ) \mathbf{J}_{h\circ g}(\mathbf{x})=\mathbf{J }_{h}(g(\mathbf{x}) \cdot \mathbf{J}_{g}(\mathbf{x})Jhg(x)=Jh(g(x)Jg( x ) is differentiable.

Examples

  • (linearity of derivative) If in the context of the above introduction, hhh is a simple linear transformation consisting of matrix multiplication byW \mathbf{W}W gets, then

JW ⋅ g ( x ) = W ⋅ J g ( x ) \mathbf{J}_{\mathbf{W} \cdot g}(\mathbf{x})=\mathbf{W} \cdot \mathbf{J} _{g}(\mathbf{x})JWg(x)=WJg(x)

  • Due to ii .The i -time output only depends onxi x_{i}xi,(5.4)Increase σ \sigmaThe Jacobian matrix of σ
    is a square diagonal matrix whose form is J σ ( x ) = [ σ ′ ( x 1 ) ⋱ σ ′ ( xm ) ] \mathbf{J}_{\sigma}(\mathbf{x })=\left[\begin{array}{lll} \sigma^{\prime}\left(x_{1}\right) & & \\ & \ddots & \\ & & \sigma^{\prime} \left(x_{m}\right) \end{array}\right]Jp(x)=p(x1)p(xm)

  • To define the Jacobian matrix of a function, multiply
    mult ⁡ ( x ) by a vector on the right: R m × n → R m , W ↦ W x , \operatorname{mult}(\mathbf{x}): \mathbb{ R}^{m \times n} \rightarrow \mathbb{R}^{m}, \quad \mathbf{W} \mapsto \mathbf{W} \mathbf{x},mult(x):Rm×nRm,WW x ,
    we first need to convertmnmnm n variables (here given in matrix structures) are embedded inR mn \mathbb{R}^{mn}Rmn . _ This can be achieved in various ways, but if we choose to embed row by row, i.e.
    π : W = [ w 1 ⊤ ⋮ wm ⊤ ] ↦ [ w 1 ⋮ wm ] ∈ R mn \pi: \mathbf{W}= \left[\begin{array}{c} \mathbf{w}_{1}^{\top} \\ \vdots \\ \mathbf{w}_{m}^{\top} \end{array} \right] \mapsto\left[\begin{array}{c} \mathbf{w}_{1} \\ \vdots \\ \mathbf{w}_{m} \end{array}\right] \in \mathbf{R}^{mn}Pi:W=w1wmw1wmRmn

Note: π \piπ ism × nm \times nm×An n -dimensional matrix is ​​mapped into a single-dimensional vector whose length ism ⋅ nm\cdot nmn.

Then the Jacobian matrix has a good arrangement of form

J mult ( x ) = [ x ⊤ ⋱ x ⊤ ] ∈ R m × mn \mathbf{J}_{\mathrm{mult}(\mathbf{x})}=\left[\begin{array}{ccc}\mathbf{x}^{\top}&&\\&\ddots& \\ & & \mathbf{x}^{\top}\end{array}\right] \in \mathbb{R}^{m \times mn}.Jmult(x)=xxRm×mn.

Note that since mult ( x ) \mathrm{mult}(\mathbf{x})m u l t ( x ) is linear, and the Jacobian matrix does not depend on W.

In order to calculate the gradient of the cost function (5.1) with respect to the weight Θ = ( W l , … W 1 ) \Theta=\left(\mathbf{W}_{l}, \ldots \mathbf{W}_{1}\ right)Th=(Wl,W1) , we note that it is an input datax \mathbf{x}The average value of the gradient of the lossfunction of
W}_{l}, \ldots, \mathbf{W}_{1}\right):=L \circ \sigma_{l} \circ \varphi_{\mathbf{W}_{l}} \circ \ cdots \circ \sigma_{1} \circ \varphi_{\mathbf{W}_{1}}(\mathbf{x}) .F(Wl,,W1):=LplPhiWlp1PhiW1(x).

Therefore, just calculate the FF depending on one input signalThe gradient of F , then for all training dataxi \mathbf{x}_{i}xiaverage. For convenience, we denote as

hj : = σ j ∘ φ W j ∘ ⋯ ∘ σ 1 ∘ φ W 1 ( x ) \mathbf{h}_{j}:=\sigma_{j} \circ \varphi_{\mathbf{W}_{j }} \circ \cdots \circ \sigma_{1} \circ \varphi_{\mathbf{W}_{1}}(\mathbf{x})hj:=pjPhiWjp1PhiW1(x)

It is the jj of FNNThe output after j layer. For0 < j < l 0<j<l0<j<lhj \mathbf{h}_{j}hjThe jjth called FNNj hidden layer,hl \mathbf{h}_{l}hlis called the output layer, h 0 : = x \mathbf{h}_{0}:=\mathbf{x}h0:=x is the input of FNN.

We define the Jacobian matrix of the function by 2^{2}2 ∂ ∂ W j F ∈ R 1 × m j n j \frac{\partial}{\partial \mathbf{W}_{j}} F \in \mathbb{R}^{1 \times m_{j} n_{j}} WjFR1×mjnj

W j ↦ F (W l , … , W j , … , W 1 ). (5.13) \mathbf{W}_{j} \mapsto F\left(\mathbf{W}_{l}, \ldots, \mathbf{W}_{j}, \ldots, \mathbf{W}_ {1}\right). \tag{5.13}WjF(Wl,,Wj,,W1).(5.13)

2 { }^{2} 2 due toFFF is real-valued, which is also the transpose of the gradient.

Using the chain rule and the example in the previous section, regarding different weight matrices W i \mathbf{W}_{i}WiThe derivative of is given by the following formula
∂ ∂ W l F = J L ( h l ) ⋅ J σ l ( W l h l − 1 ) ⋅ J m u l t ( h l − 1 ) ∂ ∂ W l − 1 F = J L ( h l ) ⋅ J σ l ( W l h l − 1 ) ⋅ W 1 ⋅ J σ l − 1 ( W l − 1 h l − 2 ) ⋅ J mult  ( h l − 2 ) … ∂ ∂ W j F = J L ( h l ) ⋅ J σ l ( W l h l − 1 ) ⋅ W 1 ⋅ J σ l − 1 ( W l − 1 h l − 2 ) ⋅ J l − 1 ⋯ ⋯ J mult  ( h j − 1 ) , \begin{aligned} \frac{\partial}{\partial \mathbf{W}_{l}} F &=\mathbf{J}_{L}\left(\mathbf{h}_{l}\right) \cdot \mathbf{J}_{\sigma_{l}}\left(\mathbf{W}_{l} \mathbf{h}_{l-1}\right) \cdot \mathbf{J}_{\mathbf{m u l t}\left(\mathbf{h}_{l-1}\right)} \\ \frac{\partial}{\partial \mathbf{W}_{l-1}} F &=\mathbf{J}_{L}\left(\mathbf{h}_{l}\right) \cdot \mathbf{J}_{\sigma_{l}}\left(\mathbf{W}_{l} \mathbf{h}_{l-1}\right) \cdot \mathbf{W}_{1} \cdot \mathbf{J}_{\sigma_{l-1}}\left(\mathbf{W}_{l-1} \mathbf{h}_{l-2}\right) \cdot \mathbf{J}_{\text {mult }\left(\mathbf{h}_{l-2}\right)} \\ \quad &\ldots \\ \frac{\partial}{\partial \mathbf{W}_{j}} F&=\mathbf{J}_{L}\left(\mathbf{h}_{l}\right) \cdot \mathbf{J}_{\sigma_{l}}\left(\mathbf{W}_{l} \mathbf{h}_{l-1}\right) \cdot \mathbf{W}_{1} \cdot \mathbf{J}_{\sigma_{l-1}}\left(\mathbf{W}_{l-1} \mathbf{h}_{l-2}\right) \cdot \mathbf{J}_{l-1} \cdots \cdots \mathbf{J}_{\text {mult }\left(\mathbf{h}_{j-1}\right)}, \end{aligned} WlFWl1FWjF=JL(hl)Jpl(Wlhl1)Jmult(hl1)=JL(hl)Jpl(Wlhl1)W1Jpl1(Wl1hl2)Jmult (hl2)=JL(hl)Jpl(Wlhl1)W1Jpl1(Wl1hl2)Jl1Jmult (hj1),

In practice, the initial weights are usually chosen randomly from a normal distribution and then used with the above gradient and step size α > 0 \alpha>0a>0 updated individually. Algorithms and methods for selecting step sizes are beyond the scope of this lecture. Note that we must "reshape" the gradient (i.e.∂ ∂ W i F \frac{\partial}{\partial \mathbf{W}_{i}} FWitranspose of F ) into matrix form by embedding the matrix in (5.9) into π j \pi_{j}PijPerform an inversion. We finally get the update rule
W j ← W j − α π j − 1 ( ∂ ∂ W j F ) ⊤ . (5.14) \mathbf{W}_{j} \leftarrow \mathbf{W}_{j}- \alpha \pi_{j}^{-1}\left(\frac{\partial}{\partial \mathbf{W}_{j}} F\right)^{\top} . \tag{5.14}WjWja pj1(WjF).(5.14)

5.3 Multi-class classification using FNNs

The problem considered in multi-class classification is to convert the input X ∈ R p \mathcal{X}\in \mathbb{R}^{p}XRp is assigned to multiple (such asCCC ) one of the classes. We pass random variables(X, Y) (\mathcal{X}, \mathcal{Y})(X,Y ) models the problem, whereX ∈ R p \mathcal{X} \in\mathbb{R}^{p}XRpY ∈ { e 1 , ... , e C } \mathcal{Y} \in\left\{\mathbf{e}_{1}, \ldots, \mathbf{e}_{C}\right\}Y{ e1,,eC} isRC \mathbb{R}^{C}RCCin COne of the C standard basis vectors. A realizationy = ec \mathbf{y}=\mathbf{e}_{c}y=ecmeans belongs to ccEvents of type c are real. This modeling of output variables is also calledone-hot-encoding.

The idea of ​​using FNN for multi-class classification is: X \mathcal{X}X is used as the input of the network, and the output isRC \mathbb{R}^{C}RA vector in C that should reflect the given X \mathcal{X}The probability of the class distribution of X. More precisely, ifx \mathbf{x}x isX \mathcal{X}X的实现,如果 f f f表示FNN,那么输出向量 h l : = f ( x ) \mathbf{h}_{l}:=f(\mathbf{x}) hl:=f(x)的第 c c c分量应该是 Y \mathcal{Y} Y属于 c c c类的概率,给定 x \mathrm{x} x,即
h l ≈ [ Pr ⁡ ( Y = e 1 ∣ X = x ) ⋮ Pr ⁡ ( Y = e C ∣ X = x ) ] \mathbf{h}_{l} \approx\left[\begin{array}{c} \operatorname{Pr}\left(\mathcal{Y}=\mathbf{e}_{1} \mid \mathcal{X}=\mathbf{x}\right) \\ \vdots \\ \operatorname{Pr}\left(\mathcal{Y}=\mathbf{e}_{C} \mid \mathcal{X}=\mathbf{x}\right) \end{array}\right] hlPr(Y=e1X=x)Pr(Y=eCX=x)
Therefore, the last activation function σ l \sigma_{l}plThe motivation is for CCClass C outputs a probability distribution, which essentially means that the entries of the output vector are between 0 and 1, and they add up to 1. A prominent choice here is the softmax function, given by
σ : RC → RC , [ a 1 ⋮ a C ] ↦ 1 ∑ c exp ⁡ ac [ exp ⁡ a 1 ⋮ exp ⁡ a C ] \sigma: \mathbb {R}^{C} \rightarrow \mathbb{R}^{C}, \quad\left[\begin{array}{c} a_{1} \\ \vdots \\ a_{C} \end{array }\right] \mapsto \frac{1}{\sum_{c} \exp a_{c}}\left[\begin{array}{c} \exp a_{1} \\ \vdots \\ \exp a_ {C} \end{array}\right]p:RCRC,a1aCcexpac1expa1expaC

Exercise: Calculate the Jacobi-Matrix of the softmax function

The loss function we need to train the network must measure how well the predicted distribution corresponds to the distribution observed through our training set ( xi , yi ) , i = 1 , … , n \left(\mathbf{x}_{i }, \mathbf{y}_{i}\right), i=1, \ldots, n(xi,yi),i=1,,n . Note that these observed distributions are deterministic, that is, ifxi \mathbf{x}_{i}xitagged ccc或0, letyc : = Pr ⁡ ( Y = ec ∣ X = xi ) = 1 y_{c}:=\operatorname{Pr}\left(\mathcal{Y}=\mathbf{e}_{c } \mid \mathcal{X}=\mathbf{x}_{i}\right)=1yc:=Pr(Y=ecX=xi)=1 . In general, compare two probability distributionsP = ( p 1 , … p C ) \mathbb{P}=\left(p_{1}, \ldots p_{C}\right)P=(p1,pC) Q = ( q 1 , … q C ) \mathbb{Q}=\left(q_{1}, \ldots q_{C}\right) Q=(q1,qC) CCon the same basisA common approach to C event collections is to use cross-entropy, seeWikipedia.

H ( P , Q ) = − ∑ c = 1 C p c log ⁡ q c . H(\mathbb{P}, \mathbb{Q})=-\sum_{c=1}^{C} p_{c} \log q_{c} . H(P,Q)=c=1Cpclogqc.

In our case, there is a distribution that is deterministic, which can be simplified to the loss function
L ( f ( X ) , yc ) = − log ⁡ f ( X ) c L\left(f(\mathcal{X}) , \mathbf{y}_{c}\right)=-\log f(\mathcal{X})_{c}L(f(X),yc)=logf(X)c

Among them fff is a FNN with softmax as the output,f ( X ) cf(\mathcal{X})_{c}f(X)cindicates its ccc entry. For training, as usual we use the empirical expectation of the loss on the training data, which leads to the optimization problem
( W ^ l , … , W ^ 1 ) = arg ⁡ min ⁡ ( W l , … , W 1 ) { − 1 n ∑ i log ⁡ f ( W l , … , W 1 ) ( xi ) ci } \left(\hat{\mathbf{W}}_{l}, \ldots, \hat{\mathbf{W}} _{1}\right)=\arg \min _{\left(\mathbf{W}_{l}, \ldots, \mathbf{W}_{1}\right)}\left\{-\frac {1}{n} \sum_{i} \log f_{\left(\mathbf{W}_{l, \ldots}, \mathbf{W}_{1}\right)}\left(\mathbf{ x}_{i}\right)_{c_{i}}\right\}(W^l,,W^1)=arg(Wl,,W1)min{ n1ilogf(Wl,,W1)(xi)ci}

5.4 CVXOPT Example

Machine learning tasks are often thought of as optimization problems, e.g., minimizing an error function or maximizing a probability. Ideally, the optimization problem is convex, meaning that any local minimum is the global minimum of the recipe. It is assumed that you already have some basic knowledge about convex optimization. The purpose of this task is to familiarize us with CVXOPT, one of the most widely used convex optimization toolboxes. Note: If CVXOPT doesn't accept a NumPy array, try converting it to it double.

  1. Go to cvxopt.org and install according to the installation instructions of the distribution version. For conda, you need to run
    conda install -c conda-forge crxopt
  2. Browse the examples section on cvxopt.org for an overview of the capabilities of CVXOPT's different solvers.
  3. Implement a function minsqthat takes as input a (m, n)NumPy array A of shape and a (m,)NumPy array y of shape as parameters and returns a (n,)NumPy array xx of shapex , solve the following problems.
    min ⁡ x ∥ A x − y ∥ \min _{\mathbf{x}}\| \mathbf{A x}-\mathbf{y}\|xminAxy
    Test the function with appropriate inputs andnp.linalg.pinvcompare the results with those obtained using y ∥. Experiment by adding white Gaussian noise to y.

5.4.1 Introduction to CVXOPT

5.4.1.1 Creating a matrix

CVXOPT has separate dense and sparse matrix objects. This example illustrates different ways of creating dense and sparse matrices.

Use this matrix()function to create a dense matrix; it can be created from a list (or iterator):

>>> from cvxopt import matrix
>>> A = matrix([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], (2,3))
>>> print(A)
[ 1.00e+00  3.00e+00  5.00e+00]
[ 2.00e+00  4.00e+00  6.00e+00]
>>> A.size
(2, 3)

or from a list of lists, where each inner list represents a column of the matrix:

>>> B = matrix([ [1.0, 2.0], [3.0, 4.0] ])
>>> print(B)
[ 1.00e+00  3.00e+00]
[ 2.00e+00  4.00e+00]

More broadly, inner lists can represent block columns.

print(matrix([ [A] ,[B] ]))
[ 1.00e+00  3.00e+00  5.00e+00  1.00e+00  3.00e+00]
[ 2.00e+00  4.00e+00  6.00e+00  2.00e+00  4.00e+00]

For more information, refer to the documentation .

5.4.1.2 Matrix index

There are two ways to index dense and sparse matrices: single-parameter indexing and two-parameter indexing. In two-parameter indexing, the matrix is ​​indexed using two index sets I and J.

>>> from cvxopt import matrix
>>> A = matrix(range(16),(4,4))
>>> print(A)
[  0   4   8  12]
[  1   5   9  13]
[  2   6  10  14]
[  3   7  11  15]
>>> print(A[[0,1,2,3],[0,2]])
[  0   8]
[  1   9]
[  2  10]
[  3  11]

An index set can be an integer, a list, a matrix of integers, or a slice.

>>> print(A[1,:])
[  1   5   9  13]
>>> print(A[::-1,::-1])
[ 15  11   7   3]
[ 14  10   6   2]
[ 13   9   5   1]
[ 12   8   4   0]

In single-parameter indexing, a matrix is ​​indexed in vector form by considering the matrix in column-major order (that is, by stacking the columns from left to right).

>>> A[::5] = -1
>>> print(A)
[ -1   4   8  12]
[  1  -1   9  13]
[  2   6  -1  14]
[  3   7  11  -1]

This is useful for accessing parts of a matrix that are not submatrices, for example, the diagonal part of a matrix.

5.4.1.3 Solving linear programs

solvers.lp()Linear programs can be specified through this function. For example, we can solve this problem

minimize 2 x 1 + x 2 subject to − x 1 + x 2 ≤ 1 x 1 + x 2 ≥ 2 x 2 ≥ 0 x 1 − 2 x 2 ≤ 4 \begin{array}{ll} \text{minimize} & 2x_1 + x_2 \\ \text{subject to} & -x_1 + x_2 \leq 1 \\ & x_1 + x_2 \geq 2 \\ & x_2 \geq 0 \\ & x_1 -2x_2 \leq 4 \end{array} minimizesubject to2x _1+x2x1+x21x1+x22x20x12x _24

as follows:

>>> from cvxopt import matrix, solvers
>>> A = matrix([ [-1.0, -1.0, 0.0, 1.0], [1.0, -1.0, -1.0, -2.0] ]) # the factors of the bounds
>>> b = matrix([ 1.0, -2.0, 0.0, 4.0 ]) # constants
>>> c = matrix([ 2.0, 1.0 ]) # minimized function
>>> sol=solvers.lp(c,A,b)
     pcost       dcost       gap    pres   dres   k/t
 0:  2.6471e+00 -7.0588e-01  2e+01  8e-01  2e+00  1e+00
 1:  3.0726e+00  2.8437e+00  1e+00  1e-01  2e-01  3e-01
 2:  2.4891e+00  2.4808e+00  1e-01  1e-02  2e-02  5e-02
 3:  2.4999e+00  2.4998e+00  1e-03  1e-04  2e-04  5e-04
 4:  2.5000e+00  2.5000e+00  1e-05  1e-06  2e-06  5e-06
 5:  2.5000e+00  2.5000e+00  1e-07  1e-08  2e-08  5e-08
>>> print(sol['x'])
[ 5.00e-01]
[ 1.50e+00]
5.4.1.4 Solving quadratic programming

Quadratic programming can solvers.qp()be solved by this function. For example, we can solve QP

minimize 2 x 1 2 + x 2 2 + x 1 x 2 + x 1 + x 2 subject to x 1 ≥ 0 x 2 ≥ 0 x 1 + x 2 = 1 \begin{array}{ll} \text{minimize} & 2x_1^2 + x_2^2 + x_1 x_2 + x_1 + x_2 \\ \text{subject to} & x_1 \geq 0 \\ & x_2 \geq 0 \\ & x_1 + x_2 = 1 \end{array} minimizesubject to2x _12+x22+x1x2+x1+x2x10x20x1+x2=1
as follows:

>>> from cvxopt import matrix, solvers
>>> Q = 2*matrix([ [2, .5], [.5, 1] ])
>>> p = matrix([1.0, 1.0])
>>> G = matrix([[-1.0,0.0],[0.0,-1.0]])
>>> h = matrix([0.0,0.0])
>>> A = matrix([1.0, 1.0], (1,2))
>>> b = matrix(1.0)
>>> sol=solvers.qp(Q, p, G, h, A, b)
     pcost       dcost       gap    pres   dres
 0:  0.0000e+00  0.0000e+00  3e+00  1e+00  0e+00
 1:  9.9743e-01  1.4372e+00  5e-01  4e-01  3e-16
 2:  1.8062e+00  1.8319e+00  5e-02  4e-02  5e-16
 3:  1.8704e+00  1.8693e+00  6e-03  2e-03  1e-15
 4:  1.8749e+00  1.8748e+00  2e-04  6e-05  6e-16
 5:  1.8750e+00  1.8750e+00  2e-06  6e-07  7e-16
 6:  1.8750e+00  1.8750e+00  2e-08  6e-09  1e-15
>>> print(sol['x'])
[ 2.50e-01]
[ 7.50e-01]

5.4.2 Example solutions

Implement a function minsqthat takes as input a (m, n)NumPy array A of shape and a (m,)NumPy array y of shape as parameters and returns a (n,)NumPy array xx of shapex , solve the following problems.
min ⁡ x ∥ A x − y ∥ \min _{\mathbf{x}}\| \mathbf{A x}-\mathbf{y}\|xminAxy
Test the function with appropriate inputs andnp.linalg.pinvcompare the results with those obtained using y ∥. Experiment by adding white Gaussian noise to y.

from cvxopt import matrix, solvers
import numpy as np

def minsq(A, y):
    P=matrix(np.dot(A.T,A).astype('double'))
    q=matrix(-np.dot(A.T,y).astype('double'))
    x=solvers.qp(P,q)
    return np.array(x['x'])

A=np.array([[10, 40],[20, 0],[-30, 40]])
y=np.array([50,20,10])+np.random.randn(3,)

print('A:', A)
print('y:', y)
print('x:', minsq(A,y).squeeze())
print('np.dot(pinv(A),y):', np.dot(np.linalg.pinv(A),y))

The results are as follows

A: [[ 10  40]
 [ 20   0]
 [-30  40]]
y: [49.88665691 19.21406554  9.38923507]
x: [0.99519146 0.98974651]
np.dot(pinv(A),y): [0.99519146 0.98974651]

Guess you like

Origin blog.csdn.net/qq_37266917/article/details/122283464