If not specified, the following conditions are assumed.
X ∈ R n ∗ m A ∈ R m ∗ n X \in R^{n*m} \\ A \in R^{m*n} X∈Rn∗mA∈Rm∗n
Trace
t r ( A ) = ∑ i a i i t r ( A + B ) = t r ( A ) + t r ( B ) t r ( c A ) = c ⋅ t r ( A ) t r ( A ) = t r ( A T ) t r ( A T B ) = t r ( A B T ) = ∑ i , j ( A ∘ B ) i j t r ( b a T ) = a T b tr(A) = \sum_i a_{ii} \\ tr(A+B) = tr(A) + tr(B) \\ tr(cA) = c \cdot tr(A) \\ tr(A) = tr(A^T) \\ tr(A^TB) = tr(AB^T) = \sum_{i,j} (A \circ B)_{ij}\\ tr(ba^T) = a^Tb tr(A)=i∑aiitr(A+B)=tr(A)+tr(B)tr(cA)=c⋅tr(A)tr(A)=tr(AT)tr(ATB)=tr(ABT)=i,j∑(A∘B)ijtr(baT)=aTb
Derivation
f ( x ) = ∣ ∣ x − p ∣ ∣ 2 − > ∇ f ( x ) = x − p ∣ ∣ x − p ∣ ∣ 2 f(x) = ||x-p||_2 -> \nabla f(x) = \frac {x-p} {||x-p||_2} f(x)=∣∣x−p∣∣2−>∇f(x)=∣∣x−p∣∣2x−p
Vector
Subspace
x , y ∈ V ; α x + β y ∈ V x, y \in V; \alpha x + \beta y \in V x,y∈V;αx+βy∈V
Basis
a set B of vecs of min cardinality s.t. span(B) = S
Norms
∣ ∣ x ∣ ∣ ≥ 0 ∣ ∣ x + y ∣ ∣ ≤ ∣ ∣ x ∣ ∣ + ∣ ∣ y ∣ ∣ ∣ ∣ c x ∣ ∣ = ∣ c ∣ ⋅ ∣ ∣ x ∣ ∣ ||x|| \ge 0 \\ ||x+y|| \le ||x|| + ||y|| \\ ||cx|| = |c| \cdot ||x|| ∣∣x∣∣≥0∣∣x+y∣∣≤∣∣x∣∣+∣∣y∣∣∣∣cx∣∣=∣c∣⋅∣∣x∣∣
Cauchy-Schwarz Inequality
∣ x T y ∣ ≤ ∣ ∣ x ∣ ∣ p ∣ ∣ y ∣ ∣ q ; ∀ 1 p + 1 q = 1 |x^Ty| \le ||x||_p ||y||_q; \forall \frac 1 p + \frac 1 q = 1 ∣xTy∣≤∣∣x∣∣p∣∣y∣∣q;∀p1+q1=1
Theorems
Orthogonal Theorem
X = S ⊕ S ⊥ ; ∀ S ⊂ X X = S \oplus S^{\perp}; \forall S \subset X X=S⊕S⊥;∀S⊂X
Projection Theorem
m i n x ∈ S ∣ ∣ y − x ∣ ∣ ⇒ y ∗ ∈ S , ( y − y ∗ ) ⊥ S min_{x \in S} ||y - x|| \Rightarrow y^* \in S, (y-y^*) \perp S minx∈S∣∣y−x∣∣⇒y∗∈S,(y−y∗)⊥S
Matrix
Partition
A b = ∑ j a j b j c T A = ∑ i c i A i Ab = \sum_j a_jb_j \\ c^TA = \sum_i c_iA_i Ab=j∑ajbjcTA=i∑ciAi
Range
R ( A ) = { A x : x ∈ R n } R m = R ( A ) ⊕ N ( A T ) R n = R ( A T ) ⊕ N ( A ) R(A) = \{Ax: x \in R^n\} \\ R^m = R(A) \oplus N(A^T) \\ R^n = R(A^T) \oplus N(A) \\ R(A)={ Ax:x∈Rn}Rm=R(A)⊕N(AT)Rn=R(AT)⊕N(A)
Fundamental Theorem of Linear Algebra
w = A x + z , z ∈ N ( A T ) w = Ax + z, z \in N(A^T) w=Ax+z,z∈N(AT)
Kernel
K = X T X K = X^TX K=XTX
Orthogonal
A A T = A T A = I , A T = A − 1 AA^T = A^TA = I, A^T = A^{-1} AAT=ATA=I,AT=A−1
Schur Complements
M = [ A X X T B ] S = A − X B − 1 X T M ≽ 0 ⟺ S ≥ 0 M = \begin{bmatrix} A & X \\ X^T & B \\ \end{bmatrix} \\ S = A - XB^{-1}X^T \\ M \succcurlyeq 0 \iff S \ge 0 M=[AXTXB]S=A−XB−1XTM≽0⟺S≥0
Positive Definiteness
For a symmetric square matrix A, PSD means
x T A x ≥ 0 , ∀ x ∈ R m ⟺ λ i ( A ) ≥ 0 x^T A x \ge 0, \forall x \in R^m \iff \lambda_i(A) \ge 0 xTAx≥0,∀x∈Rm⟺λi(A)≥0
The determinant of PSD is non-negative. The numbers on the diagonal are non-negative.
The definition of PD replaces all ≥ \ge ≥ to > > >.
If a PSD matrix is invertible, then it is PD.
Matrix Norms
Frobenius Norm
∣ ∣ A ∣ ∣ F = t r ( A A T ) = ∑ i , j ∣ A i j ∣ 2 = ∑ i λ i ( A A T ) ||A||_F = \sqrt{tr(AA^T)} = \sqrt{\sum_{i,j} |A_{ij}|^2} = \sqrt{\sum_{i} \lambda_i(AA^T)} ∣∣A∣∣F=tr(AAT)=i,j∑∣Aij∣2=i∑λi(AAT)
Operator Norm
∣ ∣ A ∣ ∣ p = m a x ∣ ∣ u ∣ ∣ p = 1 ∣ ∣ A u ∣ ∣ p ||A||_p = max_{||u||_p=1} ||Au||_p ∣∣A∣∣p=max∣∣u∣∣p=1∣∣Au∣∣p
l1 norm: largest abs col sum
l2 norm: λ m a x ( A A T ) = σ 1 \sqrt{\lambda_{max}(AA^T)} = \sigma_1 λmax(AAT)=σ1
l_inf norm: largest abs row sum
Nuclear Norm
∣ ∣ A ∣ ∣ ∗ = ∑ i σ i ||A||_* = \sum_i \sigma_i ∣∣A∣∣∗=i∑σi
Matrix Decomposition
Orthogonal-Triangular Decomposition (QR)
A = Q R A = QR A=QR
For square matrix A, Q is orthogonal, R is upper triangular.
For non-square matrix with m < n, we still have Q T Q = I m Q^TQ = I_m QTQ=Im. It can be useful to partition both Q and R.
Cholesky Decomposition
A is PD, L is lower triangular matrix
A = L L T A = LL^T A=LLT
Singular Value Decomposition (SVD)
For non-zero matrix A,
A = U Σ V T A v i = σ i u i , A T u i = σ i v i , i = 1 ∼ r σ i 2 = λ i ( A A T ) = λ i ( A T A ) , i = 1 ∼ r ∣ ∣ A ∣ ∣ F 2 = t r ( A T A ) = ∑ i = 1 n σ i 2 A k = U ~ Σ ~ V ~ T v a r i a n c e e x p l a i n e d = η k = ∣ ∣ A k ∣ ∣ F 2 ∣ ∣ A ∣ ∣ F 2 = σ 1 2 + . . . + σ k 2 σ 1 2 + . . . + σ n 2 A = U \Sigma V^T \\ Av_i = \sigma_i u_i, A^T u_i = \sigma_i v_i, i=1 \sim r \\ \sigma_i^2 = \lambda_i(AA^T) = \lambda_i(A^TA), i=1 \sim r \\ ||A||_F^2 = tr(A^TA) = \sum_{i=1}^n {\sigma_i^2} \\\\ A_k = \tilde U \tilde \Sigma \tilde V^T \\ variance \ explained = \eta_k = \frac {||A_k||_F^2}{||A||_F^2} = \frac {\sigma_1^2 + ... + \sigma_k^2}{\sigma_1^2 + ... + \sigma_n^2} A=UΣVTAvi=σiui,ATui=σivi,i=1∼rσi2=λi(AAT)=λi(ATA),i=1∼r∣∣A∣∣F2=tr(ATA)=i=1∑nσi2Ak=U~Σ~V~Tvariance explained=ηk=∣∣A∣∣F2∣∣Ak∣∣F2=σ12+...+σn2σ12+...+σk2
u i u_i ui and v i v_i vi are eigen vectors of A T A A^TA ATA and A A T AA^T AAT respectively.
x j x_j xj in low dimension is x ~ j = S ~ V ~ T e j \tilde x_j = \tilde S \tilde V^T e_j x~j=S~V~Tej, to recover use x j ′ = U ~ x ~ j x_j' = \tilde U \tilde x_j xj′=U~x~j.
Spectral Decomposition
A is a square symmetric matrix.
A = U Λ U T = ∑ i λ i u i u i T A = U\Lambda U^T = \sum_i \lambda_i u_i u_i^T A=UΛUT=i∑λiuiuiT
Rayleigh quotient
λ m i n ≤ x T A x x T x ≤ λ m a x , x ≠ 0 \lambda_{min} \le \frac{x^TAx}{x^Tx} \le \lambda_{max}, x \ne 0 λmin≤xTxxTAx≤λmax,x=0
For any matrix, Matrix gain (spectral norm)
∣ ∣ A ∣ ∣ 2 = m a x ∣ ∣ x ∣ ∣ 2 = 1 ∣ ∣ A x ∣ ∣ 2 ≤ λ m a x ( A T A ) ||A||_2 = max_{||x||_2=1} ||Ax||_2 \le \sqrt{\lambda_{max}(A^TA)} ∣∣A∣∣2=max∣∣x∣∣2=1∣∣Ax∣∣2≤λmax(ATA)
Sample Covariance Matrix
C = 1 m ∑ i = 1 m ( x i − x ^ ) ( x i − x ^ ) T s i = w T x i σ 2 = ∑ i = 1 m ( w T x i − s ^ ) = w T C w t r ( C ) = 1 m ∣ ∣ X ∣ ∣ F 2 C = \frac 1 m \sum_{i=1}^m (x_i - \hat x)(x_i - \hat x)^T \\ s_i = w^Tx_i \\ \sigma^2 = \sum_{i=1}^m (w^Tx_i - \hat s) = w^TCw \\ tr(C) = \frac 1 m ||X||_F^2 C=m1i=1∑m(xi−x^)(xi−x^)Tsi=wTxiσ2=i=1∑m(wTxi−s^)=wTCwtr(C)=m1∣∣X∣∣F2
Apparently, the covariance matrix is PSD.
Ellipsoid
Let P be a PD matrix, such that P = L L T = U Λ U T P = LL^T = U \Lambda U^T P=LLT=UΛUT. The standard form is
E = { x : x T P − 1 x ≤ 1 } E = \{x: x^T P^{-1} x \le 1\} \\ E={
x:xTP−1x≤1}
Converting another form to the standard form.
E = { x ^ + L z : ∣ ∣ z ∣ ∣ 2 ≤ 1 } = { x : ∣ ∣ L − 1 ( x − x ^ ) ∣ ∣ 2 ≤ 1 } = { x : ( x − x ^ ) T P − 1 ( x − x ^ ) ≤ 1 } E = \{\hat x+Lz: ||z||_2 \le 1\} \\ = \{x: ||L^{-1}(x-\hat x)||_2 \le 1\} \\ = \{x: (x-\hat x)^TP^{-1}(x-\hat x) \le 1\} \\ E={
x^+Lz:∣∣z∣∣2≤1}={
x:∣∣L−1(x−x^)∣∣2≤1}={
x:(x−x^)TP−1(x−x^)≤1}
Linear Equation
Systems
A x = y Ax = y Ax=y
We know that # equations = m, # unknows = n
Overdetermined System: m > n, one solution or none
Underdetermined System: m < n, dim(set of solution) = n-m
Square System: m = n
We can solve the system using SVD.
A = U Σ V T x ′ = V T x , y ′ = U T y A = U \Sigma V^T \\ x' = V^T x, y' = U^T y \\ A=UΣVTx′=VTx,y′=UTy
Assume that rank(A) = r < m.
Σ x ′ = y ′ ⇒ y i ′ = { σ i x i ′ , i = 1 ∼ r 0 , i = r + 1 ∼ m \Sigma x' = y' \Rightarrow y_i' = \begin{cases} \sigma_i x_i', & i=1 \sim r \\ 0, & i=r+1 \sim m \\ \end{cases} Σx′=y′⇒yi′={
σixi′,0,i=1∼ri=r+1∼m
If y ∉ R ( A ) y \notin R(A) y∈/R(A), the system is not feasible.
If y ∈ R ( A ) y \in R(A) y∈R(A), the system is feasible, x i ′ = y i ′ / σ i , i = 1 ∼ r x_i'= y_i' / \sigma_i, i = 1 \sim r xi′=yi′/σi,i=1∼r.
If A is full column rank, then there is unique solution.
Linear Dynamical System
x t + 1 = A t x t x_{t+1} = A_t x_t xt+1=Atxt
The system is time invariant if A t = A A_t = A At=A. It can be extended to include inputs and offset, or to an auto-regressive model.
x t + 1 = A x t + b x_{t+1} = A x_t + b xt+1=Axt+b
The steady-state solution is when t → ∞ t \rightarrow \infty t→∞, ( I − A ) x t = b (I - A)x_t = b (I−A)xt=b.
Least Square
Plain
m i n x ∣ ∣ A x − y ∣ ∣ 2 2 x ∗ = ( A T A ) − 1 A T y min_x ||Ax-y||_2^2 \\ x^* = (A^TA)^{-1}A^Ty minx∣∣Ax−y∣∣22x∗=(ATA)−1ATy
There is another variation which contains weights, but it can be converted.
m i n x ∣ ∣ W ( A x − y ) ∣ ∣ 2 2 = m i n x ∣ ∣ A w x − y w ∣ ∣ 2 2 min_x ||W(Ax-y)||_2^2 = min_x ||A_wx-y_w||_2^2 \\ minx∣∣W(Ax−y)∣∣22=minx∣∣Awx−yw∣∣22
Constrained
m i n x ∣ ∣ A x − y ∣ ∣ 2 2 : C x = d min_x ||Ax-y||_2^2 : Cx = d minx∣∣Ax−y∣∣22:Cx=d
Define x ′ x' x′ s.t. C x ′ = d Cx'=d Cx′=d. The solution set is x = x ′ + B z , B ∈ N ( C ) x = x' + Bz, B \in N(C) x=x′+Bz,B∈N(C). We can convert the problem to
m i n x ∣ ∣ A ′ z − y ′ ∣ ∣ 2 2 : A ′ = A B , y ′ = y − A x ′ min_x ||A'z-y'||_2^2 : A' = AB, y' = y-Ax' minx∣∣A′z−y′∣∣22:A′=AB,y′=y−Ax′
Penalties
Take L2 norm as an example.
m i n x ∣ ∣ A x − y ∣ ∣ 2 2 + ϕ ( x ) = m i n x ∣ ∣ A x − y ∣ ∣ 2 2 + λ ∣ ∣ x ∣ ∣ 2 2 min_x ||Ax-y||_2^2 + \phi(x)\\ = min_x ||Ax-y||_2^2 + \lambda||x||_2^2\\ minx∣∣Ax−y∣∣22+ϕ(x)=minx∣∣Ax−y∣∣22+λ∣∣x∣∣22
We can construct a new A and y.
A ′ = [ A λ I n ] y ′ = [ y 0 n ] A' = \begin{bmatrix} A \\ \sqrt \lambda I_n \\ \end{bmatrix} \\ y' = \begin{bmatrix} y \\ 0_n \\ \end{bmatrix} A′=[AλIn]y′=[y0n]
This way we can get the solution as below.
x ∗ = ( A ′ T A ′ ) A ′ T y ′ = ( A T A + λ I ) − 1 A T y x^* = (A'^TA')A'^Ty' = (A^TA+ \lambda I)^{-1}A^Ty x∗=(A′TA′)A′Ty′=(ATA+λI)−1ATy
Convex Optimization
Equality constraints are allowed if they are affine.
Linear Programming (LP)
m i n c T x + d : A x ≤ b min \ c^Tx+d:Ax \le b min cTx+d:Ax≤b
Quadratic Programming (QP)
m i n 1 2 x T H x + c T x + d : A x ≤ b min \ \frac 1 2 x^THx+c^Tx + d: Ax \le b min 21xTHx+cTx+d:Ax≤b
H is a PSD matrix.
If H is PD, then
f ( x ) = 1 2 ( x − x ∗ ) T H ( x − x ∗ ) + d − 1 2 x ∗ T H x ∗ x ∗ = − H − 1 c f(x) = \frac 1 2 (x-x^*)^T H (x-x^*) + d - \frac 1 2 x^{*T} H x^* \\ x^* = -H^{-1}c f(x)=21(x−x∗)TH(x−x∗)+d−21x∗THx∗x∗=−H−1c
If H is PSD, and c ∈ R ( H ) c \in R(H) c∈R(H), then
H x ∗ + c = 0 Hx^* + c = 0 Hx∗+c=0
Otherwise, the problem is unbounded.
Quadratic Constrained Quadratic Programming (QCQP)
m i n 1 2 x T Q 0 x + a 0 T x : x T Q i x + a i T x ≤ b i min \ \frac 1 2 x^T Q_0 x + a_0^Tx: x^T Q_i x + a_i^T x\le b_i min 21xTQ0x+a0Tx:xTQix+aiTx≤bi
Second-Order Cone Programming (SOCP)
m i n c T x : ∣ ∣ A i x + b i ∣ ∣ 2 ≤ c i T x + d i min \ c^Tx: ||A_ix+b_i||_2 \le c_i^T x + d_i min cTx:∣∣Aix+bi∣∣2≤ciTx+di
Robust Programming
m i n x m a x u ∈ U f 0 ( x , u ) : f i ( x , u ) ≤ 0 , ∀ u ∈ U min_x max_{u \in U} f_0(x,u): f_i(x, u) \le 0, \forall u \in U minxmaxu∈Uf0(x,u):fi(x,u)≤0,∀u∈U
Consider a single inequality with uncertain coefficient vector.
a T x ≤ b , a ∈ U a^Tx \le b, a \in U aTx≤b,a∈U
Scenario Uncertainty
U is finite.
m a x a ∈ U a T x ≤ b max_{a \in U} a^Tx \le b maxa∈UaTx≤b
Box Uncertainty
U = { a : ∣ ∣ a − a ^ ∣ ∣ ∞ ≤ ρ } = { a : a ^ + ρ u : ∣ ∣ u ∣ ∣ ∞ ≤ 1 } m a x a ∈ U a T x = a ^ T x + ρ ∣ ∣ x ∣ ∣ 1 U = \{a: ||a-\hat a||_{\infty} \le \rho \} = \{a: \hat a + \rho u: ||u||_{\infty} \le 1 \} \\ max_{a \in U} a^Tx = \hat a^T x + \rho||x||_1 U={ a:∣∣a−a^∣∣∞≤ρ}={ a:a^+ρu:∣∣u∣∣∞≤1}maxa∈UaTx=a^Tx+ρ∣∣x∣∣1
Spherical Uncertainty
U = { a : ∣ ∣ a − a ^ ∣ ∣ 2 ≤ ρ } m a x a ∈ U a T x = a ^ T x + ρ ∣ ∣ x ∣ ∣ 2 U = \{a: ||a-\hat a||_{2} \le \rho \} \\ max_{a \in U} a^Tx = \hat a^T x + \rho||x||_2 U={ a:∣∣a−a^∣∣2≤ρ}maxa∈UaTx=a^Tx+ρ∣∣x∣∣2
Ellipsoidal Uncertainty
U = { a : ( a − a ^ ) T P − 1 ( a − a ^ ) ≤ 1 } m a x a ∈ U a T x = a ^ T x + ∣ ∣ R T x ∣ ∣ 2 , P = R T R U = \{a: (a-\hat a)^T P^{-1} (a-\hat a) \le 1 \} \\ max_{a \in U} a^Tx = \hat a^T x + ||R^Tx||_2, P = R^TR U={ a:(a−a^)TP−1(a−a^)≤1}maxa∈UaTx=a^Tx+∣∣RTx∣∣2,P=RTR
Convexity
A subset is said to be convex if it contains the line segment between any two points in it.
x 1 , x 2 ∈ C , λ ∈ [ 0 , 1 ] ⇒ λ x 1 + ( 1 − λ ) x 2 ∈ C x_1, x_2 \in C, \lambda \in [0,1] \Rightarrow \lambda x_1 + (1- \lambda) x_2 \in C x1,x2∈C,λ∈[0,1]⇒λx1+(1−λ)x2∈C
A function f is convex if its domain and range are convex. Convex functions must be + ∞ + \infty +∞ outside their domains.
The epigraph of a function is
e p i f = { ( x , t ) , x ∈ d o m f , t ∈ R : f ( x ) ≤ t } epi f = \{(x,t), x \in dom f, t \in R: f(x) \le t \} epif={
(x,t),x∈domf,t∈R:f(x)≤t}
f is a convex function iff epi f is a convex set.
Optimality
Consider a problem
m i n x f o ( x ) : A x = b min_x f_o(x): Ax=b minxfo(x):Ax=b
We have
∇ f 0 ( x ∗ ) T ( x − x ∗ ) ≥ 0 , ∀ x ∈ X \nabla f_0(x^*)^T(x-x^*) \ge 0, \forall x \in \mathcal X ∇f0(x∗)T(x−x∗)≥0,∀x∈X
The proof comes from
f 0 ( x ) ≥ f 0 ( x ∗ ) + ∇ f 0 ( x ∗ ) T ( x − x ∗ ) f_0(x) \ge f_0(x^*) + \nabla f_0(x^*)^T(x-x^*) f0(x)≥f0(x∗)+∇f0(x∗)T(x−x∗)
In a convex unconstrained problem with differentiable objective, x is optimal iff ∇ f 0 ( x ) = 0 \nabla f_0(x) = 0 ∇f0(x)=0. If there is an equality constraint of A x = b Ax=b Ax=b, then x is optimal iff A x = b , ∃ v : ∇ f 0 ( x ) + A T v = 0 Ax=b , \exist v: \nabla f_0(x) + A^Tv = 0 Ax=b,∃v:∇f0(x)+ATv=0.
Hulls
Given a set of points P P P in R n R^n Rn.
Linear Hull
x = ∑ i = 1 m λ i x i x = \sum_{i=1}^m \lambda_ix_i x=i=1∑mλixi
Affine Hull
x = ∑ i = 1 m λ i x i : ∑ i = 1 m λ i = 1 x = \sum_{i=1}^m \lambda_ix_i: \sum_{i=1}^m \lambda_i=1 x=i=1∑mλixi:i=1∑mλi=1
aff P is the smallest affine set containing P P P.
Conic Hull
x = ∑ i = 1 m λ i x i : λ i ≥ 0 x = \sum_{i=1}^m \lambda_ix_i: \lambda_i \ge 0 x=i=1∑mλixi:λi≥0
Convex Hull
x = ∑ i = 1 m λ i x i : ∑ i = 1 m λ i = 1 , λ i ≥ 0 x = \sum_{i=1}^m \lambda_ix_i: \sum_{i=1}^m \lambda_i=1, \lambda_i \ge 0 x=i=1∑mλixi:i=1∑mλi=1,λi≥0
Preserving Convexity
Intersection Rule
The intersection of convex sets is also a convex set, and it holds for infinite families of convex sets. The intersection of halfspaces is also called a polyhedron.
Affine Transformation
An affine mapping of a convex set is still convex.
Pointwise Maximization
If ( f a ) a ∈ A (f_a)_{a \in A} (fa)a∈A is a family of convex functions, and A is a set (not necessarily a convex set), then the pointwise max function is convex.
f ( x ) = m a x a ∈ A f a f(x) = max_{a \in A} f_a f(x)=maxa∈Afa
Partial Minimization
If g(x, y) is convex in x, y and C is convex, then f ( x ) = m i n y ∈ C g ( x , y ) f(x) = min_{y \in C} g(x, y) f(x)=miny∈Cg(x,y) is convex. This result trivially extends to partial minimization over a subset of the function’s arguments.
Composition Function
x → f ( g ( x ) ) x \rightarrow f(g(x)) x→f(g(x))
If f is convex and increasing, and g is convex, then the function is convex w.r.t to x.
Constraints
Activeness
If x ∗ x^* x∗ satisfies f i ( x ∗ ) < 0 f_i(x^*) < 0 fi(x∗)<0, then the i-th inequality constraint is inactive (slack) at the optimal solution x ∗ x^* x∗.
Problem Transformations
An optimization problem can be transformed into an equivalent one
- monotone transformation (scaling, logarithm, squaring)
- change of variables
- addition of slack variables
- epigraphic reformulation
- replacement of equality constraints with inequality ones
- elimination of inactive constraints (safe feature elimination)
- discovering hidden convexity
Duality
Problem Formulation
Consider an optimization problem in standard form
p ∗ = m i n x f 0 ( x ) s . t . : f i ( x ) ≤ 0 , i = 1 ∼ m h i ( x ) = 0 , i = 1 ∼ q p^* = min_x f_0(x) \\ s.t.: f_i(x) \le 0, i=1 \sim m \\ h_i(x) = 0, i=1 \sim q p∗=minxf0(x)s.t.:fi(x)≤0,i=1∼mhi(x)=0,i=1∼q
Note that the objective function and the constraints are not necessarily convex.
Lagrangian
Vectors λ \lambda λ and v v v are referred to as Lagrange multipliers.
L ( x , λ , v ) = f o ( x ) + ∑ i = 1 m λ i f i ( x ) + ∑ i = 1 q v i h i ( x ) \mathcal L(x, \lambda, v) = f_o(x) + \sum_{i=1}^m \lambda_i f_i(x) + \sum_{i=1}^q v_i h_i(x) L(x,λ,v)=fo(x)+i=1∑mλifi(x)+i=1∑qvihi(x)
The primal can be expressed as
p ∗ = m i n x m a x λ ≥ 0 , v L ( x , λ , v ) p^* = min_x max_{\lambda \ge 0, v} \mathcal L(x, \lambda, v) p∗=minxmaxλ≥0,vL(x,λ,v)
Recovering Primal Solutions
If L ( x , λ ∗ , v ∗ ) \mathcal L(x, \lambda^*, v^*) L(x,λ∗,v∗) has an unique minimizer, then it is either primal-optimal solution or there is no such solution if it is not primal-feasible.
Weak Duality
The Minimax Inequality is
p ∗ = m i n x m a x y F ( x , y ) ≥ m a x y m i n x F ( x , y ) = d ∗ p^* = min_x max_y F(x,y) \ge max_y min_x F(x,y) = d^* p∗=minxmaxyF(x,y)≥maxyminxF(x,y)=d∗
Therefore, the weak duality indicates that
g ( λ , v ) = m i n x L ( x , λ , v ) d ∗ = m a x λ ≥ 0 , v g ( λ , v ) ≤ p ∗ g(\lambda, v) = min_x \mathcal L(x, \lambda, v) \\ d^* = max_{\lambda \ge 0, v} g(\lambda, v) \le p^* g(λ,v)=minxL(x,λ,v)d∗=maxλ≥0,vg(λ,v)≤p∗
Strong Duality
The strong duality is achieved when p ∗ = d ∗ p^* = d^* p∗=d∗.
Sion’s Minimax Theorem
Let X be convex and Y be a compact set (bounded and closed). If F(x,y) is convex over X and concave over Y, then
m i n x m a x y F ( x , y ) = m a x y m i n x F ( x , y ) min_x max_y F(x,y) = max_y min_x F(x,y) minxmaxyF(x,y)=maxyminxF(x,y)
Slater’s Condition
If the problem is strictly feasible, then strong duality holds. Namely, there exist x 0 ∈ r e l i n t D x_0 \in relint D x0∈relintD such that f i ( x 0 ) < 0 f_i(x_0) < 0 fi(x0)<0.
Karush-Kuhn-Tucker Condition (KKT)
Strong duality holds iff the KKT conditions are satisfied.
Primal feasibility: f i ( x ) ≤ 0 f_i(x) \le 0 fi(x)≤0
Dual feasibility: λ ≥ 0 \lambda \ge 0 λ≥0
Complementary slackness: λ i f i ( x ) = 0 \lambda_i f_i(x) = 0 λifi(x)=0
Lagrangian stationarity: ∇ x L ( x , λ ) = ∇ x f 0 ( x ) + ∑ i = 1 m λ i ∇ x f i ( x ) = 0 \nabla_x \mathcal L(x, \lambda) = \nabla _x f_0(x) + \sum_{i=1}^m \lambda_i \nabla _x f_i(x) = 0 ∇xL(x,λ)=∇xf0(x)+∑i=1mλi∇xfi(x)=0
Reference
- Optimization Models in Engineering (EECS 227 A), Laurent El Ghaoui, University of California Berkeley, Fall 2021