If not specified, the following conditions are assumed.
$\in R^{n*m} \\ A \in R^{m*n}$

Trace

$\sum_i a_{ii} \\ tr(A+B) = tr(A) + tr(B) \\ tr(cA) = c \cdot tr(A) \\ tr(A) = tr(A^T) \\ tr(A^TB) = tr(AB^T) = \sum_{i,j} (A \circ B)_{ij}\\ tr(ba^T) = a^Tb$

Derivation

$||x-p||_2 -> \nabla f(x) = \frac {x-p} {||x-p||_2}$

Vector

Subspace

$\in V; \alpha x + \beta y \in V$

Basis

a set B of vecs of min cardinality s.t. span(B) = S

Norms

$\ge 0 \\ ||x+y|| \le ||x|| + ||y|| \\ ||cx|| = |c| \cdot ||x||$

Cauchy-Schwarz Inequality

$|x^Ty| \le ||x||_p ||y||_q; \forall \frac 1 p + \frac 1 q = 1$

Theorems

Orthogonal Theorem

$\oplus S^{\perp}; \forall S \subset X$

Projection Theorem

$min_{x \in S} ||y - x|| \Rightarrow y^* \in S, (y-y^*) \perp S$

Matrix

Partition

$\sum_j a_jb_j \\ c^TA = \sum_i c_iA_i$

Range

$\{Ax: x \in R^n\} \\ R^m = R(A) \oplus N(A^T) \\ R^n = R(A^T) \oplus N(A) \\$

Fundamental Theorem of Linear Algebra

$\in N(A^T)$

Kernel

$K = X^TX$

Orthogonal

$AA^T = A^TA = I, A^T = A^{-1}$

Schur Complements

$\begin{bmatrix} A & X \\ X^T & B \\ \end{bmatrix} \\ S = A - XB^{-1}X^T \\ M \succcurlyeq 0 \iff S \ge 0$

Positive Definiteness

For a symmetric square matrix A, PSD means
$x^T A x \ge 0, \forall x \in R^m \iff \lambda_i(A) \ge 0$
The determinant of PSD is non-negative. The numbers on the diagonal are non-negative.

The definition of PD replaces all $\ge$ to $>$ .

If a PSD matrix is invertible, then it is PD.

Matrix Norms

Frobenius Norm

$||A||_F = \sqrt{tr(AA^T)} = \sqrt{\sum_{i,j} |A_{ij}|^2} = \sqrt{\sum_{i} \lambda_i(AA^T)}$

Operator Norm

$A||_p = max_{||u||_p=1} ||Au||_p$

l1 norm: largest abs col sum

l2 norm: $\sqrt{\lambda_{max}(AA^T)} = \sigma_1$

l_inf norm: largest abs row sum

Nuclear Norm

$||A||_* = \sum_i \sigma_i$

Matrix Decomposition

Orthogonal-Triangular Decomposition (QR)

$A = Q R$

For square matrix A, Q is orthogonal, R is upper triangular.

For non-square matrix with m < n, we still have $Q^TQ = I_m$ . It can be useful to partition both Q and R.

Cholesky Decomposition

A is PD, L is lower triangular matrix
$A = LL^T$

Singular Value Decomposition (SVD)

For non-zero matrix A,
$\Sigma V^T \\ Av_i = \sigma_i u_i, A^T u_i = \sigma_i v_i, i=1 \sim r \\ \sigma_i^2 = \lambda_i(AA^T) = \lambda_i(A^TA), i=1 \sim r \\ ||A||_F^2 = tr(A^TA) = \sum_{i=1}^n {\sigma_i^2} \\\\ A_k = \tilde U \tilde \Sigma \tilde V^T \\ variance \ explained = \eta_k = \frac {||A_k||_F^2}{||A||_F^2} = \frac {\sigma_1^2 + ... + \sigma_k^2}{\sigma_1^2 + ... + \sigma_n^2}$
$u_i$ and $v_i$ are eigen vectors of $A^TA$ and $AA^T$ respectively.

$x_j$ in low dimension is $\tilde x_j = \tilde S \tilde V^T e_j$ , to recover use $x_j' = \tilde U \tilde x_j$ .

Spectral Decomposition

A is a square symmetric matrix.
$U\Lambda U^T = \sum_i \lambda_i u_i u_i^T$
Rayleigh quotient
$\lambda_{min} \le \frac{x^TAx}{x^Tx} \le \lambda_{max}, x \ne 0$
For any matrix, Matrix gain (spectral norm)
$||A||_2 = max_{||x||_2=1} ||Ax||_2 \le \sqrt{\lambda_{max}(A^TA)}$

Sample Covariance Matrix

$\frac 1 m \sum_{i=1}^m (x_i - \hat x)(x_i - \hat x)^T \\ s_i = w^Tx_i \\ \sigma^2 = \sum_{i=1}^m (w^Tx_i - \hat s) = w^TCw \\ tr(C) = \frac 1 m ||X||_F^2$

Apparently, the covariance matrix is PSD.

Ellipsoid

Let P be a PD matrix, such that $LL^T = U \Lambda U^T$ . The standard form is
$\{x: x^T P^{-1} x \le 1\} \\$
Converting another form to the standard form.
$\{\hat x+Lz: ||z||_2 \le 1\} \\ = \{x: ||L^{-1}(x-\hat x)||_2 \le 1\} \\ = \{x: (x-\hat x)^TP^{-1}(x-\hat x) \le 1\} \\$

Linear Equation

Systems

$A x = y$

We know that # equations = m, # unknows = n

Overdetermined System: m > n, one solution or none

Underdetermined System: m < n, dim(set of solution) = n-m

Square System: m = n

We can solve the system using SVD.
$\Sigma V^T \\ x' = V^T x, y' = U^T y \\$
Assume that rank(A) = r < m.
$\Sigma x' = y' \Rightarrow y_i' = \begin{cases} \sigma_i x_i', & i=1 \sim r \\ 0, & i=r+1 \sim m \\ \end{cases}$
If $\notin R(A)$ , the system is not feasible.

If $\in R(A)$ , the system is feasible, $x_i'= y_i' / \sigma_i, i = 1 \sim r$ .

If A is full column rank, then there is unique solution.

Linear Dynamical System

$x_{t+1} = A_t x_t$

The system is time invariant if $A_t = A$ . It can be extended to include inputs and offset, or to an auto-regressive model.
$x_{t+1} = A x_t + b$
The steady-state solution is when $\rightarrow \infty$ , $I - A)x_t = b$ .

Least Square

Plain

$min_x ||Ax-y||_2^2 \\ x^* = (A^TA)^{-1}A^Ty$

There is another variation which contains weights, but it can be converted.
$min_x ||W(Ax-y)||_2^2 = min_x ||A_wx-y_w||_2^2 \\$

Constrained

$min_x ||Ax-y||_2^2 : Cx = d$

Define $x^{'}$ s.t. $C x^{'} = d$ . The solution set is $\in N(C)$ . We can convert the problem to
$min_x ||A'z-y'||_2^2 : A' = AB, y' = y-Ax'$

Penalties

Take L2 norm as an example.
$min_x ||Ax-y||_2^2 + \phi(x)\\ = min_x ||Ax-y||_2^2 + \lambda||x||_2^2\\$
We can construct a new A and y.
$\begin{bmatrix} A \\ \sqrt \lambda I_n \\ \end{bmatrix} \\ y' = \begin{bmatrix} y \\ 0_n \\ \end{bmatrix}$
This way we can get the solution as below.
$x^* = (A'^TA')A'^Ty' = (A^TA+ \lambda I)^{-1}A^Ty$

Convex Optimization

Equality constraints are allowed if they are affine.

Linear Programming (LP)

$\ c^Tx+d:Ax \le b$

Quadratic Programming (QP)

$\ \frac 1 2 x^THx+c^Tx + d: Ax \le b$

H is a PSD matrix.

If H is PD, then
$\frac 1 2 (x-x^*)^T H (x-x^*) + d - \frac 1 2 x^{*T} H x^* \\ x^* = -H^{-1}c$
If H is PSD, and $\in R(H)$ , then
$Hx^* + c = 0$
Otherwise, the problem is unbounded.

Quadratic Constrained Quadratic Programming (QCQP)

$\ \frac 1 2 x^T Q_0 x + a_0^Tx: x^T Q_i x + a_i^T x\le b_i$

Second-Order Cone Programming (SOCP)

$\ c^Tx: ||A_ix+b_i||_2 \le c_i^T x + d_i$

Robust Programming

$min_x max_{u \in U} f_0(x,u): f_i(x, u) \le 0, \forall u \in U$

Consider a single inequality with uncertain coefficient vector.
$a^Tx \le b, a \in U$

Scenario Uncertainty

U is finite.
$max_{a \in U} a^Tx \le b$

Box Uncertainty

$\{a: ||a-\hat a||_{\infty} \le \rho \} = \{a: \hat a + \rho u: ||u||_{\infty} \le 1 \} \\ max_{a \in U} a^Tx = \hat a^T x + \rho||x||_1$

Spherical Uncertainty

$\{a: ||a-\hat a||_{2} \le \rho \} \\ max_{a \in U} a^Tx = \hat a^T x + \rho||x||_2$

Ellipsoidal Uncertainty

$\{a: (a-\hat a)^T P^{-1} (a-\hat a) \le 1 \} \\ max_{a \in U} a^Tx = \hat a^T x + ||R^Tx||_2, P = R^TR$

Convexity

A subset is said to be convex if it contains the line segment between any two points in it.
$x_1, x_2 \in C, \lambda \in [0,1] \Rightarrow \lambda x_1 + (1- \lambda) x_2 \in C$
A function f is convex if its domain and range are convex. Convex functions must be $\infty$ outside their domains.

The epigraph of a function is
$\{(x,t), x \in dom f, t \in R: f(x) \le t \}$
f is a convex function iff epi f is a convex set.

Optimality

Consider a problem
$min_x f_o(x): Ax=b$
We have
$\nabla f_0(x^*)^T(x-x^*) \ge 0, \forall x \in \mathcal X$
The proof comes from
$f_0(x) \ge f_0(x^*) + \nabla f_0(x^*)^T(x-x^*)$
In a convex unconstrained problem with differentiable objective, x is optimal iff $\nabla f_0(x) = 0$ . If there is an equality constraint of $A x = b$ , then x is optimal iff $\exist v: \nabla f_0(x) + A^Tv = 0$ .

Hulls

Given a set of points $P$ in $R^n$ .

Linear Hull

$\sum_{i=1}^m \lambda_ix_i$

Affine Hull

$\sum_{i=1}^m \lambda_ix_i: \sum_{i=1}^m \lambda_i=1$

aff P is the smallest affine set containing $P$ .

Conic Hull

$\sum_{i=1}^m \lambda_ix_i: \lambda_i \ge 0$

Convex Hull

$\sum_{i=1}^m \lambda_ix_i: \sum_{i=1}^m \lambda_i=1, \lambda_i \ge 0$

Preserving Convexity

Intersection Rule

The intersection of convex sets is also a convex set, and it holds for infinite families of convex sets. The intersection of halfspaces is also called a polyhedron.

Affine Transformation

An affine mapping of a convex set is still convex.

Pointwise Maximization

If $(f_a)_{a \in A}$ is a family of convex functions, and A is a set (not necessarily a convex set), then the pointwise max function is convex.
$max_{a \in A} f_a$

Partial Minimization

If g(x, y) is convex in x, y and C is convex, then $min_{y \in C} g(x, y)$ is convex. This result trivially extends to partial minimization over a subset of the function’s arguments.

Composition Function

$\rightarrow f(g(x))$

If f is convex and increasing, and g is convex, then the function is convex w.r.t to x.

Constraints

Activeness

If $x^*$ satisfies $f_i(x^*) < 0$ , then the i-th inequality constraint is inactive (slack) at the optimal solution $x^*$ .

Problem Transformations

An optimization problem can be transformed into an equivalent one

monotone transformation (scaling, logarithm, squaring)
change of variables
addition of slack variables
epigraphic reformulation
replacement of equality constraints with inequality ones
elimination of inactive constraints (safe feature elimination)
discovering hidden convexity

Duality

Problem Formulation

Consider an optimization problem in standard form
$p^* = min_x f_0(x) \\ s.t.: f_i(x) \le 0, i=1 \sim m \\ h_i(x) = 0, i=1 \sim q$
Note that the objective function and the constraints are not necessarily convex.

Lagrangian

Vectors $\lambda$ and $v$ are referred to as Lagrange multipliers.
$\mathcal L(x, \lambda, v) = f_o(x) + \sum_{i=1}^m \lambda_i f_i(x) + \sum_{i=1}^q v_i h_i(x)$
The primal can be expressed as
$p^* = min_x max_{\lambda \ge 0, v} \mathcal L(x, \lambda, v)$

Recovering Primal Solutions

If $\mathcal L(x, \lambda^*, v^*)$ has an unique minimizer, then it is either primal-optimal solution or there is no such solution if it is not primal-feasible.

Weak Duality

The Minimax Inequality is
$p^* = min_x max_y F(x,y) \ge max_y min_x F(x,y) = d^*$
Therefore, the weak duality indicates that
$g(\lambda, v) = min_x \mathcal L(x, \lambda, v) \\ d^* = max_{\lambda \ge 0, v} g(\lambda, v) \le p^*$

Strong Duality

The strong duality is achieved when $p^* = d^*$ .

Sion’s Minimax Theorem

Let X be convex and Y be a compact set (bounded and closed). If F(x,y) is convex over X and concave over Y, then
$min_x max_y F(x,y) = max_y min_x F(x,y)$

Slater’s Condition

If the problem is strictly feasible, then strong duality holds. Namely, there exist $x_0 \in relint D$ such that $f_i(x_0) < 0$ .

Karush-Kuhn-Tucker Condition (KKT)

Strong duality holds iff the KKT conditions are satisfied.

Primal feasibility: $f_i(x) \le 0$

Dual feasibility: $\lambda \ge 0$

Complementary slackness: $\lambda_i f_i(x) = 0$

Lagrangian stationarity: $\nabla_x \mathcal L(x, \lambda) = \nabla _x f_0(x) + \sum_{i=1}^m \lambda_i \nabla _x f_i(x) = 0$

Reference

Optimization Models in Engineering (EECS 227 A), Laurent El Ghaoui, University of California Berkeley, Fall 2021

实用线性代数和凸优化 Convex Optimization