Article directory
normal distribution
f ( x ) = 1 2 π σ e − ( x − μ ) 2 2 σ 2 f(x)=\frac{1}{\sqrt{2 \pi} \sigma} e^{-\frac{(x -\mu)^2}{2\sigma^2}}f(x)=2 p.mp1e−2 p2( x − μ )2
μ: the mean of a random variable that follows a normal distribution
σ^2: the variance of this random variable
The normal distribution is denoted as N (μ, σ2)
Standard normal distribution: μ=0, σ=1
f ( x ) = 1 2 π e ( − x 2 2 ) f(x)=\frac{1}{\sqrt{2 \pi}} e^{(-\frac{x^2}{2}) }f(x)=2 p.m1e(−2x2)
Gaussian function
One-dimensional:
f ( x ) = a e − ( x − b ) 2 2 c 2 f(x)=ae^{-\frac{(x-b)^2}{2c^2}} f(x)=ae−2 c2(x−b)2
is a normal distribution
Two-dimensional:
f ( x , y ) = A ⋅ exp ( − ( ( x − x 0 ) 2 2 σ x 2 + ( y − y 0 ) 2 2 σ y 2 ) ) f(x,y)=A exp(- (\frac{(x-x_0)^2}{2\sigma^2_x}+\frac{(y-y_0)^2}{2\sigma^2_y}))f(x,y)=A⋅exp(−(2 px2(x−x0)2+2 py2(y−y0)2) )
Analysis:
The parameters in the Gaussian function are
ksize Gaussian function size
sigma variance of the Gaussian function
center Gaussian function peak center point coordinates
bias The offset of the Gaussian function peak center point, used to control the truncated Gaussian function
Norm
What is a norm?
The definition of distance is a broad concept, as long as it satisfies non-negative, reflexive, and triangle inequalities, it can be called distance.
Norm is a strengthened concept of distance, which has one more algorithm of multiplication than distance in definition.
Sometimes for ease of understanding, the norm can be understood as a distance .
Norms include vector norms and matrix norms
The vector norm characterizes the size of the vector in the vector space
The vectors in the vector space all have a size. How to measure the size is measured by the norm. Different norms can be used to measure the size, just like meters and rulers can be used to measure the distance;
The matrix norm characterizes the magnitude of the change caused by the matrix
The operation AX=B can change the vector X to B, and the matrix norm is to measure the size of this change.
Here is a brief introduction to the definitions and meanings of the following vector norms
The most commonly used should be: L0 and L1
LP norm
Like the definition of Minkowski distance, the LP norm is not a norm, but a set of norms, which are defined as follows:
L p = ∣ ∣ x ∣ ∣ p = ∑ i = 1 nxipp , x = ( x 1 , x 2 , . . . , xn ) L_p=||x||_p=\sqrt[p]{\sum_{i=1}^nx_i^p}, \quad x=(x_1,x_2,..., x_n)Lp=∣∣x∣∣p=pi=1∑nxip,x=(x1,x2,...,xn)
According to the change of P, the norm also has different changes. A classic change diagram of the P norm is as follows:
The above figure shows the change of the graph formed by the points with a distance (norm) of 1 from the origin in the three-dimensional space when p changes from infinity to 0.
Take the common L-2 norm (p=2) as an example. The norm at this time is also the Euclidean distance. The points in the space whose Euclidean distance from the origin is 1 form a sphere.
In fact, at 0, Lp does not satisfy the properties of the triangle inequality, so it is not a norm in the strict sense. Take p=0.5, two-dimensional coordinates (1,4), (4,1), (1,9) as an example,
1 + 4 0.5 + 1 + 4 0.5 < 1 + 9 0.5 \sqrt[0.5]{1+ \sqrt{4}}+\sqrt[0.5]{1+\sqrt{4}}<\sqrt[0.5]{1+\sqrt{9}}0.51+4+0.51+4<0.51+9
Therefore, the LP norm here is only a conceptually broad statement.
L0 norm
When P=0, that is, the L0 norm
It can be seen from the above that the L0 norm is not a real norm, it is mainly used to measure the number of non-zero elements in the vector.
The definition of L-0 that can be obtained with the above LP definition is:
∣ ∣ x ∣ ∣ 0 = ∑ i = 1 nxi 0 0 ||x||_0=\sqrt[0]{\sum_{i=1}^ nx_i^0}∣∣x∣∣0=0i=1∑nxi0
There is a bit of a problem here. We know that the zero power of a non-zero element is 1, but the zero power of zero and the zero power of a non-zero number are all ghosts. It is very difficult to explain the meaning of L0, so under normal circumstances , everyone uses:
∣ ∣ x ∣ ∣ 0 = # ( i ∣ xi ≠ 0 ) ||x||_0=\#(i|x_i \neq 0)∣∣x∣∣0=#(i∣xi=0 )
represents the number of non-zero elements in the vector x.
For the L0 norm, the optimization problem is:
min ∣ ∣ x ∣ ∣ 0 s . t . A x = b min||x||_0 \\ st \quad Ax=bmin∣∣x∣∣0s.t.Ax=b
In practical applications, since the L0 norm itself is not easy to have a good mathematical representation, it is a difficult problem to give a formal representation of the above problem, so it is considered an NP-hard problem. So in actual situations, the optimal problem of L0 will be relaxed to the optimization under L1 or L2.
L1 norm
The L1 norm is a norm we often see, and its definition is as follows:
∣ ∣ x ∣ ∣ 1 = ∑ i = 1 n ∣ xi ∣ ||x||_1=\sum_{i=1}^n |x_i|∣∣x∣∣1=i=1∑n∣xi∣
represents the sum of the absolute values of the non-zero elements in the vector x.
The L1 norm has many names, such as the familiar Manhattan distance, minimum absolute error, etc.
Use the L1 norm to measure the difference between two vectors, such as the sum of absolute errors (Sum of Absolute Difference):
SAD ( x 1 , x 2 ) = ∑ in ∣ x 1 i − x 2 i ∣ SAD(x_1,x_2) =\sum_i^n|x_{1i}-x_{2i}|SAD(x1,x2)=i∑n∣x1 i−x2 i∣For
the L1 norm, its optimization problem is as follows:
min ∣ ∣ x ∣ ∣ 1 s . t . A x = b min||x||_1 \\ st \quad Ax=bmin∣∣x∣∣1s.t.Ax=b
Due to the natural nature of the L1 norm, the solution to the L1 optimization is a sparse solution, so the L1 norm is also called a sparse rule operator. Sparse features can be achieved through L1, and some uninformative features can be removed. For example, when classifying the user's movie preferences, the user has 100 features, and there may be only a dozen features that are useful for classification. Most features such as height Weight, etc. may be useless, and can be filtered out by using the L1 norm.
L2 norm
The L2 norm is our most common and commonly used norm. The most commonly used measure distance Euclidean distance is a kind of L2 norm. Its definition is as follows: ∣ ∣ x ∣ ∣ 2 = ∑ i = 1 nxi 2
| |x||_2=\sqrt{\sum_{i=1}^nx_i^2}∣∣x∣∣2=i=1∑nxi2
Represents the square root of the sum of the elements of a vector.
Like the L1 norm, L2 can also measure the difference between two vectors, such as the sum of squared differences (Sum of Squared Difference):
SSD ( x 1 , x 2 ) = ∑ i = 1 n ( x 1 i − x 2 i ) 2 SSD(x_1,x_2)=\sum_{i=1}^n(x_{1i}-x_{2i})^2S S D ( x1,x2)=i=1∑n(x1 i−x2 i)2
For the L2 norm, its optimization problem is as follows:
min ∣ ∣ x ∣ ∣ 2 s . t . A x = b min||x||_2 \\ st \quad Ax=bmin∣∣x∣∣2s.t.Ax=The b
L2 norm is usually used as a regularization item to optimize the objective function, preventing the model from being too complex to cater to the training set and causing overfitting, thereby improving the generalization ability of the model.
Norm
When
P = ∞ P=\inftyP=∞
, which is the norm
L ∞ L_\inftyL∞
, it is mainly used to measure the maximum value of vector elements, like L0, usually expressed as
∣ ∣ x ∣ ∣ ∞ = max ( ∣ xi ∣ ) ||x||_\infty=max(|x_i|)∣∣x∣∣∞=max(∣xi∣ )
to represent
L ∞ L_\inftyL∞
gradient
What is a gradient?
Directional derivative: at a point, the set of derivatives in all directions
Gradient: The fastest changing directional derivative.
symbol
The gradient of the function f is: ∇ f \nabla f∇ f orgradf grad \quad fgradf , where∇ \nabla∇ nabla represents the vector differential operator
∇ = ∂ ∂ xi ˉ + ∂ ∂ yj ˉ + ∂ ∂ zk ˉ \nabla=\frac{\partial}{\partial x}\bar i + \frac{\partial}{\partial y}\bar j + \frac{\partial}{\partial z}\bar k∇=∂x∂iˉ+∂y∂jˉ+∂z∂kˉThe
gradient is expressed in Cartesian coordinates as:
∇ f = ( ∂ f ∂ x + ∂ f ∂ y + ∂ f ∂ z ) = ∂ f ∂ xi + ∂ f ∂ yj + ∂ f ∂ zk \nabla f=(\ frac{\partial f}{\partial x}+\frac{\partial f}{\partial y}+\frac{\partial f}{\partial z})=\frac{\partial f}{\partial x }\bold i + \frac{\partial f}{\partial y}\bold j + \frac{\partial f}{\partial z}\bold k∇f=(∂x∂f+∂y∂f+∂z∂f)=∂x∂fi+∂y∂fj+∂z∂f
The modulus of the k
gradient is: ∣ gradf ( x , y ) ∣ = ( ∂ f ∂ x ) 2 + ( ∂ f ∂ y ) 2 |gradf(x,y)|=\sqrt{(\frac{\partial f} {\partial x})^2+(\frac{\partial f}{\partial y})^2}∣gradf(x,y)∣=(∂x∂f)2+(∂y∂f)2
official
same as derivative formula
matrices and vectors
matrix
Two-dimensional array.
Capital letter means A
Matrix properties
Does not satisfy the commutative law: AxB is not equal to BxA
Associativity is satisfied: AxBxC = (AxB)xC = Ax(BxC)
AA − 1 = A − 1 A = I AA^{-1}=A^{-1}A=IAA−1=A−1A=I
only have a square matrix with an inverse matrix
B = ATB=A^TB=AT
B i j = A j i B_{ij}=A_{ji} Bij=Aji
The trace of the matrix: trace(A), the sum of the elements on the main diagonal
Determinant: det(A) or |A|
special matrix
Orthogonal matrix
AAT = E 或 ATA = E AA^T=E 或A^TA=EAAT=E or AT A=E
nature:
-
Each row and column of A is a unit vector, and every two intersects.
-
|A| is 1 or -1
-
A T = A − 1 A^T=A^{-1} AT=A−1
-
Orthogonal matrices are usually denoted by Q
matrix trace
trace(A), the sum of the elements on the main diagonal
nature:
Let N order matrix A
-
The trace of matrix A is equal to the sum of all main diagonal elements
-
The trace of matrix A is equal to the sum of the eigenvalues of A
-
trace(AB) = trace(BA), where A and B do not have to be a square matrix, only AB needs to be a square matrix
tr(ABC) = tr(BC A) = tr(C AB) = … -
trace(mA+nB)=m trace(A)+n trace(B)
vector
vector: nx1 matrix. (n-dimensional vector)
lowercase letter means a
vector multiplication
Quantity, point, inner point:
a ⋅ b = ∣ a ∣ ∣ b ∣ cos θ a ⋅ b = ( ax , ay , az ) ⋅ ( bx , by , bz ) = axbx + ayby + azbz Result is one number a b=|a||b|cos\theta \\ a b=(a_x,a_y,a_z) (b_x,b_y,b_z)=a_xb_x+a_yb_y+a_zb_z \\ result is one numbera⋅b=∣a∣∣b∣cosθa⋅b=(ax,ay,az)⋅(bx,by,bz)=axbx+ayby+azbzThe result is a number
If the modulus of B is assumed to be 1, then let∣ B ∣ = 1 |B|=1∣B∣=1,After completion:
A ⋅ B = ∣ A ∣ cos ( a ) A B=|A|cos(a)A⋅B=∣ A ∣ c o s ( a )
That is to say,the inner product of A and B is equal to the scalar value of the projection of A to the straight line where B is located.
Vector product, cross product, outer product:
c = a × b ∣ c ∣ = ∣ a ∣ ∣ b ∣ sin θ c is perpendicular to both a and b, pointing to the right-hand system c=a \times b \ \ |c|=|a||b|sin\theta \\ The direction of c is both perpendicular to a and perpendicular to b, pointing to the right-hand systemc=a×b∣c∣=∣a∣∣b∣sinθThe direction of c is perpendicular to both a and b , pointing to the right - handed system
Eigenvalues, Eigenvectors
definition:
Let A be a square matrix of order n, if there is a number λ and a non-zero vector x
so that Ax=λx
Then: λ is an eigenvalue of A, and x is the eigenvector of A corresponding to λ
(A-λE)x=0 is called the characteristic subspace corresponding to λ
|A-λE| is called the characteristic polynomial of A
Solve:
According to |A-λE|=0, λ can be solved.
Eigenvalues are -1,2,2,
Find the eigenvector: λ substitute (A+λE), simplify
nature:
- A 2 x = A ( A x ) = A ( λ x ) = λ ( A x ) = λ 2 x A^2x=A(Ax)=A(\lambda x)=\lambda(Ax)=\lambda^2x A2x _=A(Ax)=A(λx)=λ(Ax)=l2x _
- A k x = λ k x A^kx=\lambda^kx Akx=lkx
- A − 1 x = 1 / λ x A^{-1}x=1/ \lambda x A−1x=1/λx
similarity matrix
If a matrix B can be expressed as: B = M − 1 AMB=M^{-1}AMB=M−1AM
We say that B and A are similar, and the similarity matrix has the same eigenvalues
Trace and determinant of matrix
The trace of the matrix is equal to the sum of the eigenvectors:
tr ( A ) = λ 1 + λ 2 + . . . + λ n tr(A)=\lambda_1+\lambda_2+...+\lambda_ntr(A)=l1+l2+...+ln
The determinant of the matrix is equal to the product of the eigenvectors:
det ( A ) = λ 1 λ 2 . . . λ n det(A)=\lambda_1\lambda_2...\lambda_nd e t ( A )=l1l2. . . ln
Feature decomposition:
For a general nxn matrix, it can be decomposed into:
A = Q Λ Q − 1 A=Q\Lambda Q^{-1}A=Q Λ Q− 1
where:
- Q is an nxn square matrix, and its i-th column is the eigenvector qi q_i of Aqi
- Λ \LambdaΛ is a diagonal matrix, and the elements on its diagonal are the corresponding eigenvalues, namely:Λ ii = λ i \Lambda _{ii}=\lambda_iLii=li
Only diagonalizable matrices can do eigendecomposition.
matrix derivation
If a = W ha = Wha=W h , a, h are vectors, W is a matrix
则
∂ a ∂ h = W T ∂ a ∂ W = h T \frac{\partial a}{\partial h}=W^T \\ \frac{\partial a}{\partial W}=h^T ∂h∂a=WT∂W∂a=hT
bias and variance
Bias
Low deviation: can match the training set higher
High deviation: does not fit the training data well, underfitting
variance
Low variance: the loss is small across different sets
High variance: poor generalization ability, overfitting
knowledge of probability
Likelihood, probability
expect
Accumulation after multiplying all outcomes and outcome probabilities. Also called mean, symbol E
expect
For example: variable X, the probability of X taking x is P(X=x)
则
E ( X ) = ∑ x P ( X = x ) E(X)=\sum{xP(X=x)} E ( X )=∑xP(X=x)
conditional expectation
Variable X, condition Y=y, the probability of X taking x under the condition of Y=y is P(X=x|Y=y) E ( X ∣ Y = y ) = ∑ x P ( X = x ∣ Y =
y ) E(X|Y=y) = \sum{xP(X=x|Y=y)}E(X∣Y=y)=∑xP(X=x∣Y=y)
Indicator function I
In the derivation of machine learning algorithms, sometimes you can see a function III , what does this function mean?
I I I stands for indicator function.
It means:
- When the input is True, the output is 1
- When the input is False, the output is 0.
例如: I ( f ( x i ) ≠ y i ) I(f(x_i)\neq y_i) I(f(xi)=yi) , which means whenf ( xi ) f(x_i)f(xi) is not equal toyi y_iyiWhen , the output is 1, otherwise the output is 0.
Still in the process of continuing to update the summary. . .