[Machine Learning] Mathematical problems in machine learning (continuously updated ing)

normal distribution

f ( x ) = 1 2 π σ e − ( x − μ ) 2 2 σ 2 f(x)=\frac{1}{\sqrt{2 \pi} \sigma} e^{-\frac{(x -\mu)^2}{2\sigma^2}}f(x)=2 p.m p1e2 p2( x μ )2

μ: the mean of a random variable that follows a normal distribution

σ^2: the variance of this random variable

The normal distribution is denoted as N (μ, σ2)

Standard normal distribution: μ=0, σ=1

f ( x ) = 1 2 π e ( − x 2 2 ) f(x)=\frac{1}{\sqrt{2 \pi}} e^{(-\frac{x^2}{2}) }f(x)=2 p.m 1e(2x2)
insert image description here

Gaussian function

One-dimensional:

f ( x ) = a e − ( x − b ) 2 2 c 2 f(x)=ae^{-\frac{(x-b)^2}{2c^2}} f(x)=ae2 c2(xb)2

is a normal distribution
insert image description here

Two-dimensional:

f ( x , y ) = A ⋅ exp ( − ( ( x − x 0 ) 2 2 σ x 2 + ( y − y 0 ) 2 2 σ y 2 ) ) f(x,y)=A exp(- (\frac{(x-x_0)^2}{2\sigma^2_x}+\frac{(y-y_0)^2}{2\sigma^2_y}))f(x,y)=Aexp((2 px2(xx0)2+2 py2(yy0)2) )
insert image description here
Analysis:

The parameters in the Gaussian function are

ksize Gaussian function size

sigma variance of the Gaussian function

center Gaussian function peak center point coordinates

bias The offset of the Gaussian function peak center point, used to control the truncated Gaussian function

Norm

What is a norm?

The definition of distance is a broad concept, as long as it satisfies non-negative, reflexive, and triangle inequalities, it can be called distance.

Norm is a strengthened concept of distance, which has one more algorithm of multiplication than distance in definition.

Sometimes for ease of understanding, the norm can be understood as a distance .

Norms include vector norms and matrix norms

The vector norm characterizes the size of the vector in the vector space

The vectors in the vector space all have a size. How to measure the size is measured by the norm. Different norms can be used to measure the size, just like meters and rulers can be used to measure the distance;

The matrix norm characterizes the magnitude of the change caused by the matrix

The operation AX=B can change the vector X to B, and the matrix norm is to measure the size of this change.

Here is a brief introduction to the definitions and meanings of the following vector norms

The most commonly used should be: L0 and L1

LP norm

Like the definition of Minkowski distance, the LP norm is not a norm, but a set of norms, which are defined as follows:
L p = ∣ ∣ x ∣ ∣ p = ∑ i = 1 nxipp , x = ( x 1 , x 2 , . . . , xn ) L_p=||x||_p=\sqrt[p]{\sum_{i=1}^nx_i^p}, \quad x=(x_1,x_2,..., x_n)Lp=xp=pi=1nxip ,x=(x1,x2,...,xn)
According to the change of P, the norm also has different changes. A classic change diagram of the P norm is as follows:

insert image description here
The above figure shows the change of the graph formed by the points with a distance (norm) of 1 from the origin in the three-dimensional space when p changes from infinity to 0.

Take the common L-2 norm (p=2) as an example. The norm at this time is also the Euclidean distance. The points in the space whose Euclidean distance from the origin is 1 form a sphere.

In fact, at 0, Lp does not satisfy the properties of the triangle inequality, so it is not a norm in the strict sense. Take p=0.5, two-dimensional coordinates (1,4), (4,1), (1,9) as an example,
1 + 4 0.5 + 1 + 4 0.5 < 1 + 9 0.5 \sqrt[0.5]{1+ \sqrt{4}}+\sqrt[0.5]{1+\sqrt{4}}<\sqrt[0.5]{1+\sqrt{9}}0.51+4 +0.51+4 <0.51+9
Therefore, the LP norm here is only a conceptually broad statement.

L0 norm

When P=0, that is, the L0 norm

It can be seen from the above that the L0 norm is not a real norm, it is mainly used to measure the number of non-zero elements in the vector.

The definition of L-0 that can be obtained with the above LP definition is:
∣ ∣ x ∣ ∣ 0 = ∑ i = 1 nxi 0 0 ||x||_0=\sqrt[0]{\sum_{i=1}^ nx_i^0}x0=0i=1nxi0
There is a bit of a problem here. We know that the zero power of a non-zero element is 1, but the zero power of zero and the zero power of a non-zero number are all ghosts. It is very difficult to explain the meaning of L0, so under normal circumstances , everyone uses:
∣ ∣ x ∣ ∣ 0 = # ( i ∣ xi ≠ 0 ) ||x||_0=\#(i|x_i \neq 0)x0=#(ixi=0 )
represents the number of non-zero elements in the vector x.

For the L0 norm, the optimization problem is:
min ∣ ∣ x ∣ ∣ 0 s . t . A x = b min||x||_0 \\ st \quad Ax=bminx0s.t.Ax=b
In practical applications, since the L0 norm itself is not easy to have a good mathematical representation, it is a difficult problem to give a formal representation of the above problem, so it is considered an NP-hard problem. So in actual situations, the optimal problem of L0 will be relaxed to the optimization under L1 or L2.

L1 norm

The L1 norm is a norm we often see, and its definition is as follows:
∣ ∣ x ∣ ∣ 1 = ∑ i = 1 n ∣ xi ∣ ||x||_1=\sum_{i=1}^n |x_i|x1=i=1nxi
represents the sum of the absolute values ​​of the non-zero elements in the vector x.

The L1 norm has many names, such as the familiar Manhattan distance, minimum absolute error, etc.

Use the L1 norm to measure the difference between two vectors, such as the sum of absolute errors (Sum of Absolute Difference):
SAD ( x 1 , x 2 ) = ∑ in ∣ x 1 i − x 2 i ∣ SAD(x_1,x_2) =\sum_i^n|x_{1i}-x_{2i}|SAD(x1,x2)=inx1 ix2 i∣For
the L1 norm, its optimization problem is as follows:
min ∣ ∣ x ∣ ∣ 1 s . t . A x = b min||x||_1 \\ st \quad Ax=bminx1s.t.Ax=b
Due to the natural nature of the L1 norm, the solution to the L1 optimization is a sparse solution, so the L1 norm is also called a sparse rule operator. Sparse features can be achieved through L1, and some uninformative features can be removed. For example, when classifying the user's movie preferences, the user has 100 features, and there may be only a dozen features that are useful for classification. Most features such as height Weight, etc. may be useless, and can be filtered out by using the L1 norm.

L2 norm

The L2 norm is our most common and commonly used norm. The most commonly used measure distance Euclidean distance is a kind of L2 norm. Its definition is as follows: ∣ ∣ x ∣ ∣ 2 = ∑ i = 1 nxi 2
| |x||_2=\sqrt{\sum_{i=1}^nx_i^2}x2=i=1nxi2
Represents the square root of the sum of the elements of a vector.

Like the L1 norm, L2 can also measure the difference between two vectors, such as the sum of squared differences (Sum of Squared Difference):
SSD ( x 1 , x 2 ) = ∑ i = 1 n ( x 1 i − x 2 i ) 2 SSD(x_1,x_2)=\sum_{i=1}^n(x_{1i}-x_{2i})^2S S D ( x1,x2)=i=1n(x1 ix2 i)2
For the L2 norm, its optimization problem is as follows:
min ∣ ∣ x ∣ ∣ 2 s . t . A x = b min||x||_2 \\ st \quad Ax=bminx2s.t.Ax=The b
L2 norm is usually used as a regularization item to optimize the objective function, preventing the model from being too complex to cater to the training set and causing overfitting, thereby improving the generalization ability of the model.

Norm

When
P = ∞ P=\inftyP=
, which is the norm
L ∞ L_\inftyL
, it is mainly used to measure the maximum value of vector elements, like L0, usually expressed as
∣ ∣ x ∣ ∣ ∞ = max ( ∣ xi ∣ ) ||x||_\infty=max(|x_i|)x=max(xi)
to represent
L ∞ L_\inftyL

gradient

What is a gradient?

Directional derivative: at a point, the set of derivatives in all directions

Gradient: The fastest changing directional derivative.

symbol

The gradient of the function f is: ∇ f \nabla ff orgradf grad \quad fgradf , where∇ \nabla nabla represents the vector differential operator
∇ = ∂ ∂ xi ˉ + ∂ ∂ yj ˉ + ∂ ∂ zk ˉ \nabla=\frac{\partial}{\partial x}\bar i + \frac{\partial}{\partial y}\bar j + \frac{\partial}{\partial z}\bar k=xiˉ+yjˉ+zkˉThe
gradient is expressed in Cartesian coordinates as:
∇ f = ( ∂ f ∂ x + ∂ f ∂ y + ∂ f ∂ z ) = ∂ f ∂ xi + ∂ f ∂ yj + ∂ f ∂ zk \nabla f=(\ frac{\partial f}{\partial x}+\frac{\partial f}{\partial y}+\frac{\partial f}{\partial z})=\frac{\partial f}{\partial x }\bold i + \frac{\partial f}{\partial y}\bold j + \frac{\partial f}{\partial z}\bold kf=(xf+yf+zf)=xfi+yfj+zf
The modulus of the k
gradient is: ∣ gradf ( x , y ) ∣ = ( ∂ f ∂ x ) 2 + ( ∂ f ∂ y ) 2 |gradf(x,y)|=\sqrt{(\frac{\partial f} {\partial x})^2+(\frac{\partial f}{\partial y})^2}gradf(x,y)=(xf)2+(yf)2

official

same as derivative formula

matrices and vectors

matrix

Two-dimensional array.

Capital letter means A

Matrix properties

Does not satisfy the commutative law: AxB is not equal to BxA

Associativity is satisfied: AxBxC = (AxB)xC = Ax(BxC)
AA − 1 = A − 1 A = I AA^{-1}=A^{-1}A=IAA1=A1A=I
only have a square matrix with an inverse matrix
B = ATB=A^TB=AT

B i j = A j i B_{ij}=A_{ji} Bij=Aji

The trace of the matrix: trace(A), the sum of the elements on the main diagonal

Determinant: det(A) or |A|

special matrix

Orthogonal matrix

AAT = E 或 ATA = E AA^T=E 或A^TA=EAAT=E or AT A=E

nature:

  1. Each row and column of A is a unit vector, and every two intersects.

  2. |A| is 1 or -1

  3. A T = A − 1 A^T=A^{-1} AT=A1

  4. Orthogonal matrices are usually denoted by Q

matrix trace

trace(A), the sum of the elements on the main diagonal

nature:

Let N order matrix A

  1. The trace of matrix A is equal to the sum of all main diagonal elements

  2. The trace of matrix A is equal to the sum of the eigenvalues ​​of A

  3. trace(AB) = trace(BA), where A and B do not have to be a square matrix, only AB needs to be a square matrix
    tr(ABC) = tr(BC A) = tr(C AB) = …

  4. trace(mA+nB)=m trace(A)+n trace(B)

vector

vector: nx1 matrix. (n-dimensional vector)

lowercase letter means a

vector multiplication

Quantity, point, inner point:
a ⋅ b = ∣ a ∣ ∣ b ∣ cos θ a ⋅ b = ( ax , ay , az ) ⋅ ( bx , by , bz ) = axbx + ayby ​​+ azbz Result is one number a b=|a||b|cos\theta \\ a b=(a_x,a_y,a_z) (b_x,b_y,b_z)=a_xb_x+a_yb_y+a_zb_z \\ result is one numberab=abcosθab=(ax,ay,az)(bx,by,bz)=axbx+ayby+azbzThe result is a number
If the modulus of B is assumed to be 1, then let∣ B ∣ = 1 |B|=1B=1,After completion:
A ⋅ B = ∣ A ∣ cos ( a ) A B=|A|cos(a)AB=A c o s ( a )
That is to say,the inner product of A and B is equal to the scalar value of the projection of A to the straight line where B is located.

Vector product, cross product, outer product:
c = a × b ∣ c ∣ = ∣ a ∣ ∣ b ∣ sin θ c is perpendicular to both a and b, pointing to the right-hand system c=a \times b \ \ |c|=|a||b|sin\theta \\ The direction of c is both perpendicular to a and perpendicular to b, pointing to the right-hand systemc=a×bc=absinθThe direction of c is perpendicular to both a and b , pointing to the right - handed system

Eigenvalues, Eigenvectors

definition:

Let A be a square matrix of order n, if there is a number λ and a non-zero vector x

so that Ax=λx

Then: λ is an eigenvalue of A, and x is the eigenvector of A corresponding to λ

(A-λE)x=0 is called the characteristic subspace corresponding to λ

|A-λE| is called the characteristic polynomial of A

Solve:

According to |A-λE|=0, λ can be solved.
insert image description here
Eigenvalues ​​are -1,2,2,

Find the eigenvector: λ substitute (A+λE), simplify
insert image description here
insert image description here

nature:

  • A 2 x = A ( A x ) = A ( λ x ) = λ ( A x ) = λ 2 x A^2x=A(Ax)=A(\lambda x)=\lambda(Ax)=\lambda^2x A2x _=A(Ax)=A(λx)=λ(Ax)=l2x _
  • A k x = λ k x A^kx=\lambda^kx Akx=lkx
  • A − 1 x = 1 / λ x A^{-1}x=1/ \lambda x A1x=1/λx

similarity matrix

If a matrix B can be expressed as: B = M − 1 AMB=M^{-1}AMB=M1AM

We say that B and A are similar, and the similarity matrix has the same eigenvalues

Trace and determinant of matrix

The trace of the matrix is ​​equal to the sum of the eigenvectors:
tr ( A ) = λ 1 + λ 2 + . . . + λ n tr(A)=\lambda_1+\lambda_2+...+\lambda_ntr(A)=l1+l2+...+ln
The determinant of the matrix is ​​equal to the product of the eigenvectors:
det ( A ) = λ 1 λ 2 . . . λ n det(A)=\lambda_1\lambda_2...\lambda_nd e t ( A )=l1l2. . . ln

Feature decomposition:

For a general nxn matrix, it can be decomposed into:
A = Q Λ Q − 1 A=Q\Lambda Q^{-1}A=Q Λ Q1
where:

  • Q is an nxn square matrix, and its i-th column is the eigenvector qi q_i of Aqi
  • Λ \LambdaΛ is a diagonal matrix, and the elements on its diagonal are the corresponding eigenvalues, namely:Λ ii = λ i \Lambda _{ii}=\lambda_iLii=li

Only diagonalizable matrices can do eigendecomposition.

matrix derivation

If a = W ha = Wha=W h , a, h are vectors, W is a matrix


∂ a ∂ h = W T ∂ a ∂ W = h T \frac{\partial a}{\partial h}=W^T \\ \frac{\partial a}{\partial W}=h^T ha=WTWa=hT

bias and variance

Bias

Low deviation: can match the training set higher

High deviation: does not fit the training data well, underfitting

variance

Low variance: the loss is small across different sets

High variance: poor generalization ability, overfitting

knowledge of probability

Likelihood, probability

insert image description here

expect

Accumulation after multiplying all outcomes and outcome probabilities. Also called mean, symbol E

expect

For example: variable X, the probability of X taking x is P(X=x)


E ( X ) = ∑ x P ( X = x ) E(X)=\sum{xP(X=x)} E ( X )=xP(X=x)

conditional expectation

Variable X, condition Y=y, the probability of X taking x under the condition of Y=y is P(X=x|Y=y) E ( X ∣ Y = y ) = ∑ x P ( X = x ∣ Y =
y ) E(X|Y=y) = \sum{xP(X=x|Y=y)}E(XY=y)=xP(X=xY=y)

Indicator function I

In the derivation of machine learning algorithms, sometimes you can see a function III , what does this function mean?

I I I stands for indicator function.

It means:

  • When the input is True, the output is 1
  • When the input is False, the output is 0.

例如: I ( f ( x i ) ≠ y i ) I(f(x_i)\neq y_i) I(f(xi)=yi) , which means whenf ( xi ) f(x_i)f(xi) is not equal toyi y_iyiWhen , the output is 1, otherwise the output is 0.

Still in the process of continuing to update the summary. . .

Guess you like

Origin blog.csdn.net/qq_41340996/article/details/124838329