Line Generation of Mathematical Fundamentals

Starting from this article, I will write a series of articles called Mastering the Mathematical Foundations of Machine Learning XX (Key Knowledge), which mainly describes some of the main mathematical foundations in machine learning.

 

Why write this series?

  • The online articles are too comprehensive. As soon as they come up, they recommend MIT linear algebra, various calculus, and "Introduction to Calculus", "Introduction to Probability Theory", etc. Many of them are even in English, and they have to learn a lot of English. Only the terms can be understood. I don’t think it is necessary to read them all, because like MIT’s linear algebra, many of them are not used or used in machine learning, but are difficult to understand. What Markov matrix, fast Fourier transform, Jordan shape, wow, kinda dizzy....
  • The online articles are too brief. Although machine learning does not need to fully learn many mathematics, it has a lot to do with mathematics. Many articles want to summarize all the important mathematical foundations in machine learning. Disagree! If it is too brief, it is better to write a table of contents, or to skip too many important mathematical foundations, it is better not to write.
  • To sort out and review, I will try to intercept what I think is important, and I will point out where the mathematical foundation of application in machine learning is, and try to write as popular and in-depth as possible. Help me to review and achieve the effect of updating the column!

 

Note: I will write down the mathematical foundation that I think is highly relevant to machine learning, a lot of knowledge is learned elsewhere, mainly from "deep learning", I am just a knowledge porter and add my own opinions.

The following sections start the description, and the linear algebra section mainly includes the following:

  1. scalars, vectors, matrices and tensors
  2. Matrix-Vector Operations
  3. Identity Matrix and Inverse Matrix
  4. determinant
  5. variance, standard deviation, covariance matrix
  6. norm
  7. Special types of matrices and vectors
  8. Eigen decomposition and what it means
  9. Singular Value Decomposition and Its Significance
  10. Moore-Penrose pseudoinverse
  11. trace operation

scalars, vectors, matrices and tensors

  • Scalar: A scalar is a single number, usually represented by a lowercase variable name. Of course, when we introduce scalars, it's important to be clear what kind of numbers they are. This should be noted when writing a paper, for example: when defining a natural number scalar, we may say "let n ∈ N denote the number of elements".
  • Vectors: In physics and engineering, geometric vectors are more often referred to as vectors, which are known to those who have studied high school mathematics and physics, but in linear algebra, after further abstraction, the concepts of magnitude and direction may not necessarily apply , but we can simply understand it as a column of numbers, and through the index in this column of numbers, we can determine each individual number. Vectors are usually given bold lowercase names. When we need to explicitly represent elements in a vector, we arrange the elements into a vertical column surrounded by square brackets (as shown below):

  • Matrix: A matrix is ​​a two-dimensional array in which each element is identified by two indices instead of one. We usually give matrices bold uppercase variable names, such as A. If a real matrix has height m and width n, then we say A\epsilon R^{m\times n}that when we get to expressing matrices explicitly, we write them in an array surrounded by square brackets, as shown below:

  • Tensor: The tensor defined in linear algebra or geometric algebra is based on the generalization of vectors and matrices. If we understand it in a simple way, we can regard scalars as zero-order tensors, and vectors (vectors) as first-order tensors, then matrix is a second-order tensor. For example, any color image can be represented as a third-order tensor (like a three-dimensional array in C language), and the three dimensions are the height, width, and color data of the image. Use the font A to represent the tensor "A''. The element in tensor A with coordinates (i, j, k) is denoted as A_{i,j,k}.
The importance of the above knowledge is self-evident. If you don’t know this, let alone learn machine learning... Almost all operations are performed based on vector matrices. In tensorflow, tensors are used to represent all data. and used for operation.

Matrix-Vector Operations

Matrix multiplication: is one of the most important operations in matrix operations. The matrix product of the two matrices A and B is the third matrix C. For multiplication to be well-defined, the number of columns in matrix A must be equal to the number of rows in matrix B. If the shape of matrix A is m × n and the shape of matrix B is n × p, then the shape of matrix C is m × p. We can write matrix multiplication by placing two or more matrices side by side, such as C = AB.

Specifically, this multiplication operation is defined as:C_{i,j}=\sum_{k}^{}{A_{i,k}B_{k,j}}

As an example, it looks like this:

It should be noted that the standard product of two matrices does not refer to the product of corresponding elements in the two matrices. However, such a matrix operation does exist, which is called the element-corresponding product or Hadamard product, denoted as A \ odotB

In particular, the dot product of two vectors x and can be viewed as a matrix product . We can think of the as the dot product between the ith row of A and the jth column of B. Note that we sometimes also add the product of two vectors as the inner product and x^{T}y C_{i,j}

The matrix product obeys the distributive law: A(B + C) = AB + AC

Matrix products are also associative: A(BC) = (AB)C

But unlike the scalar product, the matrix product does not satisfy the commutative law (the case of AB = BA is not always satisfied).

However, the dot product of two vectors is commutative:x^{T}y=y^{T}x

Matrix transpose:

  • The result of matrix transpose isa_{i,j}=(a_{i,j})^{T}
  • R\bullet R^{T}The result is a symmetric matrix, R\bullet R^{T}=(R\bullet R^{T})^{T}and it is proved that the R\bullet R^{T}result is a symmetric matrix
It is necessary to study matrix multiplication and other operations, such as the meaning of matrix multiplication. In machine learning, many operations are matrix and vector operations, and the Hadamard product has applications in backpropagation derivation.

Identity Matrix and Inverse Matrix

Linear algebra provides a powerful tool called matrix inverse. For most matrices A, we can solve analytically by matrix inverse.

To describe the matrix inverse, we first need to define the concept of the identity matrix. Multiplying any vector and identity matrix will not change. We will denote the identity matrix that keeps nthe dimensional vector constant I_{n}. formalI_{n}\in R^{n}

\forall x\in R^{n},I_{n}x=x

The structure of the identity matrix is ​​simple: all elements along the main diagonal are 1s, and all other elements are 0s. Such as

The matrix inverse of matrix A is written A^{-1}, and the defined matrix satisfies the following conditions

A^{-1}A=I_{n}

Now we can solve by the following steps:

by Ax=b_A^{-1}Ax=A^{-1}b

A^{-1}A=I_{n}byI_{n}x=A^{-1}b

finally:x=A^{-1}b

It is relatively simple to find the inverse matrix of a matrix, but it is more important and more useful to judge whether a matrix has an inverse matrix. This is a key difficulty. Since there are many ways to judge, here are some simple methods:

  • All matrices that are not square (the number of rows is not equal to the number of columns) have no inverse
  • An invertible matrix is ​​a non-singular matrix, and a non-singular matrix is ​​also an invertible matrix.
  • A square matrix with a determinant equal to 0 is a singular matrix, that is to say, a determinant not equal to 0 is equivalent to an invertible matrix
The matrix inversion operation is also widely used in machine learning, such as logistic regression, such as SVM, etc. It is also very important. There are also many such operations involved in various papers, so it is really essential!

determinant

Determinant, denoted det(A): is a function that maps a square matrix A to real numbers. The determinant is equal to the product of the matrix eigenvalues. The absolute value of the determinant can be used to measure how much the space expands or shrinks after the matrix participates in matrix multiplication. If the determinant is 0, then the space is completely contracted along at least one dimension, causing it to lose all volume. If the determinant is 1, then this transformation keeps the spatial volume unchanged.

Determinant is also a big concept, and it is very easy to study. If you don't want to know a lot, you only need to know the concept.

variance, standard deviation, covariance

Variance: It is a measure of the degree of dispersion when a random variable or a set of data is measured. The formula for calculating variance is:

where \sigma^{2}is the population variance, Xis the variable, \ muis the population mean, and Nis the population number of cases. The same is true for the standard deviation formula below.

Standard Deviation: Also known as the standard deviation, or experimental standard deviation, the formula is

 

The standard deviation is the arithmetic square root of the variance. Standard deviation can reflect the degree of dispersion of a data set. Two sets of data with the same mean may not necessarily have the same standard deviation.

Why do you need covariance?

We know that standard deviation and variance are generally used to describe one-dimensional data, but in real life, we often encounter data sets containing multi-dimensional data. The easiest way is to count the test scores of multiple subjects when going to school. Faced with such a data set, we can of course calculate the variance of each dimension independently, but usually we want to know more, for example, whether there is some relationship between a boy's vulgarity and his popularity with girls. Covariance is one such statistic used to measure the relationship between two random variables.

covariance matrix

The key to understanding the covariance matrix is ​​to keep in mind that it calculates the covariance between different dimensions, not between different samples. To get a sample matrix, we must first clarify whether a row is a sample or a dimension. Make it clear that the entire calculation process will flow downstream, so that you will not be confused.

As an example (example from this post ):

question:

There is a set of data (below), which are two-dimensional vectors. What is the covariance matrix corresponding to these four data?

answer:

Since the data is two-dimensional, the covariance matrix is ​​a 2*2 matrix, and each element of the matrix is:

element(i,j) = (all elements in dimension i - mean in dimension i) * (all elements in dimension j - mean in dimension j).

Among them, "*" represents the symbol of vector inner product, that is, the inner product of two vectors is calculated, and the corresponding elements are multiplied and then accumulated.

We first list the first dimension:

D1: (1,3,4,5) Mean: 3.25
D2: (2,6,2,2) Mean: 3

The following calculates the (1,2)th element of the covariance matrix:

Element (1,2)=(1-3.25,3-3.25,4-3.25,5-3.25)*(2-3,6-3,2-3,2-3)=-1

Similarly, we can calculate all 2*2 elements:

The end result of this question is:

Let's analyze the above example. First look at the calculation process of element (1,1):

Take out the first dimension of all the data and find the mean value. The subsequent solution process is completely the method of "variance" that we are familiar with. That is to say, this is exactly the variance (8.75) of all the first-dimensional elements of the data (4 in total). Similarly, element (2,2) is the variance (12) of the second dimension (4 elements in total).

Let's look at the element (1,2), which is clearly the covariance of x and y learned in our advanced mathematics. Instead of calculating the dispersion degree of a certain dimension separately, we combine the dispersion values ​​of the two dimensions. Here It really reflects the meaning of "covariance" in "covariance matrix". It can be seen from the calculation process and calculation results that the element (2,1) is the same as the element (1,2). That is, all covariance matrices are a symmetric matrix.

Summarize the characteristics of the covariance matrix:

  • The diagonal elements (i,i) are the variance of the i-th dimension of the data.
  • The off-diagonal elements (i,j) are the covariances of the ith and jth dimensions.
  • The covariance matrix is ​​a symmetric matrix.

It's enough to know that for now.

This knowledge is also very basic and is covered in various algorithms, such as the partial variance trade-off, the variance problem and solution in RL, and the covariance matrix is ​​determined in the binary Gaussian distribution (which will be covered in the next section on Probability Theory) Its shape is demonstrated in detail .

norm

What is a norm, it sounds like a term.. It is actually a unit to measure the size of a vector. In machine learning, we also often use a function called norm to measure the size of a matrix

L^{P}The norm is as follows:

\left| \left| x \right| \right| _{p}^{} =\left( \sum_{i}^{}{\left| x_{i} \right| ^{p} } \right) _{}^{\frac{1}{p} }

(Why is this, don't worry about it, if you want to pull it off, just remember to measure the size of the vector or matrix)

Common:

L^{1}Norm \left| \left| x \right| \right|: the sum of the absolute values ​​of each element of the x vector;

L^{2}Norm \left| \left| x \right| \right| _{2}: It is the square root of the sum of the squares of each element of the x vector. This is the straight-line distance between two points. Recall the knowledge of junior high school!

Note: When p = 2, the L^{2}norm is called the Euclidean norm. it means from the origin

Euclidean distance from the point identified by the vector x. L^{2}Norms appear very frequently in machine learning

Often simplified as ∥x∥, omitting the subscript 2. The square L^{2}norm is also often used to measure the size of the vector, which can be

Simply by the dot product x^{T}xcalculation .

This knowledge is also involved in major algorithms (such as SVM), and the Euclidean distance and Washington distance in the distance measure are closely related.

Special types of matrices and vectors

Some special types of matrices and vectors are particularly useful, and are also equivalent to some terms. For example, some articles directly say that they are XX matrices or XX vectors. At this time, we should understand what these matrices or vectors look like and what else nature!

Diagonal matrix: It only contains non-zero elements on the main diagonal, and all other positions are zero. Formally, a matrix is ​​diagonal if and only if for alli=j, D_{i,j} is not equal to 0

Special: The identity matrix is ​​a diagonal matrix whose diagonal elements are all 1s.

Unit vector: A vector whose modulus is equal to 1 (has unit norm). Since it is a non-zero vector , a unit vector has a definite direction. There are an infinite number of unit vectors.

That is: for a unit vector, there is ||x||_{2}= 1.

Symmetric matrix: is a matrix that is transposed and equal to itself:A=A^{T}

Symmetric matrices often arise when elements are generated by some two-parameter function that does not depend on the order of the parameters, for example, if A is a distance metric matrix A_{i,j}representing the distance from point to ipoint j, then A_{i,j}=A_{j,i}, because the distance function is symmetric.

Orthogonal matrix: refers to a square matrix whose row vector and column vector are respectively standard orthogonal:A^{T}A=AA^{T}=I

this meansA^{-1}=A^{T}

Therefore, orthogonal matrices are concerned because the computational cost of inversion is small. We need to pay attention to the definition of orthogonal matrix. Counterintuitively, the row vectors of an orthonormal matrix are not only orthonormal, they are also standard orthonormal. There is no corresponding term for matrices whose row or column vectors are orthogonal to each other but are not standard orthogonal.

Eigen decomposition and what it means

Many mathematical objects can be better understood by breaking them down into their constituent parts, or by finding some of their properties that are general and not caused by the way we choose to represent them.

For example: Integers can be factored into prime numbers. We can represent the integer 12 in different ways such as decimal or binary, but the prime factorization is always right 12=2×3×3. From this representation we can get some useful information like 12 is not divisible by 5, or multiples of 12 are divisible by 3.

Just as we can discover some intrinsic properties of integers by decomposing prime factors, we can also decompose matrices to discover functional properties that are not obvious when matrices are represented as array elements.

  • Eigen decomposition is one of the most widely used matrix decompositions, i.e. we decompose a matrix into a set of eigenvectors and eigenvalues.
  • The eigenvector of a transformation (or matrix) is such a vector, which keeps the direction unchanged after this specific transformation, but only expands and contracts in length.

Original definition of eigenvector:AX=CX

It can be easily seen that it CXis the result of the transformation of Athe X, and CXobviously Xthe directions of and are the same. XIf it is an eigenvector, it Crepresents the eigenvalue.

Solve: Let A be an N × N square matrix with N linearly independent eigenvectorsq_{i}(i=1....N)

In this way, A can be decomposed

where Q is an N × N square matrix whose i -th column is the eigenvector of A. Λ is a diagonal matrix, and the elements on the diagonal are the corresponding eigenvalues, that is,\wedge _{ii}=C_{i}

It should be noted here that only diagonalizable matrices can be used for eigendecomposition. for example

It cannot be diagonalized, so it cannot be eigendecomposed.

Geometric and physical meanings of eigenvalues ​​and eigenvectors:

In space, for a transformation, the direction indicated by the eigenvectors is important, and the eigenvalues ​​are less important. Although we find the eigenvalues ​​first when we find these two quantities, the eigenvectors are the more essential things! The eigenvectors refer to those vectors whose direction does not change after the specified transformation (multiplied by a specific matrix), and the eigenvalues ​​refer to the multiples of the expansion and contraction of the eigenvectors after these transformations, that is to say, the matrix does not change the direction of a certain vector or some The vector only undergoes scaling transformation, and does not have the effect of rotating these vectors, then these vectors are called the eigenvectors of the matrix, and the scaling ratio is the eigenvalue.

The physical meaning is the motion of the image: the eigenvectors are stretched under the action of a matrix, and the magnitude of the stretch is determined by the eigenvalues. If the eigenvalue is greater than 1, all the eigenvectors that belong to this eigenvalue will grow violently; if the eigenvalue is greater than 0 but less than 1, the eigenvector will shrink sharply; if the eigenvalue is less than 0, the eigenvector will shrink beyond the bounds, and the opposite direction will reach 0. side went.

Note: There are often textbooks saying that eigenvectors are vectors that do not change direction under matrix transformation. In fact, when the eigenvalue is less than zero, the matrix will change the eigenvectors in the opposite direction. Of course, the eigenvectors are still eigenvectors. I also agree with the statement that eigenvectors don't change direction: eigenvectors never change direction, only the eigenvalues ​​(the direction-reversed eigenvalues ​​are negative). The eigenvectors are also linear invariants.

An important application of eigendecomposition--PCA (Principal Component Analysis):

Take a chestnut: classification problem in machine learning, given 178 wine samples, each sample contains 13 parameters, such as alcohol, acidity, magnesium content, etc. These samples belong to 3 different types of wine. The task is to extract the characteristics of three wines, so that when a new wine sample is given next time, it can be judged which wine the new sample is based on the existing data.

The original data has 13 dimensions, but it contains redundancy. The most direct way to reduce the amount of data is to reduce the dimension. Practice: assign the data set to a matrix R with 178 rows and 13 columns, subtract the mean and normalize, its covariance matrix C is a matrix with 13 rows and 13 columns, perform eigendecomposition and diagonalization on C, where U is a matrix composed of eigenvectors, and D is a diagonal matrix composed of eigenvalues, arranged in descending order. Then, another R'=RU, the projection of the dataset on the orthogonal basis of the eigenvectors is realized. Well, here comes the point, the data columns in R' are arranged according to the size of the corresponding eigenvalues, and the latter columns correspond to small eigenvalues, which will have less impact on the entire data set after removal. For example, now we directly remove the next 7 columns and keep only the first 6 columns, and the dimensionality reduction is completed.

This dimensionality reduction method is called PCA (Principal Component Analysis). After dimensionality reduction, the classification error rate is almost the same as the method without dimensionality reduction, but the amount of data to be processed is reduced by half (13 dimensions need to be processed without dimensionality reduction, and only 6 dimensions after dimensionality reduction). Before deep learning, PCA was commonly used in image processing, and PCA is a very good dimensionality reduction method!

Singular Value Decomposition and Its Significance

Singular value decomposition is to decompose the matrix A into the product of three matrices:A = UDV ^ {T}

Suppose A is an m × n matrix, then U is an m × m matrix, D is an m × n matrix, and V is an n × n matrix. Each of these matrices has a special structure by definition. Matrices U and V are both defined as orthogonal matrices, while matrix D is defined as a diagonal matrix. Note: Matrix D does not have to be square.

The solution is more complicated. It is recommended to view this article for details. Singular Value Decomposition

The meaning of singular value decomposition:

The meaning of singular value decomposition is to regard a matrix A as a linear transformation (of course, it can also be regarded as a data matrix or a sample matrix), then the effect of this linear transformation is as follows, we can find a set of standard orthogonality in the original space base V, and a set of standard orthonormal bases U can be found in the corresponding space. We know that to see the effect of a matrix, we only need to look at its effect on a set of bases. In the inner product space, we prefer to see to its effect on a set of standard orthonormal bases. And the effect of matrix A on the standard orthonormal basis V can just be expressed as only pure expansion and contraction in the corresponding direction of U! This greatly simplifies our understanding of the role of matrices, because we know that no matter how complex a matrix is ​​in front of us, its role on a certain set of standard orthonormal bases is to expand and contract on another set of standard orthonormal bases.

For a more detailed description, please see: The meaning of singular values

The same is true for eigendecomposition, which can also simplify our understanding of matrices. For diagonalizable matrices, the function of this linear transformation is to stretch some directions (eigenvector directions) in that direction.

With the above understanding, when we look at the effect of the matrix on any vector x, from the perspective of eigendecomposition, we can decompose x in the direction of the eigenvector, and then expand and contract in each direction, and finally The results can be added up; from the perspective of singular value decomposition, we can decompose x in the V direction, and then scale each component to the U direction, and finally add the results of each component.

Singular value decomposition has a lot to do with the eigendecomposition mentioned above, and my understanding is:

  • Not all matrices can be diagonalized (symmetric matrices always can), and all matrices can always be decomposed into singular values. With so many types of matrices, we can always look at it from a unified and simple perspective, and we will sigh how wonderful the singular value decomposition is!
  • The singular value decomposition results of the covariance matrix (or X^{T}X) are consistent with the eigenvalue decomposition results. So in PCA, SVD is an implementation
The above knowledge may require some other prerequisite knowledge, but I don't think it is necessary to learn it. It is not used much. You can learn it again when you encounter it. We know its main formula, meaning and application, and the importance is clear at a glance. For matrix transformation operations, such as dimensionality reduction (PCA) or recommendation systems, it plays an important role.

Moore-Penrose pseudoinverse

For non-square matrices, the inverse is not defined. Suppose in the following problem, we want to solve a linear equation by the left inverse B of matrix A:

Ax=y

After multiplying both sides of the equation to the left by the left inverse B at the same time, we get:

x=By

Whether there is a unique mapping from A to B depends on the form of the problem.

If the number of rows of matrix A is greater than the number of columns, the above equation may not have a solution; if the number of rows of matrix A is less than the number of columns, then the above equation may have multiple solutions.

The Moore-Penrose pseudo-inverse allows us to address this situation, the pseudo-inverse of matrix A is defined as:

But the actual algorithm for computing the pseudo-inverse is not based on this formula, but uses the following formula:

Among them, the matrices U, D and V are the matrix obtained after the singular value decomposition of matrix A. The pseudo-inverse D+ of the diagonal matrix D is obtained by inverting its non-zero elements and then transposing it.

Note that the pseudo-inverse here is also obtained by applying singular value decomposition, which is a good reflection that knowledge is connected. The application of pseudo-inverse also exists in large numbers in machine learning, such as the simplest linear regression to find generalized inverse. matrix, that is, the pseudo-inverse.

trace operation

The trace operation returns the sum of the diagonal elements of the matrix:

Tr(A)=\sum_{i}^{}{A_{i,j}}

Trace operations are useful for many reasons. Some matrix operations are difficult to describe without the use of the summation symbol, but can be clearly represented by the matrix multiplication and trace symbols. For example, the trace operation provides another way of describing the Frobenius norm of a matrix:||A||_{F}=\sqrt{Tr(AA^{T})}

(You don't have to know what it is, just know that there is such an operation. If you are interested, of course you can learn about it)

Expressing expressions with trace operations, we can manipulate expressions neatly using many useful equations. For example, the trace operation is invariant under the transpose operation:Tr(A)=Tr(A^{T})

The trace of a square matrix obtained by multiplying multiple matrices is the same as the trace of the multiplication after moving the last of these matrices to the front. Of course, we need to consider that the matrix product is still well defined after shifting: Tr(ABC) = Tr(CAB) = Tr(BCA).

Trace operations are also commonly used mathematical knowledge, for example, these knowledge play an important role in the calculation of normal equations.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325251269&siteId=291194637