1 Mathematical expression of Gaussian distribution
1.1 What is Gaussian distribution
Gaussian Distribution is also called Normal Distribution. Gaussian distribution is an important model, which is widely used in the distribution of continuous random variables. Gaussian distribution occupies an important position in the field of data analysis. Due to the widespread application of the Central Limit Theorem, the Gaussian distribution is very important in statistics. The central limit theorem shows that the average random variable Y composed of a set of independent and identically distributed random variables X1, X2, X3,...Xn with limited mathematical expectation and variance approximately obeys the normal distribution when n approaches endless. In addition, many physical measurements are composed of the sum of many independent random processes, and therefore often have Gaussian distributions.
The probability density function curve of the Gaussian distribution is bell-shaped, so it is often called a bell-shaped curve. That is, the random variable X obeys a Gaussian distribution with mathematical expectation μ and variance σ^2, which is recorded as N(μ, σ^2). In the Gaussian distribution, the mathematical expectation μ represents the center position of the bell shape (that is, the position of the curve), and the standard deviation (standard deviation) σ represents the degree of dispersion of the curve.
When the mathematical expectation is 0 (u=0) and the variance is 1 (σ=1), the distribution is a standard normal distribution. The figure below shows several different types of probability density function curves for the normal distribution.
1.2 Key concepts
-
Probability function: expresses the probability of an event as a function of event variables
-
Probability distribution function: The probability that the value of a random variable ξ is less than a certain value x. This probability is a function of x. This function is called the distribution function of the random variable ξ, or distribution function for short. It is recorded as F(x), that is, F (x)=P(ξ<x) (-∞<x<+∞), which can determine the probability that the random variable falls into any range.
-
Probability density function:
The probability density is equal to the total probability of the variable in an interval (the value range of the event) divided by the length of the interval.
The probability density function is a function that describes the possibility of a random variable near a certain value point.
1.3 Univariate Gaussian distribution
If the random variable X obeys a Gaussian distribution with mean μ and variance σ2, then:
The graph of Gaussian distribution is like a bell. The figure below shows the graph of general normal distribution. where μ = 0, σ = 1.
For a non-standard normal distribution, it can be obtained from the standard normal distribution through the following three steps:
-
Move x u units to the right
-
Extend the x-axis of the density function by sigma times
-
Compress the y-axis of the function density image by σ times
If X obeys the distribution, X ∼ N(μ, σ2), then it has the following properties:
1.4 Multivariate Gaussian distribution
1.4.1 Independent multivariate Gaussian distribution
If we order:
We have:
If expressed in matrix form, there is:
Define symbols:
By substituting variables, we can get:
The following takes as an example to draw an image of the binary Gaussian distribution being independent between variables:
As can be seen from the above figure, when the variables are independent of each other:
-
When the eigenvalues of the covariance matrix are smaller, the distribution function image becomes higher and sharper.
-
When the eigenvalues of the covariance matrix are equal, the projection of the distribution function image on the X1, X2 plane is circular. When the eigenvalues are not equal, the projection of the distribution function image on the X1 and X2 surfaces is elliptical. When X1 and X2 are independent of each other, the major and minor axes of the ellipse are parallel to the coordinate axis. And the larger the eigenvalue corresponding to the variable, the more dispersed the distribution range of the variable. In the binary Gaussian distribution, the variable corresponding to the large eigenvalue corresponds to the long axis of the ellipse in the function projection image. High-dimensional Gaussian distributions can be generalized according to this rule.
1.4.2 Gaussian distribution of multivariate correlation variables
When there is a correlation between variables, the covariance matrix is no longer a diagonal matrix, but a symmetric matrix, and each element of the matrixrepresents a variable Covariance of a>.
As can be seen from the above two images, when variables are correlated, the biggest difference from variables that are independent of each other is that the major and minor axes of the ellipse of the projection surface are no longer parallel to the coordinate axis. If we rotate the coordinate axes X1 and X2 to be parallel to the major and minor axes of the ellipse, as shown in the figure below:
is known from the binary Gaussian distribution of the independent variables, then under the new coordinate system, are independent of each other. The above process is called decorrelation, which is also the basis of the classic dimensionality reduction method principal component analysis PCA. The following is the solution of the new coordinate system and the mathematical expression of the coordinates of the points in the original coordinate system in the new coordinate system.
Solve the unit orthogonal eigenvector of the covariance matrix according to the characteristic equation of the covariance matrix (first find the eigenvector, then perform orthogonalization and unitization),
At this time there is no correlation between .
2 The role of Gaussian distribution in deep learning
2.1 Reasons why Gaussian distribution is widely used
The Gaussian distribution (also known as the normal distribution or bell curve) is widely used in deep learning for the following reasons:
-
Central limit theorem: Gaussian distribution has important mathematical properties, the most important of which is the central limit theorem. The theorem states that for most sums of random variables, their distribution tends to be Gaussian. This means that in practical problems, many phenomena can be approximately described by Gaussian distribution.
-
Parametric flexibility: Gaussian distribution has two important parameters, mean and standard deviation, through which the shape of the distribution can be flexibly adjusted. This enables the Gaussian distribution to adapt to the characteristics of different data sets and has strong fitting capabilities.
-
Centrality and Dispersion Measures: The Gaussian distribution is mathematically symmetrical with its mean and median equal, making it a common method for measuring the centrality of a data set. In addition, standard deviation, as a measure of Gaussian distribution, can measure the dispersion of data.
-
Maximum likelihood estimation: In probability statistics, maximum likelihood estimation is a commonly used parameter estimation method. The parameter estimates of Gaussian distribution can be calculated by maximum likelihood estimation, which makes the application of Gaussian distribution more convenient.
In a practical sense, Gaussian distribution occurs with high frequency in natural and social phenomena. Many natural and social phenomena are stochastic and can be described by Gaussian distributions. For example, Gaussian distribution is widely used in fields such as measurement error, demographics, and financial market fluctuations.
2.2 Application scenarios of Gaussian distribution
The Gaussian distribution (also known as the normal distribution) plays several important roles in deep learning models. The following are some main application scenarios:
-
Parameter initialization: At the beginning of training of a neural network, it is usually necessary to initialize the weights. Using a Gaussian distribution (especially the standard normal distribution) to initialize the weights can help avoid saturation of the activation function in the early stages of training, ensuring that the initial weights are neither too large nor too small.
-
Regularization: In some cases, Gaussian distribution is used as a prior distribution and added to the loss function as a regularization term. This kind of regularization (such as L2 regularization) can help prevent overfitting by constraining the size of the weights.
-
Generative models: In generative models such as generative adversarial networks (GANs) and variational autoencoders (VAEs), Gaussian distributions are often used to generate random noise in the latent space. These noise vectors are then used to generate data (such as images).
-
Probabilistic modeling: In many probabilistic deep learning models, Gaussian distribution is used to model the output variables, especially when dealing with continuous values (such as regression problems).
-
Activation function: Although less common, in some special network structures, a Gaussian function can be used as an activation function to simulate specific biological neural network behavior.
-
Uncertainty estimation: In Bayesian neural networks, weights and biases are treated as random variables, and Gaussian distributions are usually used to describe their uncertainties. This approach can provide estimates of uncertainty in model predictions.
-
Feature extraction: In some image processing techniques, such as Gaussian blur, using Gaussian distribution as the weight kernel can help the model better extract image features during the training process.
The Gaussian distribution has become an important tool in deep learning due to its mathematical properties and universality in nature. It is particularly important in dealing with uncertainty, regularization and probabilistic modeling.