Gaussian distribution in deep learning

1 Mathematical expression of Gaussian distribution

1.1 What is Gaussian distribution

Gaussian Distribution is also called Normal Distribution. Gaussian distribution is an important model, which is widely used in the distribution of continuous random variables. Gaussian distribution occupies an important position in the field of data analysis. Due to the widespread application of the Central Limit Theorem, the Gaussian distribution is very important in statistics. The central limit theorem shows that the average random variable Y composed of a set of independent and identically distributed random variables X1, X2, X3,...Xn with limited mathematical expectation and variance approximately obeys the normal distribution when n approaches endless. In addition, many physical measurements are composed of the sum of many independent random processes, and therefore often have Gaussian distributions.

b9e26c4666824f979a031bbcf852ef0d.png

The probability density function curve of the Gaussian distribution is bell-shaped, so it is often called a bell-shaped curve. That is, the random variable X obeys a Gaussian distribution with mathematical expectation μ and variance σ^2, which is recorded as N(μ, σ^2). In the Gaussian distribution, the mathematical expectation μ represents the center position of the bell shape (that is, the position of the curve), and the standard deviation (standard deviation) σ represents the degree of dispersion of the curve.

ffcc84892ff2471ba32cec367addd9d4.png

When the mathematical expectation is 0 (u=0) and the variance is 1 (σ=1), the distribution is a standard normal distribution. The figure below shows several different types of probability density function curves for the normal distribution.

a22a739e570b48579be86dfdc530da83.png

1.2 Key concepts

  • Probability function: expresses the probability of an event as a function of event variables

  • Probability distribution function: The probability that the value of a random variable ξ is less than a certain value x. This probability is a function of x. This function is called the distribution function of the random variable ξ, or distribution function for short. It is recorded as F(x), that is, F (x)=P(ξ<x) (-∞<x<+∞), which can determine the probability that the random variable falls into any range.

  • Probability density function:

 The probability density is equal to the total probability of the variable in an interval (the value range of the event) divided by the length of the interval.

The probability density function is a function that describes the possibility of a random variable near a certain value point.​ 

1.3 Univariate Gaussian distribution

If the random variable X obeys a Gaussian distribution with mean μ and variance σ2, then:

93893a32f7874ece9e9f3b5d73546781.png

The graph of Gaussian distribution is like a bell. The figure below shows the graph of general normal distribution. where μ = 0, σ = 1.
baaca883a73640c9a291d4c35e7574ce.png

For a non-standard normal distribution, it can be obtained from the standard normal distribution through the following three steps:

  • Move x u units to the right

  • Extend the x-axis of the density function by sigma times

  • Compress the y-axis of the function density image by σ times

If X obeys the distribution, X ∼ N(μ, σ2), then it has the following properties:

3b2f8862899240668e2b75938e50a72f.png

1.4 Multivariate Gaussian distribution

1.4.1 Independent multivariate Gaussian distribution

f7a2d27f4dea419694850a77874e5628.png

  If we order:

77621a1fa4dc4b63baa7d6755a910237.png

  We have:

87dce74c1b944321967703885e08f879.png

  If expressed in matrix form, there is:

114ef4774692427da9f1cc60a4e185cb.png

  Define symbols:

5ccb02dbf1ad49d18d24720b5577a514.png

7a0d7999ed70444c8cd8d7f142115f86.png

1baf5c7d21e146d7bc6448b49bcce6d1.png

  By substituting variables, we can get:

3c11075cb50d40eda2c11ee93998281c.png

The following takes eq?x%3D%5Bx_%7B1%7D%2Cx_%7B2%7D%5D as an example to draw an image of the binary Gaussian distribution being independent between variables:

f4f850e784dd42dc8b6b65920a8af2e7.png

5051d079d57d4e36929f8f62f88167c4.png

e991ab49488a47758164143ecafe3f32.png

7bfbdd39dead4c008dc2bc2586d08200.png

766afa687e944b459373c13083ea4272.png

0e4ffd072adb4756bfa2d886cdabf51a.png

606eff5a5d384401aadc3d4f44ad8c71.png

d99bca3882f343738907498fe303caf9.png

As can be seen from the above figure, when the variables are independent of each other:

  • When the eigenvalues ​​of the covariance matrix are smaller, the distribution function image becomes higher and sharper.

  • When the eigenvalues ​​of the covariance matrix are equal, the projection of the distribution function image on the X1, X2 plane is circular. When the eigenvalues ​​are not equal, the projection of the distribution function image on the X1 and X2 surfaces is elliptical. When X1 and X2 are independent of each other, the major and minor axes of the ellipse are parallel to the coordinate axis. And the larger the eigenvalue corresponding to the variable, the more dispersed the distribution range of the variable. In the binary Gaussian distribution, the variable corresponding to the large eigenvalue corresponds to the long axis of the ellipse in the function projection image. High-dimensional Gaussian distributions can be generalized according to this rule.

1.4.2 Gaussian distribution of multivariate correlation variables

When there is a correlation between variables, the covariance matrix is ​​no longer a diagonal matrix, but a symmetric matrix, and each element of the matrixeq?%5Csigma%20_%7Bi%2Cj%7D%5E%7B2%7Drepresents a variable Covariance of a>eq?i%2Cj.

152c881da9bd4db6a0bbddff0b9c9bad.png

2d5ffd4b31f3485baa373da7225d55d1.png

1dec202296a3402e83241281a5731dc9.png

6e222cb944e749279ff028edb7def06b.png

As can be seen from the above two images, when variables are correlated, the biggest difference from variables that are independent of each other is that the major and minor axes of the ellipse of the projection surface are no longer parallel to the coordinate axis. If we rotate the coordinate axes X1 and X2 to be parallel to the major and minor axes of the ellipse, as shown in the figure below:

b744d7fc5fc14e7593b00cbdff6f4a2f.png

is known from the binary Gaussian distribution of the independent variables, then under the new coordinate system, eq?x_%7B1%7D%5E%7B%27%7D%2Cx_%7B2%7D%5E%7B%27%7D​ are independent of each other. The above process is called decorrelation, which is also the basis of the classic dimensionality reduction method principal component analysis PCA. The following is the solution of the new coordinate system and the mathematical expression of the coordinates of the points in the original coordinate system in the new coordinate system.
Solve the unit orthogonal eigenvector of the covariance matrix according to the characteristic equation of the covariance matrix (first find the eigenvector, then perform orthogonalization and unitization),

39f8c63f1b724c2d975ffb89616196e3.png

55471f5b51bc4c309e7d757ade9e269f.png

At this time there is no correlation between eq?x_%7B1%7D%5E%7B%27%7D%2Cx_%7B2%7D%5E%7B%27%7D.

2 The role of Gaussian distribution in deep learning

822cc8dd9b354d288a4253eece2504ad.webp

2.1 Reasons why Gaussian distribution is widely used

The Gaussian distribution (also known as the normal distribution or bell curve) is widely used in deep learning for the following reasons:

  • Central limit theorem: Gaussian distribution has important mathematical properties, the most important of which is the central limit theorem. The theorem states that for most sums of random variables, their distribution tends to be Gaussian. This means that in practical problems, many phenomena can be approximately described by Gaussian distribution.

  • Parametric flexibility: Gaussian distribution has two important parameters, mean and standard deviation, through which the shape of the distribution can be flexibly adjusted. This enables the Gaussian distribution to adapt to the characteristics of different data sets and has strong fitting capabilities.

  • Centrality and Dispersion Measures: The Gaussian distribution is mathematically symmetrical with its mean and median equal, making it a common method for measuring the centrality of a data set. In addition, standard deviation, as a measure of Gaussian distribution, can measure the dispersion of data.

  • Maximum likelihood estimation: In probability statistics, maximum likelihood estimation is a commonly used parameter estimation method. The parameter estimates of Gaussian distribution can be calculated by maximum likelihood estimation, which makes the application of Gaussian distribution more convenient.

In a practical sense, Gaussian distribution occurs with high frequency in natural and social phenomena. Many natural and social phenomena are stochastic and can be described by Gaussian distributions. For example, Gaussian distribution is widely used in fields such as measurement error, demographics, and financial market fluctuations.

2.2 Application scenarios of Gaussian distribution

615baebf530f487bb1929642bf4d6abc.png

The Gaussian distribution (also known as the normal distribution) plays several important roles in deep learning models. The following are some main application scenarios:

  • Parameter initialization: At the beginning of training of a neural network, it is usually necessary to initialize the weights. Using a Gaussian distribution (especially the standard normal distribution) to initialize the weights can help avoid saturation of the activation function in the early stages of training, ensuring that the initial weights are neither too large nor too small.

  • Regularization: In some cases, Gaussian distribution is used as a prior distribution and added to the loss function as a regularization term. This kind of regularization (such as L2 regularization) can help prevent overfitting by constraining the size of the weights.

  • Generative models: In generative models such as generative adversarial networks (GANs) and variational autoencoders (VAEs), Gaussian distributions are often used to generate random noise in the latent space. These noise vectors are then used to generate data (such as images).

  • Probabilistic modeling: In many probabilistic deep learning models, Gaussian distribution is used to model the output variables, especially when dealing with continuous values ​​(such as regression problems).

  • Activation function: Although less common, in some special network structures, a Gaussian function can be used as an activation function to simulate specific biological neural network behavior.

  • Uncertainty estimation: In Bayesian neural networks, weights and biases are treated as random variables, and Gaussian distributions are usually used to describe their uncertainties. This approach can provide estimates of uncertainty in model predictions.

  • Feature extraction: In some image processing techniques, such as Gaussian blur, using Gaussian distribution as the weight kernel can help the model better extract image features during the training process.

ef35caf1b5f44080b34af6b43eefc303.png

The Gaussian distribution has become an important tool in deep learning due to its mathematical properties and universality in nature. It is particularly important in dealing with uncertainty, regularization and probabilistic modeling.

 

 

 

Guess you like

Origin blog.csdn.net/lsb2002/article/details/134935811