fundamental issue
Our focus is on prior probabilities.
During the training process, the data sample distribution of the training set and the test set should be approximately the same to ensure that the prior probabilities are approximately the same.
Probability density estimation problem
Naive Bayes classification problem
1. Restrictions:
2. Application example:
Spam classification is mainly divided into two categories, namely: spam and not spam. During training, the sample is an email file, and its unit is a word.
Need to learn:
3. Calculation of P(d|c)
In order to avoid that a certain word does not appear in the Thunder Chain sample but appears in the test sample, this will cause p to be 0.
Therefore: we make the following improvements:
4. Judgment:
Then d belongs to C1, otherwise it belongs to C2.
Gaussian probability density estimate
Here, ln makes p(x|c) change from multiplication to addition.
where μ is the mean and Σ is the covariance matrix.
Summary: steps for Gaussian probability density estimation
Gaussian Mixture Model
When a Gaussian model cannot be simulated, we use multiple Gaussian models. This is a non-convex problem, and it often cannot find the global optimum, but only the local optimum.
If we continue to find partial derivatives, it will be particularly complicated. Next, there are the following methods:
The first two methods are applicable to all local problems, while the EM algorithm detachment can solve a certain type of local extreme value problems.
In order to adapt to gradient descent, we take the inverse of E, so that the problem of finding the maximum value becomes the problem of finding the minimum value.
The EM algorithm has the following three advantages:
1. No need to adjust any parameters
2. Simple programming
3. Beautiful theory
EM algorithm
(1) Basic idea:
This idea is actually not exactly the same as the EM algorithm. This method uses a hard distinction method.
The problem we are facing now is a chicken-and-egg problem. For sample points, as long as we find the clustered area of sample points, we will find the Gaussian distribution, and we will get μ and Σ. And we find Gaussian distribution. The model is to look at the distribution density of sample points. For such a problem, we can first randomly assume the Gaussian model where each point is located, and then the hard distinction method is to select the Gaussian model with a higher probability value for each point in the next calculation, and then continue to iterate. (EM algorithm is actually similar to K-means algorithm)
When we don't know the attributes or parameters of the data, we first randomize an attribute or parameter and then iterate.
Hard distinction: If p1>p2, then x belongs to model No. 1.
Soft distinction:
(2) EM algorithm of Gaussian mixture model
Among them, N is a function of Gaussian distribution.
N samples are divided into k Gaussian distributions (in the form of this probability), and the sum of Nk is N
(3) Example of EM algorithm-K-means clustering
Indicative function I(x): When the input is True, the output is 1; when the input is False, the output is 0.
Prove its convergence:
Among them, the third step makes E smaller because the mean will make E smaller.
Example of its application: Image vector quantization based on K-means clustering
Voiceprint recognition
(1) Preprocessing - Removal of silence
Because silence has a high degree of similarity, it needs to be removed.
The zero-crossing rate is the number of times the speech waveform crosses the 0 axis per unit time (usually one frame time). (Recognizing consonants)
During preprocessing, the energy is processed first, and then the zero-crossing rate is filtered on the part removed during the energy screening process. If the zero-crossing rate of the part removed during the energy screening is high, then this section is restored.
(2) Extracted features: MEL-frequency Cepstrum Coefficients (MFCC)
That is, through a piece of speech, the height of the vocal organ (that is, the physical characteristics of the person) is deduced to
divide the speech into small segments (Segments). The recommended segment setting is: each segment length is 20ms, and a segment is taken every 10ms, so that 1 second of speech is Able to obtain 100 segments.
For each segment, extract a cepstral vector. The dimension of the vector can be adjusted according to the settings. Generally, it can be 12, 24, 36, etc.
(3) Use feature MFCC to make Gaussian mixture
The K of the input Xi (MFCC)
mixed Gaussian is set to 64, and the covariance matrix Σ is a diagonal matrix.
The sum of π is 1, and the last parameter can be subtracted from 1, so π has 63
(4) data
XM2VTS, consists of 295 people, each person has 24 sound files, recorded in four times, 6 each time, with an interval of one month between each time.
文件1和4:’zero one two three four five six seven eight ten’
文件2和5:’five zero six nine two eight one three seven four’
Documents 3 and 6: 'Joe took father's green shoe bench out'
Use the 12 files from the first two times for training and the 12 files from the last two times for testing.
Its disadvantages: there cannot be noise (changing the distribution of X), otherwise the accuracy will drop significantly
Proof of EM algorithm
The general form of the EM algorithm
Based on the above derivation of the EM algorithm, we derive the general form of the EM algorithm:
Convergence proof of EM algorithm
E(θ)<=0, there is a previous term, everything must converge
Corresponding the general EM algorithm to the K-means algorithm
Algorithm steps:
Disadvantages of the EM algorithm:
unreliability. The military region quota has certain requirements and randomness for the selection of the initial value. The selection of different initial values may cause great differences in the final results.