Machine Learning/Deep Learning - Study Notes: Concept Supplement (Part 1)

Study time: 2022.05.09~2022.05.11

Concept Supplement (Part 1)

In the process of learning machine learning and deep learning, some concepts will be relatively unfamiliar (maybe because there is no systematic in-depth study of statistics, operations research and probability statistics; it may also be because the things you see are more practical, The understanding of the theory is not very in-depth), and the understanding of some concepts is limited to the familiar or used ones, and even some concepts can be used but do not know the specific principles, so I want to know a few frequently appearing concepts. Learn about the system. mainly include:

  • Top: Maximum Likelihood Estimation, Bayesian, Fourier, Markov, Conditional Random Field;
  • Bottom: Convex Sets, Convex Functions and Convex Optimization, Optimization Algorithms (Optimizers), Overfitting and Underfitting, Regularization & Normalization, Normalization, Loss Functions and Pseudo-Labels, etc.

1. Maximum Likelihood Estimation MLE

The reference sources of this part are: thousand-word explanation of maximum likelihood estimation , maximum likelihood estimation , detailed explanation of maximum likelihood estimation , and a thorough understanding of [maximum likelihood estimation] .

For a series of observations, we can often find a specific distribution to describe, but the parameters of the distribution are unclear. At this time, we need to use **Maximum Likelihood Estimate (MLE)** to solve the parameters of this distribution. In other words, Maximum Likelihood Estimation provides a way to evaluate the parameters of a model given observational data, ie: "The model is given, the parameters are unknown".

1.1 Concept

In many machine learning problems, enter xxx is a vector, outputP ( x ) P(x)P ( x ) is the probability at a certain time (for example,xxthe probability that x belongs to a class). For an observed datasetD = { x 1 , x 2 , … , xn } D = \{ x_1,x_2,…,x_n\}D={ x1,x2,,xn} , everyxxx are all independent and identically distributed, we will enterxxThe probability distribution satisfied by x is modeled asP ( D , θ ) P(D, θ)P(D,θ ) , then the prediction for the new input isP ( x ∣ D , θ ) P(x|D,θ)P(xD,θ ) , whereθ θθ is a vector representing all parameters of the model to be estimated (objects of parameter estimation). So how to solve or estimateθ θWhat about the value of θ ?

Different understandings of the nature of θ can be divided into two major schools: frequentist school and Bayesian school.

  • Frequentist school: It is believed that θ is definite and has a real value, and the goal is to find or approximate this real value ;
  • Bayesian School: It is believed that θ is uncertain, there is no unique true value, but obeys a certain probability distribution.

Therefore, there are also three different approaches for parameter estimation:

  • Maximum Likelihood Estimation: MLE (Maximum Likelihood Estimation) [Frequency School]
  • Maximum posterior estimation: MAP (Maximum A Posterior) [Bayesian school]
  • Bayesian Estimation: BE (Bayesian Estimation) [Bayesian School]

The purpose of maximum likelihood estimation is to use known sample results to infer the parameter values ​​that are most likely (maximum probability) to lead to such a result.

principle:

Maximum likelihood estimation is a statistical method based on the maximum likelihood principle, and it is the application of probability theory in statistics. Maximum Likelihood Estimation provides a way to evaluate the parameters of a model given observation data, that is: "The model is given, the parameters are unknown". Through several tests, observing the results, and using the test results to obtain a certain parameter value can maximize the probability of the sample appearing, it is called maximum likelihood estimation.

1.2 Formula

For a set of observation samples L = ( x 1 , x 2 , … , xn ) L = ( x_1,x_2,…,x_n )L=(x1,x2,,xn) , its likelihood function is:L ( θ ) = L ( x 1 , x 2 , … , xn ; θ ) = ∏ i = 1 np ( x 1 ; θ ) L(θ) = L(x_1,x_2, …,x_n;θ) = ∏^n_{i=1}p(x_1;θ)L ( i )=L(x1,x2,,xn;i )=i=1np(x1;θ ),称L ( θ ) L(θ)L ( θ ) is the likelihood function of the sample.

The principle of maximum likelihood estimation is fixed sample observations ( x 1 , x 2 , … , xn ) ( x_1,x_2,…,x_n )(x1,x2,,xn) , adjust the parameters( θ ) ( θ )( θ )使L ( x 1 , x 2 , … , xn ; θ ) = max L ( x 1 , x 2 , … , xn ; θ ′ ) L(x_1,x_2,…,x_n;θ) = max\ L(x_1,x_2,…,x_n;θ')L(x1,x2,,xn;i )=max L(x1,x2,,xn;i ). where,θ ′ θ'i' is called the parameterθ θMaximum likelihood estimate of θ .

Let us solve the problem:
θ MLE = argmax p ( D , θ ) = argmax p ( x 1 , θ ) p ( x 2 , θ ) ... p ( xn , θ ) = argmax ∏ i = 1 n p ( xi , θ ) θ_{MLE} = argmax\p(D, θ) = argmax\p(x_1,θ)p(x_2,θ)...p(x_n,θ) = argmax\ ∏^n_{i=1}\p (x_i,θ)iMLE=argmax p(D,i )=argmax p(x1,i ) p ( x2,i )p(xn,i )=argmax i=1n p(xi,θ )
into the log-likelihood function equation (simplifies the solution of the parameter θ, ensuring the equivalence of maximum likelihood and maximum log-likelihood):
ln θ MLE = argmax ln ∏ i = 1 n p ( xi , θ ) ln^{θ_{MLE}} = argmax\ ln^{∏^n_{i=1}\ p(x_i,θ)}lniMLE=a r g m a x l n i=1n p(xi, i )

Its loss function form = argmin − ln ∏ i = 1 n p ( xi , θ ) = argmin-ln^{∏^n_{i=1}\ p(x_i,θ)}=argminlni=1n p(xi, i ) .

1.3 Calculation steps

Use the sample random variable of normal distribution to simulate and solve, the formula of normal distribution f ( x ) = 1 2 π σ exp − ( x − μ ) 2 2 σ 2 f(x) = \frac{1}{\sqrt {2\piσ}}exp^{-\frac{(x-μ)^2}{2σ^2}}f(x)=2 p.s _ 1exp2 p2( x μ )2,Specific steps are as follows:

  1. 写出似然function:
    L ( θ ) = ∏ i = 1 n f ( xi , θ ) = ∏ i = 1 n 1 2 π σ exp − ( x − μ ) 2 2 σ 2 = ( 1 2 π σ ) n ∏ i = 1 n exp − ( x − μ ) 2 2 σ 2 L(θ) = ∏^n_{i=1}\ f(x_i,θ) = ∏^n_{i=1}\ \frac{1 }{\sqrt{2\piσ}}exp^{-\frac{(x-μ)^2}{2σ^2}}\\ = (\frac{1}{\sqrt{2\piσ}}) ^n\ ∏^n_{i=1}\ exp^{-\frac{(x-μ)^2}{2σ^2}}L ( i )=i=1n f(xi,i )=i=1n 2 p.s _ 1exp2 p2( x μ )2=(2 p.s _ 1)n i=1n exp2 p2( x μ )2

  2. 对似然投行取对数:
    ln L ( θ ) = ln ∏ i = 1 n f ( xi , θ ) = − n 2 ln 2 π − nln σ − 1 2 σ 2 ∑ i = 1 n ( xi − μ ) 2 ln^{L(θ)} = ln^{∏^n_{i=1}\ f(x_i,θ)} = -\frac{n}{2}ln^{2\pi}-nln^{ σ}-\frac{1}{2σ^2}\sum^n_{i=1}(x_i-μ)^2lnL ( i )=lni=1n f(xi, i )=2nln2 p.mn l np2 p21i=1n(xim )2

  3. 电影偏密数,取极值,得字方组:
    { ∂ ln L ( μ , σ 2 ) ∂ μ = 1 σ 2 ∑ i = 1 n ( xi − μ ) = 0 ∂ ln L ( μ , σ 2 ) ∂ ln L ( μ , σ 2 ) ∂ σ 2 = − n 2 σ 2 + 1 2 σ 4 ∑ i = 1 n ( xi − μ ) 2 = 0 → { ∑ i = 1 n ( xi − μ ) = 0 σ 2 = 1 n ∑ i = 1 n ( xi − μ ) 2 \begin{cases} \frac{∂ln^{L(μ,σ^2)}}{∂μ} = \frac{1}{σ^2}\sum^n_{i= 1}(x_i-μ) = 0\\ \frac{∂ln^{L(μ,σ^2)}}{∂σ^2} = -\frac{n}{2σ^2}+\frac{ 1}{2σ^4}\sum^n_{i=1}(x_i-μ)^2 = 0 \end{cases}\ →\ \begin{cases} \sum^n_{i=1}(x_i- μ) = 0\\ σ^2=\frac{1}{n}\sum^n_{i=1}(x_i-μ)^2 \end{cases}μl nL ( m , p2 )=p21i=1n(xim )=0∂σ _2l nL ( m , p2 )=2 p2n+2 p41i=1n(xim )2=0  { i=1n(xim )=0p2=n1i=1n(xim )2

  4. Define a simple formula, for example:
    { µ ^ = x ˉ = 1 n ∑ i = 1 nxi σ 2 ^ = 1 n ∑ i = 1 n ( xi − x ˉ ) 2 \begin{cases} \widehat{µ}; = \bar{x} = \frac{1}{n}\sum_{i=1}x_i\\\width{σ^2}=\fraction{1}{n}\sum_{i= 1}(x_i-\bar{x})^2\end{cases}{ m =xˉ=n1i=1nxip2 =n1i=1n(xixˉ )2

μ ^ , σ 2 ^ \widehat{μ},\ \widehat{σ^2}m , p2 Is the normal distribution μ , σ 2 μ,\ σ^2m , pThe maximum likelihood estimate of 2 .

1.4 Features and Applications

Features of Maximum Likelihood Estimation:

  1. simpler than other estimation methods;
  2. Convergence: unbiased or asymptotically unbiased, when the number of samples increases, the convergence property will be better;
  3. Better results are usually obtained if the assumed class conditional probability model is correct. But if the hypothetical model is biased, it will lead to very poor estimates.

The maximum likelihood estimation is widely used in the theoretical algorithm research of machine learning, especially when it comes to the minimization of the loss function of machine learning, a negative sign is often added to the likelihood function to achieve the purpose of minimization. When studying linear regression first, not only can the minimum error function be established based on MSE, but also can be derived from the perspective of normal distribution and maximum likelihood estimation. The algorithms that use maximum likelihood estimation in machine learning algorithms include Naive Bayes, EM algorithm, etc.

In addition, the maximum likelihood function is particularly good at dealing with problems related to probability, because the solution of the model is often the solution of parameters. According to the maximization of random samples in batches to approximate or simulate the real world, it is often used in discrete applications. The likelihood function of the random variable of the type is solved, while the normal distribution function is more common for the continuous type. The loss function model established by using the maximum likelihood estimation needs to further update the iterative parameters with the help of the gradient descent method to solve the parameters.

2. Bayesian Bayes

The main reference content in this section: popular understanding of Bayesian formula (theorem) , thousand words to explain maximum likelihood estimation .

Thomas Bayes, the inventor of Bayes' theorem, put forward an interesting hypothesis: "If there are 10 balls in a bag, black and white, but we don't know what the ratio is between them. , Now, can you judge the proportion of black and white balls in the bag only by the color of the balls drawn?"

The above problem may conflict with the probability we accepted in high school, because the probability problem you are exposed to may be something like this: "There are 10 balls in a bag, 4 black balls, 6 white balls, if you randomly Grab a ball, what is the probability that it is a black ball?" No doubt, the answer is 0.4. This problem is very simple, because we know the ratio of black and white balls in the bag in advance, so it is easy to calculate the probability of touching one, but in some complex cases, we cannot know the "proportion", which leads to the Bayesian question.

2.1 Bayes' theorem/formula

The following describes the "Bayes Theorem" in an easy-to-understand way: Usually, event A under the condition that event B occurs and event B under the condition that event A occurs, the probabilities of the two are not the same, but the two have different probabilities. There is a certain correlation between them and has the following formula (called " Bayesian formula "):
P ( A ∣ B ) = P ( B ∣ A ) ⋅ P ( A ) P ( B ) P(A |B) = \frac{P(B|A)·P(A)}{P(B)}P(AB)=P(B)P(BA)P(A)
The meanings of the symbols in the formula are as follows:

  • P ( A ) P(A) P ( A ) This is the most basic notation in probability and represents the probability of A occurring. It is called "a priori" because it does not take into account any B-side factors; [a priori]

  • P ( B ) P(B) P ( B ) represents the probability of occurrence of B, and is also referred to as a normalized constant; [a priori]

  • P ( B ∣ A ) P(B|A) P ( B A ) isconditional probability, indicating the probability of event B occurring under the condition that event A occurs, it is also called "likelihood"; 【Likelihood】

  • P ( A ∣ B ) P(A|B) P ( A B ) isconditional probability, which represents the probability of event A occurring under the condition that event B occurs. This calculation result is also called "posterior probability"; [posterior]

    Note: The posterior probability can actually be regarded as the prior probability updated after the given data.

Therefore, the Bayesian formula can be expressed as: Posterior probability = (Likelihood × Prior probability) / Standardized constant Posterior probability = (Likelihood × Prior probability) / Standardized constantPosterior probability _=( Likelihood _ _×prior probability ) / standardization constant . _ _

Proportional P ( B ∣ A ) / P ( B ) P(B|A)/P(B)P ( B A ) / P ( B ) is also sometimes referred to as standardised likelihood, so the Bayesian formula can also be expressed as:Posterior Probability = Standardised Similarity × Prior Probability Posterior Probability = standard similarity × prior probabilityPosterior probability _=Standard similarity _×prior probability .

In specific applications, BBB stands for observed data,AAA represents the parameter (maybe a vector) that needs to be estimated.

Application example of Bayesian formula: Bayesian easy-to-understand derivation .

2.2 Steps of Bayesian Estimation

General steps (to find P ( θ ∣ D ) P(θ|D)P(θD)):

  1. Determine the parameters θ θThe prior distribution of θ P ( θ ) P(θ)P ( θ )

    Note: The conjugate prior for Bernoulli, binomial, negative binomial and geometric distribution parameters is Beta distribution, the conjugate prior for multinomial distribution parameters is Dirichlet distribution, and the conjugate prior for exponential distribution parameters is Gamma distribution , the conjugate prior of the mean of the Gaussian distribution is another Gaussian distribution, and the conjugate prior of the Poisson distribution is the Gamma distribution.

  2. According to the observed data set D = { x 1 , x 2 , … , xn } D=\{x_1, x_2,…,x_n\}D={ x1,x2,,xn} , determine the parameterθ θLikelihood of θ P ( D ∣ θ ) P(D|θ)P(Dθ)

  3. Determine the parameters θ θPosterior distribution function of θ P ( D ) P(D)P(D)

    In a continuous random variable, P ( D ) = ∫ θ P ( D ∣ θ ) ⋅ P ( θ ) ⋅ d θ P(D) = \int_θP(D|θ)·P(θ)·dθP(D)=iP(Dθ)P ( θ ) d i .

  4. Using the Bayesian formula, find the parameters θ θPosterior distribution of θ P ( θ ∣ D ) P(θ|D)P(θD)

  5. Find the parameter θ θθ的贝叶斯 成动值θ ^ = ∫ θ θ P ( θ ∣ D ) d θ \hat{θ} = \int_θθP(θ|D)dθi^=iθP(θD)dθ

  6. Estimate new measurements x ~ \tilde{x}x~ probability of occurrenceP ( x ~ ∣ D ) P(\tilde{x}|D)P(x~D).

    Bayesian estimation is not how to estimate parameters, but is used to estimate the probability of the emergence of new measurement data.

Take a coin toss as an example (binomial distribution):

  1. The conjugate prior for the parameters of the binomial distribution is the Beta distribution, since θ θThe likelihood function of θ obeys a binomial distribution, so in Bayesian estimation, it is assumed that θ θThe prior distribution of θ obeys P ( θ ) ∽ Beta ( α , β ) P(θ)\backsim Beta(\alpha, \beta)P ( θ )B e t a ( a ,β ) , the probability density formula of Beta distribution is (BBB function, also known asBeta BetaB e t a function, a normalizing constant used to make the integral of the overall probability equal to 1):
    f ( x ; α , β ) = 1 B ( α , β ) x α − 1 ( 1 − x ) β − 1 f(x;\alpha, \beta) = \frac{1}{B(\alpha, \beta)}x^{\alpha-1}(1-x)^{\beta-1}f(x;a ,b )=B ( a ,b )1xα 1 (1x)β 1

  2. According to the observed data set D = { x 1 , x 2 , … , xn } D=\{x_1, x_2,…,x_n\}D={ x1,x2,,xn} , the likelihood valueP ( D ∣ θ ) P(D|θ)P(Dθ)

  3. In a continuous random variable, P ( D ) = ∫ θ P ( D ∣ θ ) ⋅ P ( θ ) ⋅ d θ P(D) = \int_θP(D|θ)·P(θ)·dθP(D)=iP(Dθ)P ( θ ) d θ ;

  4. Solve the posterior distribution P ( θ ∣ D ) P(θ|D) according to the Bayesian formulaP ( θ D )
    P ( θ ∣ D ) = P ( D ∣ θ ) ⋅ P ( θ ) ∫ θ P ( D ∣ θ ) ⋅ P ( θ ) ⋅ d θ = θ 6 ( 1 − θ ) 4 θ α − 1 ( 1 − θ ) β − 1 B ( α , β ) ∫ θ θ 6 ( 1 − θ ) 4 ⋅ θ α − 1 ( 1 − θ ) β − 1 B ( α , β ) d θ = … … = B eta ( θ ∣ α + 6 , β + 4 ) P(θ|D) = \frac{P(D|θ)·P(θ)}{\int_θP(D|θ)·P(θ )·dθ} = \frac{θ^6(1-θ)^4\frac{θ^{\alpha-1}(1-θ)^{\beta-1}}{B(\alpha, \beta )}}{\int_θθ^6(1-θ)^4·\frac{θ^{\alpha-1}(1-θ)^{\beta-1}}{B(\alpha, \beta)} dθ}\\ =……\\ =Beta(θ|\alpha+6,\beta+4)P(θD)=iP(Dθ)P ( θ ) d iP(Dθ)P ( θ )=ii6 (1i )4B ( a , b )iα 1 (1θ)b 1d ii6 (1i )4B ( a , b )iα 1 (1θ)b 1==Beta(θα+6 ,b+4 )

  5. According to the mathematical expectation formula of Beta distribution E ( θ ) = α α + β E(θ)=\frac{\alpha}{\alpha+\beta}E ( i )=a + ba, the parameters θ θ can be obtainedBayesian estimate of θ θ ^ \hat{θ}i^
    θ ^ = ∫ θ θ P ( θ ∣ D ) d θ = E ( θ ) = α α + β = 9 9 + 7 = 0.5625 \hat{θ} = \int_θθP(θ|D)dθ = E( i) = \frac{\alpha}{\alpha+\beta} = \frac{9}{9+7} = 0.5625i^=iθP(θD)dθ=E ( i )=a+ba=9+79=0.5 6 2 5 _ _

  6. Estimate new measurements x ~ \tilde{x}x~ probability of occurrenceP ( x ~ ∣ D ) P(\tilde{x}|D)P(x~D)
    P ( x ~ ∣ D ) = ∫ θ P ( x ~ ∣ θ ) ⋅ P ( θ ∣ D ) ⋅ d θ = ∫ θ P ( x ~ ∣ θ ) ⋅ P ( D ∣ θ ) ⋅ P ( θ ) P ( D ) ⋅ d θ P(\tilde{x}|D) = \int_θP(\tilde{x}|θ)·P(θ|D)·dθ = \int_θP(\tilde{x }|θ)·\frac{P(D|θ)·P(θ)}{P(D)}·dθP(x~D)=iP(x~θ)P(θD)d i=iP(x~θ)P(D)P(Dθ)P ( θ )d i

Conjugate Prior

" Conjugation " is a relatively powerful word in mathematics. The "yoke" is the wood used by the ox to pull the cart, and the two oxen that pull the same cart at the same time is the "conjugate" relationship. Extending this relationship to mathematics, as long as there are pairs of things, and when a more appropriate name cannot be found, they are often called "conjugates". "conjugate" means "to become joined together".

In Bayesian theory, if the prior distribution and the likelihood function can make the prior distribution and the posterior distribution have the same form, then the prior distribution and the likelihood function are said to be conjugate, and the result of the conjugation is to let The prior and the posterior have the same form .

The reason why the conjugate prior is used is to make the prior distribution and the posterior distribution have the same form, so on the one hand, it is in line with human intuition (they should be in the same form), on the other hand, it can form a prior chain, That is, the current posterior distribution can be used as the prior distribution for the next calculation, and if the form is the same, a chain can be formed.

2.3 Naive Bayes

Naive Bayes is a very simple classification algorithm. Its basic idea is: for a given item to be classified, find out the probability of each category appearing under the condition that this item appears, and whichever is the largest, consider which item to be classified belongs to. category.

An important assumption of the naive Bayes classifier: the attributes corresponding to the classification are independent of each other (this is also the origin of the term "naive").

However, in practical applications, this is often difficult to achieve, so what should we do?

It is very simple to properly consider the interdependence between some attributes. This relaxed classification is called semi-naive Bayesian classification . The most commonly used strategy is to assume that each attribute only depends on at most one other attribute, which is called dependent. This property is its superparent property, and this relationship is called: Independent Dependency Estimation (ODE).

Supplement: Concepts of Bayesian Networks .

2.4 Bayesian Optimization Bayesian Optimization

The main reference content in this section: Bayesian optimization and SMBO, Gaussian process regression, TPE ,

Before starting, let's clarify the following common and confusing concepts in this field: AutoML, Bayesian Optimization (BO), Sequential Model Based Optimisation (SMBO), Gaussian Process Regression (GPR), Tree Parzen Estimator (TPE).

AutoML is the biggest concept, covering Bayesian Optimization (BO), Neural Architecture Search (NAS), and a lot of engineering stuff. The goal of AutoML is to allow people without experience in machine learning (including deep learning) to use a platform to construct, train, and use machine learning models. This platform is responsible for data management, model structure design, and model hyperparameter adjustment (hyper -parameter tuning) , model evaluation and usage, etc.

Bayesian optimization (BO) is an advanced method for hyperparameter tuning in AutoML, and is listed in manual parameter tuning (manul), grid search (Grid Search), and random search (Random Search).

SMBO is a specific implementation method of Bayesian optimization. It is an iterative optimization method. Each iteration is a new hyperparameter combination experiment, and each iteration is based on the previous history. In fact, SMBO can be considered as the standard implementation of BO. I feel that the two are equivalent in many contexts.

Finally, there are Gaussian Process Regression (GPR) and Tree Parzen Estimator (TPE) , which are parallel concepts and two modeling strategies in the SMBO method.

2.4.1 Hyperparameter Tuning Method

There are four main methods for hyper-parameter tuning: manual tuning, grid search, random search, and Bayesian Optimization.

Manual parameter tuning is to adjust parameters based on human experience. Now simply compare grid search (left) and random search (right):

Random search explores more space in the same number of hyperparameter combinations (this sentence is also ambiguous, it should be more accurate to say that the parameters of each dimension have tried more possibilities), so in the average sense See you can find better areas earlier than grid search. In theory, random search can cover the entire parameter space faster, instead of searching from a local point like grid search.

However, both grid search and random search are non-a priori searches, which are called Uninformed search in some places, that is, each step of the search does not consider the situation of the points that have been explored, which is also the main problem of grid/random search. , they are all "lazy" searches, searching around with eyes closed.

Bayesian optimization , on the other hand, is a kind of informed search, which uses the performance of the previously searched parameters to speculate on the next step, thereby reducing the search space and greatly improving the search efficiency. In a sense, Bayesian optimization is similar to manual parameter tuning, because our parameter tuning masters will also judge how to adjust the parameters in the next step based on the existing results and our own experience.

2.4.2 SMBO algorithm

Suppose a set of hyperparameter combinations is X = x 1 , x 2 , . . . , xn X=x_1,x_2,...,x_nX=x1,x2,. . . ,xn( x n x_n xnRepresents the value of a certain hyperparameter), different hyperparameters will have different effects, Bayesian optimization assumes that there is a functional relationship between the hyperparameters and the loss function we need to optimize in the end.

The general idea of ​​Bayesian optimization is as follows: Suppose we have a function f : x → R f: x → Rf:xR , we need atx ⊆ X x⊆XxFind x within X ∗ = argmin f ( x ) x^* = argmin\ f(x)x=argmin f(x)。( x x x represents hyperparameters, not input data;fff is the loss function)

SMBO full name Sequential model-based optimization, serialization model-based optimization. The so-called serialization refers to the optimization through iteration, one trial at a time. is the simplest form of Bayesian optimization. The framework of SMBO is as follows (SMBO framework pseudo code):

picture

The meaning of each symbol is as follows:

  • f f f is the function we want to optimize,xxx is a combination of hyperparameters. For example, if we want to train an image classification model, then eachxxThe choice of x will correspond to a loss of the classification model, and this functional relationship isfff . Generally,ffThe calculation of f is very time-consuming;
  • S S S is short for surrogate, which means "surrogate", we useSSS asffsurrogate function of f , generally by minimizingSSS to find out how to choose the next hyperparameter,SSThe computation of S is generally much easier. Generally, the minimization of this step is carried out by maximizing an acquisition function;
  • H H H is history, which is all the previous{ x , f ( x ) } \{x,f(x)\}{ x,f ( x ) } records, we want toHHModel H to get their probability distribution modelMMM

The overall steps are as follows:

  1. According to the existing tuning history H = ( x 1 : k , f ( x 1 : k ) ) H = (x_{1:k},f(x_{1:k}))H=(x1:k,f(x1:k) ) , establish a probability distribution modelMMM;
  2. Select the next hyperparameter xk + 1 x_{k+1} according to the acquisition functionxk+1;
  3. put new observations ( xk + 1 , f ( xk + 1 ) ) (x_{k+1},f(x_{k+1}))(xk+1,f(xk+1) ) added toHHin H ;
  4. Repeat steps 1-3 until the maximum number of iterations is reached.

Therefore, the main differences between different Bayesian optimization methods are: which probability model is used to model the history, and how to choose the acquisition function .

Therefore, a key step in Bayesian optimization is to model the objective function to be optimized and obtain the distribution p ( y ∣ x ) p(y|x) of the functionp ( y x ) , to see how much the function might fluctuate. With the distribution of the function, there is actually a conditional distribution of y given x. Specific modeling methods, the most classic ones include Gaussian Process Regression (GPR) and Tree Parzen Estimator (TPE), in Section 2.4.4.

2.4.4 Expected Improvement (EI)

Now that we have a modeling approach, how do we choose the next hyperparameter based on the distribution of the objective function?

insert image description here

Among the points that have been observed, it is x 2 x_2x2The best hyperparameters, which direction do we look for next? We have two strategies:

  • Exploitation (Exploitation? Digging? Exploiting?): Now that we have found x 2 x_2x2Best, it is estimated that the surrounding points are also good, so we search in area 1 in the figure;
  • Exploration: Although x 2 x_2x2At present, it looks good, but there are still many spaces that we have not yet explored, such as the area 2 in the picture, there may be surprises hidden in it!

In fact, the above two strategies have their own reasons, we need to design an acquisition function to help us make judgments. One of the most common schemes is Expected Improvement (EI) , which is a compromise between Exploration and Exploitation. The formula for Expected Improvement (EI) is as follows:

insert image description here

That is, we first have a baseliney ∗ y^*Y , we just calculate EI - relative toy ∗ y^*Y to evaluate the quality of a hyperparameter. So, the hyperparameter we are looking for next is:xnew = argmaxx EI y ∗ ( x ) x_{new}=argmax_xEI_{y^*}(x)xnew=argmaxxE IY(x)

The formula of EI determines that it prefers to select regions with small mean and large variance, thus embodying "exploration vs. exploitation trade-off".

2.4.3 Modeling strategies for different probability distributions

① Bayesian optimization based on GPR; ② Bayesian optimization based on TPE. The specific introduction can be seen: popular popular science articles: Bayesian optimization and SMBO, Gaussian process regression, TPE .

From the experiments of the paper, it is found that the overall effect of TPE will be better than that of GPR. For the specific reason, the paper only gives some guesses: the modeling of TPE may be more accurate than GPR, and a more conservative choice of threshold may be a better prior. But this is only an experiment on the two datasets of the paper. In fact, various open source tools have different choices, some choose GPR, some choose TPE and other algorithms. In addition, mature tools generally make some improvements to the method of changing the paper, such as how to perform better initialization and so on.

2.4.4 Python application

Advantages of Bayesian Optimization:

  • Bayesian parameter tuning adopts Gaussian process, considering the previous parameter information, and constantly updating the prior; grid search does not consider the previous parameter information;
  • Bayesian parameter tuning has few iterations and fast speed; grid search speed is slow, and it is easy to cause dimension explosion when there are many parameters;
  • Bayesian parameter tuning is still robust for non-convex problems; grid search is easy to obtain local optimal for non-convex problems.

For specific applications, I have seen two libraries and some usage tutorials, namely bayes_opt ( Bayesian optimizer , Bayesian optimization python implementation ) and Hyperopt ( parameter tuning artifact hyperopt ) library.

3. Fourier analysis Fourier

The main reference in this section: explain the Fourier transform in simple terms , and thoroughly understand the Fourier transform .

3.1 What is the frequency domain

From our birth, the world we see runs through time. The trend of stocks, the height of people, and the trajectory of cars will change over time. This method of observing the dynamic world with time as a reference is called time domain analysis (time series analysis). And we also take it for granted that everything in the world is constantly changing with time and will never stand still. But if I told you to look at the world in another way, and you would see that the world is eternal , would you think I was crazy? I'm not crazy, this static world is called the frequency domain.

insert image description here

In the time domain, we observe that the strings of the piano wiggle up and down for a while, just like the trend of a stock; in the frequency domain, there is only one eternal note.

Therefore, in your eyes, the world that looks like falling leaves is changing and changing, but it is actually just a piece of music that has already been composed in the arms of God. Sorry, this is not a chicken soup, but a solid formula on the blackboard: Fourier told us that any periodic function can be regarded as a superposition of sine waves of different amplitudes and phases. In the first example, we can understand that any piece of music can be composed by tapping on different keys with different strengths and different time points.

One of the methods that runs through the time domain and the frequency domain is the Fourier analysis in the legend. Fourier analysis can be divided into Fourier series (Fourier Serie) and Fourier transformation (Fourier Transformation), let's start with a simple one.

3.2 What is the Fourier transform

In short, the Fourier transform decomposes an input signal into a superposition of a bunch of sine waves. Like most mathematical methods, the name comes from a man named Fourier. Let's start with some simple examples and move on. First, let's look at what a wave is - a wave that has been changing according to a certain pattern over time.

Here is an example of a wave:

picture

This wave can be decomposed into a superposition of two sine waves. That is, when we add two sine waves together, we get the original wave.

img

The Fourier transform allows us to separate out the individual sine waves that make up a complex waveform. In this example, you can pretty much do it by "brain filling". Why? It turns out that many things in the real world interact with each other based on sine waves. We usually refer to this property of wave speed as the frequency of the wave.

The most obvious example is sound -- when we hear sound, we don't hear that wavy line, but we hear the different frequencies of the sine waves that make up the sound. Being able to distinguish these two tones on a computer gives us an idea of ​​what a person can actually hear. We can understand how high or low a sound is, or figure out what notes the wave contains.

Some waves do not look like they are made of sine waves, and we can also use this decomposition process for analysis.

3.3 Spectrum of Fourier Series

Let's take a look at this guy. This wave is called a square wave.

picture

While it seems unlikely, it does decompose into a sine wave as well.

picture

As the superposition increases, the rising part of all sine waves gradually steepens the originally slowly increasing curve, and the falling part of all sine waves offsets the rising part when it rises to the highest point, making it a horizontal line. A rectangle is so superimposed. But how many sine waves need to be added to form a standard 90-degree square wave? Unfortunately, the answer is infinite.

Not just rectangles, any waveform you can think of can be superimposed with sine waves in this way. This is the first intuition for someone who has not been exposed to Fourier analysis, but once you accept this setting, the game starts to get interesting. Or the sine waves in the above figure are accumulated into a rectangular wave, let's look at it from another angle:

insert image description here

In these figures, the black line at the front is the sum of all the sine waves superimposed, that is, the figure that is getting closer and closer to the rectangular wave. The sine waves arranged in different colors are the components of the rectangular wave. These sine waves are arranged in order of frequency from front to back, and the amplitude of each wave is different. Attentive readers must have discovered that there is a straight line between every two sine waves, which is not a dividing line, but a sine wave with an amplitude of 0! That is, some sine wave components are not needed in order to form a special curve.

Here, the sine waves of different frequencies are called frequency components.

**Well, here comes the key! ! **If we consider the first lowest frequency component as "1", we have the most basic unit for building the frequency domain. The basic unit of the time domain is "1 second", if we consider a sine wave with an angular frequency of \omega_{0} cos( \omega_{0}t) as the basis, then the basic unit of the frequency domain is \omega_{0}. With "1", there is also "0" to form the world, so what is the "0" in the frequency domain? cos(0t) is a sine wave with an infinite period, which is a straight line! Therefore, in the frequency domain, the 0 frequency is also called the DC component. In the superposition of the Fourier series, it only affects the whole waveform up or down relative to the number axis without changing the shape of the wave.

A sine wave is the projection of a circular motion on a straight line. So the basic unit of the frequency domain can also be understood as a circle that is always rotating
insert image description here

After introducing the basic components of the frequency domain, we can take a look at a rectangular wave, another appearance in the frequency domain:

insert image description here

What is this weird thing? This is what a rectangular wave looks like in the frequency domain, is it completely unrecognizable? Textbooks are generally given here and left to readers with endless reverie and endless complaints. In fact, textbooks only need to add a picture: the frequency domain image, which is commonly known as the frequency spectrum, is—

insert image description here

To be a little more clear:

insert image description here

It can be found that in the spectrum, the amplitudes of the even-numbered items are all 0, which corresponds to the colored line in the figure. A sine wave with 0 amplitude.

3.4 Phase Spectrum of Fourier Series (FS)

The key word in the previous chapter was: from the side. The key word for this chapter is: from below.

At the beginning of this chapter, I would like to answer a question that many people have asked: what is Fourier analysis for?

Let's talk about the most direct use first. Whether listening to the radio or watching TV, we must be familiar with one word - channel. Channel Channel is the channel of frequency, and different channels use different frequencies as a channel for information transmission. Let's try one thing:

First draw a sin(x) on the paper. It is not necessarily standard, but the meaning is almost the same. It's not difficult. OK, let's draw a graph of sin(3x)+sin(5x). Don't say the standard is not standard, you don't necessarily draw when the curve rises and falls, right? Well, it doesn't matter if you can't draw it. I will give you the curve of sin(3x)+sin(5x), but the premise is that you don't know the equation of this curve. Now you need to give me sin(5x) and take it out of the picture , and see what's left. This is basically impossible. But what about in the frequency domain? It is very simple, nothing more than a few vertical lines.

Therefore, many mathematical operations that seem impossible in the time domain are very easy in the frequency domain. This is where the Fourier transform is required. In particular, removing some specific frequency components from a certain curve, which is called filtering in engineering, is one of the most important concepts in signal processing, and can only be done easily in the frequency domain.

Another more important, but slightly more complicated use - solving differential equations. The importance of differential equations needs no introduction. Used in all walks of life. But solving differential equations is quite a hassle. Because in addition to calculating addition, subtraction, multiplication and division, we also need to calculate differential and integral. The Fourier transform allows differentiation and integration to become multiplication and division in the frequency domain, and college mathematics instantly becomes elementary school arithmetic. There are of course other, more important uses of Fourier analysis, which we will mention as we go along.

Next we continue to talk about the phase spectrum:

By transforming the time domain to the frequency domain, we get a spectrum viewed from the side, but this spectrum does not contain all the information in the time domain. Because the spectrum only represents what the amplitude of each corresponding sine wave is, without mentioning the phase. In the basic sine wave A.sin(wt+θ), amplitude, frequency, and phase are indispensable, and different phases determine the position of the wave, so for frequency domain analysis, only the spectrum (amplitude spectrum) is not enough, we also A phase spectrum is required. So where is this phase spectrum? Let's look at the picture below. This time, in order to avoid the picture being too confusing, we use a picture of 7 waves superimposed.

insert image description here

Given that sine waves are periodic, we need to set something that marks the position of the sine wave. In the picture are those little red dots. The small red dot is the crest closest to the frequency axis, and how far is the position of this crest from the frequency axis? In order to see more clearly, we project the red point to the lower plane, and we use the pink point to represent the projected point. Of course, these pink dots only mark the distance of the peak from the frequency axis, not the phase.

insert image description here

A concept needs to be corrected here: the time difference is not the phase difference. If the whole cycle is regarded as 2Pi or 360 degrees, the phase difference is the proportion of the time difference in one cycle. We divide the time difference by the period and multiply by 2Pi to get the phase difference.

In the complete stereogram, we divide the projected time difference by the period of the frequency in turn to get the lowest phase spectrum. So, the spectrum is seen from the side, and the phase spectrum is seen from below.

Note that the phases in the phase spectrum except 0 are Pi. Because cos(t+Pi)=-cos(t), the wave with phase Pi is actually just flipped up and down. For the Fourier series of a periodic square wave, such a phase spectrum is already quite simple. It is also worth noting that since cos(t+2Pi)=cos(t), the phase difference is periodic, and pi and 3pi, 5pi, 7pi are all the same phase. The value range of the phase spectrum is artificially defined as (-pi, pi], so the phase differences in the figure are all Pi.

insert image description here

3.5 Fourier Transform

The essence of a Fourier series is to decompose a periodic signal into infinitely many separate (discrete) sine waves, but the universe doesn't seem to be periodic.

In this world, some things will never come again, and time will never stop marking those unforgettable pasts on time points. But these things often become our extremely precious memories, and they will pop up periodically in our brains after a period of time. Unfortunately, these memories are scattered fragments, often only the happiest memories, while the dull memories are gradually forgotten by us. Because the past is a continuous non-periodic signal, while the memory is a periodic discrete signal. Is there a mathematical tool to transform a continuous non-periodic signal into a periodic discrete signal? Sorry, not really.

For example, the Fourier series is a periodic and continuous function in the time domain, and a non-periodic discrete function in the frequency domain. The Fourier transform that we will talk about next is to convert a continuous signal that is aperiodic in the time domain into a continuous signal that is aperiodic in the frequency domain. The previous picture is easy for you to understand:

insert image description here

Or we can understand it from another angle: the Fourier transform is actually a Fourier transform of a function with an infinite period. Therefore, the Fourier transform changes from a discrete spectrum to a continuous spectrum in the frequency domain. So what does the continuum look like?

For the convenience of comparison, we will look at the spectrum from another perspective this time, which is the most used picture in the Fourier series, and we will look at the higher frequency direction.

insert image description here

The above is the discrete spectrum, so what does the continuous spectrum look like? Let your imagination run wild and imagine these discrete sine waves getting closer and closer, gradually becoming continuous...until they become like a rippling sea:

insert image description here

The original superposition of discrete spectra has become the accumulation of continuous spectra. Therefore, the calculation is also changed from the summation symbol to the integral symbol.

3.6 Euler's formula

The concept of imaginary number i has been touched by everyone in high school, but at that time we only knew that it is the square root of -1, but what is its real meaning?

insert image description here

Here is a number line, and on the number line there is a red line segment whose length is 1. When it is multiplied by 3, its length changes and becomes a blue line segment, and when it is multiplied by -1, it becomes a green line segment, or the line segment rotates around the origin on the number line 180 degrees.

We know that multiplying by -1 is actually multiplying i twice to rotate the line segment 180 degrees, then multiplying i once - the answer is simple - rotates 90 degrees.

insert image description here

At the same time, we obtain a vertical imaginary axis. The real number axis and the imaginary number axis together form a complex plane, also known as the complex plane. In this way, we understand that a function of multiplying the imaginary number i - rotation.

Euler's formula: eix = cosx + i ⋅ sinxe^{ix} = cosx + i sinxandix=c o s x+i sin x . _ _ When x is equal to Pi:ei π + 1 = 0 e^{i\pi} + 1 = 0andiπ+1=0 . The key role of this formula is to unify the sine wave into a simple exponential form. Let's see what it means on the image:

insert image description here

What Euler's formula describes is a point that moves circularly on the complex plane as time changes. As time changes, it becomes a spiral on the time axis. If you only look at the real part of it, that is, the projection of the spiral on the left, it is the most basic cosine function. The projection on the right is a sine function.

3.7 Exponential Fourier Transform FT

With the help of Euler's formula, we know that the superposition of sine waves can also be understood as the projection of the superposition of spirals in the real space. And if the superposition of the spiral lines is understood by an image of chestnuts, what is it? light waves .

So in fact, we have been exposed to the spectrum of light very early, but we have not understood the more important meaning of the spectrum. But the difference is that the spectrum obtained by the Fourier transform is not only a superposition of a limited frequency range such as visible light, but a combination of all frequencies from 0 to infinity.

Here, we can understand the sine wave in two ways: the first one has been mentioned before, which is the projection of the helix on the real axis. The other needs to be understood with the help of another form of Euler's formula (the right formula is obtained by adding the left formula and dividing by 2):

insert image description here

How can this formula be understood? We just said, eite^{it}andi t can be understood as a counterclockwise spiral, thene − ite^{-it}andi t can be understood as a clockwise spiral. whilecos ( t ) cos (t)c o s ( t ) is half of the superposition of the two spirals with different directions of rotation, because the imaginary parts of the two spirals cancel each other out!

Here, what we rotate counterclockwise is called positive frequency, and what we rotate clockwise is called negative frequency (note that it is not a complex frequency).

We have just seen the sea - the continuous Fourier transform spectrum, now think about what a continuous spiral would look like:

insert image description here

Isn't it beautiful? Guess what this graph looks like in the time domain?

insert image description here

Do you feel like you have been slapped hard? Mathematics is such a thing that complicates simple problems. By the way, for the conch-like picture, for the convenience of viewing, I only show the part of the positive frequency, and the part of the negative frequency is not shown. If you look carefully, each spiral line on the conch diagram can be clearly seen. Each spiral line has a different amplitude (rotation radius), frequency (rotation period) and phase. And connecting all the spiral lines into a plane is this conch diagram.

Well, at this point, I believe that everyone has a visual understanding of Fourier transform and Fourier series. We finally use a picture to summarize:

insert image description here

In fact, after reading the blogger's blog, I don't think I can fully understand Fourier analysis, but compared to before - I don't even know what it is, and now I have more or less understood a little bit of the concept. , has made great progress. Maybe in the future, when you really need to use Fourier analysis, you will have a deeper understanding of this thing (in other words, this is unlikely).

In addition, the relationship between Fourier and neural networks has also seen such an article: Compared with neural networks, the famous Fourier transform, why is there no unified function approximator? .

4. Markov

Reference content in this section: First acquaintance with Markov model , Markov model Markov model , hidden Markov model , HMM that children can understand ,

Invented by Andrei Markov (1856-1922), the Markov Model is a discrete-time random process with Markov properties in mathematics . The process has the following properties: its future evolution (future) does not depend on its past evolution (past) given the known current state (present).

The Markov model is a big concept. From the definition and properties of the model, random processes/random models with Markov properties and based on random processes are collectively referred to as Markov models, including Stochastic processes/stochastic models such as Markov Chains , Markov Decision Processes , Hidden Markov Chains (HMMs) that we are familiar with .

Markov model is a statistical model, which is widely used in speech recognition, automatic part-of-speech tagging, phonetic-word conversion, probabilistic grammar, sequence classification and other natural language processing applications. After long-term development, especially its successful application in speech recognition, it has become a general statistical tool. So far, it has been considered one of the most successful ways to implement fast and accurate speech recognition systems.

4.1 Markov property

To understand the Markov property, you need to understand the Generating Patterns . Generally speaking, there are the following two generating patterns:

  • Deterministic Patterns → Deterministic Systems

    • Consider a set of traffic lights, the sequence of color changes of the lights is red-red/yellow-green-yellow-red. This sequence can be used as a state machine, where different states of the traffic light follow the previous state;
    • hmm1
    • Note that each state is uniquely dependent on the previous state, so if the traffic light is green, the next color state will always be yellow - that is, the system is deterministic;
    • Deterministic systems are relatively easy to understand and analyze because the transitions between states are completely known.
  • Non -deterministic patterns → Markov

    • Consider the most classic example of a gambler losing everything. A gambler gambles in a casino, the probability of winning is P, the probability of losing (1-P), and the bet is 1 yuan each time. Suppose the gambler has a bet of n yuan at the beginning, if he wins the bet plus 1 yuan, and loses The bet is reduced by 1 yuan. What is the probability of a gambler losing out?

    • Suppose the gambler has two processes:

      • Gambler 1 process: 0 → 1 → 2 → 1 → ?
      • Gambler 2 process: 0 → 1 → ?
    • The probability of losing light in the two processes here is the same. It does not concern the previous process, but only about one dollar now. This is Markov property , also known as no aftereffect . To put it simply, the future has nothing to do with the past, but only the present . It is a process that continues to move forward.

      The "now" here, in the application process, can not only refer to the current moment ttt , but also expand the understanding: such as theiiThe value at time i only depends on the( i − n ) (in)(in ) The value of the moment.

4.2 Markov process Markov process

If a random process satisfies the Markov property , it is called a Markov process . In this model, there are two assumptions: ① the random process satisfies the Markov property; ② the state transition matrix does not change with time.

For those with MMA first-order Markov model with M states, with a total ofM 2 M^2M2 state transitions, since any one state may be the next transition state for all states. Every state transition has a probability value, called thestate transition probability- this is the probability of transitioning from one state to another. AllM 2 M^2MThe 2 probabilities can be represented by astate transition matrix. Note that these probabilities do not- a very important (but often unrealistic) assumption.

The following state transition matrix shows the possible state transition probabilities for the weather example :

hmm4

A Markov process means that the transition of each state in the process only depends on the previous n states. This process is called an n-order Markov model, where n is the number of states that affect the transition. The simplest Markov process is a first-order process, and the transition of each state only depends on the state before it (also known as the homogeneous Markov chain hypothesis), which is also the basis for the discussion of many models that follow.

For the first-order Markov model, there are: if the value at the i-th time depends on and only depends on the value at the i-1-th time, namely:
P ( xi ∣ xi − 1 , xi − 2 , … , x 1 ) = P ( xi ∣ xi − 1 ) P(x_i|x_{i-1}, x_{i-2},…,x_1) = P(x_i|x_{i-1})P(xixi1,xi2,,x1)=P(xixi1)

The relationship and difference between Markov model and time series:

There is a certain relationship between the Markov model and the time series. Sometimes it is even said that the state sequence of the Markov process is a time series. Indeed, from the perspective of the passage of time, it does not seem to be very problematic, but they are There are still many differences between:

  • Markov models are probabilistic models. The observation value of each time point is reflected in the state value, and the so-called state value is the probability of a certain category, which is obviously different from the time series;
  • The relationship between the current state and the previous state of the Markov model is determined by the transition probability and transition probability matrix, which is also different from the time series.

4.3 Markov Chain

A Markov process with discrete time and state is called a Markov chain, denoted as: X n = X ( n ) , n = 0 , 1 , 2 … X_n = X(n),\ n=0,1 ,2…Xn=X(n), n=0 ,1 ,2 , Markov chains are random variablesX 1 , X 2 . X 3 … X_1,X_2.X_3…X1,X2.X3A sequence of numbers.

This discrete situation is actually the focus of our discussion. Many times we directly say that such a discrete situation is a Markov model.

  • State space: Markov chains are random variables X 1 , X 2 . X 3 … X_1,X_2.X_3…X1,X2.X3a sequence of ... for each variableX i X_iXiThere are several different possible values, that is, the set of all their possible values, called the "state space", and X n X_nXnThe value of is then at time nnthe state of n ;
  • Transition Probability: The conditional probability of taking a value at the current moment under a certain value at the previous moment is called the transition probability, which can be written as (representing that the previous state is ssUnder the condition of s , the current state is ttWhat is the probability of t ): P st = P ( xi = t ∣ xi − 1 = S ) P_{st} = P(x_i=t|x_{i-1}=S)Pst=P(xi=t xi1=S)
  • Transition probability matrix: Since there is more than one state at each different time, there are several situations in which the state of the previous time is transferred to a certain current state, then all the conditional probabilities will form a matrix, and this matrix is ​​called is the "transition probability matrix": P = [ P ij ] n ∗ n P=[P_{ij}]_{n*n}P=[Pij]n n, where P ij P_{ij}Pijmeans that it is in state ii at time ti , at the next instantt + 1 t+1t+1 is in statejjprobability of j , nnn is the number of all possible states of the system;
  • Initial probability distribution: The Markov chain also contains the initial probability of each state, namely: Q = [ q 1 , q 2 , … qn ] Q=[q_1,q_2,…q_n]Q=[q1,q2,qn] , among whichqi q_iqiis that the system is in state ii at the initial momentthe probability of i .

As shown in the Markov chain in the figure below, the state is represented by a circle, and the edge represents the transition probability of the state.

insert image description here

The Markov chain also contains the initial probability of each state, which is called the initial state vector of the Markov chain. The size of the vector is equal to the number of states, as shown in the following initial state vector:

insert image description here

State transition distribution and initial distribution of states are two basic properties of Markov chains.

Markov chain convergence condition: Whether the Markov chain model converges depends on the state transition matrix:

insert image description here

This is called the convergence of the Markov chain. (The convergence of the Markov chain allows us to use the Markov chain to sample the sample set we need. The memorylessness of the Markov chain is the theoretical basis of the hidden Markov model and the conditional random field)

Transition matrix restrictions :

  • The state transition of a Markov chain is not cyclic (aperiodic), and if cyclic it will never converge;
  • Any two states are connected, that is, the state transition matrix has no entry of 0;
  • The number of states is limited;
  • The transition probabilities between states need to be fixed.

4.4 Hidden Markov HMM

When I saw this part, because of other things, I didn't take a closer look. I only reprinted this blog explaining HMM in detail . You can take a closer look in the future.

Hidden Markov Model (HMM) is a relatively classic machine learning model, which has been widely used in language recognition, natural language processing, pattern recognition and other fields. Of course, with the current rise of deep learning, especially the popularity of neural network sequence models such as RNN and LSTM, the status of HMM has declined. However, as a classic model, learning the HMM model and corresponding algorithm is very good for our ability to solve problem modeling and the expansion of algorithm ideas.

When using a Hidden Markov Model (HMM) our problem generally has these two characteristics:

  • Problems are sequence-based, such as time series, or state sequences.
  • There are two types of data in the problem. One type of sequence data is observable, that is, observation sequence; and the other type of data is unobservable, that is, hidden state sequence, or state sequence for short.

With these two features, this problem can generally be solved by the HMM model. There are many such problems in real life. For example: I am talking to you, a series of continuous sounds I make is the observation sequence, and the actual sentence I want to express is the state sequence. The task of your brain is to judge from this series of continuous sounds. the content of the words that may be expressed.

Since the observable state sequence and the hidden state sequence are probabilistically related, we can model this type of process as having a hidden Markov process and a probability related to this hidden Markov process and The set of states that can be observed, this is the Hidden Markov Model.

4.4.1 Model Definition

HMM has four important concepts: observation , hidden state , transition probability and emission probability.

In simple terms, Hidden Markov = Markov Chain + a set of observations. The specific content is as follows (from: Explain HMM in detail ):

img

4.4.2 Three basic questions

  1. Probability calculation problem: Given a model λ = ( A , B , π ) λ = ( A , B , π )l=(A,B,π ) summation orderO = ( o 1 , o 2 , . . . , o T ) O=(o_1,o_2,...,o_T)O=( the1,O2,. . . ,OT) is calculated in the modelλ λThe observation sequence under λ is OOProbability of O P ( O ∣ λ ) P ( O ∣ λ )P(Ol ) ;
  2. Learning problem: known observation sequence O = ( o 1 , o 2 , . . . , o T ) O=(o_1,o_2,...,o_T)O=( the1,O2,. . . ,OT),estimate modelλ= (A, B, π) λ = (A, B, π)l=(A,B,π ) , so that the observed sequence probabilityP ( O ∣ λ ) P ( O ∣ λ)P(Oλ ) max;
  3. Prediction Problem/Decoding Problem: Known Model λ = ( A , B , π ) λ = ( A , B , π )l=(A,B,π ) summation orderO = ( o 1 , o 2 , . . . , o T ) O=(o_1,o_2,...,o_T)O=( the1,O2,. . . ,OT) , find the probability P ( I ∣ O ) P ( I ∣ O )for a given observation sequenceP(IO ) maximum value. That is, given an observation sequence, find the most likely corresponding state sequence.

4.4.3 Problem 1 - Probability calculation: Find the probability of an observation sequence

img

The forward-backward algorithm is a general term for the forward algorithm and the backward algorithm, both of which can be used to calculate the probability of the HMM observation sequence.

The forward algorithm is essentially a dynamic programming algorithm, that is, we need to find the formula of local state recursion, so that we can expand from the optimal solution of the sub-problem to the optimal solution of the whole problem step by step.

img img

Familiar with using the forward algorithm to find the probability of the HMM observation sequence, now let's look at how to use the backward algorithm to find the probability of the HMM observation sequence. The backward algorithm is very similar to the forward algorithm. They both use dynamic programming. The only difference is that the selected local states are different. The backward algorithm uses "backward probability", so how is the backward probability defined?

img img

4.4.4 Problem 2 - Learning problem: Estimating the parameters of the model

Next, we will discuss the problem of solving HMM model parameters, which is the most complicated of the three HMM problems. Before studying this problem, it is recommended to read the summary of the EM algorithm principle and the derivation of the EM algorithm that everyone can understand , which will be used in this article.

img

The principle of Baum-Welch algorithm:

img

img

img

Using our definition of the forward probability in Section 2 of the Hidden Markov Model HMM (II) Forward-Backward Algorithm for Evaluating Observation Sequence Probabilities, we get:

img

img

Summary of the Baum-Welch algorithm flow:

img

4.4.5 Problem 3 - Prediction Problem: Solving the Most Likely Hidden State Sequence

Next, we will discuss the solution to the last problem of the HMM model, that is, given the model and the observation sequence, to find the most likely corresponding hidden state sequence under the condition of the given observation sequence. The most commonly used algorithm for the decoding problem of the HMM model is the Viterbi algorithm, of course, there are other algorithms that can solve this problem. At the same time, the Viterbi algorithm is a general dynamic programming algorithm for finding the shortest path of a sequence, and it can also be used for many other problems.

The following focuses on decoding the most probable hidden state sequence of the HMM with the Viterbi algorithm.

img

In HMM, the Viterbi algorithm defines two local states for recursion. An overview of the Viterbi algorithm:

img

Summary of Viterbi algorithm flow:

img

4.4.6 Practical application

From a practical point of view, the use of HMM can be done through Python's hmmlearn library (see: Introduction to the use of hmmlearn ).

4.5 Markov Random Field MRF

In this part, please read mainly: Probability Graph – Markov Random Field (MRF) .

A probabilistic graphical model (PGM) is a probability distribution represented by a graph. Probabilistic undirected graphical model (Probabilistic undirected graphical model), also known as Markov random field (Markov random field) , represents a joint probability distribution , and its standard definition is:

Suppose the joint probability distribution P(V) is represented by an undirected graph G=(V, E), the nodes in the graph G represent random variables, and the edges represent the dependencies between random variables. If the joint probability distribution P(V) satisfies the pairwise, local or global Markov property, the joint probability distribution is called a probabilistic undirected graphical model or a Markov random field.

Given a set of random variables Y whose joint distribution P(Y) is represented by an undirected graph G=(V, E). A node v∈V of graph G represents a random variable YvYv, and an edge e∈E represents the dependency between two random variables.

Markov Random Field (MRF) is a typical Markov network . Each node in the graph represents one or a group of variables, and the edges between nodes represent the dependencies between two variables.

img

A Markov random field is a random field with Markov properties. To understand what a Markov random field is, we must first understand what a random field is.

In probability theory, S = {X1, …, Xn} consisting of random variables Xi consisting of samples from the sample space Ω = {0, 1, …, G − 1}n. If the formula π(ω) > 0 holds for all ω∈Ω, then π is called a random field.

When a value in the phase space is randomly assigned to each position according to a certain distribution, the whole is called a random field. Let us take farming as an analogy. There are two concepts: site and phase space. "Location" is like an acre of farmland; "phase space" is like a variety of crops. We can plant different crops on different fields, which is like giving each "location" of a random field a different value in phase space. So, to put it cheesily, the random field is a matter of what crops are grown in which fields.

  • Clique: A subset of nodes in a graph where any two nodes are connected by an edge;

  • Maximal clique: A clique in which any other node can no longer form a clique.

HammersleyClifford's Theorem : In a Markov random field, the joint probability distribution among multiple variables can be decomposed into a product of multiple potential functions, each of which is related to only one clique, based on cliques.

5. Conditional random field CRF

References in this section: How to explain a conditional random field (CRF) model with a simple and easy-to-understand example? , Graphical model (2) Conditional random field (CRF) , conditional random field (CRF) , principle, example, formula derivation and application of CRF conditional random field .

A conditional random field (CRF) is a conditional probability distribution model (i.e. a discriminant model) of another set of output random variables given a set of input random variables . It is characterized by the assumption that the output random variables constitute a Markov random field. .

This section unfolds in the order of probability map, HMM, MEMM, and CRF.

5.1 Probability map

In the statistical probability graph (probability graph models), referring to Mr. Zong Chengqing's book, it is such an architecture:

preview

In a probabilistic graphical model, the data (samples) are given by the formula G = ( V , E ) G=(V,E)G=( V ,E ) Modeled representation:

  • VVV represents a node, that is, a random variable (placed here, it can be a token or a label), specifically,Y = ( y 1 , y 2 , … , yn ) Y = (y_1,y_2,…,y_n )Y=( and1,Y2,,Yn) to model random variables, noteYYY now represents a batch of random variables (imagine corresponding to a sequence, including many tokens),P ( Y ) P(Y)P ( Y ) is the distribution of these random variables;
  • AND ANDE represents an edge, that is, a probabilistic dependency. (specifically combined with the following HMM or CRF graph understanding)

5.1.1 Directed and Undirected Graphs

As you can see from the above figure, Bayesian networks (belief networks) are all directed, and Markov networks are undirected. Therefore, the Bayesian network is suitable for modeling data with one-way dependencies, and the Markov network is suitable for modeling the interdependence between entities.

Specifically, their core difference is in how to find P = ( Y ) P=(Y)P=( Y ) , that is, how to expressY = ( y 1 , y 2 , … , yn ) Y = (y_1,y_2,…,y_n)Y=( and1,Y2,,Yn) the joint probability of this.

  1. Directed graph :

For a directed graph model, find the joint probability as follows: P ( x 1 , x 2 , … , xn ) = ∏ i = 0 P ( xi ∣ π ( xi ) ) P(x_1,x_2,…,x_n)=∏_ {i=0}P(x_i|\pi(x_i))P(x1,x2,,xn)=i=0P(xiπ(xi) ) .

For example: for the following directed graph (generalized) random variables, their joint probability should be expressed as follows: P ( x 1 , … , xn ) = P ( x 1 ) ⋅ P ( x 2 ∣ x 1 ) ⋅ P ( x 3 ∣ x 2 ) ⋅ P ( x 4 ∣ x 2 ) ⋅ P ( x 5 ∣ x 3 , x 4 ) P(x_1,…,x_n) = P(x_1)·P(x_2|x_1)·P (x_3|x_2)·P(x_4|x_2)·P(x_5|x_3,x_4)P(x1,,xn)=P(x1) P(x2x1) P(x3x2) P(x4x2) P(x5x3,x4) .

insert image description here

  1. Undirected graph :

For undirected graphs, it generally refers to Markov networks, as follows (generalized):

insert image description here

If a graph is too large, factorize P = ( Y ) P=(Y)P=( Y ) is written as the product of several joint probabilities. How to decompose it? Divide a graph into several "small groups". Note that each group must be a "largest group" (any two points in it are connected together, such asX 1 , X 2 , X 3 X_1 in the above picture, X_2,X_3X1,X2,X3is a small group, X 2 , X 3 , X 4 X_2,X_3,X_4X2,X3,X4is a small group), then:
P ( Y ) = ∏ c ψ c ( Y c ) Z ( x ) P(Y) = \frac{∏_cψ_c(Y_c)}{Z(x)}P ( Y )=Z(x)cpc( andc)
份位,Z ( x ) = ∑ Y ∏ c ψ c ( Y c ) Z(x) = \sum_Y∏_cψ_c(Y_c)Z(x)=andcpc( andc) , this normalization is applied in order to convert the results into probabilities.

So, for an undirected graph like the one shown above, the probability is: P ( Y ) = ψ 1 ( X 1 , X 2 , X 3 ) ⋅ ψ 2 ( X 2 , X 3 , X 4 ) Z ( x ) P(Y) = \frac{ψ_1(X_1,X_2,X_3) ψ_2(X_2,X_3,X_4)}{Z(x)}P ( Y )=Z(x)p1(X1,X2,X3) ψ2(X2,X3,X4)。 among them,ψ c ( Y c ) ψ_c(Y_c)pc( andc) is a maximum groupCCThe joint probability of all random variables on C generally takes the potential function:ψ c ( Y c ) = e − E ( Y c ) = e ∑ k λ kfk ( c , y ∣ c , x ) ψ_c(Y_c) =e^{-E(Y_c)}=e^{\sum_kλ_kf_k(c,y|c,x)}pc( andc)=andE ( Yc)=andklkfk( c , y c , x ) . Then the joint probability distribution of the probability undirected graph can be expressed as:
P ( Y ) = ∏ c ψ c ( Y c ) Z ( x ) = ∏ ce ∑ k λ kfk ( c , y ∣ c , x ) Z ( x ) = e ∑ c ∑ k λ kfk ( yi , yi − 1 , x , i ) Z ( x ) P(Y) = \frac{∏_cψ_c(Y_c)}{Z(x)} = \frac {∏_ce^{\sum_kλ_kf_k(c,y|c,x)}}{Z(x)} = \frac{e^{\sum_c\sum_kλ_kf_k(y_i,y_{i-1},x,i)} }{Z(x)}P ( Y )=Z(x)cpc( andc)=Z(x)candklkfk(c,yc,x)=Z(x)andcklkfk( andi, andi1,x,i)

5.1.2 Discriminative Models vs Generative Models

Under supervised learning, models can be divided into discriminative models and generative models.

5.2 Sequence Modeling

Common sequences include: time series data, text sentences, speech data, and so on. Sequences in a broad sense have these characteristics:

  • There is an association dependency/no association dependency between nodes
  • The nodes of the sequence are random/determined
  • The sequence is linearly variable/non-linear

There are different problem requirements for different sequences. The common sequence modeling methods are summarized as follows:

  • Fitting, predicting future nodes (or trend analysis):
    • Conventional sequence modeling methods: AR, MA, ARMA, ARIMA
    • regression fit
    • Neural Networks
  • Judging different sequence categories, classification problems:
    • HMM、CRF、General Classifier(ML models、NN models)
  • Analysis of the state corresponding to different time series, sequence labeling problem:
    • HMM、CRF、RecurrentNNs

5.3 Definition and Form of CRF

Ah, I can't get it sorted anymore. After that, you can feel like reading these articles: Graphical Model (2) Conditional Random Field , Machine Learning - Conditional Random Field , Statistical Machine Learning - Conditional Random Field , Conditional Random Field . The follow-up should be when the specific named entity recognition needs to be done. It is too unfriendly to those who have not taken courses in algebra, statistics, and probability theory (only taking high-level math 4) and relying solely on self-study.

Conditional random field (CRF) is a conditional probability distribution model P(Y|X), which represents a Markov random field of another set of output random variables Y given a set of input random variables X , That is to say, the characteristic of CRF is to assume that the output random variables constitute a Markov random field . Conditional random fields can be viewed as a generalization of the maximum entropy Markov model for labeling problems.

Most of the introductions are linear chain conditional random fields (CRFs) for sequence labeling problems, which are discriminative models that predict output sequences from input sequences.

insert image description here

From the problem description, for the sequence labeling problem, X is the observation sequence to be labeled, and Y is the label sequence (state sequence). During the learning process, the model parameters are trained by MLE or MLE with regularization; in the testing process, for a given observation sequence, the model needs to obtain the output sequence with the largest conditional probability.

If the random variable Y constitutes a Markov random field represented by an undirected graph G=(V, E), it holds for any node v∈V, that is, P ( Y v ∣ X , Y w , w ≠ v ) = P ( Y v ∣ X , Y w , w ∼ v ) P(Y_v|X,Y_w,w≠v)=P(Y_v|X,Y_w,w∼v)P ( YinX ,Yin,in=in )=P ( YinX ,Yin,inv ) is true for any node v, then P(Y|X) is said to be a conditional random field. where w≠v means w is all nodes except v, and w∼v means w is all nodes connected to v. May wish to cover the same condition X on both sides of the equation, then the formula can be represented by the following figure:

insert image description here

This is the definition of a Markov random field.

Linear Chain Conditional Random Field : X and Y are not required to have the same structure in the definition, but in reality, it is generally assumed that X and Y have the same graph structure. For a linear chain conditional random field, each edge of graph G exists in two adjacent nodes of the state sequence Y, the maximal clique C is the set of two adjacent nodes, and X and Y have the same graph structure means that Each Xi has a one-to-one correspondence with Yi.
V = 1 , 2 , . . . , n , E = ( i , i + 1 ) , i = 1 , 2 , . . . , n − 1 V = 1 , 2 , . . . , n , E = ( i , i + 1 ) , i = 1 , 2 , . . . , n − 1 V={1,2,...,n},E={(i,i+1)},i=1, 2,...,n−1V={1,2,...,n},E={(i,i+1)},i=1,2,...,n−1IN=1 ,2 ,. . . ,n,AND=(i,i+1 ) ,i=1 ,2 ,. . . ,n 1 V=1 ,2 ,. . . ,n,AND=(i,i+1 ) ,i=1 ,2 ,. . . ,n 1
Let two sets of random variablesX = ( X 1 , . . . , X n ) , Y = ( Y 1 , . . . , Y n ) X=(X1,...,Xn),Y=( Y1,...,Yn)X=( X 1 ,. . . ,Xn),Y=( Y 1 ,. . . ,Y n ) , then the linear chain conditional random field is defined as (where only one side is considered when i is 1 or n):
P ( Y i ∣ X , Y 1 , . . . , Y i − 1 , Y i + 1 , . . . , Y n ) = P ( Y i ∣ X , Y i − 1 , Y i + 1 ) , i = 1 , . . . , n P(Yi|X,Y1,...,Yi −1,Yi+1,...,Yn)=P(Yi|X,Yi−1,Yi+1),i=1,...,nP(YiX,Y 1 ,. . . ,Y i 1 ,Yi i+1 ,. . . ,the n )=P(YiX,Y i 1 ,Yi i+1 ) ,i=1 ,. . . ,n

Guess you like

Origin blog.csdn.net/Morganfs/article/details/124722948