Bayesian Networks (1) - Basic Concepts

The content of this article is mainly summarized from the coursera course Bayesian Methods for Machine Learning

1. What is Bayesian probability

There is a question, we have a coin, how to judge the probability of showing heads after the coin is tossed?

  • Frequentist School: We can flip this coin 100 times and see how many times it is heads, and the result should approximate the probability of the coin flipping heads.

  • Bayesian School: From life experience, the probability of a coin tossing heads is 50%, so our judgment on the probability of showing heads needs to be based on this, and then adjusted according to the results of tossing heads.

The above example is described in the language of Bayesian probability theory, that is, the observer holds a certain pre-belief (prior), obtains statistical evidence (evidence) through observation, and infers about the The rationality of the statement (likelihood), so as to derive the posterior belief (posterior) to best represent the observed state of knowledge (state of knowledge).
Here, the core problem that Bayesian probability inference is trying to solve is how to construct a logical system that satisfies certain conditions to give a specific assertion a measure of plausibility of the assertion represented by a real number, so that observers can Infer the state of the information. Here, the observer's belief or state of knowledge about a variable is what the frequentist school calls "probability distribution", that is, the observer's state of knowledge is the distribution of "reasonableness" given to various values ​​of the observed variable .

write picture description here

The biggest difference between the frequentist school and the Bayesian school is that
it actually arises from the cognition of the parameter space. The so-called parameter space is the possible value range of the parameter you care about. The frequentist school (actually the Fisher of the year) doesn't care about all the details of the parameter space, they believe that the data is generated at a "certain" parameter value in this space (although you don't know what that value is), so they 's methodology starts from the perspective of "which value is most likely to be the true value". So there are things like maximum likelihood and confidence intervals, and you can tell from the name that all they care about is how confident I am to circle the only true parameter. The Bayesian school is the exact opposite. They care about every value in the parameter space, because they think we don't have a God perspective, how can it be possible to know which value is true? So every value in the parameter space may be the value used by the real model, the difference is only the probability. So they introduce the concepts of prior distribution and posterior distribution to try to find the probability of each value in the parameter space. An example of this difference is best illustrated by imagining that if your posterior distribution is bimodal, frequentist methods will choose the higher of the two peaks as their best guess, while the Yeasians would report both values ​​at the same time and give the corresponding probabilities.

2. Basic model

1. Offline classification judgment model

write picture description here

2. Online training model

Take the current model as a prior distribution, and get the latest distribution based on new data
write picture description here

3. Conjugate distribution

From the above, we can see that, because the data p(x) is known, to calculate the posterior distribution, we need to obtain the prior distribution and the likelihood distribution. where the likelihood distribution is determined by the distribution model we choose to fit. So, the question is how to get the prior distribution. Here, we introduce the concept of conjugate distribution

Conjugate distribution:
There are three distributions involved in probability: prior, likelihood and posterior, if the posterior distribution determined by prior distribution and likelihood distribution is of the same type as the prior distribution distribution, the prior distribution is the conjugate distribution of the likelihood distribution, also known as the conjugate prior.

By introducing the conjugate distribution, all the distributions come out, paving the way for the subsequent parameter push down
write picture description here

write picture description here

>> Here are some typical distributions

1. Gaussian distribution and its likelihood calculation

(1) Distribution description
Example : The distance exercised every day
write picture description here

(2) Calculation description
write picture description here

Beta distribution and its likelihood calculation

(1) Distribution description
Movie ratings, etc.
write picture description here

(2) Calculation description
write picture description here

4. Dummy variables

The above case is the case where only one variable is considered. In reality, in the multivariate judgment scenario, the variables are all related to each other, such as judging height by age and weight, and there is also a mutual relationship between age and weight. In this case, calculating the probability becomes complicated.

write picture description here

At this time, dummy variables are introduced, and all parameters are integrated through dummy variables, which will make the probability model simpler
write picture description here

5. Gaussian Mixture Model GMM Gaussian Mixture Model

The dummy variable does not have to be one, and multiple dummy variable distributions can be added as needed to fit the data, which is the Gaussian mixture model
write picture description here

The training method of the model has also changed, and the result of the calculation is to find the product of the dummy variable distribution with the maximum probability.
write picture description here

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325617268&siteId=291194637