Machine learning basis - will make you a literary Naive Bayes model

Today we talk about in this article and naive Bayes model, this is one model of machine learning very classic and very simple, suitable for beginners.

Naive Bayes model, by definition and Bayes' theorem is certainly highly relevant. Before we introduced Bayes' theorem in three doors game article we first briefly recall Bayesian formula:

\[P(A|B)=\frac{P(A)P(B|A)}{P(B)}\]

We \ (P (A) \) and \ (P (B) \) as a priori probabilities, so Bayes' formula is formula posterior probability and the conditional probability by a priori estimate. That is, to find fruit abduction , we have to explore the causes of the incident based on events that have occurred. The Naive Bayes model is based on this principle, its principle is very simple, naive to the probability of a word can be: when a sample may belong to more than one category, we simply choose the most probable one.

So, since it is the category selected sample belongs, apparently naive Bayes model is a classification algorithm .

Before we introduce the principle of a specific algorithm, let's familiar with the concepts. Several of the concepts in our previous article, which also introduced, here as a review.


Priori probability


Priori probability is actually very good understanding, let no matter which "has" the word. To put it plainly, in fact, the prior probability is that we can advance through the calculation of probability experiments to do. For example, a coin toss right side up, such as a red light at an intersection, for example, then it will rain tomorrow.

These things, some we can experiment, and some can be estimated based on past experience. We question them, the probability of these events is relatively clear. It can be considered a probability model to do before we can determine the inquiry, so called a priori probabilities.


Posterior probability


Posterior probability and prior probability opposite from the intuitive point of view, is that we experience through previous experiments or is no way to directly get in. It refers to the more probability of an event for one reason or another event caused.

Take, for example, a student take the exam, can pass the probability can be measured. Whether for testing, students or batch statistical examination by a student several times, it is feasible. But suppose the students before the exam can choose to review or playing a game, obviously, will increase the probability of review by the students, playing games may be reduced or may not change, we do not know. Suppose we know that Bob has passed the exam, he wants to know there is no review before the exam, this is a posterior probability .

From the logical point of view, it is just the opposite conditional probability. Conditional probability is the probability of event B occurs under the premise event A occurs, then the posterior probability is already aware of the event B occurs, find the probability of an event A occurs.


Likelihood Estimation


It is also a bad word street, all introduced Bayesian article, no one does not mention the word. But few articles can explain this concept clearly.

English is the likelihood Likelihood , and it came from the semantic probability (probability) is very close to, the time may be just doing a translation of distinction. Both expressed in a mathematical formula is also very close, can be written \ (P (the X-| \ Theta) \) .

Where the probability is seeking already know the parameters \ (\ Theta \) , the probability of the event x occurred. The argument focused on the likelihood of an event A occurs when \ (\ Theta \) . So naturally, the likelihood estimation function is a function of the probability distribution estimate parameters by the. Maximum likelihood estimation also like to understand it is that when seeking event A occurs, most likely parameters \ (\ theta \) values.

For a very simple example, suppose we have an opaque black box, which has a plurality of balls and a plurality of black white ball. But we do not know in the end there are a few black ball white ball a few. In order to explore this ratio, we have removed back into the box from among 10 balls, assuming that the end result is 7 3 black white, then the proportion of the black box to ask how much is the ball?

This question simply could not be more simple, it is not a problem of students? Since it took 10 there are seven black ball, it is clear that the probability of black ball should be 70% ah, what is the problem?

On the surface of course no problem, but in fact wrong. Because we test the experimental results do not represent the probability itself , in simple terms, the box is 70 percent black ball 7 may appear black 3 white, black ball in the box is 50% too, can appear this result, we can judge how the box there must be a black ball 70 percent of it?

This time is necessary to use the likelihood function of.


Likelihood function


We just black and white ball experimental substituted into the above likelihood estimation formulas were to go, the final results of the experiment are determined to get, is the event x. We demand, that is, the proportion of the black ball is a parameter \ (\ Theta \) . Since we are back in the experiment, so every time to come up with the probability of the black ball is unchanged, according to the binomial distribution , we can write the probability of the event x happened:

\[P(x|\theta)=\theta^7*(1-\theta)^3=f(\theta)\]

This formula is our likelihood function, also called the probability function. It reflects different parameters, the probability of the event x occurred. We have to do is calculated according to the function \ (f (\ theta) \ ) maximum \ (\ theta \) values.

This calculation is very simple, we \ (\ Theta \) derivative , and then make the derivative equal to zero, then at this time is obtained corresponding to \ (\ Theta \) values. The final result is of course \ (\ theta = 0.7 \) equation has a maximum value.

We can also put \ (f (\ theta) \ ) is a function of the image drawn, intuitively feel the probability distribution.

import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(0, 1, 100)
y = np.power(x, 7) * np.power(1 - x, 3)

plt.plot(x, y)
plt.xlabel('value of theta')
plt.ylabel('value of f(theta)')
plt.show()

This also proves that we intuitively feel is right. Not because we come up with the black ball probability is 70% the proportion of the black box ball is 70%, but the black box ball out ratio is 70% black ball accounted for 70% of the maximum probability .


Detailed models


Then on to the main event, we look at the Bayesian formula:

\[P(A|B)=\frac{P(A)P(B|A)}{P(B)}\]

We then deformed to a formula, we assume that the set of all events and events related to B to C. Obviously \ (A \ in C \) , assuming a set C of m events are written as: \ (C_l, C_2, \ cdots, C_m \) .

Then

\(P(B)=\sum_{i=1}^mP(B|C_i)P(C_i)\)

When we pursue the cause event B occurs, it will pursue all possible leads to the result set of parameters C, and then pick out from among the greatest probability that as a result.

We use it to classify the principle is the same, for a sample x, we will calculate the probability of it belonging to all categories, and then select the most probable one of them as a category final prediction. This simple idea is the principle of naive Bayes model.

We assume \ (x = \ {A_1, A_2, cdots, A_N \ \} \) , each of which is a characteristic dimension of a sample representation of x. Similarly, we will have a set of categories \ (C = \ {Y_1, Y_2, \ cdots, Y_M \} \) , wherein each of y represents a particular category. We have to do is calculate the probability x of y belonging to each category, select the maximum probability that as a result of the final classification.

Bayesian probability formula we write according to the formula:

\[P(y_i|x)=\frac{P(x|y_i)P(y_i)}{P(x)}\]

Where \ (P (x) \) is a constant for all \ (y_i \) remains unchanged, so it can be ignored, we only need to focus on the molecule.

Here, we make an important assumption: we assume that the eigenvalues ​​are independent of each dimension of the sample x in each other.

This hypothesis is very simple, but very important, if not this assumption, then the probability of this would be complicated to us almost impossible to calculate. It is because of this simple hypothesis, it will be called naive Bayes model , which is also the name of reason. Of course, English is naive bayes, so is the theory called Bayesian clothes are also possible.

With this assumption will be easier after we launched the equation on the line:

\(P(y_i|x)=P(y_i)P(a_1|y_i)P(a_2|y_i)\cdots P(a_n|y_i)=P(y_i)\prod_{j=1}^nP(a_j|y_i)\)

Where \ (P (y_i) \) is the prior probability, we can get through experiments or other methods, such as \ (P (a_j | y_i) \) can not be obtained directly, we need to use statistical methods calculation.

If \ (a_j \) is a discrete value, is very simple, we just need to count \ (y_i \) when an event occurs, each \ (a_j \) in proportion to achieve. Suppose we test several times, \ (y_i \) a total of M times happened, \ (a_j \) occurred N times, then clearly:

\[P(a_j|y_i)=\frac{N}{M}\]

To prevent the M = 0, we can add a smoothing parameter while in the numerator and denominator, the final result is written:

\[P(a_j|y_i)=\frac{N+\alpha}{M+\beta}\]

But if \ (a_j \) is a continuous value should be how to do? If it is a continuous value, then its value could be many more. Then obviously, we can not for each of its values are to calculate the probability. So it is impossible to collect multiple samples. How should we do in this situation?

Continuous value does not matter, we can assume that the distribution of the variables normal distribution. It's normal distribution curve is actually a probability distribution of this variable .

FIG spend For example, we observed the lowest cumulative percentage value. Rather, it represents the position x of the area of negative infinity partitioned into regions. This area is in the range of 0-1, we can use this value to represent the area of probability f (x) is . In fact, assume a normal distribution of the variables subject to different dimensions, in fact, Gaussian mixture model thought the (GMM), where go beyond that, not too much expansion.

That is, if a discrete value, then we are represented by calculating the probability proportional manner, if it is a continuous value, then the method to calculate the probability distribution by the normal distribution to calculate probabilities. In this way, we can, through the n \ (P (a_j | y_i) \) multiplicative get \ (P (y_i | x) \) probability, finally, we compare all probability y corresponding to the selected one of the biggest as a result of that classification.

The above process is completely correct, but there is a small problem.

\ (P (a_j | y_i) \) is a floating-point number, and probably very small, and we need to calculate the product of n floating-point number. Because of the accuracy errors, so even when the multiplication result is smaller than the accuracy, we can not compare the size between the two probabilities.

To solve this problem, we need to do floating-point multiply even a deformed: we take the left and right sides of the equation log. The number of floating-point multiplication, sum converted to:

\[ \begin{eqnarray} P(y_i|x) &= P(y_i)P(a_1|y_i)P(a_2|y_i)\cdots P(a_n|y_i) &= P(y_i)\prod_{j=1}^nP(a_j|y_i) \\ \log(P(y_i|x)) &= \log(P(y_i))+ \log(P(a_1|y_i)) + \cdots + log(P(a_n|y_i)) &= \log(P(y_i|x)) + \sum_{i=1}^n \log(P(a_i|y_i)) \end{eqnarray} \]

Because of the logarithmic function is a monotonic function , so we can directly take the results after completion of the number to more than size, it can avoid affecting the accuracy brought by the.

These are the principles of Bayesian model, after the article which will share in the application of Bayesian text classification model among.

Wen more difficult, if gain something, pray for attention ** *

Guess you like

Origin www.cnblogs.com/techflow/p/12194870.html