Machine learning theory study: Naive Bayes

I have been reading "Statistical Learning Methods" recently, and I plan to understand the theory first, and I will focus on implementing my own ml library in C++ in the future. This is the plan, so let's do it. In fact, this algorithm is rarely used in normal times, but its main advantage is that it can perform very well in the case of a small number, dozens or hundreds of samples. But if you have a lot of data with more than a few thousand, you can try this algorithm, after all, it has to run faster. In addition, compared to other machine learning algorithms, this probabilistic model is very interpretable, and in short, it is easy to understand. There are many branches of Naive Bayes, such as Gaussian, polynomial, etc., as well as many evaluation methods, such as Brill, log-likelihood function, and the reliability curve. After all, Naive Bayes can be adjusted There are too few parameters, so if you want to use Naive Bayes, but you have nothing to adjust, you can try reliability curve calibration. Maybe there are unexpected surprises. You don’t want to introduce this, but you can study it yourself if you are interested. . Don't talk nonsense, let's get started.


table of Contents

I. Overview

Second, the learning and classification of Naive Bayes

Third, the parameter estimation of the naive Bayes method

 3.1, Naive Bayes Algorithm

4. Exploring Bayes: the imbalance of Bayesian samples


Naive Bayes is an independent classification method based on Bayes' theorem and characteristic conditions. For a given data set, the joint probability distribution of input and output is first calculated based on the assumption of independent characteristic conditions, and then based on this model, for a given input, Bayes' theorem is used to find the output that maximizes the posterior probability. What needs to be explained here is that the prerequisite of the Naive Bayes algorithm is that the features are conditionally independent. If the features are related to each other, then the effect of the algorithm is not very good. At the same time, for high-dimensional features, the Bayesian algorithm can perform well with very little data and is fast. For PCA and other dimensionality reduction features, there is also a certain internal correlation, and this feature is not applicable to the naive Bayes algorithm.

I. Overview

Naive Bayes is a supervised learning algorithm that directly measures the probability relationship between labels and features, and is an algorithm that focuses on classification. The root of Naive Bayes' algorithm is the Bayesian theory based on probability theory and mathematical statistics, so it is rooted in the probability model of Masamiao Hong. Next, let's get to know this simple and fast probability algorithm.

Naive Bayes is considered to be one of the simplest classification algorithms. First, we need to understand some basic theories of probability theory. Suppose there are two random variables X and Y, and they can take the values ​​x and y respectively. With these two random variables, we can define two probabilities:

Key concepts: joint probability and conditional probability

Joint probability : X is the probability that x and Y are both occurring at the same time, expressed as: P(X=x,Y=y)

Conditional probability : the probability that Y takes the value of y under the condition that X takes the value of x, expressed as: P(Y=y|X=x)

For example, if we let X be "temperature" and Y be "Ladybug hibernation", the possible values ​​of X and Y are divided into x and y, where x = {0,1}, and 0 means that it has not fallen to Below 0 degrees, 1 means it has fallen below 0 degrees. y = {0,1}, where 0 means no and 1 means yes. The probability of the two events occurring separately is:

  • P(X=1)=50%, it means that the possibility that the temperature drops below 0 degrees is 50%, so P(X=0)=50%.
  • P(Y=1)=70%, it means that the probability that Coccinella septempunctata will hibernate is 70%, so P(Y=0)=30%.

Then the joint probability of these two events is P(X=1, Y=1), this probability represents the probability that the two events occur independently at the same time when the temperature drops below 0 degrees and the ladybug goes to hibernation.

The conditional probability between the two events is P(Y=1|X=1). This probability represents the probability that the ladybug will go to hibernation when the condition is met when the temperature drops below 0 degrees. In other words, the temperature dropped below 0 degrees, which to some extent affected the event of Ladybug going to hibernation. In probability theory, we can prove that the joint probability of two events is equal to the arbitrary conditional probability of these two events * the probability of the conditional event itself.

To be simpler, you can write the above formula as:

From the above formula, we can get the Bayesian theory equation:

And this formula is the root theory of all our Bayesian algorithms. We can regard our feature X as our conditional event, and the label Y we require to solve as the result that we will be affected after the condition is met, and the probability relationship between the two is P(Y|X), This probability is called the posterior probability of the label in machine learning, that is, we know the condition first, and then solve the result. The probability that the label Y takes a certain value without any restrictions is written by us as P(Y), which is contrary to the posterior probability, which is completely without any restrictions. The prior probability of the label (prior probability) . And our P(X|Y) is called the "conditional probability of the class", which represents the probability that X is a certain value when the value of Y is fixed. Now, interesting things have appeared.

Second, the learning and classification of Naive Bayes

 Suppose the label of the output class is y={c1,c2,..ck}, the input feature is X, and the training data set is T={(X1,y1),(X2,y2),...,(Xn,yn )}, then the conditional probability distribution:

For Bayes' theorem, the numerator P(X|Y) is obtained, because Naive Bayes makes a conditional independence assumption for the conditional probability distribution (this is also the origin of Naive Bayes method). Therefore, based on the conditional independence assumption, we can write as follows:

For Bayes' theorem score mother P(X), we can use the full probability formula to calculate P(X):

 Therefore, when naive Bayesian classification, for a given input x, the posterior probability distribution P(Y=ck|X=x) is calculated through the model obtained by learning, and the class with the largest posterior probability is output as the class of x . The posterior probability calculation is carried out according to Bayes' theorem:

Bringing in the conditional independence assumption:

 Then the Bayesian classifier can be expressed as:

 

For this formula, it is easy to solve P(Y=ck) from the training set, but this part of P(X) and P(X|Y) is not so easy. In our example, we use the full probability formula to solve the denominator, and the four probabilities are solved for two features. As the number of features gradually increases, the calculation on the denominator increases exponentially, and the P(X|Y) in the numerator becomes more and more difficult to calculate.

In the actual calculation of the classification, when comparing the two categories, the denominators of the two probability calculations are the same, so we do not need to calculate the denominator, only consider the size of the numerator. After we calculate the size of the numerator separately, we can add the two numerators to get the value of the denominator, so as to avoid calculating the probability under all the features on a sample. This process is called "maximum a posteriori estimation" (MAP). In the maximum posterior estimation, we only need to solve the numerator, mainly to solve the probability of each feature value under a sample, and then to obtain the corresponding probability by multiplying.

Here, we can look at an example first, and after reading it we will understand how to classify it.

index

Temperature (X1)

Ladybug's age (X2)

Ladybug hibernation (Y)

0

below zero

10 days

Yes

1

below zero

20 days

Yes

2

Zero on

10 days

no

3

below zero

A month

Yes

4

below zero

20 days

no

5

Zero on

Two months

no

6

below zero

A month

no

7

below zero

Two months

Yes

8

Zero on

A month

no

9

Zero on

10 days

no

10

below zero

20 days

no

 At this time , we hope to predict whether a ladybug with an age of 20 days will hibernate when it is below zero .

 For the numerator we can find:

For the denominator we can find:

 The threshold is set to 0.5. If it is greater than 0.5, it is considered as hibernating, and if it is less than 0.5, it is considered as not hibernating. According to our calculations, we believe that a ladybug with an age of 20 days under sub-zero conditions will not hibernate. This completes a forecast.

Third, the parameter estimation of the naive Bayes method

As can be seen from the previous section, if naive Bayes estimation is to be performed, then P(Y) and P(X|Y) need to be calculated. The corresponding probability can be estimated using maximum likelihood. The maximum likelihood estimate of the prior probability P(Y=ck) is:

In other words, we can directly calculate the sample label as the proportion of Ck to the total number of samples, thus obtaining the probability distribution of P(Y).

Assuming that the set of possible values ​​of the j-th feature xj is {aj1,aj2,...,ajn}, then the maximum likelihood estimate of the conditional probability P(Xj=ajl|Y=ck) is:

In the formula, Xij is the j-th feature of the i-th sample; ajl is the l-th value of the possible value of the j-th feature; I() is the indicator function.

It can be seen from the above formula that calculating the conditional probability function P(Xj=ajl|Y=ck) actually finds that under the Y=ck label, the eigenvalue of a certain feature is equal to the proportion of ajl.

 3.1, Naive Bayes Algorithm

Input: training data set

Output: Classification of instance x

  • Calculate prior probability and conditional probability

Priori probability:

Conditional Probability:

  • For a given instance x=(x1,x2,...xn) calculation

  • Determine the category

4. Exploring Bayes: the imbalance of Bayesian samples

Complement Naive Bayes (CNB) algorithm is an improvement of standard polynomial naive Bayes algorithm. The original intention of CNB’s invention team to create CNB was to solve various problems caused by the "naive" assumption in Bayesian. They hope to create mathematical methods to escape the naive assumption in Naive Bayes, so that the algorithm can not To care about whether all features are conditionally independent. Based on this, they created a complementary naive Bayes that can solve the problem of sample imbalance and ignore the naive hypothesis to a certain extent. In experiments, the parameter estimation of CNB has been proved to be more stable than ordinary polynomial naive Bayes, and it is particularly suitable for data sets with unbalanced samples. Sometimes, CNB's performance on text classification tasks can sometimes be better than polynomial naive Bayes, so now the complement of naive Bayes is also beginning to become popular. Regarding how Complementary Naive Bayes evades our naive hypothesis, or how to improve our sample imbalance problem, there are profound mathematical principles and complex mathematical proofs behind it. If you are interested, please refer to this Papers:

Rennie, J. D., Shih, L., Teevan, J., & Karger, D. R. (2003). Tackling the poor assumptions of naive bayes textclassifiers. In ICML (Vol. 3, pp. 616-623).

In simple terms, CNB uses the probabilities from the complement of each tag category to calculate the weight of each feature.

Where j represents each sample, and xij represents the lower value of feature i on the sample, which is usually the count value or the TF-IDF value in text classification. a is the smoothing coefficient as in standard polynomial naive Bayes. It can be seen that this seemingly complicated formula is actually very simple. In fact , it refers to the sum of the feature values ​​of all samples whose label category is not equal to the value of c under a feature i. And in fact, all features, all tags category is not equal to c worth of samples of features and value. In fact, it is the inverse idea of ​​polynomial distribution. 

references:

"Statistical Learning Methods" 2nd Edition 

Summary of the Principles of the Naive Bayes Algorithm

Guess you like

Origin blog.csdn.net/wxplol/article/details/105660608