Naive Bayesian Classification Algorithm and Example Demonstration

At the beginning of this article, let's learn a new machine learning method: Bayesian algorithm.
This time, starting from the most basic naive Bayesian classification algorithm, we will understand the relevant algorithm principles.

Consider the following classification problem: the sample contains only 2 types of features, and the labels are only 0 and 1.
At present, we want to evaluate the classification results when the two feature values ​​are a and b respectively.

The core logic of using the Naive Bayesian classification algorithm is: based on the relevant theory of probability theory, directly calculate the probability of label values ​​0 and 1 when the two feature values ​​are a and b respectively, and then select the label corresponding to the greater probability value value as the classification result.

Bayes formula

Since it is based on probability theory, it is necessary to understand the principles of probability theory necessary for the use of the algorithm.
It’s embarrassing to say, although my undergraduate probability course grades are good, but in fact most of the content has been forgotten, so let’s start with the basic conditional probability formula
P ( B ∣ A ) = P ( AB ) / P ( A ) P(B|A)=P(AB)/P(A)P(BA)=P(AB)/P(A)
此处, P ( B ∣ A ) P(B|A) P ( B A ) refers to the probability of B occurring under the conditions of A (posterior probability),P ( AB ) P(AB)P ( A B ) refers to the probability that A and B occur at the same time, and P(A) refers to the probability of A occurring (prior probability).

First find a simple example to revisit the conditional probability formula: Suppose there are 5 identical boxes, only one box contains money, define A as no money in the first box, and B as money in the second box. Using the simplest intuitive logic, we can know that
P ( A ) = 4 / 5 , P ( AB ) = 1 / 5 , P ( B ∣ A ) = 1 / 4 P(A)=4/5,P(AB )=1/5,P(B|A)=1/4P(A)=4/5,P(AB)=1/5,P(BA)=1/4
Obviously, the probability value in this instance satisfies the previous conditional probability formula.

Change A and B in the conditional probability, you can get
P ( A ∣ B ) = P ( AB ) / P ( B ) P(A|B)=P(AB)/P(B)P(AB)=P ( A B ) / P ( B )
can get
P ( B ∣ A ) ∗ P ( A ) = P ( A ∣ B ) ∗ P ( B ) = P ( AB ) P(B|A)*P (A)=P(A|B)*P(B)=P(AB)P(BA)P(A)=P(AB)P(B)=P ( A B )
Do another deformation
P ( B ∣ A ) = P ( A ∣ B ) ∗ P ( B ) P ( A ) P(B|A)=\frac{P(A|B)*P( B)}{P(A)}P(BA)=P(A)P(AB)P(B)
Ha, finally got the Bayes formula.

Algorithm principle

Looking at the formula above, it seems that it has little to do with machine learning, so we need to use the language of machine learning to understand this formula:
P (category∣ feature) = P (feature∣ category) ∗ P (category) P (feature) P (category|feature)=\frac{P(feature|category)*P(category)}{P(feature)}P ( category∣feature ) _ _=P ( characteristic )P ( feature∣category ) _ _P ( category )
Going back to the classification problem at the beginning of this article, there are two sets of features, then you can calculate P ( 0 ∣ ​​feature 1 = a , feature 2 = b ) P(0|feature 1=a, feature 2=b)P ( 0∣Feature 1 _=a,feature 2=b ) andP ( 1 ∣ feature 1 = a , feature 2 = b ) P(1|feature 1=a, feature 2=b)P ( 1∣Characteristic 1 _=a,feature 2=b ) If the P value of the former is larger, then the classification is 0, otherwise it is classified as 1.

Next, take P ( 0 ∣ ​​feature 1 = a , feature 2 = a ) P(0|feature 1=a, feature 2=a)P ( 0∣Feature 1 _=a,feature 2=a ) as an example, analyze its specific calculation process.
P ( 0 ∣ ​​feature1 = a , feature2 = b ) = P ( feature1 = a , feature2 = b ∣ 0 ) ∗ P ( 0 ) P ( feature1 = a , feature2 = b ) P(0| Feature 1=a, feature 2=b)=\frac{P(feature 1=a, feature 2=b|0)*P(0)}{P(feature 1=a, feature 2=b)}P ( 0∣Feature 1 _=a,feature 2=b)=P ( Feature 1=a,feature 2=b)P ( Feature 1=a,feature 2=b∣0)P(0)
First calculate the simplest P ( 0 ) P(0)P ( 0 ) : After a given training set,P ( 0 ) P(0)P ( 0 ) can be directly obtained by counting the number of label values ​​in the training set.

Then consider P ( feature 1 = a , feature 2 = b ∣ 0 ) P (feature 1=a, feature 2=b|0)P ( Feature 1=a,feature 2=b ∣0 ) : After the training set is given, there may be few cases where the eigenvalues ​​are exactly a and b respectively, or even 0, so it is unreliable to expect to find the number of combinations directly in the training set.
In order to calculate this P value stably, we make an assumption here: the characteristics are independent of each other.
In this way, the value can be equivalent to
P (feature 1 = a , feature 2 = b ∣ 0 ) = P (feature 1 = a ∣ 0 ) ∗ P (feature 2 = b ∣ 0 ) P(feature 1=a,feature 2=b|0)=P(feature 1=a|0)*P(feature 2=b|0)P ( Feature 1=a,feature 2=b∣0)=P ( Feature 1=a∣0)P ( Feature 2=b ∣0 )
The values ​​on the right side of the above formula are easily obtained through training set statistics.

Next, we simply prove the above formula. First expand the conditional probability formula
P ( AB ∣ C ) = P ( ABC ) P ( C ) P(AB|C)=\frac{P(ABC)}{P(C)}P(ABC)=P(C)P(ABC)
Transform the right formula
P ( AB ∣ C ) = P ( AC ) P ( C ) ∗ P ( ABC ) P ( AC ) P(AB|C)=\frac{P(AC)}{P(C)} *\frac{P(ABC)}{P(AC)}P(ABC)=P(C)P ( A C )P ( A C )P(ABC)
Obviously the right formula is two conditional probability values
​​P ( AB ∣ C ) = P ( A ∣ C ) ∗ P ( B ∣ AC ) P(AB|C)=P(A|C)*P(B|AC)P(ABC)=P(AC)P ( B A C )
Because A and B are independent of each other, under the condition that C occurs, whether A occurs or not does not affect the probability of B occurring, that is,
P ( B ∣ AC ) = P ( B ∣ C ) P( B|AC)=P(B|C)P(BAC)=P ( B C )
put the above formula into the above formula, get
P ( AB ∣ C ) = P ( A ∣ C ) ∗ P ( B ∣ C ) P(AB|C)=P(A|C)* P(B|C)P(ABC)=P(AC)P ( B C )
Set A as "feature 1=a", B as "feature 2=b", C as "label value=0", the above formula becomes
P (feature 1 = a, feature 2 = b ∣ 0 ) = P (feature 1 = a ∣ 0 ) ∗ P (feature 2 = b ∣ 0 ) P(feature 1=a,feature 2=b|0)=P(feature 1=a|0)*P( feature2=b|0)P ( Feature 1=a,feature 2=b∣0)=P ( Feature 1=a∣0)P ( Feature 2=b ∣0 )
So far, the formula has been proved.

Finally look at P (feature 1 = a, feature 2 = b) P(feature 1=a, feature 2=b)P ( Feature 1=a,feature 2=b ) : On the assumption that the features are independent of each other, this value is equivalent to
P (feature 1 = a , feature 2 = b ) = P (feature 1 = a ) ∗ P (feature 2 = b ) P(feature 1=a ,feature 2=b)=P(feature 1=a)*P(feature 2=b)P ( Feature 1=a,feature 2=b)=P ( Feature 1=a)P ( Feature 2=b)

In fact, no matter calculating P ( 0 ∣ ​​feature1 = a , feature2 = b ) P(0|feature1=a, feature2=b)P ( 0∣Feature 1 _=a,feature 2=b ) orP ( 1 ∣ feature 1 = a , feature 2 = b ) P(1|feature 1=a, feature 2=b)P ( 1∣Characteristic 1 _=a,feature 2=b ) , the denominators are allP (feature 1 = a, feature 2 = b) P (feature 1=a, feature 2=b)P ( Feature 1=a,feature 2=b ) .
If only need to evaluateP ( 0 ∣ ​​feature1 = a , feature2 = b ) P(0|feature1=a, feature2=b)P ( 0∣Feature 1 _=a,feature 2=b ) andP ( 1 ∣ feature 1 = a , feature 2 = b ) P(1|feature 1=a, feature 2=b)P ( 1∣Characteristic 1 _=a,feature 2=b ) The size relationship does not need to calculateP (feature 1 = a, feature 2 = b) P (feature 1=a, feature 2=b)P ( Feature 1=a,feature 2=b ) value.

Example demonstration

In this section, we refer to an example that is used a lot on the Internet to demonstrate the calculation process of the Naive Bayesian classification algorithm.

The following table lists a variety of situations in which a girl is willing to marry when boys have different eigenvalues.
The characteristics here include appearance (1 means high, 0 means low), personality (1 means good, 0 means bad), height (2/1/0 means high/medium/low respectively) and motivation (1 means yes, 0 means no); the feature is married (1 means yes, 0 means no).

Now suppose there is a boy whose characteristic values ​​are: face value=0, character=0, height=0, self-motivated=0, then should the girl marry or not?

Leaving aside the algorithm for now, let’s start with our simple cognition to predict.
From the values ​​in the above table, it can be easily judged that when a girl is married, the boy will always have some characteristics that are excellent, that is to say, at least a "normal" logical thinking.
So when faced with a situation where all 4 characteristics are poor, you will naturally choose not to marry.
In other words, from the perspective of probability, the probability of not marrying is much higher than the probability of marrying.

Next, let's take a look at what kind of conclusions will be drawn from the perspective of algorithms and how to get this conclusion.
The first is to calculate the probability of marrying:

P ( Marriage = 1 ∣ Yan value = 0 , Personality = 0 , Height = 0 , Self-motivated = 0 ) = P ( Yan value = 0 ∣ Marriage = 1 ) ∗ P ( Personality = 0 ∣ Marriage = 1 ) ∗ P ( Height = 0 ∣ Marriage = 1 ) ∗ P ( Self-motivated = 0 ∣ Marriage = 1 ) ∗ P ( Marriage = 1 ) P ( Appearance = 0 ) ∗ P ( Personality = 0 ) ∗ P ( Height = 0 ) ∗ P ( Self-motivated = 0 ) P(married=1|appearance=0, character=0, height=0, self-motivated=0)=\frac{P(appearance=0|married=1)*P(character= 0|marry=1)*P(height=0|marry=1)*P(motivated=0|marry=1)*P(marry=1)}{P(face value=0)*P(personality= 0)*P(height=0)*P(motivated=0)}P ( married=1∣Appearance _=0,character=0,height=0,self-motivated=0)=P ( face value=0)P ( character=0)P ( height=0)P ( motivated=0)P ( face value=0∣Bride _=1)P ( character=0∣Bride _=1)P ( height=0∣Bride _=1)P ( motivated=0∣Bride _=1)P ( married=1)

In the above formula, P (marriage = 1), P (face value = 0), P (character = 0), P (height = 0) P (marriage = 1), P (face value = 0), P (personality =0), P(height=0)P ( married=1 ) , P ( face value=0 ) , P ( character=0 ) , P ( height=0 ) andP (motivated = 0 ) P (motivated = 0)P ( motivated=0 ) has the same calculation logic, we useP ( marriage = 1 ) P ( marriage = 1)P ( married=1 ) as an example, briefly describe the calculation process of these values.

There are a total of 10 samples in the table, and the number of samples with marriage=1 is 6, so
P ( marriage= 1 ) = 6 / 10 = 3 / 5 P( marriage=1)=6/10=3/5P ( married=1)=6/10=3/5
Referring to this logic, we can get
P (face value = 0 ) = 2 / 5 , P ( personality = 0 ) = 3 / 10 , P ( height = 0 ) = 1 / 2 , P ( self-motivated = 0 ) = 3 / 10 P(face value=0)=2/5, P(character=0)=3/10, P(height=0)=1/2, P(motivated=0)=3/10P ( face value=0)=2/5,P ( character=0)=3/10,P ( height=0)=1/2,P ( motivated=0)=3/10

P (face value = 0 ∣ marriage = 1 ), P (character = 0 ∣ marriage = 1 ), P (height = 0 ∣ marriage = 1 ) P(face value=0|marriage=1), P(personality=0 |married=1), P(height=0|married=1)P ( face value=0∣Bride _=1 ) , P ( character=0∣Bride _=1 ) , P ( height=0∣Bride _=1 ) and P (motivated = 0 ∣ married = 1 ) and P (motivated = 0 | married = 1)and P ( motivated=0∣Bride _=1 ) The calculation logic is the same
, we useP ( Yan value = 0 ∣ marriage = 1 ) P( Yan value = 0 | marriage = 1)P ( face value=0∣Bride _=1 ) as an example, describe the calculation process of these values.

First filter the original table into the following new table according to the condition of "marriage = 1"


There are 6 samples in the new table, and the number of samples with face value=0 is 3, so
P ( face value=0 ∣ marriage=1 ) = 3 / 6 = 1 / 2 P(face value=0 | marriage=1 )=3/6=1/2P ( face value=0∣Bride _=1)=3/6=1/2
Referring to this logic, we can get
P (character = 0 ∣ marriage = 1 ) = 1 / 2 , P ( height = 0 ∣ marriage = 1 ) = 1 / 6 , P ( self-motivated = 0 ∣ marriage = 1 ) = 1 / 6 P(character=0|marry=1)=1/2, P(height=0|marry=1)=1/6, P(motivated=0|marry=1)=1/6P ( character=0∣Bride _=1)=1/2,P ( height=0∣Bride _=1)=1/6,P ( motivated=0∣Bride _=1)=1/6
With the above data values, we can get the probability of a girl marrying as
P ( Marriage = 1 ∣ Appearance = 0 , Personality = 0 , Height = 0 , Self-motivated = 0 ) = 1 2 ∗ 1 6 ∗ 1 6 ∗ 1 6 ∗ 3 5 2 5 ∗ 3 10 ∗ 1 2 ∗ 3 10 P(marriage=1|appearance=0, personality=0, height=0, motivation=0)=\frac{\frac{1 }{2}*\frac{1}{6}*\frac{1}{6}*\frac{1}{6}*\frac{3}{5}}{\frac{2}{5} *\frac{3}{10}*\frac{1}{2}*\frac{3}{10}}P ( married=1∣Appearance _=0,character=0,height=0,self-motivated=0)=52103211032161616153

Similarly, we can calculate the probability of a girl not marrying as
P ( Marriage = 0 ∣ Appearance = 0 , Personality = 0 , Height = 0 , Self-motivated = 0 ) = 1 4 ∗ 1 2 ∗ 1 ∗ 1 2 ∗ 2 5 2 5 ∗ 3 10 ∗ 1 2 ∗ 3 10 P(marriage=0|appearance=0, character=0, height=0, self-motivated=0)=\frac{\frac{1}{4}*\frac {1}{2}*1*\frac{1}{2}*\frac{2}{5}}{\frac{2}{5}*\frac{3}{10}*\frac{1 }{2}*\frac{3}{10}}P ( married=0 | face value=0,character=0,height=0,self-motivated=0)=5210321103412112152
Since the probability of marrying is less than the probability of not marrying, it can be determined that the girl will choose not to marry.
We further calculate the ratio of the two, and we can find that the probability of not marrying is 18 times the probability of marrying, which is the same as the previous expectation.

Code

In the Naive Bayesian classification algorithm, no optimization algorithm is used, so from a personal point of view, there is no need to do code implementation.

However, the obsessive-compulsive disorder itself must have a standardized "code implementation" module, so there is this section.

Guess you like

Origin blog.csdn.net/taozibaby/article/details/128702631