"Machine learning in practice" sorting out the ideas of junk mail

Naive Bayes Algorithm

The original book focuses on actual combat, but it is troublesome to understand thoroughly. Here is the general process and model. I hope it will be helpful to everyone. It is also a sort of my thinking.

First, the Bayesian formula: p(y|x)=(p(x|y)P(y))/p(x)

Pros: the amount of data in less time is still valid
Cons: There are strict criteria for data input format for a nominal number of data
requirements are relatively independent eigenvalues
algorithm core idea: Suppose now give you an article: the probability of belonging to Class A It is P(A),
the probability of belonging to category B is P(B), if P(A)>P(B), it is judged that this article belongs to category
A
The next step is how to calculate the probability.
We now have an article as follows with 9 words in it
The word a Word b The word c The word d The word e The word f The word g The word m Word n Category is A
Now applying the Bayesian formula is
Find p(A|a,b,c,d,e,f,g,m,n), which means the probability of A in the case of a,b,c,d,e,f,g,m,n,
p(A|a,b.....n)=(p(a,b...n|A)p((a,b...n)A))/p(a,b...n)=(p(a|A)p(b|A)...p(n|A)P((a,b,....n)|A))/p(a,b,c....n)
Now look at this p(a,b,c...n|A), it becomes the probability of these words appearing under A
How to find the probability of these words?
We can do this:
Given you N articles, you list all the non-repeated words of these N documents and turn them into a bag-of-words model, as follows: a, b inside. . . All one word
[a,b,c,d.......] This is a bag of words, which contains all the words that have appeared, we assume there are m words in it
1) In order to find p(a,b,c,d.....|A) we first create a zero matrix with N rows*m columns. Each row in this matrix is ​​the length of the bag of words. Compare with this bag of words with N documents
If the word in this article is vectorized, after comparing the bag of words, the word that appears is recorded as 1, and the word that does not appear is recorded as 0.
In the end we will get a matrix like this, PS: I am just an example here to see what it looks like.
a b c d e f g h i category
1 0 1 0 1 1 0 0 0 B
1 0 1 0 0 1 1 0 0 A
0 1 0 1 1 0 0 0 0 B
0 1 0 1 0 0 0 1 0  A
2) The next step is to find p(a|A)...p(n|A)
We select all categories A according to the above example
Then p(a|A)=1/7 p(b|A)=1/7 p(c|A)=1/7 p(d|A)=1/7 p (e | A) = 0 .........
There is a problem here: according to the above formula, p(a,b,c,d..|A)=p(a|A)p(b|A).... But here p(e|A)=0 Will cause the probability to be 0
So we no longer initialize the matrix with 0 but initialize it with 1. If a word appears, it is recorded as 2. The same denominator is no longer the sum of all words in class A. We want to prevent the number of words
In extremely small cases, large errors will occur. We add 2 to the denominator, and of course 3 can also be added, as long as the data is not too outrageous.
3) For the convenience of calculation, we use a vector to save the number of occurrences of each word under A [a|A,b|A...i|A]
       Same number of occurrences of each word under B [a|B,b|B...i|B]
       Category A articles account for 1/2 of all articles
4) The last step is to take out the article we want to test, convert it into a vector and compare it with the bag of words, and turn it into a set of vectors similar to [0,1,0,1,0]. We call it d
We find d* [a|A,b|A.....i|A], and then add up each element in this vector (I call it word coincidence degree),
   Then multiplyProbability of being listed in category A article (in the above, it is 1/2)
  Find the probability that this article is A
Similarly: we find d* [a|B,b|B.....i|B], then add each element in this vector,
   Multiply by the probability of category B articles
 Find the probability that this article is B
Whichever probability is high, this article is classified into which
Super detailed code + comments on another article of mine







Guess you like

Origin blog.csdn.net/qq_37633207/article/details/79178472