Naive Bayes algorithm text classification principle

sequence

This article mainly briefly studies how the Naive Bayes algorithm classifies text.

Bayesian algorithm

The Bayesian method converts the probability of "belonging to a certain class under the condition of a certain feature" into the probability of "belonging to a certain feature under the condition of a certain class", which belongs to supervised learning.

后验概率 = 先验概率 x 调整因子

This is what Bayesian inference means. We first estimate a "prior probability", and then add the experimental results to see whether the experiment enhances or weakens the "prior probability", thereby obtaining a "posterior probability" that is closer to the truth.

official

p(yi|x) = p(yi) * p(x|yi) / p(x)

p(a category yi|a certain feature x) = p(a category yi) * p(a certain feature x|a category yi) /p(a certain feature x)

According to the formula, the probability of calculating "belonging to a certain category under the condition of a certain characteristic" can be converted into the probability of "belonging to a certain category under the condition of a certain characteristic".

  • Prior probability where p(yi) is called prior probability, that is, the probability that event yi occurs before event x occurs

  • The posterior probability p(yi|x) is called the posterior probability, that is, the probability that the yi event occurs after the x event occurs, which belongs to the observable value

  • The adjustment factor p(x|yi)/p(x) is the adjustment factor, and it also becomes the possibility function ( Likelyhood), making the estimated probability closer to the real probability

Naive Bayes Algorithm

Naive Bayes theory originates from the independence of random variables: as far as text classification is concerned, from the point of view of Naive Bayes, the relationship between two words in a sentence is independent of each other, that is, the feature vector of an object Each dimension is independent of each other. This is the ideological basis of Naive Bayes theory. The process is as follows - in the first stage, the training data generates a training sample set: TF-IDF. - In the second stage, P(yi) is calculated for each class. - In the third stage, the conditional probability p(ai|yi) under all categories is calculated for each feature attribute. - In the fourth stage, p(x|yi)p(yi) is calculated for each class. - In the fifth stage, the largest item of p(x|yi)p(yi) is used as the category of x.

question

Suppose x is a text to be classified, and its features are {a1,a2,...,am}; the known category set {y1,y2,...yn}; find the category to which x belongs

The basic idea

如果p(yi|x) = max{p(y1|x),p(y2|x),p(y3|x),...,p(yn|x)},则x属于yi类别

How to calculate p(yi|x)

Use Bayesian formula

p(yi|x) = p(x|yi)*p(yi) / p(x)

The problem is converted to calculating p(x|yi)*p(yi) for each category, and taking the largest item of p(x|yi)*p(yi) as the category to which x belongs

Calculate p(x|yi)

Since Naive Bayes assumes that each feature is independent of each other, so

p(x|yi) = p(a1|yi)*p(a2|yi)...*p(am|yi)

p(x|yi)/p(x) is the adjustment factor

Calculate p(ai|yi)

And p(ai|yi) can be obtained by 已经分好类calculating the conditional probability p(ai|yi) of various features under each category through the training set ( ).

Since then, the category p(yi|x) to which x belongs is solved step by step, which can be obtained by calculating the conditional probability p(ai|yi) of various features under each category in the training set.

The training process is based on the training set to calculate the influencing factors of the adjustment factor p(x|yi)=p(a1|yi)*p(a2|yi)...*p(am|yi), so the training set The quality of the prediction directly affects the accuracy of the prediction results.

TF-IDF

TF-IDF (term frequency–inverse document frequency) is a commonly used weighting method for information retrieval and data mining.

TF-IDF = TF * IDF

The main idea of ​​TF-IDF is: if a word or phrase appears in an article with a high frequency TF, and rarely appears in other articles (with a large IDF value), it is considered that the word or phrase has a good category Distinguishing ability, suitable for classification.

TF

It means word frequency (Term Frequency), that is, the frequency of a word in the file

TFi = Ndti / Ndt1..n

That is, the number of times the word appears in the document / the sum of the number of occurrences of all words in the document

IDF

It means inverse document frequency (Inverse Document Frequency), which is an adjustment coefficient of the importance of a word, which measures whether a word is a common word. If a certain word is relatively rare, but it appears many times in this article, then it is likely to reflect the characteristics of this article, and it is exactly the keyword we need.

The IDF for a particular word can be obtained by dividing the total number of documents by the number of documents containing the word, and then taking the logarithm of the resulting quotient to get

IDF = log(Nd/Ndt)

Nd is the total number of files, Ndt is the number of files containing the word

If a word is very commonly used, the Ndt will become larger, the IDF value will become smaller, and the TF-IDF value will also become smaller, which greatly weakens the characteristics of common words.

summary

The Naive Bayes algorithm solves the problem step by step, and finally solves it through the training set, which is worth studying and scrutinizing.

doc

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325276046&siteId=291194637