Machine Learning Algorithm (7)——Naive Bayes Algorithm

1 Introduction to Naive Bayes

Naive Bayes is a classification algorithm that is often used for text classification. Its output is the probability that a certain sample belongs to a certain category.
Insert image description here
Insert image description here

2 Bayesian formula

Probability Basics Review

  • Joint probability: contains multiple conditions, and the probability that all conditions are true at the same time
    • Referred to as:P(A,B)
  • Conditional probability: It is the probability of event A occurring under the condition that another event B has occurred.
    • Referred to as:P(A|B)
  • Mutually independent: If P(A, B) = P(A)P(B), then event A and event B are said to be independent of each other

Insert image description here
Let’s understand this formula through a case “Judge how much the goddess likes you”

Insert image description here

问题如下:

女神喜欢的概率?
职业是程序员并且体型匀称的概率?
在女神喜欢的条件下,职业是程序员的概率?
在女神喜欢的条件下,职业是程序员、体重超重的概率?

计算结果为:

P(喜欢) = 4/7
P(程序员, 匀称) = 1/7(联合概率)
P(程序员|喜欢) = 2/4 = 1/2(条件概率)
P(程序员, 超重|喜欢) = 1/4

Thought questions:

  • How to calculate the probability that Xiao Ming is liked by the goddess when Xiao Ming is a product manager and is overweight?
    Right nowP(喜欢|产品, 超重) = ?

At this time we need to use Naive Bayes to solve.

Then the thinking question can be solved by applying Bayesian formula:

P(喜欢|产品, 超重) = P(产品, 超重|喜欢)P(喜欢)/P(产品, 超重)

By calculating the above formula, we can find:

  • P(产品, 超重|喜欢)The results of and P(产品, 超重)are both 0, making it impossible to calculate the result. This is because our sample size is too small and not representative.
  • Originally, in real life, there must be people who are product managers and are overweight. P(product, overweight) cannot be 0;
  • And events 职业是产品经理and events 体重超重are usually considered to be independent events, however, based on our limited 7 sample calculations this P(产品, 超重) = P(产品)P(超重)is not true.

Naive Bayes can help us solve this problem:

  • Naive Bayes, simply understood, is the Bayesian formula that assumes that features are independent of each other.
  • In other words, the reason why Naive Bayes is naive is that it assumes that features are independent of each other .

Therefore, if the thinking question is solved according to the idea of ​​Naive Bayes, it can be

P(产品, 超重) = P(产品) * P(超重) = 2/7 * 3/7 = 6/49
p(产品, 超重|喜欢) = P(产品|喜欢) * P(超重|喜欢) = 1/2 * 1/4 = 1/8
P(喜欢|产品, 超重) = P(产品, 超重|喜欢)P(喜欢)/P(产品, 超重) = 1/8 * 4/7 / 6/49 = 7/12

3 Laplace smoothing coefficient

If Bayesian formula is applied to the scenario of article classification , we can look at it like this:
Insert image description here

Insert image description here
Let’s understand through a case

Requirement: Through the first four training samples (articles), determine whether the fifth article belongs to the China category

Insert image description here

P(C|Chinese, Chinese, Chinese, Tokyo, Japan)
= P(Chinese, Chinese, Chinese, Tokyo, Japan|C) * P(C) / P(Chinese, Chinese, Chinese, Tokyo, Japan) 
= P(Chinese|C)^3 * P(Tokyo|C) * P(Japan|C) * P(C) / [P(Chinese)^3 * P(Tokyo) * P(Japan)]

# 这个文章是需要计算是不是China类,是或者不是最后的分母值都相同:

# 首先计算是China类的概率: 
P(Chinese|C) = 5/8
P(Tokyo|C) = 0/8
P(Japan|C) = 0/8

# 接着计算不是China类的概率:
P(Chinese|C) = 1/3
P(Tokyo|C) = 1/3
P(Japan|C) = 1/3

Question : From the above example, we can get P(Tokyo|C)that and P(Japan|C)are both 0, which is unreasonable. If there are many times in the word frequency list that are 0, it is very likely that the calculation results are all 0.

Solution : Laplacian Smoothing Coefficient

Insert image description here

# 这个文章是需要计算是不是China类:
# 该例中,m=6(训练集中特征词的个数,重复不计)

首先计算是China类的概率:
    P(Chinese|C) = 5/8 --> 6/14
    P(Tokyo|C) = 0/8 --> 1/14
    P(Japan|C) = 0/8 --> 1/14

接着计算不是China类的概率: 
    P(Chinese|C) = 1/3 --> 2/9
    P(Tokyo|C) = 1/3 --> 2/9
    P(Japan|C) = 1/3 --> 2/9

4 Naive Bayes API usage

sklearn.naive_bayes.MultinomialNB(alpha = 1.0)

  • Naive Bayes Classification
  • alpha: Laplace smoothing coefficient

Naive Bayes application case - product review sentiment analysis

5 Summary of Naive Bayes Algorithm

5.1 Advantages and Disadvantages of Naive Bayes

(1) Advantages

  • The Naive Bayes model originated from classical mathematical theory and has stable classification efficiency.
  • It is not very sensitive to missing data and the algorithm is relatively simple. It is often used for text classification.
  • High classification accuracy and fast speed

(2) Disadvantages

  • Since the assumption of independence of sample attributes is used, the effect is not good if the feature attributes are related.
  • Prior probabilities need to be calculated, and prior probabilities often depend on assumptions. There can be many types of assumed models, so sometimes the prediction effect will be poor due to the assumed prior models;
  • Prior probability: Intuitively understood, the so-called "first" refers to the probability of something happening before the event, that is, before the event occurs. It is a probability obtained based on past experience and analysis, "seeking results from causes."
  • Posterior probability: Something has already happened. There may be many reasons for it to happen. To determine the probability of which reason caused the thing to happen, it is "seeking the cause from the effect".
  • Prior probability is what is commonly known as probability, and posterior probability is a conditional probability, but conditional probability is not necessarily posterior probability. Bayes' formula is a formula for finding posterior probability from prior probability.

5.2 Difficulties with Naive Bayes

(1) Naive Bayes principle

The Naive Bayes method is a classification method based on Bayes' theorem and the assumption of independence of feature conditions.

  • For a given item x to be classified, the posterior probability distribution is calculated through the learned model,
  • That is: the probability of each target category appearing under the condition that this item appears, and the category with the largest posterior probability is regarded as the category to which x belongs.

(2) Where is Naive Bayesian naivety?

When calculating the conditional probability distribution P(X=x|Y=c_k), NB introduces a strong conditional independence assumption, that is, when Y is determined, the values ​​of each characteristic component of X are independent of each other.

(3) Why is the conditional independence assumption introduced?

In order to avoid the combinatorial explosion and sample sparse problems faced when solving Bayes' theorem.
Assume that the conditional probability is divided into:
Insert image description here
(4) What should we do if the probability is 0 when estimating the conditional probability P(X|Y)?

The solution to this problem is to use Bayesian estimation.
To put it simply, introduce λ,

  • When λ=0, it is an ordinary maximum likelihood estimate;
  • When λ=1, it is called Laplace smoothing.

(5) Why is it difficult to establish the attribute independence assumption in actual situations, but Naive Bayes can still achieve good results?

  • Before using a classifier, the first step (and the most important step) that people do is often feature selection. The purpose of this process is to eliminate collinearity between features and select relatively independent features;
  • For classification tasks, as long as the conditional probabilities of each category are sorted correctly, the correct classification can be obtained without precise probability values;
  • If inter-attribute dependencies have the same impact on all categories, or the effects of dependencies can cancel each other out, the attribute conditional independence assumption will not have a negative impact on performance while reducing computational complexity.

5.3 Differences from logistic regression

(1) Difference 1:

Naive Bayes is a generative model

  • Bayesian estimation is performed based on existing samples to learn the prior probability P(Y) and conditional probability P(X|Y),
  • Then find the joint distribution probability P(XY),
  • Finally, use Bayes’ theorem to solve P(Y|X),

And LR is a discriminant model

  • The conditional probability P(Y|X) is directly calculated based on the maximized log-likelihood function;

(2) Difference 2:

  • Naive Bayes is based on the strong conditional independence assumption (under the condition that the classification Y is known, the values ​​of each feature variable are independent of each other),
  • LR does not require this

(3) Difference three:

  • Naive Bayes is suitable for scenarios with small data sets.
  • And LR is suitable for large-scale data sets.

Discriminative and Generative Models

Insert image description here

Guess you like

Origin blog.csdn.net/hu_wei123/article/details/127302416