Machine Learning (1) - Naive Bayes

0. Idea: For a given item x to be classified, the posterior probability distribution is calculated by the learned model, that is: the probability of each target category appearing under the condition that this item appears, and the class with the largest posterior probability is taken as the category to which x belongs. category . The posterior probability is calculated according to Bayes' theorem.

Key: In order to avoid the problem of combinatorial explosion and sample sparseness when solving Bayes' theorem, the conditional independence assumption is introduced. The features used for classification are all conditionally independent under the condition of class determination.

 

1. Where is Naive Bayes Naive?

To put it simply: when using Bayes' theorem to solve the joint probability P(XY), you need to calculate the conditional probability P(X|Y). When calculating P(X|Y), Naive Bayes makes a strong conditional independence assumption (when Y is determined, the values ​​of each component of X are independent of each other), that is, P(X1=x1, X2= x2,...Xj=xj|Y=yk) = P(X1=x1|Y=yk) *P(X2=x2|Y=yk) *...*P(Xj=xj|Y=yk) .

 

2. What is the difference between Naive Bayes and LR?

simply put:

(1) Naive Bayes is a generative model. According to the existing samples, Bayesian estimation is performed to learn the prior probability P(Y) and the conditional probability P(X|Y), and then the joint distribution probability P(XY) is obtained, Finally, use Bayes' theorem to solve P(Y|X), and LR is a discriminant model, and the conditional probability P(Y|X) is directly obtained according to the maximized log-likelihood function;

(2) Naive Bayes is based on a strong assumption of conditional independence (under the condition of known classification Y, the values ​​of each feature variable are independent of each other), while LR does not require this;

(3) Naive Bayes is suitable for scenarios with few datasets, while LR is suitable for large-scale datasets.

 

The former is a generative model, and the latter is a discriminative model. The difference between the two is the difference between a generative model and a discriminative model.

1) First, Navie Bayes obtains the prior probability P(Y) and the conditional probability P(X|Y) through the known samples. For a given instance, the joint probability is calculated, and then the posterior probability is obtained. That is, it tries to find out exactly how this data was generated (generated), and then classifies it. Which category is most likely to produce this signal belongs to that category.

Advantages: When the sample size increases, the convergence is faster; it is also applicable when hidden variables exist.

Disadvantages: long time; need many samples; waste of computing resources

2) In contrast, Logistic regression does not care about the proportion of categories in the sample and the probability of features appearing under the category, it directly gives the formula of the prediction model. Assuming that each feature has a weight, the training sample data updates the weight w to obtain the final expression. Gradient method.

Advantages: direct prediction is often more accurate; it simplifies the problem; it can reflect the distribution of data and the different characteristics of categories; it is suitable for the identification of more categories.

Disadvantages: slow convergence; not suitable for cases with hidden variables.

 

3. What if the probability is 0 when estimating the conditional probability P(X|Y)?

To put it simply: λ is introduced, and when λ=1, it is called Laplace smoothing.

 

4. Advantages and disadvantages of Naive Bayes

Advantages: perform well on small-scale data, suitable for multi-classification tasks, suitable for incremental training.

Disadvantages: It is very sensitive to the expression form of the input data (discrete, continuous, extremely small and so on).

 

 

5. Why is the attribute independence assumption difficult to hold in practice, but Naive Bayes can still achieve better results?

1) For classification tasks, as long as the conditional probabilities of each category are sorted correctly, the correct classification can be caused without precise probability values;

2) If inter-attribute dependencies have the same effect on all categories, or the effects of dependencies can cancel each other, the attribute conditional independence assumption will not have a negative impact on performance while reducing computational overhead.

 

6. Why maximize the posterior probability:

Equivalent to expected risk minimization. Assuming that the 0-1 loss function is selected, that is, the correct classification takes 1, and the error takes 0. At this time, the expected risk is minimized as

 

7. Algorithm problem:

 In practical projects, the probability value is often a very small decimal, and the multiplication of consecutive tiny decimals can easily cause underflow and make the product 0.

Solution: Take the natural logarithm of the product, and change the continuous multiplication into a continuous addition.

In addition, it should be noted that the length of the given feature vector may be different. This is a vector that needs to be normalized to a general length (here, text classification is used as an example). For example, if it is a sentence word, the length is the length of the entire vocabulary, corresponding to Position is the number of times the word appears.

 

8. Calculation method of prior conditional probability:

 a. Discrete distribution: count the frequency of occurrence of each category in the training sample. If the probability of an eigenvalue is 0, the entire probability product will become 0 (called data sparse), which destroys the assumption that each eigenvalue has the same status.

Solution 1: Use Bayesian estimation (called Laplace smoothing when λ=1):

Solution 2: Find the systematic keywords of the words that do not appear by clustering, and calculate the average value according to the probability of the related words.

b. When continuously distributed: Assume that its values ​​follow a Gaussian distribution (normal distribution). That is, calculate the sample mean and variance.

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325250310&siteId=291194637