Machine Learning Notes 07---Naive Bayesian Classifier

1. Bayesian Decision Theory

Bayesian decision theory is the basic method for implementing decision-making under the framework of probability. For classification tasks, in an ideal situation where all relevant probabilities are known, Bayesian decision theory considers how to choose the optimal class label based on these probabilities and misjudgment losses.

    Bayesian formula:

 Among them, P(c) is the "prior" probability of the class; P(x|c) is the class conditional probability of the sample x relative to the class label c, or called "likelihood" (likelihood); P(x) is used on the normalized "evidence" factor. For a given sample x, the evidence factor P(x) has nothing to do with the class label, so the problem of estimating P(c|x) is transformed into how to estimate the prior P(c) and the likelihood P(x| c).

    The class prior probability P(c) expresses the proportion of various types of samples in the sample space. According to the law of large numbers, when the training set contains sufficient independent and identically distributed samples, P(c) can pass the frequency of occurrence of various types of samples to estimate.

    For the class conditional probability P(x|c) , since it involves the joint probability of all attributes of x, estimating it directly from the frequency of samples will encounter serious difficulties. Because in practical applications, many sample values ​​​​do not appear in the training set at all, it is obviously not feasible to directly estimate P(x|c) by frequency, and the maximum likelihood estimation can be used to solve the problem.

2. Naive Bayesian Classifier

    It is not difficult to find that the main difficulty in estimating the posterior probability P(c|x) based on the Bayesian formula is that the class conditional probability P(x|c) is the joint probability of all attributes, which is difficult to estimate directly from limited training samples And get. (Estimating the joint probability directly based on limited training samples will encounter the combinatorial explosion problem in calculation, and the sample sparse problem in data; the more attributes, the more serious the problem) In order to avoid this obstacle, Naive Bayesian classification The naive Bayes classifier adopts the "attribute conditional independence assumption": for known categories, it is assumed that all attributes are independent of each other. In other words, it is assumed that each attribute affects the classification result independently.

    Based on the assumption of attribute conditional independence, the Bayesian formula can be rewritten as:

Where d is the number of attributes, and xi is the value of x on the i-th attribute.

    Since P(x) is the same for all categories, the Bayesian decision criterion is:

 This is the expression for the Naive Bayes classifier.

    Obviously, the training process of the Naive Bayesian classifier is to estimate the class prior probability P(c) based on the training set D, and estimate the conditional probability P(xi|c) for each attribute.

    Let Dc represent the set of c-th class samples of the training set D. If there are sufficient independent and identically distributed samples, the prior probability of the class can be easily estimated:

 For discrete attributes, let Dc,xi denote the set of Dc samples whose value is xi on the i attribute, then the conditional probability P(xi|c) can be estimated as:

 For continuous attributes, the probability density function can be considered, assuming p(xi|c)~N(μc,i,σc,i²), where μc,i and σc,i² are the values ​​of the c-th sample on the i-th attribute mean and variance of . Then there are:

     It should be noted that if a certain attribute value does not appear at the same time as a certain class in the training set, it will be problematic to directly estimate the probability based on the conditional probability P(xi|c), and then judge according to the Bayesian criterion. For example, when a certain attribute value does not appear at the same time as a certain class in the training set, the probability value calculated by the multiplication formula of the Bayesian decision criterion is zero. Therefore, no matter what other attributes of the sample are, the result of the classification is not affected. This is obviously not reasonable.

    In order to prevent the information carried by other attributes from being "erased" by attribute values ​​​​that do not appear in the training set, "smoothing" is usually performed when estimating the probability value, and Laplacian correction is often used. Specifically, let N represent the number of possible categories in the training set D, and Ni represent the number of possible values ​​of the i-th attribute, then the class prior probability and conditional probability can be modified as:

    

 Note: The Laplace correction essentially assumes a uniform distribution of attribute values ​​and categories, which is an additional prior about the data introduced in the naive Bayesian learning process.

    Obviously, the Laplace correction avoids the problem of zero probability estimation due to insufficient training set samples. And when the training set becomes larger, the prior influence introduced by the correction process will gradually become negligible, making the estimate gradually tend to the actual probability value.

    Naive Bayes classifiers are used in many ways in real-world tasks. For example, if the task requires high prediction speed , for a given training set, all probability estimates involved in the Naive Bayesian classifier can be calculated and stored in advance, so that only "look-up table" is needed when making predictions. It can be discriminated; if the task data changes frequently , the "lazy learning" method can be adopted, without any training first, and the probability estimation will be performed according to the current data set when the prediction request is received; if the data continues to increase , Then, on the basis of the existing valuation, incremental learning can be realized only by counting and correcting the probability estimation involved in the attribute value of the newly added sample.

3. Expansion - semi-naive Bayesian classifier

    In order to reduce the difficulty of estimating the posterior probability P(c|x) in the Bayesian formula, the Naive Bayesian classifier adopts the assumption of attribute conditional independence, but this assumption is often difficult to establish in real tasks. Therefore, people try to relax the assumption of attribute conditional independence to a certain extent, resulting in a class of learning methods called "semi-naive Bayesian classifiers".

    The basic idea of ​​the semi-naive Bayesian classifier is to properly consider the interdependence information between some attributes, so that it does not need to calculate the complete joint probability, and it does not completely ignore the relatively strong attribute dependencies. One-Dependent Estimator (ODE) is the most commonly used strategy for semi-naive Bayesian classifiers. As the name suggests, the so-called "independent dependency" is to assume that each attribute depends on at most one other attribute outside the category.

 Where pai is the attribute on which attribute xi depends, called the parent attribute of xi. At this time, for each attribute xi, if its parent attribute pai is known, the probability value P(xi|c,pai) can be estimated by using the Laplace-modified class conditional probability method.

    Thus, different methods produce different independent dependent classifiers. Here is a simple example of two independent dependent classifiers:

    ①SPODE(Super-Parent ODE)

    It is assumed that all attributes depend on the same attribute, which is called "superparent", and then the superparent attributes are determined by model selection methods such as cross-validation.

    ②AODE(Average One-Dependent Estimator)

    AODE is a more powerful independent dependency classifier based on an ensemble learning mechanism. Unlike SPODE, which determines the superparent attribute through model selection. AODE tries to construct SPODE by superparenting each attribute seat, and then integrates those SPODEs with sufficient training data support as the final result.

Summarize:

    Since it is possible to improve the generalization performance by relaxing the attribute conditional independence assumption to the independent dependency assumption, can the generalization performance be further improved by considering the high-order dependencies between attributes? That is, the ODE is extended to kDE by replacing the attribute pai with the set pai containing k attributes . It should be noted that as k increases, the number of training samples required to accurately estimate the probability P(xi|y, pa i) will increase exponentially. Therefore, if the training data is very sufficient, the generalization performance may be improved; but under the condition of limited samples, it will fall into the quagmire of high-order joint probability.

Refer to Zhou Zhihua's "Machine Learning"

Guess you like

Origin blog.csdn.net/m0_64007201/article/details/127587678