Li Hang - statistical learning methods - notes -4: Naive Bayes

Naive Bayes

Introduction : naive Bayes theorem and Bayesian methods wherein conditional independence assumptions based classification. For a given training data set, the joint probability based on the first "characteristic condition independent" hypothesis learning input / output distribution. Then, based on the model, for a given input x, Bayes Theorem method largest posterior probability y.

Naive Bayes simple, learning and prediction of efficiency are high, it is a commonly used method.

Basic methods : Naive Bayesian joint probability distribution method of learning by training data set \ (P (the X-, the Y-) \) . Specifically, the study and the prior probability distribution conditional probability distribution, and to learn to joint probability distribution.

Prior probability distribution
\ [P (Y = c_k) , k = 1,2, ..., K \]

Conditional probability distribution
\ [P (X = x \ | \ Y = c_k) = P (X = (x ^ {(1)}, x ^ {(2)}, ..., x ^ {(n)} ) \ | \ Y = c_k) , k = 1,2, ..., K \]

At that time conditional probability distribution has an exponential number of parameters, it is not feasible to estimate the actual. Suppose \ (x ^ {(j) } \) can values \ (S_j \) a, then the number of parameters \ (K \ prod_. 1} = {J} ^ {n-S_j \) .

Individual features assumed conditions : Naive Bayes assumption "feature for the classification is independent of the conditions at the determined class." This is a strong assumption, the algorithm makes it easier (hence the term "plain"), but sometimes at the expense of a certain classification accuracy.
\ [\ begin {split} P (X = x \ | \ Y = c_k) & = P (x ^ {(1)}, x ^ {(2)}, ..., x ^ {(n)} \ | \ Y = c_k) \\ & = \ prod_ {j = 1} ^ {n} P (X ^ {(j)} = x ^ {(j)} \ | \ Y = c_k) \ end {split } \]

Naive Bayes classifier

Bayes' theorem
\ [\ begin {split} P (Y | X) & = \ frac {P (Y) \ P (X | Y)} {P (X)} \\ & = \ frac {P (Y ) \ P (X | Y) } {\ sum_YP (Y) \ P (X | Y)} \ end {split} \]

Classification
when classifying, for a given input \ (X \) , by the posterior probability model calculation learned distribution \ (P (the Y = C_K | X-= X) \) , the posterior probability largest class as \ ( X \) is the output class.

\[\begin{split} P(Y=c_k \ | \ X=x) &= \frac{P(X = x \ | \ Y = c_k) \ P(Y = c_k)}{P(X = x)} \\ &= \frac{P(X = x \ | \ Y = c_k) \ P(Y = c_k)}{\sum_k P(X = x \ | \ Y = c_k) \ P(Y = c_k)} \\ &= \frac{P(Y = c_k) \ \prod_j P(X^{(j)}=x^{(j)} | Y = c_k)}{\sum_k P(Y = c_k) \ \prod_j P(X^{(j)}=x^{(j)} | Y = c_k)} \end{split}\]

Then naive Bayes classifier can be expressed as
\ [y = f (x) = \ arg \ max_ {c_k} P (Y = c_k | X = x) \]

Note denominator for all \ (C_K \) are the same, can be removed, eventually:
\ [Y = \ Arg \ max_ C_K} {P (the Y = C_K) \ prod_j P (X-^ {(J)} = X ^ {(j)} | Y = c_k) \]

Maximum likelihood estimate

Maximum Likelihood Estimation : using a known sample results, thrust reversers are most likely (most probable) cause parameter values such results it is a parameter estimation method.

Naive Bayes, the learning means \ (P (Y = c_k) \) and \ (P (X-^ {(J)} = X ^ {(J)} | = C_K the Y) \) .
Maximum likelihood estimation can be used to estimate the corresponding probability.
\ [P (Y = c_k) = \ frac {\ sum_ {i = 1} ^ {N} I (y_i = c_k)} {N} \]

The first set \ (J \) feature \ (x ^ {(j) } \) may be set to values \ (\ {a_ {j1} , a_ {j2}, ..., a_ {jS_j} \} \) .

\[P(X^{(j)} = a_{jl} \ | \ Y = c_k) = \frac{\sum_{i=1}^{N} I(x_i^{(j)} = a_{jl}, y_i = c_k)}{\sum_{i=1}^{N}I(y_i = c_k)}\]

Bayesian estimation : maximum likelihood estimate of the probability might appear to be the estimated value of zero. Then the results will affect the posterior probability, the classification bias. The solution to this problem is to use Bayesian estimation.

\[P_{\lambda}(Y = c_k) = \frac{\sum_{i=1}^{N} I(y_i = c_k) + \lambda}{N + k \lambda}\]

\[P(X^{(j)} = a_{jl} \ | \ Y = c_k) = \frac{\sum_{i=1}^{N} I(x_i^{(j)} = a_{jl}, y_i = c_k) + \lambda}{\sum_{i=1}^{N}I(y_i = c_k) + S_j \lambda}\]

\ (\ lambda = 0 \) referred to as maximum likelihood estimation, \ (\ =. 1 the lambda \) is called the Laplace smoothing time.

Guess you like

Origin www.cnblogs.com/liaohuiqiang/p/10979742.html