Nuggets notes: Naive Bayes Model

1-- basic theorems and definitions

  • Conditional probability formula:
    \ [P (A | B) = \ {dfrac P (AB)} {P (B)} \]

  • Total probability formula:
    \ [P (A) = \ sum_ {J} = ^. 1 the NP (AB_i) = \ sum_ {J} = ^. 1 the NP (B_i) P (A | B_i) \]

  • Bayesian formula:
    \ [P (B_i | A) = \ {dfrac P (AB_i)} {P (A)} = \ dfrac {P (B_i) P (A | B_i)} {\ sum_ {J. 1 = } ^ NP (B_i) P ( A | B_i)} \]

  • Probability sum rules:
    \ [P \ left (X-x_i = \ right) = \ sum_ {J} = ^. 1 the NP \ left (X-= x_i, y_j the Y = \ right) \]

    \[ P\left(X\right)=\sum_Y P\left(X,Y\right) \]

  • 概率乘积规则:
    \[ P\left(X=x_i,Y=y_j\right)=P\left(Y=y_j|X=x_i\right)P\left(X=x_i\right) \]

    \[ P\left(X,Y\right)=P\left(Y|X\right)P\left(X\right) \]

  • Generative learning method:

    Use the training data to learn \ (P (X | Y) \) and \ (P (Y) \) estimates to obtain the joint probability distribution:
    \ [P (the X-, the Y-) = P (the Y-) P (the X-| the Y-) \]
    then the obtained posterior probability distribution \ (P (Y | X) \) specific probability estimation method may be a maximum likelihood estimation or Bayes' estimation.

2 - Description of model

Naive Bayes ( \ (Naive \) \ (Bayes \) ) are characterized with Bayesian conditional independence assumptions based classification.

For a given set of training data, based on the first conditional independence assumptions, the learning input and output joint probability distribution; then, based on this model, a given input \ (X \) , the use of Bayes' theorem, the maximum posterior probabilities calculated the output class \ (Y \) .

The maximum a posteriori probability is equivalent to \ (0-1 \) the expected risk when the loss function is minimized.

As a typical method for generating learning, naive Bayes simple, learning and prediction efficiency are high, it is a commonly used model.

The following describes the classical polynomial Bayesian classifier .

3 - model assumptions

  1. Training set \ (P (X, Y) \) generated independently and identically distributed

  2. Conditional independence assumptions. For feature classification, determined under conditions independent class, namely:
    \ [\ the aligned the begin {P} \ left (X-X = | the Y = C_ {K} \ right) = P & \ left (X-^ {( 1)} = x ^ {( 1)}, \ cdots, X ^ {(n)} = x ^ {(n)} | Y = c_ {k} \ right) \\ & = \ prod_ {j = 1 } ^ {n} P \ left
    (X ^ {(j)} = x ^ {(j)} | Y = c_ {k} \ right) \ end {aligned} \] this is a strong assumption. Under conditions made some compromises on performance, this model contains the assumption that the number of conditional probability greatly reduced, making learning and prediction model is greatly simplified, efficient and easy to implement.

    Conditional independence assumptions can be regarded as the simplest model to have a view of probability.

4 - Model main tactics

  1. Maximum likelihood estimate
  2. After maximizing the posterior probability

5 - Model Input

Training set \ (T = \ left \ { \ left (x_ {1}, y_ {1} \ right), \ left (x_ {2}, y_ {2} \ right), \ cdots, \ left (x_ { } N, Y_ {N} \ right) \ right \} \) , \ (x_i \ in \ X-mathcal {} \ subseteq \ mathbf {^ {n-R & lt}} \) , \ (I = 1,2, \ DOTS, N \) , \ (Y \ in \ mathcal {the Y} = \ {c_1 and, c_2, \ DOTS, C_K \} \) , \ (| \ mathcal {the Y} | = K \) ; \ (X_ { i} = \ left (x_ { i} ^ {(1)}, x_ {i} ^ {(2)}, \ cdots, x_ {i} ^ {(n)} \ right) ^ {\ mathrm {T }} \) , \ (X_ {I} ^ {\ left (J \ right)} \) is the first \ (I \) of samples \ (J \) feature, \ (J = 1,2, \ DOTS, n-\) , \ (X_ {I} ^ {\ left (J \ right)} \ in \ {A_ {J1}, A_ {J2}, \ DOTS, A_ {jS_j} \} \) , wherein \ (a_ {jl} \) is the first \ (J \) th feature \ (L \) th value,\(l=1,2,\dots,S_j\).

Another example \ (the X-\) .

6 - model development

By assumption available
\ [P \ left (X = x, Y = c_ {k} \ right) = P \ left (X = x | Y = c_ {k} \ right) P \ left (Y = c_ {k } \ right) = P \ left
(Y = c_ {k} | X = x \ right) P (X = x) \] taken after two equations, the process transitions to obtain:
\ [\ the aligned the begin {P} \ left (Y = c_ {k} | X = x \ right) & = \ frac {P \ left (X = x | Y = c_ {k} \ right) P \ left (Y = c_ {k} \ right) } {P (X = x) } \\ & = \ frac {P \ left (X = x | Y = c_ {k} \ right) P \ left (Y = c_ {k} \ right)} {\ sum_ {k} P \ left (X = x, Y = c_ {k} \ right)} \\ & = \ frac {P \ left (X = x | Y = c_ {k} \ right) P \ left (Y = c_ {k} \ right) } {\ sum_ {k} P \ left (X = x | Y = c_ {k} \ right) P \ left (Y = c_ {k} \ right)} \\ & = \ frac {P \ left (Y = c_ {k} \ right) \ prod_ {j = 1} ^ {n} P \ left (X ^ {(j)} = x ^ {(j)} | Y = c_ {k} \ right)} { \ sum_ {k} P \ left (Y = c_ {k} \ right) \ prod_ {j = 1} ^ {n} P \ left (X ^ {(j)} = x ^ {(j)} | Y
= c_ {k} \ right)} \ end {aligned} \] more second equal sign of plus and using probability rules, using the third conditional probability formula equal sign, the two equal sign using the conditional independence assumptions.

Naive Bayes molecule expressed directly available as:
\ [Y = F (\ mathbf {X}) = \ Arg \ max _ {C_ {K}} \ {FRAC P \ left (the Y = C_ {K} \ right ) \ prod_ {j = 1} ^ {n} P \ left (X ^ {(j)} = x ^ {(j)} | Y = c_ {k} \ right)} {\ sum_ {k} P \ left (Y = c_ {k} \ right) \ prod_ {j = 1} ^ {n} P \ left (X ^ {(j)} = x ^ {(j)} | Y = c_ {k} \ right )} \]
note that in equation ((10) \) \ , the value of the denominator for all \ (C_K \) are equal, thus rounding the denominator to give:
\ [Y = \ Arg \ _ max {C_K } P \ left (Y = c_ {k} \ right) \ prod_ {j} P \ left (X ^ {(j)} = x ^ {(j)} | Y = c_ {k} \ right) \]

7-- parameter estimation

  1. Maximum likelihood estimate
  • 先验概率:
    \[ P\left(Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)}{N}, \quad k=1,2, \cdots, K \]

  • 条件概率:
    \[ \begin{array}{l}{P\left(X^{(j)}=a_{j t} | Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l}, y_{i}=c_{k}\right)}{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)}} \\ {j=1,2, \cdots, n ; l=1,2, \cdots, S_{j}: k=1,2, \cdots, K}\end{array} \]

Wherein the function \ (the I \) is the indicator function.

  1. Bayesian estimation

If a property value is centralized and there has never been a class while in training, then use the formula \ ((10) \) probability estimate appears \ (0 \) , and even take the lead's value is also calculated probability \ (0 \) . To prevent the above situation, the introduction of Bayesian estimation as follows.

  • 先验概率:
    \[ P_{\lambda}\left(X^{(j)}=a_{j l} | Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l}, y_{i}=c_{k}\right)+\lambda}{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)+S_{j} \lambda} \]
    式中\(\lambda \geqslant 0\).

  • 条件概率:
    \[ P_{\lambda}\left(Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)+\lambda}{N+K \lambda} \]

When \ (\ lambda = 0 \) , that is equivalent to maximum likelihood estimation.

Common \ (\ =. 1 the lambda \) , known as Laplace smoothing.

Consider the formula \ ((14) \) , for any \ (L = 1,2, \ cdots, of S_ {J}, \ Quad K = 1,2, \ cdots, K \) , both
\ [\ begin { array} {l} {P _ {\ lambda} \ left (X ^ {(j)} = a_ {jl} | Y = c_ {k} \ right)> 0} \\ {\ sum_ {l = 1} ^ {s_ {j}} P \
left (X ^ {(j)} = a_ {jl} | Y = c_ {k} \ right) = 1} \ end {array} \] visible indeed a distributed. Formula \ ((15) \) empathy. Laplacian smoothing essence and attribute value classes are assumed uniformly distributed, which is a priori data about the additionally introduced. It fixes the training set inadequately probability value \ (0 \) of the problem, and in the training set increases, the introduction of amendments prior impact will gradually become negligible.

8 - algorithmic process (maximum likelihood estimation)

Input: See example 5. Another \ (X \) .

Output: Example \ (x \) classification.

  1. Calculated prior probability and conditional probability:
  • Priori probability (Total \ (K \) two equations): \
    [P \ left (the Y = C_ {K} \ right) = P \ left (the Y = C_ {K} \ right) = \ {FRAC \ sum_ {i = 1} ^ {N } I \ left (y_ {i} = c_ {k} \ right)} {N} \]

  • 条件概率(共计\(k\sum_{j=1}^{n}S_j\)个式子):
    \[ \begin{array}{l}{P\left(X^{(j)}=a_{j t} | Y=c_{k}\right)=\frac{\sum_{i=1}^{N} I\left(x_{i}^{(j)}=a_{j l}, y_{i}=c_{k}\right)}{\sum_{i=1}^{N} I\left(y_{i}=c_{k}\right)}} \\ {j=1,2, \cdots, n ; l=1,2, \cdots, S_{j}: k=1,2, \cdots, K}\end{array} \]

  1. 对于给定实例\(x=\left(x_{i}^{(1)}, x_{i}^{(2)}, \cdots, x_{i}^{(n)}\right)^{\mathrm{T}}\),计算:
    \[ P\left(Y=c_{k}\right) \prod_{j=1}^{n} P\left(X^{(j)}=x^{(j)} | Y=c_{k}\right), \quad k=1,2, \cdots, K \]

  2. Determining Examples \ (X \) Classification:
    \ [Y = \ Arg \ max _ {A} P \ left (the Y = C_ {K} \ right) \ prod_ {J =. 1} ^ {n-} P \ left ( X ^ {(j)} = x ^ {(j)} | Y = c_ {k} \ right) \]

9 - Gaussian Bayesian classifier

These are the basis of a polynomial Bayes classifier for cases arguments are discrete values.

If the data set from the continuous variables are numerical data, then selection of this chapter Gaussian Bayesian classifier .

Suppose argument wherein \ (x ^ {\ left ( j \ right)} \) Gaussian distribution, namely:
\ [P \ left (X ^ {\ left (J \ right)} | C_ {K} \ right) \ sim \ mathcal {N} \
left (\ mu_ {j, k}, \ sigma_ {j, k} ^ {2} \ right) \] where \ (\ mu_ {j, k } \) and \ (\ sigma_ {j, k} \) for the training set characterized \ (x ^ {\ left ( j \ right)} \) belonging to the category \ (c_ {k} \) of the mean and standard deviation, the conditional probability can be expressed as:
\ [P \ left (x ^ {\ left (j \ right)} | Y = c_k \ right) = \ frac {1} {\ sqrt {2 \ pi} \ sigma_ {j, k}} \ exp \ left (- \ frac {\ left ( x ^ {\ left (j \ right)} - \ mu_ {j, k} \ right) ^ {2}} {2 \ sigma_ {j, k} ^ {2}} \ right) \]
other steps and thought the same, reference polynomial Bayesian classifier can be.

10 - Bernoulli Bayes classifier

In some tasks, such as text mining, characterized in \ (x ^ {\ left ( j \ right)} \) are \ (0-1 \) bins, preferably Bernoulli case Bayesian classifier .

Suppose wherein \ (x ^ {\ left ( j \ right)} \) is the conditional probability to satisfy the Bernoulli distribution.

Provided wherein \ (X ^ {\ left (J \ right)} \ in \ {0,1 \} \) , then the note:
\ [P = P \ left (X-^ {(J). 1} = | the Y = c_ {k} \ right) = \ frac {\ sum_ {i = 1} ^ {N} I \ left (x_ {i} ^ {(j)} = 1, Y = c_ {k} \ right) + \ lambda} {\ sum_ {i =
1} ^ {N} I \ left (y_ {i} = c_ {k} \ right) + K \ lambda} \] can therefore be conditional probability written as:
\ [P (X- ^ {\ left (j \ right )} = x ^ {\ left (j \ right)} | Y = c_k) = p \ cdot x ^ {\ left (j \ right)} + (1-p) \ cdot (1-x ^ {\ left
(j \ right)}) \] the other steps are the same thoughts, a reference to a polynomial Bayesian classifier.

11-- special episode: Why no loss of function occurs?

The following posterior probability prove to maximize the expected risk minimization is equivalent to.

Provided Select \ (0-1 \) loss function:
\ [L (the Y, F (X-)) = \ left \ {\ Array the begin {} {} {LL. 1, the Y} & {\ F NEQ (X-)} \\ {0,} & {Y
= f (X)} \ end {array} \ right. \] wherein \ (f (X) \) classification decision function. At this time, the expected risk is:
\ [R & lt _ {\ mathrm {exp}} (F) = E [L (the Y, F (X-))] \]
desirably joint distribution \ (P (X, Y) \) Take therefore then take the conditional expectation:
\ [R & lt _ {\ mathrm {exp}} (F) = of E_ {X-} \ sum_ {K =. 1} ^ {K} \ left [L \ left (C_ {K}, F (X) \ right) \ right
] P \ left (c_ {k} | X \ right) \] In order to minimize the risk of a desired, it is necessary to \ (X = x \) one by minimization, namely:
\ [\ Begin {aligned} f (x) & = \ arg \ min _ {y \ in \ mathcal {Y}} \ sum_ {k = 1} ^ {K} L \ left (c_ {k}, y \ right) P \ left (c_ { k} | X = x \ right) \\ & = \ arg \ min _ {y \ in \ mathcal {Y}} \ sum_ {k = 1} ^ {K} P \ left (y \ neq c_ {k} | X = x \ right) \\ & = \ arg \ min _ {y \ in \ mathcal {Y}} \ left (1-P \ left (y = c_ {k} | X = x \ right) \ right ) \\ & = \ arg \ max _ {y \ in \ mathcal {Y}} P \ left (y = c_ {k} | X = x \ right) \ end {aligned} \]
Note that from the first row to the second row, for the loss function \ (L (C_K, Y) \) , if the \ (Y = C_K \) , the loss function \ (L (c_k, y) = 0 \ ) , also after a failure, only when \ (y \ neq c_k \) , the loss function \ (L (C_K, Y) =. 1 \) , the latter to be effective, and therefore can be written after a first in the form of two rows of simplified formula. From the second row to the third row can be readily understood. From the third to fourth rows can be noted that \ (\ arg \ min \) becomes \ (\ Arg \ max \) , has been to maximize conversion from the maximum to minimize the loss of function of the posterior probability. Known expected risk minimization is equivalent to using Naive Bayesian posterior probability maximization:
\[ f(x)=\arg \max _{c_{k}} P\left(Y=c_{k} | X=x\right) \]

Guess you like

Origin www.cnblogs.com/joexredding/p/11725493.html