[Eating the book together] "Machine Learning" Chapter 7 Bayesian Classifier

Chapter 7 Bayesian Classifiers

7.1 Bayesian decision theory

  For classification tasks, in the ideal situation where all relevant probabilities are known, Bayesian decision theory considers how to select the optimal category label based on these probabilities and misjudgment losses, assuming NNPossible class labels in N , ie y = { c 1 , c 2 , . . . , c N } y = \{ {c_1},{c_2},...,{c_N}\}y={ c1,c2,...,cN} ,λ ij {\lambda _{ij}}lijis to mark a real as cj c_jcjof samples misclassified as ci c_iciThe resulting loss, based on the posterior probability P ( ci ∣ x ) P({c_i}|{\bf{x}})P(cix ) can obtain the samplex \bf{x}x is classified asci c_iciThe resulting expected loss, that is, at sample x \bf{x}"Conditional risk" on x
: R ( ci ∣ x ) = ∑ j = 1 N λ ij P ( cj ∣ x ) R({c_i}|{\bf{x}}) = \sum\limits_{j = 1}^N { {\lambda _{ij}}P({c_j}|{\bf{x}})}R(cix)=j=1NlijP(cjx )
  Our task is to find a criterionh : x ↦ yh:x \mapsto yh:xy to minimize the overall risk:
R ( h ) = E x [ R ( h ( x ) ∣ x ) ] R(h) = {\mathbb{E}_{\bf{x}}}[R(h( {\bf{x}})|{\bf{x}})]R(h)=Ex[ R ( h ( x ) x )]
  Obviously, for each samplex \bf{x}x , ifhhh can minimize the conditional riskR ( h ( x ) ∣ x ) R(h({\bf{x}})|{\bf{x}})R ( h ( x ) x ) , then the overall riskR ( h ) R(h)R ( h ) will also be minimized, which leads to the Bayesian decision criterion: To minimize the overall risk, simply choose on each sample that conditional risk R ( c ∣ x ) R(c|\ bf{x})R ( c x ) the smallest class label, that is,
h ∗ ( x ) = arg ⁡ min ⁡ c ∈ y R ( c ∣ x ) {h^*}({\bf{x}}) = \mathop {\ arg \min }\limits_{c \in y} R(c|{\bf{x}})h(x)=cyargminR ( c x )
  At this time,h ∗ h^*h is called the Bayesian optimal classifier, and the correspondingR ( h ∗ ) R(h^*)R(h )is called Bayesian hazard,1 − R ( h ∗ ) 1-R(h^*)1R(h )reflects the best performance that the classifier can achieve, that is, the theoretical upper limit of the model accuracy that can be produced by machine learning.

  It is not difficult to see that if you want to use the Bayesian criterion to minimize the decision risk, you must first obtain the posterior probability P ( c ∣ x ) P(c|\bf{x})P ( c x ) , however, this is usually difficult to obtain directly in real tasks, so what machine learning needs to achieve is to estimate the posterior probability P ( c ∣ x )P(c| \bf{x})P ( c x ) , there are currently two strategies:

  • Discriminant model: given x \bf{x}x , can be directly modeled byP ( c ∣ x ) P(c|\bf{x})P ( c x ) to predictccc
  • Generative mode: First, the joint probability distribution P ( x , c ) P(\bf{x},c)P(x,c ) modeling, and then obtainP ( c ∣ x ) P(c|\bf{x})P(cx)

P ( c ∣ x ) = P ( x , c ) P ( x ) = P ( c ) P ( x ∣ c ) P ( x ) P(c|{\bf{x}}) = \frac{ {P({\bf{x}},c)}}{ {P({\bf{x}})}} = \frac{ {P(c)P({\bf{x}}|c)}}{ {P({\bf{x}})}} P(cx)=P(x)P(x,c)=P(x)P(c)P(xc)

  where P ( c ) P(c)P ( c ) is the class "prior" probability,P ( x ∣ c ) P(\bf{x}|c)P ( x ∣c ) is the samplex \bf{x}x is relative to the class labelccThe class conditional probability (also called "likelihood") of c , P ( x ) P(\bf{x})P ( x ) is the "evidence" factor used for normalization. For a given samplex \bf{x}x , evidence factorP ( x ) P(\bf{x})P ( x ) has nothing to do with the class label, so estimateP ( c ∣ x ) P(c|\bf{x})The problem of P ( c x ) is transformed into howtoD to estimate priorP ( c ) P(c)P ( c ) and likelihoodP ( x ∣ c ) P(\bf{x}|c)P(x∣c)

  Class prior probability P ( c ) P(c)P ( c ) expresses the proportion of various samples in the sample space. According to the law of large numbers, when the training set contains sufficient independent and identically distributed samples,P ( c ) P(c)P ( c ) can be estimated by the probability of occurrence of various samples.

  For class conditional probability P ( x ∣ c ) P(\bf{x}|c)For P ( x ∣c ) , since it involves aboutx \bf{x}The joint probability of all attributes of x , directly based on the probability of sample occurrence, will encounter serious difficulties.

  • Prior probability: A subjective probability estimate of an event or parameter before new information is observed.
  • Posterior Probability: The reestimation of the probability of an event or parameter after new information has been observed.
  • Likelihood: The probability of observing new information given an event or parameter is known.

7.2 Maximum Likelihood Estimation

  A common strategy for estimating class conditional probability is to assume that it has a certain form of probability distribution, and then estimate the parameters of the probability distribution based on training samples, assuming P ( x ∣ c ) P(\bf{x}|c )P ( x ∣c ) has a definite form and is parameterized by the vectorθ c \theta_cicuniquely determined, then you need to use the training set DDD to estimate the parameterθ c \theta_cic

  The following is a brief introduction to the views of the two schools of statistics on parameter estimation:

  • Frequentist school: The parameter is unknown, but the fixed value exists objectively, and the parameter value can be determined by optimizing the likelihood function and other criteria, which also forms statistical learning.
  • Bayesian school: The parameter is an unobserved random variable, which itself has a distribution. It can be assumed that the parameter obeys a prior distribution, and then the posterior distribution of the parameter is calculated based on the observed data, which also forms Bayesian learning .

  Let D c D_cDcDenotes the training set DDccin DA set of c samples, assuming that these samples are independent and identically distributed, the parameterθ c \theta_cicFor dataset D c D_cDcIndependent function
P ( D c ∣ θ c ) = ∏ x ∈ D c P ( x ∣ θ c ) P({D_c}|{\theta _c}) = \prod\limits_{ {\bf{x}} ; \in {D_c}} {P({\bf{x}}|{\theta_c})}P(Dcθc)=xDcP(xθc)
  Considering that the multiplication operation is likely to cause underflow (multiplication of small real values), the log likelihood is usually used, as shown below
LL ( θ c ) = log ⁡ P ( D c ∣ θ c ) = ∑ x ∈ D c log ⁡ P ( x ∣ θ c ) LL({\theta _c}) = \log P({D_c}|{\theta _c}) = \sum\limits_{ {\bf{x}} \in {D_c } } {\log P({\bf{x}}|{\theta _c})}LL(θc)=logP(Dcθc)=xDclogP(xθc)
  so forθ c \theta_cicCarry out maximum likelihood estimation and find a way to maximize the likelihood P ( D c ∣ θ c ) P(D_c|\theta_c)P(Dcθc) parameter valueθ ^ c \hat{\theta}_ci^c,unlock
θ^c = arg ⁡ max ⁡ θ c LL ( θ c ) {\hat\theta _c} = \origin {\arg \max }\limits_{ {\theta _c}} LL({\theta _c })i^c=icargmaxLL(θc)
  Let’s take an example of maximum likelihood

7.3 Naive Bayes Classifier

  The main difficulty in estimating the posterior probability based on the Bayesian formula is that the class conditional probability is the joint probability of all attributes, which is difficult to estimate directly from limited training samples; It will encounter the problem of combinatorial explosion, and it will encounter the problem of sample sparseness in the data (the more attributes, the more serious the problem).

  In order to avoid this obstacle, the Naive Bayesian classifier adopts the "attribute conditional independence assumption": for known categories, it is assumed that all attributes are independent of each other, that is, it is assumed that each attribute independently affects the classification result, as shown below , where ddd is the number of attributes,xi x_ixifor x \bf{x}x presentiiThe value on the i attribute.
P ( c ∣ x ) = P ( c ) P ( x ∣ c ) P ( x ) = P ( c ) P ( x ) ∏ i = 1 d P ( xi ∣ c ) P(c|{\bf{x }}) = \frac{ {P(c)P({\bf{x}}|c)}}{ { P({\bf{x}})}} = \frac{ { P(c)} }{ {P({\bf{x}})}}\prod\limits_{i = 1}^d {P({ { x}_i}|c)}P(cx)=P(x)P(c)P(xc)=P(x)P(c)i=1dP(xic )
  The Bayesian decision criterion is as follows
hnb ( x ) = arg ⁡ max ⁡ c ∈ y P ( c ) ∏ i = 1 d P ( xi ∣ c ) {h_{nb}}({\bf{x }}) = \mathop {\arg \max }\limits_{c \in y} P(c)\prod\limits_{i = 1}^d {P({x_i}|c)}hnb(x)=cyargmaxP(c)i=1dP(xic )
  letD c D_cDcDenotes the training set DDccin DFor a set of c -type samples, if there are sufficient independent and identically distributed samples, the class prior probabilityP ( c ) = ∣ D c ∣ ∣ D ∣ P(c) = \frac{ { \left| { {D_c}} \right|}}{ {\left| D \right|}}P(c)=DDc

  For discrete attributes, let D c , xi D_{c,x_i}Dc,ximeans D c D_cDcin the iiThe value of i attribute isxi x_ixiA set composed of samples, the conditional probability P ( xi ∣ c ) P(x_i|c)P(xic)可估计为
P ( x i ∣ c ) = ∣ D c , x i ∣ ∣ D c ∣ P({x_i}|c) = \frac{ {\left| { {D_{c,{x_i}}}} \right|}}{ {\left| { {D_c}} \right|}} P(xic)=DcDc,xi
  For continuous attributes, the probability density function can be considered, assuming p ( xi ∣ c ) ∼ N ( μ c , i , σ c , i 2 ) p({x_i}|c) \sim N({\mu _{c,i }},\sigma_{c,i}^2)p(xic)N ( mc,i,pc,i2) , whereμ c , i \mu _{c,i}mc,i σ c , i 2 \sigma _{c,i}^2 pc,i2cc _Class c sample iniiThe mean and variance of the values ​​​​on i attributes, then
p ( xi ∣ c ) = 1 2 π σ c , i exp ⁡ ( − ( xi − μ c , i ) 2 2 σ c , i 2 ) p({ x_i}|c) = \frac{1}{ {\sqrt {2\pi } {\sigma _{c,i}}}}\exp ( - \frac{ { { { ( { x_i} - {\mu _{c,i}})}^2}}}{ {2\sigma _{c,i}^2}})p(xic)=2 p.m pc,i1exp(2 pc,i2(ximc,i)2)
  Let's take an example of the Naive Bayesian classifier

7.4 Semi-Naive Bayesian Classifiers

  The basic idea of ​​the semi-naive Bayesian classifier is to properly consider the interdependence information between some attributes, so that it does not need to calculate the complete joint probability, and it does not completely ignore the relatively strong attribute dependencies. "Independent dependence estimation" is the most commonly used strategy for semi-naive Bayesian classifiers, assuming that each attribute depends on at most one other attribute outside the category, that is, P ( c ∣ x )
∝ P ( c ) ∏ i = 1 d P ( xi ∣ c , pai ) P(c|{\bf{x}}) \propto P(c)\prod\limits_{i = 1}^d {P({x_i}|c,p {a_i})}P(cx)P(c)i=1dP(xic,pai)
  in thatpai pa_ipaiis attribute xi x_ixiDepends on the attribute, called xi x_ixiIn order to determine the parent attribute of each attribute, the most direct way is to assume that all attributes depend on the same attribute, which is called "super parent", and then determine the super parent attribute through model selection methods such as cross-validation.

7.5 Bayesian Nets

  Bayesian networks use directed acyclic graphs to describe the dependencies between attributes, and use conditional probability tables to describe the joint probability distribution of attributes. Specifically, a Bayesian net BBB by structureGGG and parametersΘ \ThetaΘ consists of two parts, that is,B = ⟨ G , Θ ⟩ B = \left\langle {G,\Theta } \right\rangleB=G,Θ . Network structureGGG is a directed acyclic graph, each node of which corresponds to an attribute, if two attributes have a direct dependency, they are connected by an edge; the parameter Θ \ThetaΘ quantitatively describes this dependency, assuming the attributesxi x_ixiin GGThe set of parent nodes in G is π i {\pi _i}Pii,则Θ \ThetaΘ containment condition summary tableθ xi ∣ π i = PB ( xi ∣ π i ) {\theta _{ {x_i}|\pi {}_i}} = {P_B}({x_i}|{\ pi_i})ixiπi=PB(xiπi)

  Let’s illustrate the Bayesian network with an example. As shown in the figure below, "color" directly depends on "good melon" and "sweetness", while "root" directly depends on "sweetness". Further, the conditional probability table Get the quantitative dependence of "root" on "sweetness", such as PPP (root=stiff|sweetness=high)=0.1 etc.

(1) Structure

  The Bayesian network structure effectively expresses the conditional independence between attributes. Given a set of parent nodes, the Bayesian network assumes that each attribute is independent of its non-descendant attributes, so B = ⟨ G , Θ ⟩ B = \ left\langle {G,\Theta } \right\rangleB=G,Θ attributesxi , x 2 , . . . , xd x_i,x_2,...,x_dxi,x2,...,xdThe joint probability distribution is defined as
PB ( x 1 , x 2 , . },{x_2},...,{x_d}) = \prod\limits_{i = 1}^d { { P_B}({x_i}|{\pi _i})} = \prod\limits_{i = 1}^d { {\theta _{ {x_i}|{\pi _i}}}}PB(x1,x2,...,xd)=i=1dPB(xiπi)=i=1dixiπi
  The figure below shows the typical dependencies among the three variables in the Bayesian network. In the "same parent" structure, given the parent node x 1 x_1x1value, then x 3 x_3x3with x 4 x_4x4Conditionally independent; in the "order" structure, given xxThe value of x , thenyyy andzzz conditionally independent;VVThe V- shaped structure is also called "collision" structure, given the stator nodex 4 x_4x4The value of x 1 x_1x1with x 2 x_2x2must not be independent, but if x 4 x_4x4The value of is completely unknown, then VVUnder the V -shaped structure x 1 x_1x1with x 2 x_2x2are independent of each other. Such independence is also called "marginal independence".

  To analyze conditional independence between variables in a directed graph, "directed separation" can be used:

  • Find all VVs in a directed graphV- shaped structure,the VVA directed edge is added between the two parent nodes of the V- shaped structure.
  • Change all directed edges to undirected edges.

(2) study

  The primary task of Bayesian network learning is to find the most "appropriate" Bayesian network according to the training data set. "Scoring search" is a common method to solve this problem. By defining a scoring function to evaluate the Bayesian network The degree of fit between the network and the training data, and then based on this scoring function to find the optimal structure of the Bayesian network.

  Commonly used scoring functions are usually based on information-theoretic criteria, which treat the learning problem as a data compression task, and the goal of learning is to find a model that can describe the training data with the shortest code length. For Bayesian network learning, the model is a Bayesian network. At the same time, each Bayesian network describes a probability distribution on the training data, and has its own set of encoding mechanisms to make those frequently occurring samples more accurate Short code, so we should choose the Bayesian network with the shortest comprehensive code length, which is the "minimum description length" criterion.

  Given training set D = { x 1 , x 2 , . . . , xm } D = \{{x_1},{x_2},...,{x_m}\}D={ x1,x2,...,xm} , Bayesian networkB = ⟨ G , Θ ⟩ B = \left\langle {G,\Theta } \right\rangleB=G,Θ inDDThe scoring function on D can be written as follows, where∣ B ∣ \left| B \right|B is the number of parameters of the Bayesian network,f ( θ ) f(\theta )f ( θ ) means describing each parameterθ \thetaThe number of encoding bits required for θ .
s ( B ∣ D ) = f ( θ ) ∣ B ∣ − LL ( B ∣ D ) s(B|D) = f(\theta )\left| B \right| - LL(B|D)s(BD)=f ( i )BLL ( B D )
  The first item in the formula is the calculation code Bayesian networkBBThe number of coding bits required for B , the second item is to calculate BBThe probability distribution PBcorresponding to B P_BPBto DDD how well described. Therefore, the learning task is transformed into an optimization task, which is to find a Bayesian networkBBB makes the scoring functions ( B ∣ D ) s(B|D)s ( B D ) is the smallest.

  若f ( θ ) = 1 f(\theta)=1f ( i )=1 , that is, each parameter uses1 11 encoding bit description, then getAIC AICA I C scoring function
AIC ( B ∣ D ) = ∣ B ∣ − LL ( B ∣ D ) AIC(B|D) = \left| B \right| - LL(B|D)AIC(BD)=BLL(BD)
  若 f ( θ ) = 1 2 log ⁡ m f(\theta ) = \frac{1}{2}\log m f ( i )=21logm , that is, each parameter uses1 2 log ⁡ m \frac{1}{2}\log m21logm encoding bit description, then getBIC BICB I C scoring function
BIC ( B ∣ D ) = log ⁡ m 2 ∣ B ∣ − LL ( B ∣ D ) BIC(B|D) = \frac{ { \log m}}{2}\left| B \ right| - LL(B|D)BIC(BD)=2logmBLL ( B D )
  f( θ ) = 0 f(\theta)=0f ( i )=0 , that is, the length of encoding the network is not calculated, then the scoring function degenerates into negative log likelihood, and correspondingly, the learning task degenerates into maximum likelihood estimation.

(3) Inference

  Inference refers to the process of inferring the variable to be queried through the observed value of the known variable. The ideal is to accurately calculate the posterior probability directly according to the joint probability distribution defined by the Bayesian network, but this is an NP NPNP- hard problems, in practical applications, the approximate inference of the Bayesian network is often completed by Gibbs sampling, which is a random sampling method, as shown below.

7.6 EM algorithm

  Real life often encounters "incomplete" training samples, for example, the roots of watermelons have fallen off, and it is impossible to tell whether it is "curled up" or "stiff". The scientific name for unobserved variables is "hidden variables". Let X \bf{X}X represents the observed variable set,Z \bf{Z}Z represents the hidden variable set,Θ \ThetaΘ denotes the model parameters. ForΘ \ThetaΘ is used for maximum likelihood estimation, the log likelihood should be maximized
LL ( Θ ∣ X , Z ) = ln ⁡ P ( X , Z ∣ Θ ) LL(\Theta |{\bf{X}},{\ bf{Z}}) = \ln P({\bf{X}},{\bf{Z}}|\Theta )LL(Θ∣X,Z)=lnP(X,Z ∣Θ)
  However, sinceZ \bf{Z}Z is a hidden variable, and the above formula cannot be solved directly. At this time, it can be solved byZ \bf{Z}Z computes the expectation to maximize the log "marginal likelihood" of the observed data
LL ( Θ ∣ X ) = ln ⁡ P ( X ∣ Θ ) = ln ⁡ ∑ ZP ( X , Z ∣ Θ ) LL(\Theta | {\bf{X}}) = \ln P({\bf{X}}|\Theta ) = \ln \sum {_{\bf{Z}}P({\bf{X}},{\ bf{Z}}|\Theta )}LL(Θ∣X)=lnP(X∣Θ)=lnZP(X,Z ∣Θ)
  EM EMThe EM algorithm is a commonly used tool for estimating parameter hidden variables. It is an iterative method, and its basic idea is: if the parameterΘ \ThetaΘ is known, then the optimal latent variable Z \bf{Z}can be inferred based on the training dataThe value of Z (EEE step); On the contrary, ifZ \bf{Z}If the value of Z is known, it is convenient for the parameterΘ \ThetaΘdo maximum likelihood estimation (MMStep M ).

  Therefore, with the initial value Θ 0 \Theta^0Th0 as the starting point, the following steps can be performed iteratively until convergence:

  • Based on Θ t \Theta^tTht infers the hidden variableZ \bf{Z}The expectation of Z , denoted asZ t \bf{Z}^tZt
  • Based on the observed variable X \bf{X}X Z t \bf{Z}^t Zt versus parameterΘ \ThetaΘ is used for maximum likelihood estimation, recorded asΘ t + 1 \Theta^{t+1}Tht+1

  If we are not taking Z \bf{Z}The expectation of Z , but based onΘ t \Theta^tTht calculates the hidden variableZ \bf{Z}The probability distribution P of Z ( Z ∣ X , Θ t ) P({\bf{Z}}|{\bf{X}},{\Theta ^t})P(ZX,Tht )I'mThe two steps of the EM algorithm are:

  • AND ANDStep E : With the current parameterΘ t \Theta^tTht infers the hidden variable distributionP ( Z ∣ X , Θ t ) P({\bf{Z}}|{\bf{X}},{\Theta ^t})P(ZX,Tht ), and calculate the log-likelihoodLL ( Θ ∣ X , Z ) LL(\Theta |{\bf{X}},{\bf{Z}})LL(Θ∣X,Z )AboutZ \bf{Z}Z 's expectations

Q ( Θ ∣ Θ t ) = E Z ∣ X , Θ t L L ( Θ ∣ X , Z ) Q(\Theta |{\Theta ^t}) = {\mathbb{E}_{ {\bf{Z}}|{\bf{X}},{\Theta ^t}}}LL(\Theta |{\bf{X}},{\bf{Z}}) Q(Θ∣Θt)=EZ X , ThtLL(Θ∣X,Z)

  • M M M- step: find the parameters to maximize the expected likelihood, ie

Θ t + 1 = arg ⁡ max ⁡ Θ Q ( Θ ∣ Θ t ) {\Theta ^{t + 1}} = \top {\arg \max }\limits_\Theta Q(\Theta |{\Theta ^t })Tht+1=ThargmaxQ(Θ∣Θt)

Guess you like

Origin blog.csdn.net/qq_44528283/article/details/130658848