Machine Learning - Bayesian networks - Notes

Bayesian networks Description:

  1) Bayesian networks (Bayesian network), also known as belief networks (Belief Network), or a directed acyclic graph model (directedacyclic graphical model), is a probabilistic graphical model is a simulation of the human reasoning process causation uncertainty processing model, the network topology is a directed acyclic graph (DAG). Examine a set of X-random variables { . 1 , X- 2 ... X- n } n groups and conditional probability distributions (Conditional Probability Distributions, CPD) properties.

  2) In general, a Bayesian network is a random variable representing the directed acyclic graph of nodes, which can be a variable to be observed, or hidden variables, the unknown parameters. Connecting two nodes of this arrow represents two random variables are causal (or dependent condition). If between two nodes are connected together with a single arrow indicates a node which is "because (Parents)", and the other is "fruit (Children)", two nodes will produce a conditional probability value.

  3) at each node given its direct predecessor, successor conditions independent of its non.

A simple Bayesian network:

  p(a,b,c) = p(c|a,b)p(b|a)p(a)

  Corresponding directed acyclic graph:

  

Fully connected Bayesian network:

  Between each pair of edge-connected node has

  p(x1,...,xk)=p(xk|x1,...,xk-1)...p(x2|x1)p(x1)

  

A "normal" Bayesian network:

  

  1) Some side missing

  2) Intuitively: X . 1 and X 2 independently, X . 6 and X . 7 in X . 4 independently under the given conditions

  . 3) X . 1 , X 2 , ... X . 7 the joint distribution:

  p(x1)p(x2)p(x3)p(x4|x1,x2,x3)p(x5|x1,x3)p(x6|x4)p(x7|x4,x5)

Three kinds Bayesian network structure:

  To introduce the concept of separation of D- (D-Separation) is: an independent variable used to determine whether the condition of patterning method. I.e., the DAG for a (directed acyclic graph) E, D-Separation method can quickly determine whether the conditions are independent between the two nodes.

形式1:head-to-head

  A first Bayesian network structure:

  

  P(a,b,c) =P(a)*P(b)*P(c| a,b)

  

  →P(a,b)=P(a)*P(b)

  Under conditions known c, a, b is blocked (blocked) are independent

形式2:tail-to-tail

  

  The graph model, to give: P (A, B, C) = P (C) * P (A | C) * P (B | C)
 so: P (a, b, c ) / P (c) = P (A | C) * P (B | C)
 since P (A, B | C) = P (A, B, C) / P (C)
 obtained: P (a, b | c ) = P (a | c) * P (B | c)
 namely: under given conditions c,
      a, B is blocked (blocked) are independent

Forms. 3: head-to-tail

  

  P(a,b,c)=P(a)*P(c|a)*P(b|c)

  That is: Under the conditions given c, a, b is blocked (blocked), it is independent.

 

Naive Bayes:

1) Naive Bayes assumptions

  The probability of occurrence of a characteristic, and other characteristics (conditions) Independent (characteristic independence), in fact: for under the conditions of a given classification, characteristics independent

  Each equally important feature (characteristic balance)

2) derived Naive Bayes

  Naive Bayes (Naive Bayes, NB) is based on the "between features are independent," the simple hypothesis, supervised learning algorithm Bayes' theorem.

  For a given feature vector X . 1 , X 2 , ..., X n-

  Probability category y can be obtained according to the Bayesian formula:

  

  Use simple assumption of independence:

  P(xi|y,x1,...,xi-1,xi+1,...,xn)=P(xi|y)

  Probability of category y simplifies to:

   

  Under the premise of a given sample, P (X . 1 , X 2 , ..., X n- ) is a constant:

  

  thereby:

  

Gauss Naive Bayesian Gaussian Naive Bayes:

  The sample using the MAP (the Maximum A Posteriori maximum a posteriori estimation) estimating P (y), to establish a reasonable estimation model P (X I | Y), whereby the sample category.
  
  Wherein Gaussian distribution is assumed, namely:

  

  Parameters using MLE (maximum likelihood estimation) can be estimated.

A number of distribution naive Bayes Multinomial Naive Bayes:

  Wherein the distribution is assumed to obey a number, whereby for each category y, parameters [theta] Y = ([theta] Y1 , [theta] Y2 , ..., [theta] Yn ), where n is the number of characteristics, P (X I | Y) of probability [theta] yi .

  The results MLE estimate the parameters used are as follows: 

   

  Training set is assumed as T, are:

    

  其中,α=1称为Laplace平滑,α<1称为Lidstone平滑。

拉普拉斯Laplace平滑  

  p(x1|c1)是指的:在垃圾邮件c1这个类别中,单词x1出现的概率。
   x1是待考察的邮件中的某个单词定义符号
   n1:在所有垃圾邮件中单词x1出现的次数。如果x1没有出现过,则n1=0。
   n:属于c1类的所有文档的出现过的单词总数目。

  得到公式:p(x1|c1)=n1/n

  拉普拉斯平滑:p(x1|c1)=(n1+1)/(n+N),其中,N是所有单词的数目。修正分母是为了保证概率和为1

  同理,以同样的平滑方案处理p(x1)

对朴素贝叶斯的思考

  拉普拉斯平滑能够避免0/0带来的算法异常
   要比较的是P(c1|x)和P(c2|x) 的相对大小,而根据公式P(c|x) =P(x|c)*P(c) / P(x),二者的分母都是除以P(x),实践时可以不计算该系数。
  问题:一个词在样本中出现多次,和一个词在样本中出现一次,形成的词向量相同
    由0/1向量改成频数向量或TF-IDF向量
  如何判断两个文档的距离
    夹角余弦

   如何给定合适的超参数: 

  

    交叉验证

Guess you like

Origin www.cnblogs.com/yang901112/p/11625934.html