-EM machine learning algorithm -pLSA model notes

pLSA model - Probability and Statistics pLSA model (probabilistic Latent Semantic Analysis, Probability hidden semantic analysis) based on the theme of the model increases, the formation of a simple Bayesian network, you can use the EM algorithm to learn the model parameters. Probabilistic latent semantic analysis used in information retrieval, filtering, natural language processing, machine learning, text or other related fields.

D represents a document, Z represents the theme (implicit category), W represents the word;
  P (d i ) indicates that the document d i probability of occurrence,
 P (z k | d i ) represents the document d i topic z k appears probability,
 P (w j | z k ) represents the given word w zk theme appears j probabilities.
Each topic obey a number of items distributed on all the words, each subject to a number of documents distributed on all topics.
Generation process the entire document is such that:
 to P (D I probability) of the selected document D I ;
 to P (Z K | D I ) the probability of the selected topic Z K ;
   to P (W J | Z K ) the probability of a word w J .
Observation data (D I , W J ) pair, subject matter zk is implicit variables.
(D I , W J ) for the joint distribution

 
And P (W J | Z K ), P (Z K | D I ) corresponding to the distribution of a number of groups, each document is calculated relating to the distribution, according to any of the model is operational objectives.

Maximum Likelihood Estimation: wj number appears in the di n-(D I , W J )

The objective function analysis:

Observation data (D I , W J ) pair is implicit variables relating to zk.

The objective function:

Unknown variable / argument P (W J | Z K ), P (Z K | D I )

  1) using the successive approximation approach: Suppose P (Z K | D I ), P (W J | Z K ) is known, find implicit variables Z K posteriori probability;

 2) in (D I , W J , Z K lower) known in the premise, requirements on the parameter P (Z K | D I ), P (W J | Z K ) the maximum likelihood function of desired, to give optimal solution P (Z K | D I ), P (W J | Z K ), into the step, whereby iteration of the loop, i.e.: EM algorithm.

After seeking implicit variables zk theme posterior probability:

  Assumed that P (Z K | D I ), P (W J | Z K ) is known, find implicit variables Z K posteriori probability;

   

  In (D I , W J , Z K lower) known in the premise, requirements on the parameter P (Z K | D I ), P (W J | Z K ) of the desired maximum likelihood function like, the optimal solution P (Z K | D I ), P (W J | Z K ), into the step, whereby iteration of the loop.

About the parameters P (Z K | D I ), P (W J | Z K ) of the desired likelihood function

Complete the establishment of objective function:

About the parameters P (Z K | D I ), P (W J | Z K ) of the function E, and, with probability and to add a constraint condition:

 

Obviously, this is only a problem extremum equality constraints using Lagrange multiplier method to solve.

Solving the objective function:

Seeking stagnation:

Analysis of an equation:

Similarly analysis of the second equation:

Solutions at --M-Step extremum:

STEP-E (Z K a posteriori probability):

 

pLSA summary:

  1)pLSA应用于信息检索、过滤、自然语言处理等领域,pLSA考虑到词分布和主题分布,使用EM算法来学习参数。
 2)虽然推导略显复杂,但最终公式简洁清晰,很符合直观理解,需用心琢磨;此外,推导过程使用了EM算法,也是学习EM算法的重要素材。

Guess you like

Origin www.cnblogs.com/yang901112/p/11621568.html