System learn NLP (XXIII) - LDA topic model

Transfer: https://blog.csdn.net/kisslotus/article/details/78427585

1. Introduction
In the field of machine learning, LDA is commonly referred to as two models: Linear Discriminant Analysis and Latent Dirichlet Allocation. LDA this article refers only to Latent Dirichlet Allocation. LDA occupies a very important position in the subject model used to classify text.

LDA proposed by Blei, David M., Ng, Andrew Y., Jordan in 2003, the subject of speculation for document distribution. It can centralize documentation theme of each document is given in the form of probability distributions, so that some of the documents extracted by analyzing the distribution of their subject matter, they can be distributed according to theme or topic clustering text categorization.

2. priori knowledge of the
LDA model involves a lot of mathematical knowledge, which is probably the main reason LDA obscure. This section describes the mathematical knowledge involved in the LDA. Mathematical skills better readers can skip this section.

LDA knowledge involved are: binomial, Gamma function, Beta distribution, a number of distribution, Dirichlet distribution, Markov chains, MCMC, Gibbs Sampling, EM algorithm.

2.1 bag of words model
LDA uses bag of words model. The so-called bag of words model, is a document, we consider only whether there is a glossary, regardless of the order in which they appear. In the words of the bag model, "I love you" and "I love you" are equivalent. A model with a bag of words model Contrary n-gram, n-gram consider the order of words appears.

N is the binomial distribution Bernoulli weight, i.e. X ~ B (n, p) is the probability density formula:

P(K = k) = \begin{pmatrix} n\\ k\\ \end{pmatrix}p^k{(1-p)}^{n-k}

2.3 multinomial distribution

Multinomial distribution, binomial distribution is extended to the case of multi-dimensional. More distribution means that a single value of the random variable is no longer in trials 0-1, but there are various possible discrete values ​​(1,2, . 3 ..., k) the probability density function is:

P(x_1, x_2, ..., x_k; n, p_1, p_2, ..., p_k) = \frac{n!}{x_1!...x_k!}{p_1}^{x_1}...{p_k}^{x_k}

2.4 Gamma function

Gamma function definition:

\Gamma(x) = \int_0^\infty t^{x-1}e^{-t}dt

After partial integration, if such a function can be found in Gamma properties:\ Gamma (x + 1) = x \ Gamma (x)

Gamma factorial function can be seen as the set of real numbers on the extension, it has the following properties:\ Gamma (n) = (n-1)!

2.5 Beta distribution

Beta distribution is defined: For the parameters α> 0, β> 0, a probability density function in the range [0, 1] is a random variable x is:

f(x; \alpha, \beta) = \frac{1}{B(\alpha, \beta)} x^{\alpha - 1} {(1-x)}^{\beta-1}

among them,\frac{1}{B(\alpha, \beta)} = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)}

2.6 conjugate prior distribution

Bayesian probability theory, if the posterior probability P (θ|x) prior probability p (θ) satisfying the same distribution law, the prior distribution and posterior distributions are called conjugate distribution, at the same time, a priori distribution is called the likelihood function conjugate prior distribution.

Beta distribution is binomial distribution conjugate prior distribution, and Dirichlet (the Dirichlet) distribution is a multinomial distribution of conjugate distribution.

Conjugated mean to Beta distribution and binomial distribution, for example, data in line with binomial distribution when the prior distribution and posterior distributions of the parameters can be maintained in the form of Beta distribution, this form of benefit is the same we can give prior distribution parameter in very clear physical sense, the physical meaning can be extended to explain the subsequent distribution, and transformation from a priori to a posteriori process supplementary data from the knowledge also likely to have physical interpretation.

2.7 Dirichlet distribution

Dirichlet probability density function is:f(x_1, x_2, ..., x_k; \alpha_1, \alpha_2, ..., \alpha_k) = \frac{1}{B(\alpha)}\prod_{i=1}^{k}{x_i}^{\alpha^i-1}

among them,B(\alpha) = \frac{\prod_{i=1}^{k}\Gamma(\alpha^i)}{\Gamma(\sum_{i=1}^{k}\alpha^i)}, \sum_{i=1}^{k}x^i = 1

According to Beta distribution, binomial distribution, Dirichlet distribution, distribution of polynomial equations, we can verify the binomial distribution Beta distribution is the conjugate prior distribution, and Dirichlet (Dirichlet) distribution is the distribution of polynomial conjugate distribution.

For random variable Beta distribution, the mean may be estimated by α / (α + β). Dirichlet distribution has a similar conclusion,E(p) =\biggl ( \frac{\alpha ^ 1}{\sum_{i = 1}^K \alpha_i}, \frac{\alpha ^ 1}{\sum_{i = 2}^K \alpha_i}, \cdots, \frac{\alpha ^ K}{\sum_{i = 1}^K \alpha_i} \biggr)

These two results are very important, LDA mathematical derivation process will be used later in this conclusion.

2.9 MCMC and Gibbs Sampling
in real-world applications, we often difficult to accurately obtain accurate probability distribution, often using approximate inference method. Approximate estimation method can be broadly divided into two categories: The first category is the sampling (the Sampling), approximation is accomplished by using a random method; the second is a deterministic estimation approximation approximation is completed, typified Variational inference (variational inference) . (actually solving method, you can understand, this is not moved)

3. Text modeling
a document, can be seen as an ordered set of word sequence d = (ω1, ω2, ⋯ , ωn) from a statistical point of view, document generation can be seen as God throwing dice generated the result, each time throwing dice generates a vocabulary, throwing N word generating a document. Text in the statistical modeling, we want to guess how God is to play this game, which involves two core issues:

What God has dice;
how God is throwing the dice;
the first question is, what are the parameters in the model represent the probability of each face of the dice corresponds to parameters in the model; the second question, says What rules of the game is that God may have a variety of different types of dice, God can throw those dice according to certain rules resulting word sequence.

3.1 Unigram Model
in Unigram Model, we use the bag of words model, assuming that the documents are independent, mutually independent document vocabulary. Suppose our dictionary word in a total of V ν1, ν2, ⋯, νV, then the easiest Unigram Model is to think that God is produced in accordance with the text of the rules of the game are as follows.

1. God is only one dice, the dice have V faces, each face should be a word, the probability of each different surface;
2. Each throw of the dice thrown surface produces a corresponding word; if a document the N words on separate dice thrown n times to generate n words;

Here it is not moved.

3.1.2 Bayeux Spirax perspective
for statisticians above models, Bayesian statistics school will have different opinions, they will be very picky criticism only assume that God has only one fixed dice is unreasonable. In the Bayesian view, all parameters are random variables, the above model of the dice \ Begin {p}is not the only constant, it is also a random variable. Therefore, in accordance with the Bayesian point of view, God is in accordance with the following procedure to play the game:

1. infinitely jar equipped with a plurality of conventional dice, which is equipped with all kinds of dice, each dice with a V-faces;
2. a dice now drawn out from the jar, and continue to use the dice throw, until a corpus of all vocabulary

LDA is based on Bayesian model, Bayesian model involves inseparable "prior distribution", "data (likelihood)" and "posterior profile" three. In Bayesian here:

Prior distribution data + (likelihood) = the posterior distribution

This fact is well understood, because it is in our people's way of thinking, such as your awareness of good guys and bad guys, prior distribution is: 100 100 bad guys and good guys, good people and bad do you think that is half and half, now you are good 2 (data) and the help of a bad lie, so you get a new posterior distribution: 102 to 101 good and bad. Now your posterior distribution inside that the good man than a bad man. The posterior distribution and then they become your new prior distribution, when you are a good person (data) to help the bad guys and 3 (data) cheated after, you have to update your posterior distribution: 103 104 of a good and bad. In turn continue to update it.

In fact, the key issue here has been explained away, that is to say, we represent data (likelihood), with Dirichlet distribution is expressed by a polynomial distributed prior distribution, then, using the Bayesian formula to approximate the prior distribution (such as MCMC sampling), the posterior distribution (prior and posterior conjugate) after the recalculation. Constantly updated iteration of the model distribution.

LDA topic model

Front done so much groundwork, we can finally start the LDA topic model.

Our problem is this, we have published a document M, corresponding to the d-th document there have Nd words. Is entered as follows:

Our goal is to find the distribution of the topic distribution of each document and each word of a theme. In the LDA model, we need to assume a number of theme K, so that all distributions are based on the K themes. Less specific LDA model is kind of how it? FIG follows:

LDA is assumed that the document relating to prior Dirichlet distribution is the distribution, i.e., for any document d, [theta] d is the subject matter of distribution: θd = Dirichlet (α →), wherein, the parameter [alpha] is the super distribution, which is a K-dimensional vector.

LDA is assumed that topic words prior Dirichlet distribution is distributed, i.e. for any topic k, which is the word distribution βk: βk = Dirichlet (η →) where, [eta] is the hyper-parameters of the distribution, is a V-dimensional vector. V represents the number of all the vocabulary words.

For eleven document d n-word data in any we can get from the topic distribution θd in the distribution of its theme zdn number is: zdn = multi (θd)

The theme for this number, we see the probability of getting the word wdn of distribution: wdn = multi (βzdn)

The main task of understanding LDA topic model is understood that the above model. This model, we have the M document theme Dirichlet distribution, and distribution data corresponding to a number M subjects are numbered, so that (α → θd → z → d) on the formation of Dirichlet-multi-conjugated, may be used in front of Bayesian inference mentioned get the posterior distribution document theme Dirichlet distribution.

If the d documents, the number of words relating to the k-th are: n (k) d, the corresponding count number distribution may be represented as n → d = (nd (1), nd (2) , ... nd (K))

Using Dirichlet-multi conjugated obtain the posterior distribution of [theta] d: Dirichlet (θd | α → + n → d)

By the same token, for the distribution of themes and words, we have K themes and words Dirichlet distribution, and a number of distributed data corresponding K number of topics, such η → βk → w → (k)) on the formation of Dirichlet-multi conjugate, the aforementioned methods can be used to obtain Bayesian posterior distribution Dirichlet distribution based on keywords.

If the k-th topic, the number of the v-th word is: n (v) k, corresponding to the number of count distribution may be represented as n → k = (nk (1), nk (2) ,. ..nk (V))

Using Dirichlet-multi conjugate βk posterior distribution obtained as: Dirichlet (βk | η → + n → k)

As the theme to produce a certain word does not depend on a specific document, so the document keywords topic distribution and distribution are independent. Appreciated that the above group M + K Dirichlet-multi conjugate will be understood that the basic principles of the LDA.

The question now is how the LDA-based distribution model we want to solve the topic distribution of each document and each topic word of it?

There are two general methods, the first one is based on Gibbs sampler algorithm, the second is based on variational EM algorithm estimation.

Guess you like

Origin blog.csdn.net/App_12062011/article/details/90383042