[Deep Learning] Sequence Generation Model (3): N-ary Statistical Model

N-ary statistics model

  N-Gram Model is a commonly used sequence modeling method, especially when dealing with data sparse problems. This model is based on the Markov hypothesis, which assumes that the generation of the current word only depends on its previous N − 1 N-1N1 word.
  The core idea of ​​the N-ary model is to use the previousN − 1 N-1NThe historical information of a word is used to estimate the conditional probability of the current word. For an N-ary model, the conditional probability can be expressed as:
p ( xt ∣ x 1 : ( t − 1 ) ) ≈ p ( xt ∣ xt − ( N − 1 ) : t − 1 ) p(x_t | \mathbf{x}_{1:(t-1)}) \approx p(x_t | x_{t-(N-1):t-1})p(xtx1:(t1))p(xtxt(N1):t1) in whichxt − ( N − 1 ) : t − 1 x_{t-(N-1):t-1}xt(N1):t1means from xt − ( N − 1 ) x_{t-(N-1)}xt(N1)to xt − 1 x_{t-1}xt1N − 1 N-1NA sequence of 1 word.

  • When N = 1 N = 1N=When 1 , it is called a unigram model.
    • The generation of each word is only related to itself and has nothing to do with any previous words.
  • When N = 2 N = 2N=2 , it is called a binary (Bigram) model.
    • The generation of each word depends only on its preceding word.
  • When N = 3 N = 3N=3 , it is called the Trigram model.
    • The generation of each word depends on its two previous words.
  • By analogy, when NNAs N increases, the historical information considered by the model also increases.

  This assumption of Markov properties simplifies the complexity of probabilistic modeling, but it also brings about the problem of data sparsity. Especially when N is large, the specific combinations contained in the training data may be very limited. In order to solve the problem of data sparseness, smoothing technology is often used to assign a certain prior probability to combinations that do not appear in the training data. Additive smoothing is a common smoothing technique that adds a small constant δ \delta when calculating conditional probabilities.δ , to avoid the denominator being zero.

1. Univariate model

1.1 Overview

  • Definition: A unigram model is a special case of an N-gram statistical model, in which the generation probability of each word is independent of other words and has nothing to do with context.

  • Build a model :
      In the unary model, the occurrence of each word in the sequence is independent. For a given sequence x 1: T \mathbf{x}_{1:T}x1:TFor example:
    p ( x 1 : T ; θ ) = ∏ t = 1 T p ( xt ) = ∏ k = 1 ∣ V ∣ θ kmkp(\mathbf{x}_{1:T}; \bold symbol{ \theta}) = \prod_{t=1}^{T} p(x_t)= \prod_{k=1}^{|V|} \theta_k^{m_k}p(x1:T;i )=t=1Tp(xt)=k=1Vikmkwhere, p (xt) p(x_t)p(xt) is the wordxt x_txtProbability of occurrence in the entire word list.

1.2 Generation probability

  • Multinomial distribution assumption: The probability of word generation conforms to a multinomial distribution, and the parameter is the probability of each word.
      In the unary model, the probability of each word depends only on the frequency of that word in the entire vocabulary. This can be modeled by a multinomial distribution, where θ = [ θ 1 , θ 2 , … , θ ∣ V ∣ ] \boldsymbol{\theta} = [\theta_1, \theta_2, \ldots, \theta_{|V| }]i=[ i1,i2,,iV] is a parameter of the multinomial distribution, indicating the selection probability of each word in the sequence.

1.3 Maximum likelihood estimation

  • Maximum likelihood estimation of a univariate model can be transformed into a constrained optimization problem

    • Goal: Determine the multinomial distribution parameters through maximum likelihood estimation to maximize the likelihood of the entire training set.

    • Constrained optimization: introduce Lagrange multipliers to obtain frequency estimates.

  • Specific process

    • Log-likelihood function :

      For a set of training sets { x 1 : T n ( n ) } n = 1 N ′ \{\mathbf{x}^{(n)}_{1:T_n}\}_{n=1}^{N '}{ x1:Tn(n)}n=1N,In other words:
      log ⁡ ( ∏ n = 1 N ′ p ( x 1 : T n ( n ) ; θ ) ) = log ⁡ ∏ k = 1 ∣ V ∣ θ kmk = ∑ k = 1 ∣ V ∣ mk log ⁡ θ k \log \left( \prod_{n=1}^{N'} p(\mathbf{x}^{(n)}_{1:T_n}; \ball symbol{\theta}) \right) = \log\prod_{k=1}^{|V|} \theta_k^{m_k}= \sum_{k=1}^{|V|} m_k \log \theta_klog n=1Np(x1:Tn(n);i ) =logk=1Vikmk=k=1Vmklogik

      Among them, mk m_kmkIt’s the kthThe number of times k words appear in the entire training set.

    • Maximum likelihood estimation problem :
        This is a maximum likelihood estimation problem. We need to optimize the parameter θ \boldsymbol{\theta}θ to maximize the log-likelihood function.

    • Introducing Lagrange multipliers :

      Introducing the Lagrange multiplier λ \lambdaλ , define the Lagrangian functionΛ ( θ , λ ) \Lambda(\boldsymbol{\theta}, \lambda)L ( i ,λ ) is:

    Λ ( θ , λ ) = ∑ k = 1 ∣ V ∣ mk log ⁡ θ k + λ ( ∑ k = 1 ∣ V ∣ θ k − 1 ) \Lambda(\boldsymbol{\theta}, \lambda) = \sum_; {k=1}^{|V|} m_k \log \theta_k + \lambda \left( \sum_{k=1}^{|V|} \theta_k - 1 \right)L ( i ,l )=k=1Vmklogik+l k=1Vik1

    • Find the partial derivative :

      For parameters θ k \theta_kikand λ \lambdaλ can be infinite, and one of the following is: ∂ Λ ( θ , λ ) ∂ θ k = mk θ k + λ = 0 , k = 1 , 2 , ... , ∣ V ∣ \frac{\partial \Lambda( \bold symbol{\theta}, \lambda)}{\partial \theta_k} = \frac{m_k}{\theta_k} + \lambda = 0, \quad k = 1, 2,... , |V|θkΛ ( θ ,l ).=ikmk+l=0,k=1,2,,V ∂ Λ ( θ , λ ) ∂ λ = ∑ k = 1 ∣ V ∣ θ k − 1 = 0 \frac{\partial \Lambda(\boldsymbol{\theta}, \lambda)}{\partial \lambda } = \sum_{k=1}^{|V|} \theta_k - 1 =λΛ ( θ ,l ).=k=1Vik1=0 further solve to getλ = − ∑ k = 1 ∣ V ∣ mk \lambda = -\sum_{k=1}^{|V|} m_kl=k=1Vmkθ k = mkm ˉ \theta_k = \frac{m_k}{\bar{m}}ik=mˉmk, where m ˉ = ∑ k ′ = 1 ∣ V ∣ mk ′ \bar{m} = \sum_{k'=1}^{|V|} m_{k'}mˉ=k=1Vmkis the length of the document collection.

    • Final result :
        Finally, the maximum likelihood estimate of the unary model is equivalent to the frequency estimate, parameter θ k \theta_kikThe estimated value is mkm ˉ \frac{m_k}{\bar{m}}mˉmk

2. N-ary model

  In the N-element model, the conditional probability p ( xt ∣ xt − N + 1 : t − 1 ) p(x_t | x_{t-N+1:t-1})p(xtxtN+1:t1) means N − 1 N-1in front of the givenNIn the case of 1 word, thettThe probability of occurrence of t words. This probability can be obtained by maximum likelihood estimation:
p ( xt ∣ xt − N + 1 : t − 1 ) = m ( xt − N + 1 : t ) m ( xt − N + 1 : t − 1 ) p(x_t | x_{t-N+1:t-1}) = \frac{m(x_{t-N+1:t})}{m(x_{t-N+1:t-1})}p(xtxtN+1:t1)=m(xtN+1:t1)m(xtN+1:t)

其中 m ( x t − N + 1 : t ) m(x_{t-N+1:t}) m(xtN+1:t) represents the sequencext − N + 1 : t x_{t-N+1:t}xtN+1:tThe number of occurrences, and m ( xt − N + 1 : t − 1 ) m(x_{t-N+1:t-1})m(xtN+1:t1) represents the sequencext − N + 1 : t − 1 x_{t-N+1:t-1}xtN+1:t1The number of occurrences.

3. Smoothing technology

3.1 Data sparsity problem

  • Challenge: N-ary models face the problem of data sparseness, especially for unseen N-ary combinations. Data sparseness causes the model to have zero probability for unseen N-element combinations.

3.2 Smoothing technology

  N-ary models face the problem of data sparsity, especially when the training data set is relatively small. The data sparse problem refers to the fact that due to insufficient training samples, the probability estimate of the model for some N-gram combinations that may appear but is not observed in the training set is zero, which will affect the generalization ability of the model. In natural language processing, this problem is particularly significant because most words in natural languages ​​obey Zipf's law, that is, the words with the highest frequency appear much more than other words .
  Smoothing technology is a method to solve the problem of data sparseness. Its basic idea is to alleviate the excessive penalty of unseen events by the model by assigning some probability mass to unseen events. Here are some common smoothing techniques:

  1. Add-One Smoothing :
      Add-One Smoothing is a simple and intuitive smoothing technique that works by adding a small constant δ \delta to the probability estimate.δ to avoid the zero probability problem. For the conditional probability in the N-ary model, the calculation formula of additive smoothing is:
    p ( xt ∣ x ( t − N + 1 ) : ( t − 1 ) ) = m ( x ( t − N + 1 ) : t ) + δ m ( x ( t − N + 1 ) : ( t − 1 ) ) + δ ∣ V ∣ p(x_t | \mathbf{x}_{(t-N+1):(t-1)}) = \frac{m(\mathbf{x}_{(t-N+1):t}) + \delta}{m(\mathbf{x}_{(t-N+1):(t-1) }) + \delta |V|}p(xtx(tN+1):(t1))=m(x(tN+1):(t1))+δVm(x(tN+1):t)+d

    where the constant δ \deltaδ takes a value between 0 and 1, takingδ = 1 \delta = 1d=When 1 is called plus 1 smoothing.

  2. Good-Turing Smoothing :
      Good-Turing smoothing is a more complex but more effective smoothing technique that adjusts for observed frequencies and expected frequencies of unobserved events. It weights low-frequency events and reduces the estimate of high-frequency events. This method requires frequency distribution statistics of the training data.

  3. Kneser-Ney smoothing :
      Kneser-Ney smoothing is an advanced smoothing technique, especially suitable for N-ary models. It considers the frequency of N-1-ary prefixes and N-ary combinations, improving the performance of the model by recursively considering shorter prefixes.

4. Application

  • Widely used: N-ary models are widely used in the field of natural language processing, including speech recognition, machine translation, Pinyin input method, etc.

おすすめ

転載: blog.csdn.net/m0_63834988/article/details/135050694