语言模型和循环神经网络(LM、RNNs)

Language Model

Language modeling is the task of predicting what word comes next:

Language model as a system that assigns probability to a piece of text:
p ( s ) = p ( w 1 w 2 w m ) = i = 1 m p ( w i w 1 w i 1 ) p(s)=p(w_1w_2\cdots w_m)=\prod_{i=1}^mp(w_i|w_1\cdots w_{i-1})

where w i w_i can be any word in the vocabulary V = { w 1 , , w V } V=\{w_1,\cdots,w_{|V|}\} .

As you type some words using keyboard, it can predict next word:

I'll meet you as the [cafe, airport, office and etc].


N-Gram Language Model

How to learn language model? learn a n-gram language model. A n-gram is a chunk of n consecutive words, for example:

  • unigrams: “the”, “students”, “opened”, “their”;
  • bigrams: “the students”, “students opened”, “opened their”;
  • trigrams: “the students opened”, “students opened their”;
  • 4-grams,etc;

How do we get there n-gram probabilities? by counting them in some large corpus of text (approximate max likelihood estimation):
p ( w i w i n + 1 w i 1 ) count ( w i n + 1 w i ) count ( w i n + 1 w i 1 ) p(w_i|w_{i-n+1}\cdots w_{i-1})\approx\frac{\text{count}(w_{i-n+1}\cdots w_i)}{\text{count}(w_{i-n+1}\cdots w_{i-1})}


Problems with n-gram language model

  • sparse: using smoothing and backoff to solve sparse problem of n-gram model.
  • storage: model size increase with the increase of corpus;

Generating text with n-gram language model

Text generating process with n-gram model:

Long sentence generated by 3-gram model is incoherent. We need consider more than three words at a time, but model size is too bigger.


Recurrent Neural Networks Language Model

Window-based neural networks model

We can use window-based neural model that applied to Named Entity Recognition:

Remaining problems: window size is fixed and too small, inputs are processed not symmetry.


A RNN Language Model


Training a RNN Language Model

For each sentence x \pmb x in a big corpus of text:

  • compute output distribution y ^ ( t ) \hat{\pmb y}^{(t)} for every step t;
  • loss function on step t is cross-entropy between predicted probability distribution y ^ ( t ) \hat{\pmb y}^{(t)} , and the true next word y ( t ) \pmb y^{(t)} (one hot for x ( t ) \pmb x^{(t)} );
  • average to get overall loss of one sentence x \pmb x (length T):
    J ( θ ) = 1 T t = 1 T J ( t ) ( θ ) J ( t ) ( θ ) = C E ( y ( t ) , y ^ ( t ) ) = w V y w ( t ) log y ^ w ( t ) = log y ^ x t + 1 ( t ) \begin{aligned} &J(\theta)=\frac{1}{T}\sum_{t=1}^TJ^{(t)}(\theta)\\[.5ex] &J^{(t)}(\theta)=CE(\pmb y^{(t)},\hat{\pmb y}^{(t)})=-\sum_{w\in V}\pmb y_w^{(t)}\log\hat{\pmb y}_w^{(t)}=-\log\hat{\pmb y}_{\pmb x_{t+1}}^{(t)} \end{aligned}

RNNs training process:


Evaluating Language Models

Entropy of Language

给定单词序列 w = w 1 w m \pmb w=w_1\cdots w_m ,单词序列的熵
H ( w ) = w L p ( w ) log p ( w ) H(\pmb w)=-\sum_{\pmb w\in L}p(\pmb w)\log p(\pmb w)

将语言L视为随机过程,则单词的熵率
H ( L ) = 1 n lim n w L p ( w ) log p ( w ) H(L)=-\frac{1}{n}\lim_{n\to\infty}\sum_{\pmb w\in L}p(\pmb w)\log p(\pmb w)


Cross Entropy of Language

语言模型应用场景下,真实分布为 u ( w ) u(\pmb w) 与预测分布为 v ( w ) v(\pmb w) ,语言模型 L L 的交叉熵
H ( u , v ) = 1 m E u [ log v ( w ) ] = lim n 1 m w L u ( w ) log v ( w ) H(u,v)=\frac{1}{m}\Bbb E_u[-\log v(\pmb w)]=-\lim_{n\to\infty}\frac{1}{m}\sum_{\pmb w\in L}u(\pmb w)\log v(\pmb w)

To summarize, by making some incorrect but convenient simplifying assumptions, we can compute the entropy of some stochastic process by taking a very long sample of the output and computing its average log probability.
H ( L ) = lim m 1 m log p ( w 1 w m ) H(L)=-\lim_{m\to\infty}\frac{1}{m}\log p(w_1\cdots w_m)

语言模型的真实分布通常使用训练预料替代,认为
u ( x w 1 w i 1 ) = { 1 , x = w i 0 , x w i u(x|w_1\cdots w_{i-1})= \begin{cases} 1,&x=w_i\\ 0,&x\neq w_i \end{cases}
语言模型的交叉熵近似于
H ( u , v ) = 1 m log p ( w 1 w m ) H(u,v)=-\frac{1}{m}\log p(w_1\cdots w_m)


Perplexity

The standard evaluation metric for language models is perplexity.
perplexity = t = 1 T ( 1 P L M ( x ( t + 1 ) ) x ( t ) x ( 1 ) ) 1 / T \text{perplexity}=\prod_{t=1}^T\left(\frac{1}{P_{LM}(x^{(t+1)})|x^{(t)}\cdots x^{(1)}}\right)^{1/T}
This is equal to the exponential of the cross-entropy loss J ( θ ) J(\theta) :
perplexity = e H ( w ) = p ( w 1 w m ) 1 m \text{perplexity}=e^{H(\pmb w)}=p(w_1\cdots w_m)^{-\frac{1}{m}}

根据语言模型的概率分布随机选词,困惑度等价于获得正确词语的期望采样次数. 对于给定词序列,语言模型计算得到的概率越高,困惑度越小,模型越接近真实分布。

参考资料:https://www.zhihu.com/question/58482430

猜你喜欢

转载自blog.csdn.net/sinat_34072381/article/details/105839827
今日推荐