文章目录

Language Model
N-Gram Language Model

Problems with n-gram language model
Generating text with n-gram language model

Recurrent Neural Networks Language Model

Window-based neural networks model
A RNN Language Model
Training a RNN Language Model

Evaluating Language Models

Entropy of Language
Cross Entropy of Language
Perplexity

Language Model

Language modeling is the task of predicting what word comes next:

Language model as a system that assigns probability to a piece of text:
$p(s)=p(w_1w_2\cdots w_m)=\prod_{i=1}^mp(w_i|w_1\cdots w_{i-1})$

where $w_i$ can be any word in the vocabulary $V=\{w_1,\cdots,w_{|V|}\}$ .

As you type some words using keyboard, it can predict next word:

I'll meet you as the [cafe, airport, office and etc].

N-Gram Language Model

How to learn language model? learn a n-gram language model. A n-gram is a chunk of n consecutive words, for example:

unigrams: “the”, “students”, “opened”, “their”;
bigrams: “the students”, “students opened”, “opened their”;
trigrams: “the students opened”, “students opened their”;
4-grams,etc;

How do we get there n-gram probabilities? by counting them in some large corpus of text (approximate max likelihood estimation):
$p(w_i|w_{i-n+1}\cdots w_{i-1})\approx\frac{\text{count}(w_{i-n+1}\cdots w_i)}{\text{count}(w_{i-n+1}\cdots w_{i-1})}$

Problems with n-gram language model

sparse: using smoothing and backoff to solve sparse problem of n-gram model.
storage: model size increase with the increase of corpus;

Generating text with n-gram language model

Text generating process with n-gram model:

Long sentence generated by 3-gram model is incoherent. We need consider more than three words at a time, but model size is too bigger.

Recurrent Neural Networks Language Model

Window-based neural networks model

We can use window-based neural model that applied to Named Entity Recognition:

Remaining problems: window size is fixed and too small, inputs are processed not symmetry.

A RNN Language Model

Training a RNN Language Model

For each sentence $\pmb x$ in a big corpus of text:

compute output distribution $\hat{\pmb y}^{(t)}$ for every step t;
loss function on step t is cross-entropy between predicted probability distribution $\hat{\pmb y}^{(t)}$ , and the true next word $\pmb y^{(t)}$ (one hot for $\pmb x^{(t)}$ );
average to get overall loss of one sentence $\pmb x$ (length T):
$\begin{aligned} &J(\theta)=\frac{1}{T}\sum_{t=1}^TJ^{(t)}(\theta)\\[.5ex] &J^{(t)}(\theta)=CE(\pmb y^{(t)},\hat{\pmb y}^{(t)})=-\sum_{w\in V}\pmb y_w^{(t)}\log\hat{\pmb y}_w^{(t)}=-\log\hat{\pmb y}_{\pmb x_{t+1}}^{(t)} \end{aligned}$

RNNs training process:

Evaluating Language Models

Entropy of Language

给定单词序列 $\pmb w=w_1\cdots w_m$ ，单词序列的熵
$H(\pmb w)=-\sum_{\pmb w\in L}p(\pmb w)\log p(\pmb w)$

将语言L视为随机过程，则单词的熵率
$H(L)=-\frac{1}{n}\lim_{n\to\infty}\sum_{\pmb w\in L}p(\pmb w)\log p(\pmb w)$

Cross Entropy of Language

语言模型应用场景下，真实分布为 $u(\pmb w)$ 与预测分布为 $v(\pmb w)$ ，语言模型 $L$ 的交叉熵
$H(u,v)=\frac{1}{m}\Bbb E_u[-\log v(\pmb w)]=-\lim_{n\to\infty}\frac{1}{m}\sum_{\pmb w\in L}u(\pmb w)\log v(\pmb w)$

To summarize, by making some incorrect but convenient simplifying assumptions, we can compute the entropy of some stochastic process by taking a very long sample of the output and computing its average log probability.
$H(L)=-\lim_{m\to\infty}\frac{1}{m}\log p(w_1\cdots w_m)$

语言模型的真实分布通常使用训练预料替代，认为
$u(x|w_1\cdots w_{i-1})= \begin{cases} 1,&x=w_i\\ 0,&x\neq w_i \end{cases}$
语言模型的交叉熵近似于
$H(u,v)=-\frac{1}{m}\log p(w_1\cdots w_m)$

Perplexity

The standard evaluation metric for language models is perplexity.
$\text{perplexity}=\prod_{t=1}^T\left(\frac{1}{P_{LM}(x^{(t+1)})|x^{(t)}\cdots x^{(1)}}\right)^{1/T}$
This is equal to the exponential of the cross-entropy loss $J(\theta)$ :
$\text{perplexity}=e^{H(\pmb w)}=p(w_1\cdots w_m)^{-\frac{1}{m}}$

根据语言模型的概率分布随机选词，困惑度等价于获得正确词语的期望采样次数. 对于给定词序列，语言模型计算得到的概率越高，困惑度越小，模型越接近真实分布。

参考资料：https://www.zhihu.com/question/58482430

语言模型和循环神经网络（LM、RNNs）