文章目录
Language Model
Language modeling is the task of predicting what word comes next:
Language model as a system that assigns probability to a piece of text:
where can be any word in the vocabulary .
As you type some words using keyboard, it can predict next word:
I'll meet you as the [cafe, airport, office and etc].
N-Gram Language Model
How to learn language model? learn a n-gram language model. A n-gram
is a chunk of n consecutive words, for example:
unigrams
: “the”, “students”, “opened”, “their”;bigrams
: “the students”, “students opened”, “opened their”;trigrams
: “the students opened”, “students opened their”;4-grams
,etc;
How do we get there n-gram probabilities? by counting them in some large corpus of text (approximate max likelihood estimation):
Problems with n-gram language model
- sparse: using
smoothing
andbackoff
to solve sparse problem of n-gram model. - storage: model size increase with the increase of corpus;
Generating text with n-gram language model
Text generating process with n-gram model:
Long sentence generated by 3-gram model is incoherent. We need consider more than three words at a time, but model size is too bigger.
Recurrent Neural Networks Language Model
Window-based neural networks model
We can use window-based neural model that applied to Named Entity Recognition:
Remaining problems: window size is fixed and too small, inputs are processed not symmetry.
A RNN Language Model
Training a RNN Language Model
For each sentence in a big corpus of text:
- compute output distribution for every step t;
- loss function on step t is cross-entropy between predicted probability distribution , and the true next word (one hot for );
- average to get overall loss of one sentence
(length T):
RNNs training process:
Evaluating Language Models
Entropy of Language
给定单词序列
,单词序列的熵
将语言L视为随机过程,则单词的熵率
Cross Entropy of Language
语言模型应用场景下,真实分布为
与预测分布为
,语言模型
的交叉熵
To summarize, by making some incorrect but convenient simplifying assumptions, we can compute the entropy of some stochastic process by taking a very long sample of the output and computing its average log probability.
语言模型的真实分布通常使用训练预料替代,认为
语言模型的交叉熵近似于
Perplexity
The standard evaluation metric for language models is perplexity.
This is equal to the exponential of the cross-entropy loss
:
根据语言模型的概率分布随机选词,困惑度等价于获得正确词语的期望采样次数. 对于给定词序列,语言模型计算得到的概率越高,困惑度越小,模型越接近真实分布。
参考资料:https://www.zhihu.com/question/58482430