Evaluation indicators (Metric) (3)

Perplexity , translated as perplexity in Chinese , is a concept in information theory. It can be used to measure the uncertainty of a random variable, and it can also be used to measure the quality of model training. Usually, the higher the Perplexity value of a random variable, the higher its uncertainty; the higher the Perplexity value of a model during inference, the worse the model's performance, and vice versa.

Perplexity of probability distribution of random variables

For discrete random variables XXX ​, assuming that the probability distribution can be expressed asp ( x ) p(x)p ( x ) ​​​, then the corresponding perplexity is: 2 H ( p ) = 2 − ∑ x ∈ X p ( x ) log 2 p ( x ) 2^{H(p)}=2^{- \sum_{x \in X}p(x) log_2p(x)}2H(p)=2xXp(x)log2p ( x ) where,H (p) H(p)H ( p ) is the probability distributionppThe entropy of p . It can be seen that the greater the entropy of a random variable, the greater its corresponding confusion, and the greater the uncertainty of the random variable.

Perplexity of model distribution

Perplexity can also be used to measure how well a model is trained, that is, it measures the difference between the model distribution and the sample distribution. Generally speaking, during the training process of the model, the closer the model distribution is to the sample distribution, the better the model training will be.

Suppose there is a batch of data x 1 , x 2 , x 3 , . . . , xn x_1,x_2,x_3,...,x_nx1,x2,x3,...,xn, its corresponding empirical distribution is pr ( x ) p_r(x)pr( x ) . Now we have successfully trained a modelp θ ( x ) p_θ(x)pi( x ) , then the model distributionp θ ( x ) p_θ(x)piThe quality of ( x )​​​​ can be defined by the degree of confusion: 2 H ( pr , p θ ) = 2 − ∑ inpr ( xi ) log 2 p θ ( xi ) 2^{H(p_r,p_\theta )}=2^{-\sum^n_i p_r(x_i) log_2p_\theta(x_i)}2H(pr,pi)=2inpr(xi)log2pi(xi) in which,H ( pr , p θ ) ​ H(p_r,p_θ)H(pr,pi) ​represents the empirical distribution of the samplep ~ r \tilde p_rp~rand model distribution p θ ​ p_θpiThe cross entropy between Assume that each samplexi xiThe generation probability of x i is equal, that is,pr ( xi ) = 1 n p_r(x_i)=\frac 1 npr(xi)=n1, then the perplexity of the model distribution can be simplified as: 2 H ( pr , p θ ) = 2 − 1 n ∑ inlog 2 p θ ( xi ) 2^{H(p_r,p_\theta)}=2^{-\ frac 1 n\sum^n_i log_2p_\theta(x_i)}2H(pr,pi)=2n1inlog2pi(xi)

Perplexity in NLP

In the field of NLP, language models can be used to calculate the probability of a sentence. Suppose there is a sentence like s = w 1 , w 2 , w 3 , . . . , wn ​​​​​ s=w_1,w_2,w_3 ,...,w_n​​​​​​s=w1,w2,w3,...,wn​​​​​​, We can calculate the generation probability of this sentence like this: p ( x ) = p ( w 1 , w 2 , . . . , wn ) = ∏ i = 1 np ( wi ∣ w 1 , w 2 , . . . , wi − 1 ) \begin{aligned} p(x)&=p(w_1,w_2,...,w_n)\\ &=\displaystyle \prod^n_{i=1}p( w_i|w_1,w_2,...,w_{i-1})\end{aligned}p(x)=p(w1,w2,...,wn)=i=1np(wiw1,w2,...,wi1)After the language model training is completed, how to judge the quality of the language model? This is where confusion comes into play. Generally speaking, the test sets used to evaluate language models are reasonable and high-quality corpus. As long as the language model has a higher perplexity on the test set, it means the language model is better trained, and vice versa.

After understanding the calculation of statement probability, for the statement s = w 1 , w 2 , w 3 , . . . , wn ​​​​​​ s=w_1,w_2,w_3,...,w_n​​​s=w1,w2,w3,...,wn​​​​​​​​,其困惑度可以这样来定义: p e r p l e x i t y = p ( s ) − 1 n = p ( w 1 , w 2 , . . . , w n ) − 1 n = 1 p ( w 1 , w 2 , . . . , w n ) n = ∏ i = 1 n 1 p ( w i ∣ w 1 , w 2 , . . . , w i − 1 ) n \begin{aligned} perplexity&=p(s)^{- \frac 1 n} \\ &=p(w_1,w_2,...,w_n)^{- \frac 1 n} \\ &=\sqrt[n]{\frac {1} {p(w_1,w_2,...,w_n)}} \\ &=\sqrt[n]{\displaystyle \prod^n_{i=1}\frac 1 {p(w_i|w_1,w_2,...,w_{i-1})}} \end{aligned} perplexity=p(s)n1=p(w1,w2,...,wn)n1=np(w1,w2,...,wn)1 =ni=1np(wiw1,w2,...,wi1)1 Obviously, the greater the probability of a sentence in the test set, the smaller the confusion.

Guess you like

Origin blog.csdn.net/weixin_49346755/article/details/127344212