Easily play open source large language model bloom (3)

foreword

2023/03/31 Modification, adding top-p, deleting redundant words, I didn’t plan to continue this series, but the language generation strategy is a general thing, let’s write it completely.
The book continues the topic. Greed search and beam search were introduced in the previous issue. In fact, the disadvantages of beam search are not mentioned above, that is, the lack of randomness, which seems too reasonable, and does not look like something written by humans, so this issue will introduce sampling . The two parameters of the sampling technique, top_k and top_p , and the related temperature temperature value are set to reasonably increase the randomness of the generated text.

random sampling

The literal meaning is very simple, that is, to select words randomly when selecting words, as shown in the figure below:
insert image description here
according to the greedy search method, the nice and drives with the highest probability will definitely be selected, but after adding sampling, the probability of only 0.1 may be selected car, let's look at the code below:

from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
import time
a1 = time.time()
checkpoint = "bigscience/bloom-1b1"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint)
inputs = tokenizer.encode("我和我的猫都很想你", return_tensors="pt") #prompt
outputs = model.generate(inputs,min_length=150,
do_sample=True,max_length=200,temperature=1.0)
print(tokenizer.decode(outputs[0])) #使用tokenizer对生成结果进行解码
a2 = time.time()
print(f'time cost is {
      
      a2 - a1} s')

The main thing is to change the parameters here in model.generate. Add do_sample=True to ensure random generation every time. temperature=1.0 is the default value. It is written here intentionally for later understanding. In this case, there is no random probability distribution. processed.

The generated results are as follows:

我和我的猫都很想你,你却不在,我知道你也在等我,我有些伤感,
但还是要相信一切都会好起来的,一切,包括我和你猫。 
图片发自简书app 我们这群猫,虽然现在都离不开我们,但还是有很多地方,比如我的家、我们家后院、我的小狗、我家的猫咪,
还有小狗、小猫...还有很多呢,他们只是需要我... 
图片发自简书app 图片发自简书app 图片发自简书app 图片发自简书app 图片发自简书app 图片发自简书app 图 片发自简书app 图片发自简书app 你说,我还能做什么?</s>
time cost is 29.499494075775146 s

It can be seen that in addition to the repeated text in the back, the text generated in the front is mostly a bit messy, which is the problem of not setting the temperature value .

temperature setting

Let’s first look at such a thing. This thing is called softmax, or more intuitively called soft arg max. Simply put, it is used to convert a series of real values ​​into a probability distribution whose sum is 1 . The probability distribution of our word vector is composed of many words, and the probability of adding up is equal to 1. For example, the corresponding probability of {nice, dog, car} in the above figure is {0.5, 0.4, 0.1}. And setting the temperature value is actually setting the β value of the softmax distribution.
insert image description here
In the most extreme case, soft arg max will tend to argmax, that is, {0,0,0,0,...1,0}, a string of one-hot vectors, only one feature is 1, and the others are all 0, So what is the situation? That is, β tends to infinity , and the temperature value T is equal to 1/β , tending to 0. At this time, the probability of only one word is close to 0.99999999999, and the probability of other words is close to 0. The probability distribution at this time becomes greedy search , even if it is randomly selected, no other words can be selected, only the word with the highest probability will be selected.

The sample code is as follows:

outputs = model.generate(inputs,min_length=150,do_sample=True,
max_length=200,temperature=0.001)

At this time, we set the temperature very small , and the generated results are as follows:

我和我的猫都很想你。 猫咪:你什么时候回来? 
狗狗:明天。 猫咪:明天? 狗狗:明天。 
猫咪:明天? 狗狗:明天。 猫咪:明天? 
狗狗:明天。 猫咪:明天? 狗狗:明天。 
猫咪:明天? 狗狗:明天。 猫咪:明天? 
狗狗:明天。 猫咪:明天? 狗狗:明天。 猫咪:明天? 狗狗:明天。 猫咪:明天? 狗狗:明天。 猫咪:明天? 狗狗:明天。 猫咪:明天? 狗狗:明天。 猫咪:明天? 狗狗:明天。 猫咪:明 天? 狗狗:明天。 猫咪:明天? 狗狗:明天。 猫咪:明天? 狗狗:明天。 猫咪:明天? 狗狗:明天。 猫咪:
time cost is 35.493085861206055 s

The found result is exactly the same as the greedy search we set in the previous section.

So what if I set the temperature very high ?

outputs = model.generate(inputs,min_length=150,do_sample=True,
max_length=200,temperature=100)

T=100, at this time β is close to 0, the probability distribution is close to {1/n, 1/n, 1/n...1/n}, all words have the same probability of being selected, and the generated text results are as follows:

我和我的猫都很想你:猫小狗喐呔猫 小白兔子,
还有"毛孩儿的猫阿玛 祝你过日子平实和暖昧、开一家宠物食品商店。 这几件都是送 给老爸们的.我买的礼物",
是希望他们一直走在那里开这猫饭窝里的饭呢... 春节刚贴的对家长的条子里:"孩子们对父亲节的怀念和对母亲节日感激,有家人们对父母的敬戴祝福以及对爱的肯定"这是对他们说的了......
我的爱爸爸- 我从小没有受过父亲的教育, 
但我 爱在春节与家人有快乐一起回家参加他们的生活了 "对父母有最好的爱的时候是第一次见外"是对家人父母说:爱自己才是最完美的呢 对外孙 女儿 奶奶奶奶,说有朋友请,大家见面,就象吃晚饭时我们用膳的""老"! 我是一个不会唱歌不认爹我总想妈妈,每次吃
time cost is 35.58598041534424 s

It can be seen that the randomly generated results are a complete mess of garbage, without any rules, and they are all randomly selected equally.

Finally, put a result of setting temperature=0.7:

我和我的猫都很想你,因为你是我的偶像。
在人生路上,我们都在跌跌撞撞,我们都在摸爬滚打,但我们最终都选择了安稳,选择了安逸,选择了安逸。
因为你的存在,让我们 的生活变得安稳,让我们的生活变得温暖。
我知道,你现在过得很好,你一切都很好,我一定会好好照顾你的,所以,我会一直爱你 ,保护你,保护你。

The text is at least logical. Therefore, random sampling must set a reasonable temperature value. So what is a reasonable temperature value? As shown in the figure below:
insert image description here
you can see that the entire distribution has become steeper, the probability of car has changed from 0.1 to 0.02, and it is more difficult to be selected. A reasonable temperature value is to keep the randomness while making the probability higher and should be selected. The vocabulary can be selected, retaining a certain logic.

Top-k and Top-p

Top-k is to select the first K words that are ranked from high to low in the probability distribution. For example, top_k=50 is to select the top fifty words, and the others are ignored.
insert image description here

Top-p means that the probability of selecting the first N words exceeds the set value. For example, top-p=0.95, the probability of the first word is 0.6, and the probability of the second word is 0.35, then only the first two words are selected. , the latter will be ignored.
insert image description here

From the description, it can be seen that top-p is more adaptable to changes in probability than top-k, and flexibly selects the first N words, while top-k does not take into account the cumulative probability. Of course, the two are often used in combination in practice. .

Look at the actual effect on bloom:

outputs = model.generate(inputs,min_length=150,do_sample=True,
no_repeat_ngram_size=2,max_length=200,top_p=0.8)

No_repeat_ngram_size is added here to prevent repeated statements from being generated, and the generated results are as follows:

我和我的猫都很想你。 在你最需要的时候,你总是会出现在我身边。 
你总是那么细心,温柔,善良,无私。 
无论发生什么事情,我们都会彼此鼓励,分担,互相理解,帮助,让你感到骄傲。 
从你开始,我的世界就多了你一个可爱的伙伴,我们的友谊从那时起就开始了。 
有时你会让我很生气,因为你的那些任性,但是当我生气的时候你就会走过来抱着我,亲吻我,轻轻地亲我。 
你会对我那么好,你会对我照顾那么周到, 会为我做任何事情。我无法用言语表达我对你的爱,无法形容我对你在我心里的位置。 
你的好让我感到温暖,让我觉得你是世界上最棒的 女人。 
我知道你是爱我的,但我还是不舍得离开你,我甚至想让你陪着我过一辈子。 
希望你能幸福,我希望我们能一起过一生。 
我已经很久没有联系你了,我很想见你,想和你

Feel. . The effect is great?

Guess you like

Origin blog.csdn.net/weixin_43945848/article/details/129611861