The GPT large language model detonates the upsurge of reinforcement learning and language generation models, and takes you to understand RLHF.

DRL:Reinforcement Learning with Language Model

With the explosion of ChatGPT, the combination of reinforcement learning (Reinforcement Learning) and language generation model (Language Model) has become more and more popular.

A video explanation about ChatGPT can be found here .

A detailed introduction to the project can be found here .

In this project, we will build several examples of updating the language model (GPT-2) through the reinforcement learning algorithm (PPO) through the open source project trl , including:

  • Positive comment generation robot based on Chinese emotion recognition model (No Human Reward)

  • A positive comment generation robot based on human scoring (With Human Reward)

  • Train a reward model (Reward Model) based on the ranking sequence (Rank List)

  • Rank List annotation platform

1. A positive comment generation robot based on Chinese emotion recognition model (No Human Reward)

Considering that we now have a ready-made language model (Chinese GPT2 is used in the example), promptthe model can continue to generate a piece of text through a short paragraph, for example:

prompt: 刚收到货,感觉有

output 1: 刚收到货,感觉有 点 不 符 合 预 期 ,不 好
output 2: 刚收到货,感觉有 挺 无 奈 的 送 货 速 度 不 太 行
...

We now hope that the language model can learn to generate "positive emotion" praise, and the current GPT model does not have the ability to "emotion recognition". For example, the above two generated results do not conform to positive emotions.

To this end, we expect to evolve the existing GPT model through the method of "reinforcement learning", so that it can learn to generate "positive sentiment" comments as much as possible.

In reinforcement learning, when the model generates a result, we need to tell the model the score (reward) of this result, that is, we score each generated result of the model, for example:

output 1: 刚收到货,感觉有 点 不 符 合 预 期 ,不 好                -> 0.2
output 2: 刚收到货,感觉有 挺 无 奈 的 送 货 速 度 不 太 行          -> 0.1
output 3: 刚收到货,感觉有 些 惊 喜 于 货 物 质 量                  -> 0.9...

This would be a very lengthy process if we had to manually score each output (we will do this in another example).

Therefore, we introduce another "emotion recognition model" to simulate the scores given by humans.

"Emotion recognition model" we use the built-in sentiment-analysis pipeline in transformers to implement.

The model is trained based on a data set of online comments, and can discriminate the "positive and negative" emotions of sentences, as follows:

We use the discrimination result (0.0~1.0) of this "emotion recognition model" as the reward of the GPT generation model to guide the GPT model to iteratively update through the reinforcement learning (PPO) algorithm.

1.1 Training process

The entire PPO + GPT2 training process is as follows:

  1. Choose one at random prompt, e.g. "This movie is very"

  2. The GPT model promptgenerates answers based on , such as: "This movie is very good~"

  3. Feed the generated answer of GPT to the "emotion recognition" model and get a score (reward), such as: 0.9

  4. Optimize the GPT model using the reward.

This cycle is repeated until the training ends.

1.2 Start training

This project is based on the implementation of pytorch+ transformers, please install the relevant dependency packages before running:

pip install -r ../requirements.txt

Run the training script:

python ppo_sentiment_example.py

After starting the training normally, the terminal will print the following data:

...
epoch 0 mean-reward: 0.7271811366081238
Random Sample 5 text(s) of model output:
1. 刚收到货,感觉不 错 , 会 冒 充 收 银 员 在 果 盘 盘 底 , 就
2. 说实话,真的很般 般 , 一 般 都 是 饭 点 去 , 没 办 法 我 现
3. 说实话,真的很怪 不 得 刚 开 的 没 多 久 , 现 在 上 海 这 个
4. 这部电影很啊 , 所 以 , 也 算 一 个 抛 砖 引 玉 。 昨 天
5. 这次购物总的来说体验很[SEP] ~ 满 意 谢 谢 送 货 很 快 [SEP] 为 什 么 输 出
  1%|| 1/157 [00:55<2:23:53, 55.34s/it]
epoch 1 mean-reward: 0.7439988851547241
Random Sample 5 text(s) of model output:
1. 这次购物总的来说体验很我 不 知 道 表 盘 这 是 男 人 的? 听 说 女 人
2. 这部电影很金 士 顿 鉴 定 和 暗 暗 [SEP] 正 品 。 是 正 品 这
3. 刚收到货,感觉是 有 些 人 吃 不 尽 的 名 字 ! ~ 世 界 几 大
4. 说实话,真的很对 不 起 这 个 价 钱 , 可 能 是 因 为 做 出 来
5. 说实话,真的很非 电 。 31. 可 说 是 食 堂 , 没 怎 么 规 划
  1%|█▎                                                                                                    | 2/157 [01:51<2:24:31, 55.95s/it]
epoch 2 mean-reward: 0.8219242691993713
...

Among them mean-reward, represents the average score of the model under the epoch (feedback from the "emotion recognition model"), and Random Samplerepresents the sentence samples generated by the model in the current epoch.

logs/PPO-Sentiment-Zh.pngUnder will save the changes of various indicators during the model training process (including the reward change curve):

At the beginning of the model training, GPT will generate some random answers, and the average reward at this time will not be very high, and some "negative" emotional comments will be generated (as shown below):

With training, GPT will slowly learn to favor "positive" sentiment comments (as shown below):

2. A comment generation robot based on human scoring (With Human Reward)

In the first example, the model's reward comes from another model.

In this example, we will make a platform to support human scoring.

We start the annotation platform:

python terminal_main.py 

Then we can see the generation result of the model on the terminal, and iterate the model by manually inputting reward:

3. Training Reward Model based on manual ranking

A scoring model is trained by sorting sequences.

In the training dataset data/reward_datasets/sentiment_analysis, each line is a sorted sequence (separated by \t symbols).

The higher the ranking, the more "positive emotions", and the lower the ranking, the more "negative emotions".

1.买过很多箱这个苹果了,一如既往的好,汁多味甜~	2.名不副实。	3.拿过来居然屏幕有划痕,顿时就不开心了	4.什么手机啊!一台充电很慢,信号不好!退了!又买一台竟然是次品。
1.一直用沙宣的洗发露!是正品!去屑止痒润发护发面面俱到!	2.觉得比外买的稀,好似加了水的	3.非常非常不满意,垃圾。	4.什么垃圾衣服,买来一星期不到口袋全拖线,最差的一次购物
...

Start the training script:

sh train_reward_model.sh

After successfully starting the training, the terminal will print the following information:

...
global step 10, epoch: 1, loss: -0.51766, speed: 0.21 step/s
global step 20, epoch: 1, loss: -0.55865, speed: 0.22 step/s
global step 30, epoch: 1, loss: -0.60930, speed: 0.21 step/s
global step 40, epoch: 1, loss: -0.65024, speed: 0.21 step/s
global step 50, epoch: 1, loss: -0.67781, speed: 0.22 step/s
Evaluation acc: 0.50000
best F1 performence has been updated: 0.00000 --> 0.50000
global step 60, epoch: 1, loss: -0.69296, speed: 0.20 step/s
global step 70, epoch: 1, loss: -0.70710, speed: 0.20 step/s
...

In logs/reward_model/sentiment_analysis/ERNIE Reward Model.pngwill store the training graph:

After completing the training, we run the prediction script to see the scoring effect of the trained model:

python inference_reward_model.py

We enter two comment sentences:

texts = [
    '买过很多箱这个苹果了,一如既往的好,汁多味甜~',
    '一台充电很慢,信号不好!退了!又买一台竟然是次品。。服了。。'
]

>>> tensor([[10.6989], [-9.2695]], grad_fn=<AddmmBackward>)

It can be seen that "Positive Comments" got 10.6 points, while "Negative Comments" got -9.26 points.


4. Manual ranking (RankList) labeling platform

For the third step of Reward Model training, if you want to customize the sorting data set, you can use the labeling tools provided in this project:

The platform is streamlitbuilt using , so you need to install the three-party package before using it:

pip install streamlit==1.17.0

Then, run the following command to enable the annotation platform:

sh start_ranklist_labler.sh

Access the ip + port in the browser (the default is 8904, sh start_ranklist_labler.shthe port number can be modified in ) to open the annotation platform.

Click 随机 promptthe button to randomly select a prompt from the prompt pool (the prompt pool can be ranklist_labeler.pymodified in MODEL_CONFIG['random_prompts']).

By sorting the 4 answers generated by the model, the sort sequence from high score to low score is obtained, and click the 存储当前排序button at the bottom to save the current sort into the local data set.

The data set will be stored in data/human_labeled/total_dataset.tsv(the parameter can ranklist_labeler.pybe modified in MODEL_CONFIG['dataset_file']), each row is a rank_list, \tseparated by :

今天早晨我去了 一 趟 酒 店 , 在 check in 的 时 候 我 也 在 , 但 是 那 位 前 任 不 让 我 进 去 , 直 接 说 了 一 句	今天早晨我去了 中 介 的 办 公 楼 , 看 了 我 的 婚 纱 照 , 拍 的 时 候 已 经 是 晚 上 十 一 点 有 点 累 了 , 我	今天早晨我去了 天 津 , 因 为 天 气 真 是 糟 糕 , 天 都 是 蓝 色 的 , 但 我 在 一 个 山 坡 上 , 因 为 时 间 短	今天早晨我去了 你 们 工 作 室 , 一 片 混 乱 , 有 什 么 问 题 都 没 有 , 还 有 一 些 工 作 人 员 乱 来 乱 走 ,
...

You can also click the button at the top of the annotation page Datasetto view the currently stored datasets:

After the data labeling is completed, you can refer to the third step to train a custom one Reward Model.

Reference link:

https://mp.weixin.qq.com/s/1v4Uuc1YAZ9MRr1UWMH9xw

https://zhuanlan.zhihu.com/p/595579042

https://zhuanlan.zhihu.com/p/606328992

For more high-quality content, please pay attention to the official account: Ting, artificial intelligence; some related resources and high-quality articles will be provided for free reading.

Guess you like

Origin blog.csdn.net/sinat_39620217/article/details/132416416