Step 1 - Supervised Fine-Tuning
# Move into the first step of the pipeline cd training/step1_supervised_finetuning/
# Run the training script
bash training_scripts/single_gpu/ run_1.3b.sh
# Evaluate the model
bash evaluation_scripts/ run_prompt.sh
Step 2 - Reward Model
# Move into the second step of the pipeline cd training/step2_reward_model_finetuning
# Run the training script
bash training_scripts/single_gpu/run_350m.sh
# Evaluate the model
bash evaluation_scripts/ run_eval.sh
Step 3 - RLHF
作为整个3步InstructGPT管道中最复杂的一步,DeepSpeed Chat的混合动力引擎提供了足够的加速,以避免大量的训练时间(成本)。
强化学习人类反馈(RLHF)获取更多信息。如果您已经有了经过微调的Actor和奖励模型检查点,那么您可以简单地运行以下脚本来启用PPO训练。
# Move into the final step of the pipeline cd training/step3_rthffinetuning/
# Run the training script
bash training_scripts/single_gpu/ run_1.3b.sh