【论文阅读】Learing to summarize from human feedback - 代码天地

【论文阅读】Learing to summarize from human feedback

企业开发 2023-09-05 22:08:23 阅读次数: 0

前言

更多关于大模型的文章可见：ShiyuNee/Awesome-Large-Language-Models: Papers about large language models (github.com)

该仓库持续更新

Abs

通过训练模型来向着人类偏好优化可以显著提高摘要质量。

Method

High-level methodology

在这里插入图片描述

从一个在摘要数据集上使用SFT微调好的初始模型开始，然后经过以下三个步骤：

从已有的模型中收集一些样本，利用人工来进行对比
- 对一个 Reddit post，从不同模型（当前模型，初始模型，原有的参考摘要以及其他baselines）中收集summaries。之后，将一批成对的摘要送给人工评估，人工被要求对一个Reddit post选择其最好的摘要
从人类对比的结果中学一个reward model
- reward model用来对<post, summary>打分，分数越高说明summary越好，该模型打分需要和人类偏好保持一致，及人工认为越好的摘要应该得到更高的分数
针对reward model训练一个policy
- 利用reward model对policy生成的结果进行打分，使用强化学习用该分数来优化当前policy

Dataset and task

数据：从TL;DR数据集中过滤得到，包含123169个posts，5%用来validation
- 为什么不用更常用的CNN/DM数据集？
  - 因为这个数据集太简单，简单的提取式模型都能做的很好
任务：训练一个模型，生成小于48token的摘要，模型效果要好

Models

所有模型都是GPT-3架构，用1.3B和6.7B的GPT-3进行 human feedback实验

预训练模型：自回归形式的GPT-3
监督baselines：用过滤后的TL;DR微调的GPT-3，用来初始化policy、RM，也用来采集摘要pair，以及作为评估的baseline。在最后的人工评估中，对所有模型使用T=0的采样方式（贪婪搜索），因为发现这样效果最好
Reward models：用上面的监督baseline来初始化，加了一个随机初始化的线性层来输出一个标量分数，该模型在两个摘要 $y_0, y_1)$ 中判断哪个更好，如果 $y_1$ 更好，loss可以写成：
Human feedback policies：用上面的监督baseline初始化，基于上面的RM，我们采用RL的方式，用PPO算法来训练一个policy。在reward中添加了一个惩罚项（学到的policy和原始的监督模型之间的KL散度）

扫描二维码关注公众号，回复： 16440580 查看本文章

Discussion

Limitations：训练以及数据收集都非常耗时，因此没办法上到更大的模型

Future directions：

可以应用到各种能比较样本的任务上
希望能扩展到人类不能轻易评估模型输出的任务上
使用除了二分比较的其他形式的human feedback

Broader impacts：本文探索的是通用技术，可以用在各种机器学习应用上。

猜你喜欢

转载自blog.csdn.net/qq_52852138/article/details/131253071

【论文阅读】Learing to summarize from human feedback

论文阅读-Training a Helpful and Harmless Assistant withReinforcement Learning from Human Feedback

training a helpful and harmless assistant with refinforcement learning from human feedback

【LLM】RLHF机制（Reinforcement Learning from Human Feedback）

AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback

RLHF：基于人类反馈（Human Feedback）对语言模型进行强化学习【Reinforcement Learning from Human Feedback】

LLMs: 强化学习从人类反馈中学习Reinforcement learning from human feedback (RLHF)

【论文阅读笔记】Estimating Human Pose from Occluded Images

文献阅读 BPR: Bayesian Personalized Ranking from Implicit Feedback

【论文阅读】Collaborative Filtering for Implicit Feedback Datasets

论文1—BPR: Bayesian Personalized Ranking from Implicit Feedback（基于隐式反馈的贝叶斯个性化排名）

【翻译论文】Learning to Estimate 3D Human Pose and Shape from a Single Color Image （CVPR 2018）

[论文学习]Private traits and attributes are predictable from digital records of human behavior

论文翻译:View Adaptive Recurrent Neural Networks for High Performance Human Action Recognition from Skel

论文复现:＜Beyond Static Features for Temporally Consistent 3D Human Pose and Shape from a Video＞

【论文速览】根据人脑fMRI信号重建图像 Image Reconstruction from human brain activity

Integral Human Pose Regression论文阅读

【论文阅读笔记】Fast Human PoseEstimation

【论文阅读笔记】Integral Human Pose Regression

【论文阅读】 AdaptivePose: Human Parts as Adaptive Points

VBPR: Visual Bayesian Personalized Ranking from Implicit Feedback

LLM-202212-Claude：《Constitutional AI: Harmlessness from AI Feedback》

SUMMARIZE 6.1

HuNavSim: a ROS 2 human navigation simulator for benchmarking human-aware robot navigation 论文阅读

论文阅读 Human Pose Estimation using Deep Consensus Voting

《Macro-Micro Adversarial Network for Human Parsing》论文阅读笔记

DensePose:Dense Human Pose Estimation In The Wild 论文阅读笔记

论文：DensePose: Dense Human Pose Estimation In The Wild 阅读笔记

《Human-Level Artificial Intelligence? Be Serious!》论文阅读报告

【论文阅读笔记】Cascade Feature Aggregation for Human Pose Estimation

今日推荐

美国拟限制 AI 大模型出口中国和俄罗斯

苹果将与 OpenAI 达成协议，将 ChatGPT 应用于 iPhone

openKylin 社区生态委员会第六次会议圆满召开

阿里云正式发布通义千问 2.5

Python 3.13 发布首个 Beta：实验性自由线程模式和 JIT、改进交互式解释器

Stack Overflow 拿我的代码去训练 AI 大模型，还封了我的账号

Pop!_OS 的 COSMIC 桌面完成 App Store 上架工作

报告：Django 仍然是 74% 开发者的首选

《2024 年一季度互联网投融资运行情况》研究报告

15 年前上了“FFmpeg 耻辱柱”，今天他还得谢谢咱——腾讯QQPlayer一雪前耻？

TIOBE 5 月榜单：Fortran “复活”进入 Top 10

GCC 14.1 发布

周排行

curl的POST请求，封装方法

8.1.1. Integer Types

Java基础 Day05(个人复习整理)

Python - Django - 中间件 process_exception

小L的试卷

【Shell编程】（函数）判断用户是否存在

python(css样式)

spring ant path 匹配原则 - 【笔记】

《JavaScript与JScript从入门到精通》(美)James.Jaworski.中译本.扫描版.pdf

Eclipse运行带参数的java程序

每日归档

更多

2024-05-12(0)

2024-05-11(38)

2024-05-10(38)

2024-05-09(35)

2024-05-08(42)

2024-05-07(14)

2024-05-06(40)

2024-05-05(0)

2024-05-04(7)

2024-05-03(19)