论文标题

部分可观测时空混沌系统的无模型预测

On Reinforcement Learning and Distribution Matching for Fine-Tuning Language Models with no Catastrophic Forgetting

论文作者

Korbak, Tomasz, Elsahar, Hady, Kruszewski, Germán, Dymetman, Marc

论文摘要

大型预训练模型的可用性正在改变机器学习研究和实践的景观,从训练到训练到微调范式。虽然在某些应用中,目标是将预先培训的分布推向首选输出,而在其他应用程序中则是将其转向样本空间上的不同分布。已经出现了两个主要的范式来应对这一挑战:奖励最大化(RM)和最近的分销匹配(DM)。 RM应用标准增强学习(RL)技术,例如政策梯度,以逐渐增加奖励信号。 DM处方首先将模型微调为近似的目标分布。在这里,我们探讨了两个范式之间的理论联系,并表明为RM开发的KL-Control之类的方法也可以解释为属于DM。我们进一步观察到,尽管DM与RM不同,但它可能会遇到类似的训练困难,例如高梯度差异。我们利用两个范式之间的连接将基线的概念导入DM方法。我们从经验上验证了在一系列可控的语言生成任务中添加基线的好处,例如在语言模型中采样的文本中约束主题,情感和性别分布。我们在约束满意度,稳定性和样本效率方面观察到了卓越的性能。

The availability of large pre-trained models is changing the landscape of Machine Learning research and practice, moving from a training-from-scratch to a fine-tuning paradigm. While in some applications the goal is to "nudge" the pre-trained distribution towards preferred outputs, in others it is to steer it towards a different distribution over the sample space. Two main paradigms have emerged to tackle this challenge: Reward Maximization (RM) and, more recently, Distribution Matching (DM). RM applies standard Reinforcement Learning (RL) techniques, such as Policy Gradients, to gradually increase the reward signal. DM prescribes to first make explicit the target distribution that the model is fine-tuned to approximate. Here we explore the theoretical connections between the two paradigms, and show that methods such as KL-control developed for RM can also be construed as belonging to DM. We further observe that while DM differs from RM, it can suffer from similar training difficulties, such as high gradient variance. We leverage connections between the two paradigms to import the concept of baseline into DM methods. We empirically validate the benefits of adding a baseline on an array of controllable language generation tasks such as constraining topic, sentiment, and gender distributions in texts sampled from a language model. We observe superior performance in terms of constraint satisfaction, stability and sample efficiency.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源