Document detail
ID

oai:arXiv.org:2408.09834

Topic
Computer Science - Artificial Inte...
Author
Xie, Shiming Chen, Hong Yu, Fred Sun, Zeye Wu, Xiuyu Hu, Yingfan
Category

Computer Science

Year

2024

listing date

9/4/2024

Keywords
human preference dpo
Metrics

Abstract

Learning from human preference is a paradigm used in large-scale language model (LLM) fine-tuning step to better align pretrained LLM to human preference for downstream task.

In the past it uses reinforcement learning from human feedback (RLHF) algorithm to optimize the LLM policy to align with these preferences and not to draft too far from the original model.

Recently, Direct Preference Optimization (DPO) has been proposed to solve the alignment problem with a simplified RL-free method.

Using preference pairs of chosen and reject data, DPO models the relative log probability as implicit reward function and optimize LLM policy using a simple binary cross entropy objective directly.

DPO is quite straight forward and easy to be understood.

It perform efficiently and well in most cases.

In this article, we analyze the working mechanism of $\beta$ in DPO, disclose its syntax difference between RL algorithm and DPO, and understand the potential shortage brought by the DPO simplification.

With these insights, we propose MinorDPO, which is better aligned to the original RL algorithm, and increase the stability of preference optimization process.

;Comment: 8 pages, 19 figures

Xie, Shiming,Chen, Hong,Yu, Fred,Sun, Zeye,Wu, Xiuyu,Hu, Yingfan, 2024, Minor DPO reject penalty to increase training robustness

Document

Open

Share

Source

Articles recommended by ES/IODE AI

Investigation of Heavy Metal Analysis on Medicinal Plants Used for the Treatment of Skin Cancer by Traditional Practitioners in Pretoria
heavy metals medicinal plants skin cancer icp-ms health risk assessment treatment cancer plants 0 metal health medicinal