Secrets of RLHF in Large Language Models Part II: Reward Modeling
Published in CoRR abs/2401.06080, 2024
From a data perspective, we propose a method to measure the strength of preferences within the data, based on a voting mechanism of multiple reward models. From an algorithmic standpoint, we introduce contrastive learning to enhance the ability of reward models to distinguish between chosen and rejected responses, thereby improving model generalization.
Recommended citation: Binghai Wang, Rui Zheng, Lu Chen, Yan Liu, Shihan Dou, Caishuang Huang, Wei Shen, Senjie Jin, Enyu Zhou, Chenyu Shi, Songyang Gao, Nuo Xu, Yuhao Zhou, Xiaoran Fan, Zhiheng Xi, Jun Zhao, Xiao Wang, Tao Ji, Hang Yan, Lixing Shen, Zhan Chen, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Zuxuan Wu, Yu-Gang Jiang: Secrets of RLHF in Large Language Models Part II: Reward Modeling. CoRR abs/2401.06080 (2024) http://xuanjing-huang.github.io/files/reward.pdf