Abstract
Reinforcement learning from human feedback (RLHF) has emerged as the primary method for aligning large language models (LLMs) with human preferences. The RLHF process typically starts by training a reward model (RM) using human preference data. Conventional RMs are trained on pairwise responses to the same user request, with relative ratings indicating which response humans prefer. The trained RM serves as a proxy for human preferences. However, due to the black-box nature of RMs, their outputs lack interpretability, as humans cannot intuitively understand why an RM thinks a response is good or not. As RMs act as human preference proxies, it is desirable for them to be human-interpretable to ensure that their internal decision processes are consistent with human preferences and to prevent reward hacking in LLM alignment. To build RMs with interpretable preferences, we propose a two-stage approach: i) train an Absolute-Rating Multi-Objective Reward Model (ArmoRM) with multi-dimensional absolute-rating data, each dimension corresponding to a human-interpretable objective (e.g., honesty, verbosity, safety); ii) employ a Mixture-of-Experts (MoE) strategy with a gating network that automatically selects the most suitable reward objectives based on the context. We efficiently trained an ArmoRM with Llama-3 8B and a gating network consisting of a shallow MLP on top of the ArmoRM. Our trained model, ArmoRM-Llama3-8B, obtains state-of-the-art performance on RewardBench, a benchmark evaluating RMs for language modeling. Notably, the performance of our model surpasses the LLM-as-a-judge method with GPT-4 judges by a margin, and approaches the performance of the much larger Nemotron-4 340B reward model. Our code and model are released at https://github.com/RLHFlow/RLHF-Reward-Modeling.
| Original language | English |
|---|---|
| Title of host publication | EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2024 |
| Editors | Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen |
| Publisher | Association for Computational Linguistics (ACL) |
| Pages | 10582-10592 |
| Number of pages | 11 |
| ISBN (Electronic) | 9798891761681 |
| DOIs | |
| Publication status | Published - 2024 |
| Externally published | Yes |
| Event | 2024 Findings of the Association for Computational Linguistics, EMNLP 2024 - Hybrid, Miami, United States Duration: 12 Nov 2024 → 16 Nov 2024 |
Publication series
| Name | EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2024 |
|---|
Conference
| Conference | 2024 Findings of the Association for Computational Linguistics, EMNLP 2024 |
|---|---|
| Country/Territory | United States |
| City | Hybrid, Miami |
| Period | 12/11/24 → 16/11/24 |
Bibliographical note
Publisher Copyright:© 2024 Association for Computational Linguistics.
Fingerprint
Dive into the research topics of 'Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver