Boosting Policy and Process Reward Models with Monte Carlo Tree Search in Open-Domain QA

Chi-min CHAN, Chunpu XU, Junqi ZHU, Jiaming JI, Donghai HONG, Pengcheng WEN, Chunyang JIANG, Zhen YE, Yaodong YANG, Wei XUE, Sirui HAN*, Yike GUO*

*Corresponding author for this work

Research output: Chapter in Book/Conference Proceeding/ReportConference Paper published in a bookpeer-review

Abstract

The recent introduction of OpenAI’s o1/o3 model represents a significant milestone in developing strong reasoning capabilities in Large Language Models (LLMs). By introducing
more computational budget during test-time, LLMs have the potential to explore more accurate and higher-quality solutions. However, such paradigms are primarily verified in domains that have well-defined criteria for responses, such as coding and mathematics. Inspired by the success of this paradigm, we aim to bridge it to more subtle open-domain question answering. Specifically, we utilize search mechanisms such as Monte Carlo Tree Search (MCTS) for both policy model improvement and reward model improvement that achieve better performance in test-time scaling strategies. Our contributions are summarized in two folds: For the training phase, we demonstrate
that our approach surpasses previous SOTA automatic data annotation methods and various public instruction-tuning datasets, with fewer data points. This offers a more data-efficient solution for training robust models. For the inference phase, we utilize the intermediate values collected during training data construction to train a process reward model called PRM+. This model employs a novel two-stage training method to provide finer-grained guidance across the generation trajectory. This introduces no additional overhead during training data collection and further enhances performance by scaling test-time computation. Experimental results show that our method can effectively improve the performance of both the policy model and the reward model.
Original languageEnglish
Title of host publicationFindings of the Association for Computational Linguistics: ACL 2025
EditorsWanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
PublisherAssociation for Computational Linguistics (ACL)
Pages7433-7451
Number of pages19
DOIs
Publication statusAccepted/In press - May 2025
EventThe 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025)
- Vienna, Austria
Duration: 27 Jul 20251 Aug 2025

Conference

ConferenceThe 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025)
Country/TerritoryAustria
CityVienna
Period27/07/251/08/25

Fingerprint

Dive into the research topics of 'Boosting Policy and Process Reward Models with Monte Carlo Tree Search in Open-Domain QA'. Together they form a unique fingerprint.

Cite this