Daily Arxiv

This is a page that curates AI-related papers published worldwide.
All content here is summarized using Google Gemini and operated on a non-profit basis.
Copyright for each paper belongs to the authors and their institutions; please make sure to credit the source when sharing.

Prior Prompt Engineering for Reinforcement Fine-Tuning

Created by
  • Haebom

Author

Pittawat Taveekitworachai, Potsawee Manakul, Sarana Nutanong, Kunat Pipatanakul

Outline

This paper investigates the effectiveness of prior prompt engineering (pPE) in Reinforcement Fine-Tuning (RFT). While previous RFT research has primarily focused on algorithms, reward design, and data management, the design of pPE—the instructions prepended to queries during training (e.g., step-by-step inference guidance)—has been understudied. In this paper, we investigate whether various pPE approaches can induce different behaviors in language models (LMs) after RFT. We convert five strategies used in inference-time prompt engineering (iPE) (inference, planning, code-based reasoning, knowledge recall, and null-example exploitation) into pPE and apply them to the Qwen2.5-7B model. We evaluate their performance on benchmarks such as AIME2024, HumanEval+, and GPQA-Diamond. Experimental results show that all PPE-trained models outperform the iPE-prompted models, with the null-example PPE approach achieving the greatest performance gains, with the highest performance gains observed on AIME2024 and GPQA-Diamond. Furthermore, utilizing a behavioral classification framework, we demonstrate that different PPE strategies instill different behavioral styles in the models.

Takeaways, Limitations

Takeaways:
Revealing that pPE is an important component of RFT.
We present the possibility of improving model performance by applying various iPE strategies to pPE.
Demonstrating the superiority of the null-example pPE approach.
We show that pPE is effective in controlling the model's behavioral style.
Emphasize the importance of pPE in future RFT studies.
Limitations:
Since these results are for a specific model (Qwen2.5-7B) and benchmark, generalizability is limited.
Further research is needed on other LMs or other RFT algorithms.
Further research is needed on the interactions and optimal combinations between PPE strategies.
Lack of analysis of the computational costs and efficiency of PPE.
👍