Sign In

Position: The Hidden Costs and Measurement Gaps of Reinforcement Learning with Verifiable Rewards

Created by
  • Haebom
Category
Empty

์ €์ž

Fang Wu, Aaron Tu, Weihao Xuan, Heli Qi, Xu Huang, Qingcheng Zeng, Shayan Talaei, Yijia Xiao, Peng Xia, Xiangru Tang, Yuchen Zhuang, Bing Hu, Hanqun Cao, Wenqi Shi, Rui Yang, Nan Liu, Huaxiu Yao, Ge Liu, Li Erran Li, Amin Saberi, Naoto Yokoya, Jure Leskovec, Yejin Choi

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ๋…ผ๋ฌธ์€ ๊ฐ•ํ™”ํ•™์Šต ๊ธฐ๋ฐ˜ ๋ณด์ƒ ๊ฒ€์ฆ(RLVR)์ด ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ์˜ ์ˆ˜ํ•™, ์ฝ”๋“œ ๋“ฑ ๊ตฌ์กฐํ™”๋œ ์ž‘์—… ์„ฑ๋Šฅ ํ–ฅ์ƒ์— ์œ ์šฉํ•˜๋‹ค๋Š” ๊ธฐ์กด ์ฃผ์žฅ์— ์˜๋ฌธ์„ ์ œ๊ธฐํ•ฉ๋‹ˆ๋‹ค. ์—ฐ๊ตฌ์ง„์€ RLVR์˜ ์„ฑ๊ณผ๊ฐ€ ์˜ˆ์‚ฐ ๋ถˆ์ผ์น˜, ์‹œ๋„ ํšŸ์ˆ˜ ์ฆ๊ฐ€, ๋ฐ์ดํ„ฐ ์˜ค์—ผ ๋“ฑ ์„ธ ๊ฐ€์ง€ ๊ต๋ž€ ์š”์ธ์œผ๋กœ ์ธํ•ด ๊ณผ๋Œ€ํ‰๊ฐ€๋  ์ˆ˜ ์žˆ์Œ์„ ์ง€์ ํ•˜๋ฉฐ, ์ด๋ฅผ ๊ฒ€์ฆํ•˜๊ธฐ ์œ„ํ•œ ์ƒˆ๋กœ์šด ํ‰๊ฐ€ ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
RLVR์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์€ ์˜ˆ์‚ฐ, ํ”„๋กฌํ”„ํŠธ, ๋ฐ์ดํ„ฐ์…‹ ๋ฒ„์ „ ์ผ์น˜ ์—ฌ๋ถ€์— ๋”ฐ๋ผ ํฌ๊ฒŒ ๋‹ฌ๋ผ์ง€๋ฉฐ, ๊ต๋ž€ ์š”์ธ์„ ์ œ๊ฑฐํ–ˆ์„ ๋•Œ ๊ธฐ์กด์— ๋ณด๊ณ ๋œ ์„ฑ๋Šฅ ๊ฒฉ์ฐจ๊ฐ€ ์ค„์–ด๋“ค๊ฑฐ๋‚˜ ์‚ฌ๋ผ์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
โ€ข
ํ˜„์žฌ์˜ RLVR ์ธก์ • ๋ฐฉ์‹์€ ๋Šฅ๋ ฅ ํ–ฅ์ƒ์„ ๊ณผ๋Œ€ํ‰๊ฐ€ํ•˜๊ณ  ์‹ ๋ขฐ์„ฑ ๋น„์šฉ์„ ๊ฐ„๊ณผํ•  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ, ๋ณด๋‹ค ์—„๊ฒฉํ•˜๊ณ  ํˆฌ๋ช…ํ•œ ํ‰๊ฐ€ ๊ธฐ์ค€์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
โ€ข
์ œ์•ˆ๋œ '์„ธ๊ธˆ ์ธ์ง€ ์ตœ์†Œ ํ‘œ์ค€(tax-aware minimum standard)'์€ ์˜ˆ์‚ฐ ์ผ์น˜ ํฌํ™” ๊ณก์„ , ๋ณด์ •, ๊ธฐ๊ถŒ ์ถ”์ , LLM ํŒ์‚ฌ ๊ฒฌ๊ณ ์„ฑ ๊ฒ€์‚ฌ, ์˜ค์—ผ ์Šคํฌ๋ฆฌ๋‹์„ ํฌํ•จํ•˜์—ฌ RLVR์˜ ํšจ๊ณผ์ ์ธ ํ›ˆ๋ จ ๋ฐ ํ‰๊ฐ€๋ฅผ ์ง€์›ํ•˜์ง€๋งŒ, ์ด๋ฅผ ์ ์šฉํ•˜์ง€ ์•Š์€ ์ƒํƒœ์—์„œ์˜ ์ถ”๋ก  ๋Šฅ๋ ฅ ํ–ฅ์ƒ์€ ์ž ์ •์ ์œผ๋กœ ๊ฐ„์ฃผํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.
๐Ÿ‘