Sign In

VERIFY-RL: Verifiable Recursive Decomposition for Reinforcement Learning in Mathematical Reasoning

Created by
  • Haebom
Category
Empty

์ €์ž

Kaleem Ullah Qasim, Jiashu Zhang, Hao Li, Muhammad Kafeel Shaheen

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ์—ฐ๊ตฌ๋Š” ๋ณต์žกํ•œ ์ˆ˜ํ•™ ๋ฌธ์ œ ํ•ด๊ฒฐ์„ ์œ„ํ•œ ๊ฐ•ํ™”ํ•™์Šต ๋ฐฉ๋ฒ•๋ก ์ธ Verify-RL์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. Verify-RL์€ ๊ธฐํ˜ธ ๋ฏธ๋ถ„์„ ํ™œ์šฉํ•˜์—ฌ ๋ฌธ์ œ ๋ถ„ํ•ด ๊ณผ์ •์—์„œ ๊ตฌ์กฐ์  ๋ณต์žก์„ฑ ๊ฐ์†Œ, ํ•ด ํฌํ•จ, ๊ทธ๋ฆฌ๊ณ  ํ˜•์‹์  ๊ทœ์น™ ์œ ๋„๋ผ๋Š” ์„ธ ๊ฐ€์ง€ ๊ฒ€์ฆ ๊ฐ€๋Šฅํ•œ ์กฐ๊ฑด์„ ๋งŒ์กฑํ•˜๋„๋ก ํ•จ์œผ๋กœ์จ ๊ธฐ์กด์˜ ๊ฒฝํ—˜์  ๋ถ„ํ•ด ๋ฐฉ๋ฒ•์˜ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด invalidํ•œ ๋ถ„ํ•ด๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์ œ๊ฑฐํ•˜์—ฌ ๊ฐ€์žฅ ์–ด๋ ค์šด ๋ฌธ์ œ์—์„œ ์ •ํ™•๋„๋ฅผ 2๋ฐฐ ์ด์ƒ ํ–ฅ์ƒ์‹œํ‚ค๊ณ  ์ „๋ฐ˜์ ์ธ ์„ฑ๋Šฅ์„ 40% ๊ฐœ์„ ํ•˜๋Š” ์„ฑ๊ณผ๋ฅผ ๊ฑฐ๋‘์—ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
๊ธฐํ˜ธ ๋ฏธ๋ถ„์„ ํ†ตํ•œ ์ˆ˜ํ•™์  ์ถ”๋ก ์—์„œ์˜ '๊ฒ€์ฆ ๊ฐ€๋Šฅํ•œ ์žฌ๊ท€์  ๋ถ„ํ•ด'๋Š” ๊ฐ•ํ™”ํ•™์Šต ์„ฑ๋Šฅ ํ–ฅ์ƒ์˜ ํ•ต์‹ฌ ์š”์†Œ์ž„์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.
โ€ข
์ž๋™ํ™”๋œ ๊ฒ€์ฆ์„ ํ†ตํ•ด ์ง๊ด€์— ์˜์กดํ•˜๋Š” ๊ธฐ์กด์˜ ๋ถ„ํ•ด ๋ฐฉ๋ฒ•๋ก ์˜ ๋ถˆํ™•์‹ค์„ฑ์„ ์ œ๊ฑฐํ•˜๊ณ  ์‹ ๋ขฐ๋„๋ฅผ ๋†’์˜€์Šต๋‹ˆ๋‹ค.
โ€ข
์ œ์•ˆ๋œ ๊ฒ€์ฆ ์กฐ๊ฑด(๊ตฌ์กฐ์  ๋ณต์žก์„ฑ ๊ฐ์†Œ, ํ•ด ํฌํ•จ, ํ˜•์‹์  ๊ทœ์น™ ์œ ๋„)์€ ๋ณต์žกํ•œ ๋ฌธ์ œ ํ•ด๊ฒฐ์„ ์œ„ํ•œ ์ฒด๊ณ„์ ์ธ ์ ‘๊ทผ ๋ฐฉ์‹์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
โ€ข
๋ชจ๋“  ์ˆ˜ํ•™์  ์˜์—ญ์— ๋Œ€ํ•œ ์ผ๋ฐ˜์ ์ธ ์ ์šฉ ๊ฐ€๋Šฅ์„ฑ ๋ฐ ๋‹ค์–‘ํ•œ ๋ณต์žก์„ฑ์˜ ๋ฌธ์ œ์— ๋Œ€ํ•œ ํ™•์žฅ์„ฑ์€ ์ถ”๊ฐ€ ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
๐Ÿ‘