Sign In

Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective

์ž‘์„ฑ์ž
  • Haebom
์นดํ…Œ๊ณ ๋ฆฌ
Empty

์ €์ž

Feng Zhang, Xinhong Ma, Ziqiang Dong, Xi Leng, Jianfei Zhao, Xin Sun, Yang Yang, Guanjun Jiang

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ๋…ผ๋ฌธ์€ ์–ธ์–ด ๋ชจ๋ธ์˜ ์ถ”๋ก  ๋Šฅ๋ ฅ ๊ฐ•ํ™”๋ฅผ ์œ„ํ•œ ๊ฐ•ํ™”ํ•™์Šต ๊ธฐ๋ฒ•์ธ GRPO์˜ ํ•œ๊ณ„๋ฅผ ์ง€์ ํ•˜๊ณ , ์ด๋ฅผ ๊ฐœ์„ ํ•œ ConSPO๋ฅผ ์ œ์•ˆํ•œ๋‹ค. GRPO๋Š” ๊ฒ€์ฆ๋œ ๊ธ์ •์  ๊ฒฐ๊ณผ์™€ ๋ถ€์ •์  ๊ฒฐ๊ณผ ๊ฐ„์˜ ์ ์ˆ˜ ์ฐจ์ด๋ฅผ ์ตœ๋Œ€ํ™”ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ์ •์ฑ…์„ ์ตœ์ ํ™”ํ•˜์ง€๋งŒ, ์‹ค์ œ ์‹œํ€€์Šค ํ™•๋ฅ ์ด ์•„๋‹Œ ์ž„์˜์˜ ์ ์ˆ˜์™€ ๋ชจ๋“  ๊ฒฐ๊ณผ์— ๋Œ€ํ•ด ๋™์ผํ•œ ๊ฐ€์ค‘์น˜๋ฅผ ๋ถ€์—ฌํ•˜๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค. ConSPO๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์‹œํ€€์Šค ๊ธธ์ด๋กœ ์ •๊ทœํ™”๋œ ๋กœ๊ทธ ํ™•๋ฅ ์„ ์ ์ˆ˜๋กœ ์‚ฌ์šฉํ•˜๊ณ , ๊ธ์ •์  ๊ฒฐ๊ณผ์™€ ๋ถ€์ •์  ๊ฒฐ๊ณผ ๊ฐ„์˜ ๋Œ€๋น„ ํ•™์Šต์„ ํ†ตํ•ด ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚จ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
๊ฐ•ํ™”ํ•™์Šต์—์„œ ๊ฒ€์ฆ๋œ ๋ณด์ƒ(Verifiable Rewards)์„ ํ™œ์šฉํ•˜๋Š” ๋ฐฉ์‹์— ๋Œ€ํ•œ ์ƒˆ๋กœ์šด ๊ด€์ ์„ ์ œ์‹œํ•œ๋‹ค.
โ€ข
GRPO์˜ ๋‘ ๊ฐ€์ง€ ์ฃผ์š” ํ•œ๊ณ„์ (likelihood-misaligned surrogate scores, score-insensitive credit assignment)์„ ๋ช…ํ™•ํžˆ ๊ทœ๋ช…ํ•œ๋‹ค.
โ€ข
์ œ์•ˆ๋œ ConSPO๋Š” ๋‹ค์–‘ํ•œ ์ถ”๋ก  ์ž‘์—…์—์„œ ๊ธฐ์กด ๋ฐฉ๋ฒ•๋ก  ๋Œ€๋น„ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ, ๊ฒ€์ฆ๋œ ๋ณด์ƒ ๊ธฐ๋ฐ˜ ๊ฐ•ํ™”ํ•™์Šต์˜ ์‹คํšจ์„ฑ์„ ์ž…์ฆํ•œ๋‹ค.
โ€ข
ConSPO์˜ ํšจ๊ณผ๋ฅผ ๊ทน๋Œ€ํ™”ํ•˜๊ธฐ ์œ„ํ•œ ์ปค๋ฆฌํ˜๋Ÿผ ํ•™์Šต ๋ฐ ๋งˆ์ง„ ์„ค๊ณ„๊ฐ€ ์ค‘์š”ํ•œ ์—ญํ• ์„ ํ•œ๋‹ค.
โ€ข
ConSPO์˜ ์ผ๋ฐ˜์ ์ธ ์ ์šฉ ๊ฐ€๋Šฅ์„ฑ๊ณผ ๋‹ค์–‘ํ•œ LLM ์•„ํ‚คํ…์ฒ˜์—์„œ์˜ ์„ฑ๋Šฅ ๊ฒ€์ฆ์ด ํ–ฅํ›„ ์—ฐ๊ตฌ ๊ณผ์ œ๋กœ ๋‚จ๋Š”๋‹ค.
๐Ÿ‘