Sign In

Blockwise Advantage Estimation for Multi-Objective RL with Verifiable Rewards

Created by
  • Haebom
Category
Empty

์ €์ž

Kirill Pavlenko, Alexander Golubev, Simon Karasik, Boris Yangel

๐Ÿ’ก ๊ฐœ์š”

๊ธฐ์กด GRPO ๋ฐฉ๋ฒ•๋ก ์€ ์—ฌ๋Ÿฌ ๋ชฉํ‘œ๋ฅผ ๊ฐ€์ง„ ๊ตฌ์กฐํ™”๋œ ์ƒ์„ฑ ํƒœ์Šคํฌ์—์„œ ๊ฐ ํ† ํฐ์— ๋‹จ์ผ ์Šค์นผ๋ผ ์–ด๋“œ๋ฐดํ‹ฐ์ง€๋ฅผ ํ• ๋‹นํ•˜์—ฌ ๋ชฉํ‘œ ๊ฐ„ ๊ฐ„์„ญ์„ ์œ ๋ฐœํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ ๊ฐ ๋ชฉํ‘œ์— ๋Œ€ํ•œ ๊ฐœ๋ณ„์ ์ธ ์–ด๋“œ๋ฐดํ‹ฐ์ง€๋ฅผ ํ• ๋‹นํ•˜๊ณ  ํ•ด๋‹น ํ…์ŠคํŠธ ๋ธ”๋ก์—๋งŒ ์ ์šฉํ•˜๋Š” Blockwise Advantage Estimation์„ ์ œ์•ˆํ•˜์—ฌ ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•ฉ๋‹ˆ๋‹ค. ์ œ์•ˆ ๋ฐฉ๋ฒ•์€ Reward Interference๋ฅผ ์™„ํ™”ํ•˜๊ณ , ์ถ”๊ฐ€์ ์ธ Rollout ์—†์ด ์ˆœ์ฐจ์  ๋ชฉํ‘œ ์ตœ์ ํ™”๋ฅผ ์œ„ํ•œ ๋ชจ๋“ˆ์‹ ์ ‘๊ทผ ๋ฐฉ์‹์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
๋‹ค์ค‘ ๋ชฉํ‘œ ๊ฐ•ํ™”ํ•™์Šต์—์„œ ๊ฐ ๋ชฉํ‘œ๋ณ„ ๋…๋ฆฝ์ ์ธ ์–ด๋“œ๋ฐดํ‹ฐ์ง€ ์ถ”์ •์„ ํ†ตํ•ด ๋ชฉํ‘œ ๊ฐ„ ๊ฐ„์„ญ ๋ฌธ์ œ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์™„ํ™”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
โ€ข
Outcome-Conditioned Baseline์€ ๋น„์‹ผ ์ค‘์ฒฉ Rollout ์—†์ด๋„ ์ค‘๊ฐ„ ์ƒํƒœ ๊ฐ€์น˜๋ฅผ ๊ทผ์‚ฌํ•˜์—ฌ ๊ณ„์‚ฐ ํšจ์œจ์„ฑ์„ ๋†’์ž…๋‹ˆ๋‹ค.
โ€ข
์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์€ ๋ณต์žกํ•œ Reward Engineering ์—†์ด๋„ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜๋ฉฐ, Confidence-Weighted Ensembling์˜ ํ…Œ์ŠคํŠธ ์‹œ๊ฐ„ ์ด๋“์„ ๋ณด์กดํ•ฉ๋‹ˆ๋‹ค.
โ€ข
ํ›„๋ฐ˜ ๋ธ”๋ก์˜ ์–ด๋“œ๋ฐดํ‹ฐ์ง€ ์ถ”์ • ์‹œ ์ƒ˜ํ”Œ๋ง๋œ ์ ‘๋‘์‚ฌ์— ๋”ฐ๋ฅธ ๋ณด์ƒ์ด ์กฐ๊ฑดํ™”๋˜๋Š” ๋ฌธ์ œ๋Š” ์—ฌ์ „ํžˆ ๋„์ „ ๊ณผ์ œ๋กœ ๋‚จ์•„์žˆ์œผ๋ฉฐ, ์ด์— ๋Œ€ํ•œ ์ถ”๊ฐ€์ ์ธ ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๐Ÿ‘