Sign In

Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

Created by
  • Haebom
Category
Empty

์ €์ž

Gengsheng Li, Tianyu Yang, Junfeng Fang, Mingyang Song, Mao Zheng, Haiyun Guo, Dan Zhang, Jinqiao Wang, Tat-Seng Chua

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ๋…ผ๋ฌธ์€ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ์˜ ์‚ฌํ›„ ํ•™์Šต์— ์‚ฌ์šฉ๋˜๋Š” ๊ฐ•ํ™”ํ•™์Šต ๋ฐฉ์‹์ธ RLVR์—์„œ ๋ฐœ์ƒํ•˜๋Š” GRPO์˜ ๊ฑฐ์นœ ์‹ ์šฉ ํ• ๋‹น ๋ฌธ์ œ์™€ SDPO์˜ ๋ถˆ์•ˆ์ •์„ฑ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด Sample-Routed Policy Optimization (SRPO)์ด๋ผ๋Š” ์ƒˆ๋กœ์šด ํ†ตํ•ฉ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. SRPO๋Š” ์˜ฌ๋ฐ”๋ฅธ ์ƒ˜ํ”Œ์€ GRPO์˜ ๋ณด์ƒ ๊ธฐ๋ฐ˜ ๊ฐ•ํ™” ํ•™์Šต์œผ๋กœ, ์‹คํŒจํ•œ ์ƒ˜ํ”Œ์€ SDPO์˜ ๋กœ๊ทธ ๋ฆฟ ์ˆ˜์ค€ ๊ต์ •์œผ๋กœ ๋ผ์šฐํŒ…ํ•˜๋ฉฐ, ์—”ํŠธ๋กœํ”ผ ๊ธฐ๋ฐ˜ ๋™์  ๊ฐ€์ค‘์น˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ํ†ตํ•ด ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ๋Š” ์ฆ๋ฅ˜ ๋Œ€์ƒ์„ ๊ฐ•์กฐํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
GRPO์™€ SDPO์˜ ์žฅ์ ์„ ํ†ตํ•ฉํ•˜์—ฌ ์ดˆ๊ธฐ ํ•™์Šต ์†๋„์™€ ์žฅ๊ธฐ ํ•™์Šต ์•ˆ์ •์„ฑ์„ ๋™์‹œ์— ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.
โ€ข
๊ธฐ์กด ๋ฐฉ๋ฒ•๋ก  ๋Œ€๋น„ ํ–ฅ์ƒ๋œ ์„ฑ๋Šฅ๊ณผ ํšจ์œจ์„ฑ์„ ๋ณด์—ฌ์ฃผ๋ฉฐ, ํŠนํžˆ Qwen3-8B ๋ชจ๋ธ์—์„œ ๋ฒค์น˜๋งˆํฌ ํ‰๊ท  ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œ์ผฐ์Šต๋‹ˆ๋‹ค.
โ€ข
์—”ํŠธ๋กœํ”ผ ๊ธฐ๋ฐ˜ ๋™์  ๊ฐ€์ค‘์น˜ ๋ฉ”์ปค๋‹ˆ์ฆ˜์€ ์ฆ๋ฅ˜ ์‹ ํ˜ธ์˜ ์‹ ๋ขฐ๋„๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ๊ด€๋ฆฌํ•ฉ๋‹ˆ๋‹ค.
โ€ข
์ œ์•ˆ๋œ SRPO ํ”„๋ ˆ์ž„์›Œํฌ์˜ ์ผ๋ฐ˜ํ™” ๊ฐ€๋Šฅ์„ฑ ๋ฐ ๋‹ค์–‘ํ•œ ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜์—์„œ์˜ ์„ฑ๋Šฅ ๊ฒ€์ฆ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
๐Ÿ‘