Sign In

Reinforcement-aware Knowledge Distillation for LLM Reasoning

Created by
  • Haebom
Category
Empty

์ €์ž

Zhaoyang Zhang, Shuli Jiang, Yantao Shen, Yuting Zhang, Dhananjay Ram, Shuo Yang, Zhuowen Tu, Wei Xia, Stefano Soatto

๐Ÿ’ก ๊ฐœ์š”

์ด ๋…ผ๋ฌธ์€ ๊ฐ•ํ™”ํ•™์Šต(RL)์œผ๋กœ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋œ ๊ฑฐ๋Œ€ ์–ธ์–ด ๋ชจ๋ธ(LLM)์˜ ๋†’์€ ์ถ”๋ก  ๋น„์šฉ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์ง€์‹ ์ฆ๋ฅ˜(KD)๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ๊ธฐ์กด KD ๋ฐฉ๋ฒ•์ด RL ํ™˜๊ฒฝ์—์„œ ๋ฐœ์ƒํ•˜๋Š” ๋ถ„ํฌ ๋ถˆ์ผ์น˜ ๋ฐ ๋ชฉํ‘œ ์ถฉ๋Œ ๋ฌธ์ œ๋ฅผ ๊ฒช๋Š”๋‹ค๋Š” ์ ์„ ์ง€์ ํ•˜๋ฉฐ, ์ด๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด RL ๊ณผ์ •์—์„œ ์ •์ฑ… ์—…๋ฐ์ดํŠธ์— ๋„์›€์ด ๋  ๋•Œ๋งŒ ๋ชจ๋ฐฉ์„ ์ˆ˜ํ–‰ํ•˜๋Š” RL-aware distillation (RLAD) ๊ธฐ๋ฒ•์„ ์†Œ๊ฐœํ•ฉ๋‹ˆ๋‹ค. ํ•ต์‹ฌ ๊ธฐ์ˆ ์ธ Trust Region Ratio Distillation (TRRD)์€ PPO/GRPO ์Šคํƒ€์ผ์˜ ํ™•๋ฅ  ๋น„์œจ ๋ชฉ์ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์žฅ์  ์ธ์‹ ๋ฐ ์‹ ๋ขฐ ์˜์—ญ ์ œํ•œ์ ์ธ ์ฆ๋ฅ˜๋ฅผ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
RL ํ›ˆ๋ จ ๊ณผ์ •์—์„œ ๋ฐœ์ƒํ•˜๋Š” teacher-student ๋ถ„ํฌ ๋ถˆ์ผ์น˜ ๋ฌธ์ œ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ํ•ด๊ฒฐํ•˜๊ณ , reward maximization๊ณผ์˜ ์ƒ์ถฉ์„ ์ค„์ž…๋‹ˆ๋‹ค.
โ€ข
์ œ์•ˆํ•˜๋Š” RLAD ๊ธฐ๋ฒ•์€ ๋‹ค์–‘ํ•œ ๋…ผ๋ฆฌ ์ถ”๋ก  ๋ฐ ์ˆ˜ํ•™ ๋ฒค์น˜๋งˆํฌ์—์„œ ๊ธฐ์กด์˜ ์˜คํ”„๋ผ์ธ ์ฆ๋ฅ˜, ํ‘œ์ค€ GRPO, KL ๊ธฐ๋ฐ˜ ์ฆ๋ฅ˜ ๋ฐฉ์‹๋ณด๋‹ค ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค.
โ€ข
TRRD๋Š” ํƒํ—˜, ํ™œ์šฉ, ๋ชจ๋ฐฉ ๊ฐ„์˜ ๊ท ํ˜•์„ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ ๋งž์ถฐ ํšจ์œจ์ ์ธ ์ง€์‹ ์ „๋‹ฌ์„ ์œ ๋„ํ•ฉ๋‹ˆ๋‹ค.
โ€ข
์ œ์•ˆ๋œ ๋ฐฉ๋ฒ•์˜ ๋ณต์žก์„ฑ๊ณผ ํŠน์ • RL ์•Œ๊ณ ๋ฆฌ์ฆ˜(PPO/GRPO)์— ๋Œ€ํ•œ ์˜์กด์„ฑ์ด ํ–ฅํ›„ ์—ฐ๊ตฌ์—์„œ ๊ณ ๋ ค๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๐Ÿ‘