Sign In

SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting

์ž‘์„ฑ์ž
  • Haebom
์นดํ…Œ๊ณ ๋ฆฌ
Empty

์ €์ž

Binbin Zheng, Xing Ma, Yiheng Liang, Jingqing Ruan, Xiaoliang Fu, Kepeng Lin, Benchang Zhu, Ke Zeng, Xunliang Cai

๐Ÿ’ก ๊ฐœ์š”

๊ธฐ์กด์˜ ์˜จ-ํด๋ฆฌ์‹œ ๊ฐ•ํ™”ํ•™์Šต(RL)์€ ์–ธ์–ด ๋ชจ๋ธ์˜ ์ถ”๋ก  ์ •๋ ฌ์— ํšจ๊ณผ์ ์ด์ง€๋งŒ, ํ† ํฐ ์ˆ˜์ค€์˜ ํฌ๋ ˆ๋”ง ํ• ๋‹น์ด ์–ด๋ ต๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ์Šต๋‹ˆ๋‹ค. ์ œ์•ˆ๋œ SCOPE๋Š” ์ด๋Ÿฌํ•œ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋กค์•„์›ƒ์˜ ์ •ํ™•๋„์— ๋”ฐ๋ผ ๋‘ ๊ฐ€์ง€ ๊ฒฝ๋กœ๋กœ ๊ฐ๋… ์‹ ํ˜ธ๋ฅผ ๋ถ„๊ธฐํ•˜๋Š” ์ด์ค‘ ๊ฒฝ๋กœ ์ ์‘ํ˜• ๊ฐ€์ค‘์น˜ ๋ถ€์—ฌ ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ๋„์ž…ํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์ž˜๋ชป๋œ ๊ถค์ ์—๋Š” ๊ต์ • ๋Šฅ๋ ฅ์ด ๋†’์€ ๊ฒฝ์šฐ๋ฅผ ์šฐ์„ ์‹œํ•˜๊ณ , ์˜ฌ๋ฐ”๋ฅธ ๊ถค์ ์—๋Š” ๋‚ฎ์€ ํ™•์‹ ๋„๋ฅผ ๊ฐ€์ง„ ์ƒ˜ํ”Œ์— ์ง‘์ค‘ํ•˜์—ฌ ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œํ‚ต๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
๋กค์•„์›ƒ์˜ ์ •ํ™•๋„์— ๋”ฐ๋ผ ๊ฐ๋… ์‹ ํ˜ธ๋ฅผ ์ฐจ๋ณ„ํ™”ํ•˜์—ฌ ์˜จ-ํด๋ฆฌ์‹œ ๊ฐ•ํ™”ํ•™์Šต์˜ ํšจ์œจ์„ฑ์„ ๋†’์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
โ€ข
๊ต์ • ๋Šฅ๋ ฅ์ด ๋†’์€ ์ƒ˜ํ”Œ์— ์ง‘์ค‘ํ•˜๊ณ  ๋‚ฎ์€ ํ™•์‹ ๋„๋ฅผ ๊ฐ€์ง„ ์ƒ˜ํ”Œ์„ ํ™œ์šฉํ•จ์œผ๋กœ์จ ๋ชจ๋ธ์˜ ํ•™์Šต ํšจ์œจ์„ฑ๊ณผ ์„ฑ๋Šฅ์„ ๊ฐœ์„ ํ•ฉ๋‹ˆ๋‹ค.
โ€ข
๋‹ค์–‘ํ•œ ํ”„๋กฌํ”„ํŠธ์˜ ๋‚œ์ด๋„ ๋ณ€ํ™”๋ฅผ ๊ณ ๋ คํ•œ ๊ทธ๋ฃน ๋ ˆ๋ฒจ ์ •๊ทœํ™”๋ฅผ ํ†ตํ•ด ๊ฐ€์ค‘์น˜ ๋ถ„ํฌ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์กฐ์ •ํ•ฉ๋‹ˆ๋‹ค.
โ€ข
SCOPE์˜ ์ „๋ฐ˜์ ์ธ ์„ฑ๋Šฅ ๊ฐœ์„ ์— ๋Œ€ํ•œ ์ถ”๊ฐ€์ ์ธ ๋ถ„์„๊ณผ ๋‹ค์–‘ํ•œ LLM ์•„ํ‚คํ…์ฒ˜์— ๋Œ€ํ•œ ์ผ๋ฐ˜ํ™” ๊ฐ€๋Šฅ์„ฑ์„ ํƒ๊ตฌํ•  ํ•„์š”๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.
๐Ÿ‘