Sign In

Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning

์ž‘์„ฑ์ž
  • Haebom
์นดํ…Œ๊ณ ๋ฆฌ
Empty

์ €์ž

Zhicheng Yang, Zhijiang Guo, Yifan Song, Minrui Xu, Yongxin Wang, Yiwei Wang, Xiaodan Liang, Jing Tang

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ๋…ผ๋ฌธ์€ ๊ธด ์‹œํ€€์Šค์˜ ์ถ”๋ก  ์ž‘์—…์„ ์œ„ํ•œ on-policy distillation(OPD) ๊ธฐ๋ฒ•์˜ ํšจ์œจ์„ฑ์„ ๊ฐœ์„ ํ•˜๊ธฐ ์œ„ํ•ด Prune-OPD ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. Prune-OPD๋Š” ํ•™์ƒ ๋ชจ๋ธ์˜ ์ถ”๋ก  ๊ณผ์ •์ด ๊ต์‚ฌ ๋ชจ๋ธ์—์„œ ๋ฒ—์–ด๋‚˜๋Š” "prefix-drift" ํ˜„์ƒ์„ ์‹ค์‹œ๊ฐ„์œผ๋กœ ๊ฐ์ง€ํ•˜๊ณ , ์‹ ๋ขฐํ•  ์ˆ˜ ์—†๋Š” ๊ต์‚ฌ ๋ณด์ƒ์— ๋Œ€ํ•œ ๊ฐ€์ค‘์น˜๋ฅผ ๋‚ฎ์ถ”๋ฉฐ ๋™์ ์œผ๋กœ ๋กค์•„์›ƒ์„ ์ค‘๋‹จํ•˜์—ฌ ๊ณ„์‚ฐ ์ž์›์„ ํšจ์œจ์ ์œผ๋กœ ์žฌํ• ๋‹นํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์—ฐ์‚ฐ ํšจ์œจ์„ฑ์„ ๋†’์ด๋ฉด์„œ๋„ ์„ฑ๋Šฅ์„ ์œ ์ง€ํ•˜๊ฑฐ๋‚˜ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
๊ณ„์‚ฐ ํšจ์œจ์„ฑ ๊ทน๋Œ€ํ™”: Prefix-drift ๋ฐœ์ƒ ์‹œ ๋ถˆํ•„์š”ํ•œ ์—ฐ์‚ฐ์„ ์ค„์ด๊ณ  ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐ๋… ์‹ ํ˜ธ์— ์ง‘์ค‘ํ•˜์—ฌ ํ›ˆ๋ จ ์‹œ๊ฐ„์„ ํฌ๊ฒŒ ๋‹จ์ถ•ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
โ€ข
์„ฑ๋Šฅ ์œ ์ง€ ๋ฐ ํ–ฅ์ƒ: ํ›ˆ๋ จ ํšจ์œจ์„ฑ ์ฆ๋Œ€์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ , ๋ณต์žกํ•œ ๋ฒค์น˜๋งˆํฌ์—์„œ ์„ฑ๋Šฅ์„ ๋ณด์กดํ•˜๊ฑฐ๋‚˜ ๊ฐœ์„ ํ•˜๋Š” ํšจ๊ณผ๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.
โ€ข
๋™์  ํ•™์Šต ์˜ˆ์‚ฐ ๊ด€๋ฆฌ: Prefix-drift๊ฐ€ ๋ฐœ์ƒํ•˜์ง€ ์•Š์„ ๋•Œ๋Š” ํ•™์Šต ์ฐฝ์„ ํ™•์žฅํ•˜์—ฌ ์žฅ๊ฑฐ๋ฆฌ ๊ฐ๋… ์‹ ํ˜ธ๋ฅผ ์œ ์ง€ํ•จ์œผ๋กœ์จ, ๋‹จ์ˆœํžˆ ๋กค์•„์›ƒ ๊ธธ์ด๋ฅผ ์ค„์ด๋Š” ๋ฐฉ์‹๊ณผ๋Š” ์ฐจ๋ณ„ํ™”๋ฉ๋‹ˆ๋‹ค.
โ€ข
๋ฏธ์„ธํ•œ Drift ๊ฐ์ง€ ๋ฐ ๋Œ€์‘: Prefix-drift๋ฅผ ์‹ค์‹œ๊ฐ„์œผ๋กœ ์ •ํ™•ํ•˜๊ฒŒ ๊ฐ์ง€ํ•˜๊ณ  ๊ทธ ์ •๋„์— ๋”ฐ๋ผ ๋™์ ์œผ๋กœ ๋Œ€์‘ํ•˜๋Š” ๋ฉ”์ปค๋‹ˆ์ฆ˜์˜ ์ •๊ตํ•จ์ด ์ค‘์š”ํ•˜๋ฉฐ, Drift ๊ฐ์ง€ ์ž„๊ณ„๊ฐ’ ์„ค์ • ๋“ฑ์— ๋Œ€ํ•œ ์ถ”๊ฐ€์ ์ธ ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๐Ÿ‘