Sign In

Blending Supervised and Reinforcement Fine-Tuning with Prefix Sampling

์ž‘์„ฑ์ž
  • Haebom
์นดํ…Œ๊ณ ๋ฆฌ
Empty

์ €์ž

Zeyu Huang, Tianhao Cheng, Zihan Qiu, Zili Wang, Yinghui Xu, Edoardo M. Ponti, Ivan Titov

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ๋…ผ๋ฌธ์€ ๊ธฐ์กด ์ง€๋„ ํ•™์Šต ๋ฏธ์„ธ ์กฐ์ •(SFT)๊ณผ ๊ฐ•ํ™” ํ•™์Šต ๋ฏธ์„ธ ์กฐ์ •(RFT)์˜ ์žฅ๋‹จ์ ์„ ํ†ตํ•ฉํ•˜์—ฌ, ์‹œ์—ฐ ๋ฐ์ดํ„ฐ ๋ชจ๋ฐฉ ๋Šฅ๋ ฅ๊ณผ ํƒ์ƒ‰์„ ํ†ตํ•œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋ชจ๋‘ ๋‹ฌ์„ฑํ•˜๋Š” ์ƒˆ๋กœ์šด ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ์ ‘๊ทผ ๋ฐฉ์‹์ธ Prefix-RFT๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. ์ˆ˜ํ•™์  ์ถ”๋ก  ๋ฌธ์ œ์— ๋Œ€ํ•œ ์‹คํ—˜ ๊ฒฐ๊ณผ, Prefix-RFT๋Š” ๋‹จ๋… SFT ๋ฐ RFT๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ๋ณ‘๋ ฌ ํ˜ผํ•ฉ ์ •์ฑ… RFT ๋ฐฉ์‹๋ณด๋‹ค ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ์ด๋Š” SFT์™€ RFT์˜ ๋ณด์™„์ ์ธ ํŠน์„ฑ์„ ํšจ๊ณผ์ ์œผ๋กœ ๊ฒฐํ•ฉํ–ˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
SFT์˜ ํ–‰๋™ ๋ณต์ œ ๋ฌธ์ œ์™€ RFT์˜ ์˜ˆ์ธก ๋ถˆ๊ฐ€๋Šฅํ•œ ํ–‰๋™ ํ•™์Šต ๋ฌธ์ œ๋ฅผ ๋™์‹œ์— ํ•ด๊ฒฐํ•˜๋Š” ํšจ๊ณผ์ ์ธ ํ•˜์ด๋ธŒ๋ฆฌ๋“œ ํ•™์Šต ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์‹œํ–ˆ์Šต๋‹ˆ๋‹ค.
โ€ข
์ˆ˜ํ•™์  ์ถ”๋ก  ๋ฌธ์ œ์—์„œ ๊ธฐ์กด ๋‹จ์ผ ํ•™์Šต ๋ฐฉ์‹ ๋ฐ ๋ณ‘๋ ฌ ํ˜ผํ•ฉ ๋ฐฉ์‹ ๋Œ€๋น„ ๋›ฐ์–ด๋‚œ ์„ฑ๋Šฅ์„ ์ž…์ฆํ•˜์—ฌ, ์–ธ์–ด ๋ชจ๋ธ ๋ฏธ์„ธ ์กฐ์ •์˜ ์ƒˆ๋กœ์šด ๊ฐ€๋Šฅ์„ฑ์„ ๋ณด์—ฌ์ฃผ์—ˆ์Šต๋‹ˆ๋‹ค.
โ€ข
์‹œ์—ฐ ๋ฐ์ดํ„ฐ์˜ ํ’ˆ์งˆ ๋ฐ ์–‘ ๋ณ€ํ™”์—๋„ ๊ฐ•๊ฑดํ•œ Prefix-RFT์˜ ํšจ๊ณผ๋ฅผ ํ™•์ธํ•˜์˜€์œผ๋‚˜, ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ ๋ฐ ๋ฌธ์ œ ์œ ํ˜•์— ๋Œ€ํ•œ ์ถ”๊ฐ€์ ์ธ ๊ฒ€์ฆ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
๐Ÿ‘