Sign In

How Fast Should a Model Commit to Supervision? Training Reasoning Models on the Tsallis Loss Continuum

Created by
  • Haebom
Category
Empty

์ €์ž

Chu-Cheng Lin, Eugene Ie

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ๋…ผ๋ฌธ์€ ์ถ”๋ก  ๋ชจ๋ธ ํ•™์Šต ์‹œ ์ง€๋„ ํ•™์Šต(SFT)๊ณผ ๊ฐ•ํ™” ํ•™์Šต(RLVR)์˜ ์ˆœ์„œ ๋ฐ ๋‹จ๋… ์‚ฌ์šฉ์˜ ๋ฌธ์ œ์ ์„ '์ฐจ์ด(Tsallis) ์†์‹ค ์—ฐ์†์ฒด'๋ผ๋Š” ํ†ต์ผ๋œ ํ‹€๋กœ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค. ์ œ์•ˆ๋œ ์ฐจ์ด ์†์‹ค ํ•จ์ˆ˜ $J_Q$๋Š” $q=0$์ผ ๋•Œ RLVR, $q=1$์ผ ๋•Œ ๊ธฐ์กด ์ง€๋„ ํ•™์Šต๊ณผ ์œ ์‚ฌํ•œ ํ˜•ํƒœ๋ฅผ ๊ฐ€์ง€๋ฉฐ, $q=1 \to 0$ ์ˆœ์ฐจ ํ•™์Šต์ด ๊ธฐ์กด ํŒŒ์ดํ”„๋ผ์ธ์— ํ•ด๋‹นํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด SFT๊ฐ€ ์ฝœ๋“œ ์Šคํƒ€ํŠธ ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•˜๊ณ  RLVR์ด ๋…ธ์ด์ฆˆ์— ๊ฐ•๊ฑดํ•จ์„ ์ด๋ก ์ ์œผ๋กœ ์„ค๋ช…ํ•˜๋ฉฐ, ์ƒˆ๋กœ์šด ํ•™์Šต ๋ฐฉ๋ฒ•๋ก ์ธ GARL๊ณผ PAFT๋ฅผ ์ œ์‹œํ•˜์—ฌ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ์ž…์ฆํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
SFT-then-RLVR ํ•™์Šต ์ˆœ์„œ๊ฐ€ ์ด๋ก ์ ์œผ๋กœ ํƒ€๋‹นํ•˜๋ฉฐ, ๊ฐ ๋‹จ๊ณ„์˜ ์—ญํ• ์ด ๋ช…ํ™•ํžˆ ๊ทœ๋ช…๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
โ€ข
'์ฐจ์ด ์†์‹ค ์—ฐ์†์ฒด'๋Š” ๊ธฐ์กด ํ•™์Šต ๋ฐฉ๋ฒ•๋“ค์„ ํ†ตํ•ฉ์ ์œผ๋กœ ์ดํ•ดํ•˜๊ณ  ์ƒˆ๋กœ์šด ํ•™์Šต ์ „๋žต์„ ์„ค๊ณ„ํ•˜๋Š” ๋ฐ ์œ ์šฉํ•œ ํ‹€์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
โ€ข
์ œ์•ˆ๋œ GARL๊ณผ PAFT ๋ฐฉ๋ฒ•๋ก ์€ ์ฝœ๋“œ ์Šคํƒ€ํŠธ ๋ฌธ์ œ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์™„ํ™”ํ•˜๊ณ , ํŠน์ • ๋ฐ์ดํ„ฐ์…‹์—์„œ๋Š” ๊ธฐ์กด ๋ฐฉ๋ฒ•๋ก  ๋Œ€๋น„ ์ƒ๋‹นํ•œ ์„ฑ๋Šฅ ํ–ฅ์ƒ์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค.
โ€ข
GARL๊ณผ PAFT์˜ ์ตœ์  $q$ ๊ฐ’ ์„ค์ •์ด ๋ฐ์ดํ„ฐ์…‹์˜ ์•ˆ์ •์„ฑ ๋ฐ ํ•™์Šต ํŠน์„ฑ์— ๋”ฐ๋ผ ๋‹ฌ๋ผ์ง€๋ฏ€๋กœ, ์ด์— ๋Œ€ํ•œ ์ถ”๊ฐ€์ ์ธ ์—ฐ๊ตฌ์™€ ํƒ์ƒ‰์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
โ€ข
๋ชฌํ…Œ์นด๋ฅผ๋กœ ์ถ”์ •๊ธฐ์˜ ํŽธํ–ฅ(bias)์ด ์กด์žฌํ•˜๋ฏ€๋กœ, ์ด๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•œ ๊ฐœ์„  ๋ฐฉ์•ˆ์ด ํ–ฅํ›„ ์—ฐ๊ตฌ ๊ณผ์ œ๊ฐ€ ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๐Ÿ‘