Sign In

Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty

์ž‘์„ฑ์ž
  • Haebom
์นดํ…Œ๊ณ ๋ฆฌ
Empty

์ €์ž

Mehul Damani, Isha Puri, Stewart Slocum, Idan Shenfeld, Leshem Choshen, Yoon Kim, Jacob Andreas

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ๋…ผ๋ฌธ์€ ๊ธฐ์กด ์ด์ง„ ๋ณด์ƒ ํ•จ์ˆ˜๋งŒ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฐ•ํ™”ํ•™์Šต(RL) ๊ธฐ๋ฐ˜ ์–ธ์–ด ๋ชจ๋ธ(LM) ํ›ˆ๋ จ ๋ฐฉ์‹์ด ์˜ˆ์ธก ๋ถˆํ™•์‹ค์„ฑ์„ ์ œ๋Œ€๋กœ ๋‹ค๋ฃจ์ง€ ๋ชปํ•ด ๋ณด์ •(calibration) ์„ฑ๋Šฅ ์ €ํ•˜์™€ ์ž˜๋ชป๋œ ์‘๋‹ต ์ƒ์„ฑ ์ฆ๊ฐ€๋ผ๋Š” ๋ถ€์ž‘์šฉ์„ ๋‚ณ๋Š” ๋ฌธ์ œ๋ฅผ ์ œ๊ธฐํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ๋ณธ ์—ฐ๊ตฌ๋Š” ์˜ˆ์ธก ์ •ํ™•๋„์™€ ์‹ ๋ขฐ๋„ ์ถ”์ •์น˜๋ฅผ ๋™์‹œ์— ๊ฐœ์„ ํ•˜๋Š” RLCR(Reinforcement Learning with Calibration Rewards) ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. RLCR์€ ์ด์ง„ ์ •ํ™•๋„ ์ ์ˆ˜์— Brier ์ ์ˆ˜๋ฅผ ๊ฒฐํ•ฉํ•œ ๋ณด์ƒ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์ด ์ •ํ™•ํ•œ ์˜ˆ์ธก๊ณผ ํ•จ๊ป˜ ์‹ ๋ขฐ๋„ ์ถ”์ •์น˜๋ฅผ ์ตœ์ ํ™”ํ•˜๋„๋ก ํ›ˆ๋ จํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
์‹ ๋ขฐ๋„ ๋ณด์ • ๊ฐ•ํ™”: RLCR์€ ์ •ํ™•๋„ ์†์‹ค ์—†์ด ์‹ ๋ขฐ๋„ ๋ณด์ • ์„ฑ๋Šฅ์„ ํฌ๊ฒŒ ํ–ฅ์ƒ์‹œํ‚ค๋ฉฐ, ์ด๋Š” ์ผ๋ฐ˜์ ์ธ RL ํ›ˆ๋ จ ๋ฐฉ์‹์ด๋‚˜ ์‚ฌํ›„ ๋ณด์ • ๋ฐฉ์‹๋ณด๋‹ค ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค.
โ€ข
์ผ๋ฐ˜ํ™”๋œ ์‹ ๋ขฐ์„ฑ ํ™•๋ณด: ํ•™์Šต ์‹œ ๋ช…์‹œ์ ์œผ๋กœ ์‹ ๋ขฐ๋„ ๋ณด์ •์„ ์ตœ์ ํ™”ํ•จ์œผ๋กœ์จ, ํ…Œ์ŠคํŠธ ์‹œ verbalized confidence๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ •ํ™•๋„์™€ ๋ณด์ • ์„ฑ๋Šฅ์„ ๋”์šฑ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ๋ชจ๋ธ์„ ๋งŒ๋“ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
โ€ข
์ด์ง„ ๋ณด์ƒ ํ•จ์ˆ˜์˜ ํ•œ๊ณ„ ๊ทน๋ณต: ๋ณธ ์—ฐ๊ตฌ๋Š” ์ด์ง„ ๋ณด์ƒ ํ•จ์ˆ˜๊ฐ€ ๊ฐ€์ง€๋Š” '์ถ”์ธก' ๋˜๋Š” '๋‚ฎ์€ ์‹ ๋ขฐ๋„' ์ถœ๋ ฅ์— ๋Œ€ํ•œ ํŽ˜๋„ํ‹ฐ ๋ถ€์žฌ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ , ๋ณด๋‹ค ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ๋Š” ์ถ”๋ก  ๋ชจ๋ธ์„ ๊ตฌ์ถ•ํ•  ๊ฐ€๋Šฅ์„ฑ์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
โ€ข
ํ•œ๊ณ„์  ๋˜๋Š” ํ–ฅํ›„ ๊ณผ์ œ: Brier ์ ์ˆ˜์™€ ๊ฐ™์€ ํŠน์ • proper scoring rule์— ๋Œ€ํ•œ ๋ณด์ƒ ํ•จ์ˆ˜ ์„ค๊ณ„๋ฅผ ๋„˜์–ด์„œ, ๋‹ค์–‘ํ•œ ์ข…๋ฅ˜์˜ proper scoring rule์„ ํ™œ์šฉํ•˜๊ฑฐ๋‚˜ ๋”์šฑ ๋ณต์žกํ•œ ์ถ”๋ก  ์ž‘์—…์— RLCR์„ ์ ์šฉํ•˜๋Š” ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๐Ÿ‘