Sign In

CalArena: A Large-Scale Post-Hoc Calibration Benchmark

์ž‘์„ฑ์ž
  • Haebom
์นดํ…Œ๊ณ ๋ฆฌ
Empty

์ €์ž

Eugene Berta, David Holzmuller, Francis Bach, Michael I. Jordan

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ์—ฐ๊ตฌ๋Š” ํ˜„๋Œ€ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์˜ ํ™•๋ฅ  ์ถ”์ • ์ •ํ™•๋„ ๋ถ€์กฑ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ๋Œ€๊ทœ๋ชจ ํ‘œ์ค€ํ™”๋œ ํ›„์ฒœ์  ๋ณด์ •(post-hoc calibration) ๋ฒค์น˜๋งˆํฌ์ธ CalArena๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. CalArena๋Š” ๋‹ค์–‘ํ•œ ๋ชจ๋ธ๊ณผ ๋ถ„๋ฅ˜ ์„ค์ •์— ๊ฑธ์ณ ์•ฝ 2000๊ฐœ์˜ ์‹คํ—˜์„ ํฌํ•จํ•˜๋ฉฐ, ์ˆ˜์‹ญ ๊ฐ€์ง€ ๋ณด์ • ๋ฐฉ๋ฒ•์˜ ํ†ตํ•ฉ๋˜๊ณ  ์žฌํ˜„ ๊ฐ€๋Šฅํ•œ ๊ตฌํ˜„์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์ „ํ†ต์ ์ธ ๋ณด์ • ์˜ค๋ฅ˜ ์ถ”์ •์น˜ ๋Œ€์‹  ์ ์ ˆํ•œ ์ ์ˆ˜ ๊ทœ์น™(proper scoring rules)์˜ ํ›„์ฒœ์  ์„ฑ๋Šฅ ํ–ฅ์ƒ(Post-Hoc Improvement, PHI)์„ ์ œ์•ˆํ•˜๋ฉฐ, ๋ณด์ • ํ’ˆ์งˆ๊ณผ ์˜ˆ์ธก ์„ฑ๋Šฅ ์ €ํ•˜ ๊ฐ€๋Šฅ์„ฑ์„ ๋™์‹œ์— ํฌ์ฐฉํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
๋งค๋„๋Ÿฌ์šด ๋ณด์ • ํ•จ์ˆ˜(smooth calibration functions)๊ฐ€ ์ด์ง„ํ™” ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•๋ณด๋‹ค ์ผ๊ด€๋˜๊ฒŒ ์šฐ์ˆ˜ํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค.
โ€ข
๊ณ ์ฐจ์› ์„ค์ •์—์„œ๋Š” ๋‹ค์ค‘ ํด๋ž˜์Šค ์ „์šฉ ๋ณด์ • ๋ฐฉ๋ฒ•์˜ ์ค‘์š”์„ฑ์ด ๊ฐ•์กฐ๋ฉ๋‹ˆ๋‹ค.
โ€ข
์ผ๋ฐ˜์ ์ธ ๋จธ์‹ ๋Ÿฌ๋‹ ๋ชจ๋ธ์€ ๋ณด์ • ์ „์šฉ ์„ค๊ณ„ ์—†์ด๋Š” ๊ฒฝ์Ÿ๋ ฅ์ด ๋–จ์–ด์ง‘๋‹ˆ๋‹ค.
โ€ข
PHI ์ง€ํ‘œ๋Š” ๋ณด์ • ๋ฐฉ๋ฒ• ๋น„๊ต๋ฅผ ์œ„ํ•œ ๋” ์›์น™์ ์ธ ๋Œ€์•ˆ์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
โ€ข
ํ–ฅํ›„ ์—ฐ๊ตฌ๋ฅผ ์œ„ํ•ด ๋ฐ์ดํ„ฐ, ์ฝ”๋“œ, ํ‰๊ฐ€ ๋„๊ตฌ๊ฐ€ ๊ณต๊ฐœ๋˜์–ด ์‹ ๊ทœ ๋ณด์ • ๋ฐฉ๋ฒ• ๊ฐœ๋ฐœ ๋ฐ ๋น„๊ต์— ํ™œ์šฉ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
โ€ข
๋ณธ ๋ฒค์น˜๋งˆํฌ๋Š” ํ˜„์žฌ๊นŒ์ง€ ๊ฐ€์žฅ ํฌ๊ด„์ ์ธ ๊ฒฝํ—˜์  ์—ฐ๊ตฌ๋ฅผ ์ˆ˜ํ–‰ํ–ˆ์ง€๋งŒ, ์—ฌ์ „ํžˆ ์ƒˆ๋กœ์šด ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜ ๋ฐ ๋ณต์žกํ•œ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•œ ์ถ”๊ฐ€์ ์ธ ํ‰๊ฐ€๊ฐ€ ํ•„์š”ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
๐Ÿ‘