Sign In

LiveMedBench: A Contamination-Free Medical Benchmark for LLMs with Automated Rubric Evaluation

Created by
  • Haebom
Category
Empty

์ €์ž

Zhiling Yan, Dingjie Song, Zhe Fang, Yisheng Ji, Xiang Li, Quanzheng Li, Lichao Sun

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ๋…ผ๋ฌธ์€ LLM์˜ ์ž„์ƒ ์ ์šฉ์„ ์œ„ํ•œ ์—„๊ฒฉํ•˜๊ณ  ์‹ ๋ขฐํ•  ์ˆ˜ ์žˆ๋Š” ํ‰๊ฐ€์˜ ํ•„์š”์„ฑ์„ ์ œ๊ธฐํ•˜๋ฉฐ, ๊ธฐ์กด ์˜๋ฃŒ ๋ฒค์น˜๋งˆํฌ์˜ ๋ฐ์ดํ„ฐ ์˜ค์—ผ ๋ฐ ์‹œ๊ฐ„์  ๋น„์ •๋ ฌ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด LiveMedBench๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. LiveMedBench๋Š” ์‹ค์‹œ๊ฐ„ ์ž„์ƒ ์‚ฌ๋ก€๋ฅผ ์ฃผ๊ฐ„ ๋‹จ์œ„๋กœ ์ˆ˜์ง‘ํ•˜๊ณ , ์ „๋ฌธ๊ฐ€ ๊ฒ€์ฆ ๋ฐ ์ž๋™ํ™”๋œ ์ฑ„์  ๊ธฐ์ค€์„ ํ†ตํ•ด LLM์˜ ์ž„์ƒ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ๊ฐ๊ด€์ ์œผ๋กœ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
์ง€์†์ ์ธ ์—…๋ฐ์ดํŠธ ๋ฐ ์˜ค์—ผ ๋ฐฉ์ง€: LiveMedBench๋Š” ์‹ค์ œ ์ž„์ƒ ์‚ฌ๋ก€๋ฅผ ์ง€์†์ ์œผ๋กœ ์—…๋ฐ์ดํŠธํ•˜์—ฌ ์ตœ์‹  ์˜๋ฃŒ ์ง€์‹์„ ๋ฐ˜์˜ํ•˜๊ณ , ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์™€์˜ ๋ถ„๋ฆฌ๋ฅผ ์—„๊ฒฉํžˆ ๊ด€๋ฆฌํ•˜์—ฌ ๋ฐ์ดํ„ฐ ์˜ค์—ผ์œผ๋กœ ์ธํ•œ ์„ฑ๋Šฅ ๊ณผ๋Œ€ํ‰๊ฐ€๋ฅผ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค.
โ€ข
์‹ ๋ขฐ์„ฑ ๋†’์€ ์ž๋™ํ™”๋œ ํ‰๊ฐ€: ์ „๋ฌธ๊ฐ€์˜ ์ง€์‹์— ๊ธฐ๋ฐ˜ํ•œ ์ž๋™ํ™”๋œ ์ฑ„์  ๊ธฐ์ค€์„ ํ†ตํ•ด ๊ธฐ์กด์˜ ์ฃผ๊ด€์ ์ธ LLM ํ‰๊ฐ€ ๋ฐฉ์‹๋ณด๋‹ค ๋” ์ •ํ™•ํ•˜๊ณ  ์ผ๊ด€๋œ ํ‰๊ฐ€๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
โ€ข
LLM์˜ ์ž„์ƒ ์ ์šฉ ๋ณ‘๋ชฉ ํ˜„์ƒ ๊ทœ๋ช…: LLM์˜ ์„ฑ๋Šฅ์„ ๊ด‘๋ฒ”์œ„ํ•˜๊ฒŒ ํ‰๊ฐ€ํ•œ ๊ฒฐ๊ณผ, ์‚ฌ์‹ค์  ์ง€์‹๋ณด๋‹ค๋Š” ํ™˜์ž๋ณ„ ๋งฅ๋ฝ์— ๋งž๋Š” ์ง€์‹ ์ ์šฉ ๋Šฅ๋ ฅ ๋ถ€์กฑ์ด ์ž„์ƒ ์ ์šฉ์˜ ์ฃผ์š” ๋ณ‘๋ชฉ ํ˜„์ƒ์ž„์„ ๋ฐํ˜”์Šต๋‹ˆ๋‹ค.
โ€ข
๋ฐ์ดํ„ฐ ์˜ค์—ผ์˜ ๊ด‘๋ฒ”์œ„ํ•œ ์˜ํ–ฅ: 38๊ฐœ์˜ LLM ํ‰๊ฐ€ ๊ฒฐ๊ณผ, ์ƒ๋‹น์ˆ˜์˜ ๋ชจ๋ธ์ด ์ตœ์‹  ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ์„ฑ๋Šฅ ์ €ํ•˜๋ฅผ ๋ณด์ด๋ฉฐ ๋ฐ์ดํ„ฐ ์˜ค์—ผ์˜ ์‹ฌ๊ฐ์„ฑ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.
โ€ข
ํ–ฅํ›„ ๊ณผ์ œ: ๋” ๋ณต์žกํ•œ ์ž„์ƒ ์‹œ๋‚˜๋ฆฌ์˜ค ๋ฐ ํฌ๊ท€ ์งˆํ™˜์— ๋Œ€ํ•œ ํ‰๊ฐ€๋ฅผ ๊ฐ•ํ™”ํ•˜๊ณ , LLM์ด ํ™˜์ž๋ณ„ ์ œ์•ฝ์„ ๊ณ ๋ คํ•˜์—ฌ ๋งฅ๋ฝ์— ๋งž๋Š” ์˜๋ฃŒ ์ง€์‹์„ ์ ์šฉํ•˜๋Š” ๋Šฅ๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๋Š” ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
๐Ÿ‘