AstroAlertBench: Evaluating the Accuracy, Reasoning, and Honesty of Multimodal LLMs in Astronomical Classification

์ž‘์„ฑ์ž
  • Haebom
์นดํ…Œ๊ณ ๋ฆฌ
๋น„์–ด ์žˆ์Œ

์ €์ž

Claire Chen, Jiabao Sean Xiao, Shuze Daniel Liu, Facundo Perez Paolino, Luke Handley, Theophile Jegou du Laz, Ricky Nilsson, Alice Zou, Matthew Graham, Ashish Mahabal

๐Ÿ’ก ๊ฐœ์š”

๋ณธ ์—ฐ๊ตฌ๋Š” ์ฒœ๋ฌธํ•™ ๋ถ„์•ผ์—์„œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(LLM)์˜ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•œ ์ƒˆ๋กœ์šด ๋ฒค์น˜๋งˆํฌ์ธ AstroAlertBench๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. AstroAlertBench๋Š” ๋ณต์žกํ•œ ์ฒœ๋ฌธํ•™์  ๋ฐ์ดํ„ฐ๋ฅผ ์ดํ•ดํ•˜๊ณ  ๋ถ„๋ฅ˜ํ•˜๋Š” LLM์˜ ์ •ํ™•์„ฑ, ์ถ”๋ก  ๋Šฅ๋ ฅ, ๊ทธ๋ฆฌ๊ณ  ์ž๊ธฐ ํ‰๊ฐ€ ๋Šฅ๋ ฅ(์ •์ง์„ฑ)์„ ๋‹ค๋‹จ๊ณ„๋กœ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค. Zwicky Transient Facility(ZTF)์˜ ์‹ค์ œ ๊ด€์ธก ๋ฐ์ดํ„ฐ 1,500๊ฑด์„ ์‚ฌ์šฉํ•˜์—ฌ 13๊ฐœ์˜ ์ตœ์ฒจ๋‹จ LLM์„ ํ‰๊ฐ€ํ–ˆ์œผ๋ฉฐ, ๋†’์€ ์ •ํ™•๋„๊ฐ€ ๋ฐ˜๋“œ์‹œ ๋ชจ๋ธ์˜ ์‹ ๋ขฐ์„ฑ์„ ๋ณด์žฅํ•˜์ง€๋Š” ์•Š์Œ์„ ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ LLM์€ ๋ฐฉ๋Œ€ํ•œ ์ฒœ๋ฌธํ•™ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฐ ์ž ์žฌ๋ ฅ์„ ๋ณด์—ฌ์ฃผ์ง€๋งŒ, ์ „๋ฌธ์ ์ธ ๊ณผํ•™์  ๋ถ„๋ฅ˜์™€ ํ•ด์„ ๊ฐ€๋Šฅํ•œ ์ถ”๋ก  ๋Šฅ๋ ฅ์€ ์—ฌ์ „ํžˆ ๊ฐœ์„ ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
โ€ข
๋ชจ๋ธ์˜ '์ •์ง์„ฑ', ์ฆ‰ ์Šค์Šค๋กœ์˜ ์ถ”๋ก ์„ ํ‰๊ฐ€ํ•˜๋Š” ๋Šฅ๋ ฅ์€ ์‹ค์ œ ์‘์šฉ ๋ถ„์•ผ์—์„œ์˜ ์‹ ๋ขฐ์„ฑ์„ ํŒ๋‹จํ•˜๋Š” ๋ฐ ์ค‘์š”ํ•œ ์ง€ํ‘œ๊ฐ€ ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
โ€ข
๋ณธ ์—ฐ๊ตฌ๋Š” ์ธ๊ฐ„ ์ฐธ์—ฌํ˜• ํ‰๊ฐ€ ํ”„๋กœํ† ์ฝœ์„ ์ดˆ๊ธฐํ™”ํ•˜์—ฌ ํ–ฅํ›„ ์ปค๋ฎค๋‹ˆํ‹ฐ ๊ทœ๋ชจ์˜ ์ฐธ์—ฌ๋ฅผ ์œ„ํ•œ ๊ธฐ๋ฐ˜์„ ๋งˆ๋ จํ–ˆ์œผ๋ฉฐ, ์ด๋Š” ํ–ฅํ›„ ์ฒœ๋ฌธํ•™ ๋ถ„์•ผ์˜ LLM ๊ฐœ๋ฐœ์— ์ค‘์š”ํ•œ ๋ฐฉํ–ฅ์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
โ€ข
ํ˜„์žฌ ๋ฒค์น˜๋งˆํฌ๋Š” ZTF ๋ฐ์ดํ„ฐ์— ๊ตญํ•œ๋˜์–ด ์žˆ์œผ๋ฉฐ, ๋‹ค์–‘ํ•œ ์ฒœ๋ฌธํ•™์  ์ด๋ฒคํŠธ์™€ ๋ฐ์ดํ„ฐ์…‹์„ ํฌ๊ด„ํ•˜๋Š” ํ™•์žฅ ์—ฐ๊ตฌ๊ฐ€ ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
๐Ÿ‘