Sign In

See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

Created by
  • Haebom
Category
Empty

์ €์ž

Le Thien Phuc Nguyen, Zhuoran Yu, Samuel Low Yu Hang, Subin An, Jeongik Lee, Yohan Ban, SeungEun Chung, Thanh-Huy Nguyen, JuWan Maeng, Soochahn Lee, Yong Jae Lee

๐Ÿ’ก ๊ฐœ์š”

๊ธฐ์กด ์˜์ƒ ๊ธฐ๋ฐ˜ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(MLLM) ๋ฒค์น˜๋งˆํฌ๋Š” ์ธ๊ฐ„์˜ ์Œ์„ฑ์— ๋Œ€ํ•œ ๋ฏธ์„ธํ•œ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ์ถฉ๋ถ„ํžˆ ํ‰๊ฐ€ํ•˜์ง€ ๋ชปํ•ฉ๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ ๋ˆ„๊ฐ€ ๋งํ•˜๊ณ , ๋ฌด์—‡์„ ๋งํ•˜๋ฉฐ, ์–ธ์ œ ๋งํ•˜๋Š”์ง€์— ๋Œ€ํ•œ ํ™”์ž ์ค‘์‹ฌ์˜ ์˜์ƒ-์Œ์„ฑ ์ถ”๋ก ์„ ํ‰๊ฐ€ํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ฒค์น˜๋งˆํฌ์ธ AV-SpeakerBench๋ฅผ ์ œ์•ˆํ•ฉ๋‹ˆ๋‹ค. AV-SpeakerBench๋Š” 3,212๊ฐœ์˜ ๊ฐ๊ด€์‹ ๋ฌธ์ œ๋กœ ๊ตฌ์„ฑ๋˜๋ฉฐ, ํ™”์ž๋ฅผ ํ•ต์‹ฌ ์ถ”๋ก  ๋‹จ์œ„๋กœ ์‚ผ๊ณ , ์˜์ƒ-์Œ์„ฑ ์ข…์†์„ฑ์„ ์งˆ๋ฌธ ์˜๋ฏธ์— ํฌํ•จ์‹œํ‚ค๋ฉฐ, ์ „๋ฌธ๊ฐ€ ์ˆ˜์ค€์˜ ์ •๋ฐ€ํ•œ ์ฃผ์„์„ ํŠน์ง•์œผ๋กœ ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ”‘ ์‹œ์‚ฌ์  ๋ฐ ํ•œ๊ณ„

โ€ข
AV-SpeakerBench๋Š” MLLM์˜ ๋ฏธ์„ธํ•œ ์Œ์„ฑ ์ดํ•ด ๋ฐ ํ™”์ž ์ค‘์‹ฌ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐ ์žˆ์–ด ์ค‘์š”ํ•œ ์ƒˆ๋กœ์šด ๊ธฐ์ค€์ ์„ ์ œ์‹œํ•ฉ๋‹ˆ๋‹ค.
โ€ข
Gemini ๋ชจ๋ธ์ด ํ˜„์žฌ ์˜คํ”ˆ ์†Œ์Šค ๋ชจ๋ธ๋ณด๋‹ค ์›”๋“ฑํ•œ ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉฐ, ํŠนํžˆ Gemini 2.5 Pro๊ฐ€ ์ตœ์ƒ์œ„ ์„ฑ๋Šฅ์„ ๋‹ฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ๋ชจ๋ธ์˜ ์˜์ƒ-์Œ์„ฑ ์œตํ•ฉ ๋Šฅ๋ ฅ์ด ์„ฑ๋Šฅ์— ์ค‘์š”ํ•œ ์˜ํ–ฅ์„ ๋ฏธ์นจ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.
โ€ข
ํ˜„์žฌ ๋ฒค์น˜๋งˆํฌ๋Š” ์‹ค์ œ ๋ณต์žกํ•œ ์ƒํ™ฉ์—์„œ์˜ MLLM ์Œ์„ฑ ์ดํ•ด ๋Šฅ๋ ฅ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•œ ์‹œ์ž‘์ ์ด๋ฉฐ, ํ–ฅํ›„ ๋‹ค์–‘ํ•œ ์Œ์„ฑ ํŠน์ง• ๋ฐ ๋ณตํ•ฉ์ ์ธ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ํ‰๊ฐ€ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ๋ฐœ์ „์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
๐Ÿ‘